Re: Spark speed performance

2014-10-19 Thread jan.zikes
Thank you very much lot of very small json files was exactly the speed performance problem, using coalesce makes my Spark program to run on single node only twice slower (even with starting Spark) than single node Python program, which is acceptable. Jan 

why does driver connects to master fail ?

2014-10-19 Thread randylu
In my programer, the application always connects to master fail for serveral iterations. The driver' log is as follows: WARN AppClient$ClientActor: Connection to akka.tcp://sparkMaster@master1:7077 failed; waiting for master to reconnect... why does this warnning happen and how to avoid it?

Re: why does driver connects to master fail ?

2014-10-19 Thread randylu
In additional, driver receives serveral DisassociatedEvent messages. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-does-driver-connects-to-master-fail-tp16758p16759.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: What's wrong with my spark filter? I get org.apache.spark.SparkException: Task not serializable

2014-10-19 Thread Ilya Ganelin
Check for any variables you've declared in your class. Even if you're not calling them from the function they are passed to the worker nodes as part of the context. Consequently, if you have something without a default serializer (like an imported class) it will also get passed. To fix this you

scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
I am working with Spark 1.1.0 and I believe Timestamp is a supported data type for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp when I try to use reflection to register a Java Bean with Timestamp field. Anything wrong with my code below? public

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
Can you provide the exception stack? Thanks, Daoyuan From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 10:17 PM To: user@spark.apache.org Subject: scala.MatchError: class java.sql.Timestamp I am working with Spark 1.1.0 and I believe Timestamp is a supported data type

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
scala.MatchError: class java.sql.Timestamp (of class java.lang.Class) at org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189) at

Error while running Streaming examples - no snappyjava in java.library.path

2014-10-19 Thread bdev
I built the latest Spark project and I'm running into these errors when attempting to run the streaming examples locally on the Mac, how do I fix these errors? java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)

Using SVMWithSGD model to predict

2014-10-19 Thread npomfret
Hi, I'm new to spark and just trying to make sense of the SVMWithSGD example. I ran my dataset through it and build a model. When I call predict() on the testing data (after clearThreshold()) I was expecting to get answers in the range of 0 to 1. But they aren't, all predictions seem to be

Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Hi, I'm new to spark and just trying to make sense of the SVMWithSGD example. I ran my dataset through it and build a model. When I call predict() on the testing data (after clearThreshold()) I was expecting to get answers in the range of 0 to 1. But they aren't, all predictions seem to be

Re: Using SVMWithSGD model to predict

2014-10-19 Thread Sean Owen
The problem is that you called clearThreshold(). The result becomes the SVM margin not a 0/1 class prediction. There is no probability output. There was a very similar question last week. Is there an example out there suggesting clearThreshold()? I also wonder if it is good to overload the

Re: Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Thanks. The example I used is here https://spark.apache.org/docs/latest/mllib-linear-methods.html see SVMClassifier So there's no way to get a probability based output? What about from linear regression, or logistic regression? On 19 October 2014 19:52, Sean Owen so...@cloudera.com wrote:

Re: Using SVMWithSGD model to predict

2014-10-19 Thread Sean Owen
Ah right. It is important to use clearThreshold() in that example in order to generate margins, because the AUC metric needs the classifications to be ranked by some relative strength, rather than just 0/1. These outputs are not probabilities, and that is not what SVMs give you in general. There

Re: What executes on worker and what executes on driver side

2014-10-19 Thread Saurabh Wadhawan
Any response for this? 1. How do I know what statements will be executed on worker side out of the spark script in a stage. e.g. if I have val x = 1 (or any other code) in my driver code, will the same statements be executed on the worker side in a stage? 2. How can I do a map side

Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration will open in a few weeks, and like in previous Spark Summits, I expect there to be a special pass for students. Matei On Oct 18, 2014, at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: After successful

Re: Oryx + Spark mllib

2014-10-19 Thread Jayant Shekhar
Hi Deb, Do check out https://github.com/OryxProject/oryx. It does integrate with Spark. Sean has put in quite a bit of neat details on the page about the architecture. It has all the things you are thinking about:) Thanks, Jayant On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das

Is Spark the right tool?

2014-10-19 Thread kc66
I am very new to Spark. I am work on a project that involves reading stock transactions off a number of TCP connections and 1. periodically (once every few hours) uploads the transaction records to HBase 2. maintains the records that are not yet written into HBase and acts as a HTTP query server

Re: Using SVMWithSGD model to predict

2014-10-19 Thread Nick Pomfret
Thanks for the info. On 19 October 2014 20:46, Sean Owen so...@cloudera.com wrote: Ah right. It is important to use clearThreshold() in that example in order to generate margins, because the AUC metric needs the classifications to be ranked by some relative strength, rather than just 0/1.

mlib model build and low CPU usage

2014-10-19 Thread Nick Pomfret
I'm building a model in a stand alone cluster with just a single worker limited to use 3 cores and 4GB ram. The node starts up and spits out the message: Starting Spark worker 192.168.1.185:60203 with 3 cores, 4.0 GB RAM During the model train (SVMWithSGD) the CPU on the worker is very low. It

Spark Streaming scheduling control

2014-10-19 Thread davidkl
Hello, I have a cluster 1 master and 2 slaves running on 1.1.0. I am having problems to get both slaves working at the same time. When I launch the driver on the master, one of the slaves is assigned the receiver task, and initially both slaves start processing tasks. After a few tens of batches,

Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Pat Ferrel
Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is the problem but anyway... I get a NoClassDefFoundError for RandomGenerator when running a driver from the CLI. But only when using a named master, even a standalone master. If I run using master = local[4] the job

RE: how to build spark 1.1.0 to include org.apache.commons.math3 ?

2014-10-19 Thread Henry Hung
@Sean Owen, Thank you for the information. I change the pom file to include math3, because I needed the math3 library from my previous use with 1.0.2. Best regards, Henry -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Saturday, October 18, 2014 2:19 AM To: MA33

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Cheng, Hao
Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of the data types supported by Catalyst. From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 11:44 PM To: Wang, Daoyuan; user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp

All executors run on just a few nodes

2014-10-19 Thread Tao Xiao
Hi all, I have a Spark-0.9 cluster, which has 16 nodes. I wrote a Spark application to read data from an HBase table, which has 86 regions spreading over 20 RegionServers. I submitted the Spark app in Spark standalone mode and found that there were 86 executors running on just 3 nodes and it

default parallelism bug?

2014-10-19 Thread Kevin Jung
Hi, I usually use file on hdfs to make PairRDD and analyze it by using combineByKey,reduceByKey, etc. But sometimes it hangs when I set spark.default.parallelism configuration, though the size of file is small. If I remove this configuration, all works fine. Does anyone tell me why this occur?

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
I have created an issue for this https://issues.apache.org/jira/browse/SPARK-4003 From: Cheng, Hao Sent: Monday, October 20, 2014 9:20 AM To: Ge, Yao (Y.); Wang, Daoyuan; user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp Seems bugs in the JavaSQLContext.getSchema(),

Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav
Write to hdfs and then get one file locally bu using hdfs dfs -getmerge... On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote: You can save to a local file. What are you trying and what doesn't work? You can output one file by repartitioning to 1 partition but this is probably