ALS.trainImplicit running out of mem when using higher rank

2015-01-10 Thread Antony Mayi
the memory requirements seem to be rapidly growing hen using higher rank... I am unable to get over 20 without running out of memory. is this expected?thanks, Antony. 

Re: DeepLearning and Spark ?

2015-01-10 Thread Jaonary Rabarisoa
Can someone explain what is the difference between parameter server and spark ? There's already an issue on this topic https://issues.apache.org/jira/browse/SPARK-4590 Another example of DL in Spark essentially based on downpour SDG http://deepdist.com On Sat, Jan 10, 2015 at 2:27 AM, Peng

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-10 Thread Antony Mayi
the actual case looks like this:* spark 1.1.0 on yarn (cdh 5.2.1)* ~8-10 executors, 36GB phys RAM per host* input RDD is roughly 3GB containing ~150-200M items (and this RDD is made persistent using .cache())* using pyspark yarn is configured with the limit yarn.nodemanager.resource.memory-mb of 

Re: Data locality running Spark on Mesos

2015-01-10 Thread Timothy Chen
Hi Michael, I see you capped the cores to 60. I wonder what's the settings you used for standalone mode that you compared with? I can try to run a MLib workload on both to compare. Tim On Jan 9, 2015, at 6:42 AM, Michael V Le m...@us.ibm.com wrote: Hi Tim, Thanks for your response.

Re: Parquet compression codecs not applied

2015-01-10 Thread Ayoub Benali
it worked thanks. this doc page https://spark.apache.org/docs/1.2.0/sql-programming-guide.htmlrecommends to use spark.sql.parquet.compression.codec to set the compression coded and I thought this setting would be forwarded to the hive context given that HiveContext extends SQLContext, but it was

Re: Web Service + Spark

2015-01-10 Thread Cui Lin
Thanks, Gaurav and Corey, Probably I didn’t make myself clear. I am looking for best Spark practice similar to Shiny for R, the analysis/visualziation results can be easily published to web server and shown from web browser. Or any dashboard for Spark? Best regards, Cui Lin From: gtinside

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-10 Thread Benyi Wang
You may try to change the schudlingMode to FAIR, the default is FIFO. Take a look at this page https://spark.apache.org/docs/1.1.0/job-scheduling.html#scheduling-within-an-application On Sat, Jan 10, 2015 at 10:24 AM, YaoPau jonrgr...@gmail.com wrote: I'm looking for ways to reduce the

Re: IOError: [Errno 2] No such file or directory: '/tmp/spark-9e23f17e-2e23-4c26-9621-3cb4d8b832da/tmp3i3xno'

2015-01-10 Thread lucio raimondo
Update: I resolved this by increasing the granularity of RDD persistence for complex map-reduce operations, as the one whose reduceByKey stage was failing. Coolio. Lucio -- View this message in context:

status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Kevin Burton
I’m curious what the status of implementing hive analytics functions in spark. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Many of these seem missing. I’m assuming they’re not implemented yet? Is there an ETA on them? or am I the first to bring this

Does Spark automatically run different stages concurrently when possible?

2015-01-10 Thread YaoPau
I'm looking for ways to reduce the runtime of my Spark job. My code is a single file of scala code and is written in this order: (1) val lines = Import full dataset using sc.textFile (2) val ABonly = Parse out all rows that are not of type A or B (3) val processA = Process only the A rows from

Re: FileNotFoundException in appcache shuffle files

2015-01-10 Thread lucio raimondo
Hey, I am having a similar issue, did you manage to find a solution yet? Please check my post below for reference: http://apache-spark-user-list.1001560.n3.nabble.com/IOError-Errno-2-No-such-file-or-directory-tmp-spark-9e23f17e-2e23-4c26-9621-3cb4d8b832da-tmp3i3xno-td21076.html Thank you,

Re: FileNotFoundException in appcache shuffle files

2015-01-10 Thread Aaron Davidson
As Jerry said, this is not related to shuffle file consolidation. The unique thing about this problem is that it's failing to find a file while trying to _write_ to it, in append mode. The simplest explanation for this would be that the file is deleted in between some check for existence and

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-10 Thread Stéphane Verlet
From your pseudo code, it would be sequential and done twice 1+2+3 then 1+2+4 If you do a .cache() in step 2 then you would have 1+2+3 , then 4 I ran several steps in parrallel from the same program but never using the same source RDD so I do not know the limitations there. I simply started

Re: Discrepancy in PCA values

2015-01-10 Thread Upul Bandara
Hi Xiangrui, Thanks a lot for you answer. So I fixed my Julia code, also calculated PCA using R as well. R Code: - data - read.csv('/home/upul/Desktop/iris.csv'); X - data[,1:4] pca - prcomp(X, center = TRUE, scale=FALSE) transformed - predict(pca, newdata = X) Julia Code (Fixed)

[no subject]

2015-01-10 Thread Krishna Sankar
Guys, registerTempTable(Employees) gives me the error Exception in thread main scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath

Re: Issue writing to Cassandra from Spark

2015-01-10 Thread Akhil Das
Just make sure you are not connecting to the Old RPC Port (9160), new binary port is running on 9042. What is your rpc_address listed in cassandra.yaml? Also make sure you have start_native_transport: *true *in the yaml file. Thanks Best Regards On Sat, Jan 10, 2015 at 8:44 AM, Ankur Srivastava

Re: status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Will Benton
Hi Kevin, I'm currently working on implementing windowing. If you'd like to see something that's not covered by a JIRA, please file one! best, wb - Original Message - From: Kevin Burton bur...@spinn3r.com To: user@spark.apache.org Sent: Saturday, January 10, 2015 12:12:38 PM

train many decision tress with a single spark job

2015-01-10 Thread Josh Buffum
I've got a data set of activity by user. For each user, I'd like to train a decision tree model. I currently have the feature creation step implemented in Spark and would naturally like to use mllib's decision tree model. However, it looks like the decision tree model expects the whole RDD and

How can I measure the time an RDD takes to execute?

2015-01-10 Thread Saiph Kappa
Hi, How can I measure the time an RDD takes to execute? In particular, I want to do it for the following piece of code: « val ssc = new StreamingContext(sparkConf, Seconds(5)) val distFile = ssc.textFileStream(/home/myuser/twitter-dump) val words = distFile.flatMap(_.split( )).filter(_.length

Re: Play Scala Spark Exmaple

2015-01-10 Thread Akhil Das
What is your spark version that is running on the EC2 cluster? From the build file https://github.com/knoldus/Play-Spark-Scala/blob/master/build.sbt of your play application it seems that it uses Spark 1.0.1. Thanks Best Regards On Fri, Jan 9, 2015 at 7:17 PM, Eduardo Cusa

Re: Job priority

2015-01-10 Thread Mark Hamstra
-dev, +user http://spark.apache.org/docs/latest/job-scheduling.html On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta alexbare...@gmail.com wrote: Is it possible to specify a priority level for a job, such that the active jobs might be scheduled in order of priority? Alex

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-10 Thread Nathan McCarthy
Thanks Cheng Michael! Makes sense. Appreciate the tips! Idiomatic scala isn't performant. I’ll definitely start using while loops or tail recursive methods. I have noticed this in the spark code base. I might try turning off columnar compression (via

Re: Job priority

2015-01-10 Thread Cody Koeninger
http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties Setting a high weight such as 1000 also makes it possible to implement *priority* between pools—in essence, the weight-1000 pool will always get to launch tasks first whenever it has jobs active. On Sat, Jan 10,

Re: Job priority

2015-01-10 Thread Alessandro Baretta
Cody, Maybe I'm not getting this, but it doesn't look like this page is describing a priority queue scheduling policy. What this section discusses is how resources are shared between queues. A weight-1000 pool will get 1000 times more resources allocated to it than a priority 1 queue. Great, but

Re: Spark Graph Visualizer

2015-01-10 Thread kevinkim
Hi Rajesh, There's a great web-based notebook visualize tool called Zeppelin. (And it's opensource!) Check it out: http://zeppelin.incubator.apache.org Regards, Kevin -- View this message in context:

Re: Job priority

2015-01-10 Thread Alessandro Baretta
Mark, Thanks, but I don't see how this documentation solves my problem. You are referring me to documentation of fair scheduling; whereas, I am asking about as unfair a scheduling policy as can be: a priority queue. Alex On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra m...@clearstorydata.com

Re: Removing JARs from spark-jobserver

2015-01-10 Thread abhishek
There is path /tmp/spark-jobserver/file where all the jar are kept by default. probably deleting from there should work On 11 Jan 2015 12:51, Sasi [via Apache Spark User List] ml-node+s1001560n21081...@n3.nabble.com wrote: How to remove submitted JARs from spark-jobserver?

Removing JARs from spark-jobserver

2015-01-10 Thread Sasi
How to remove submitted JARs from spark-jobserver? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Removing-JARs-from-spark-jobserver-tp21081.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Does DecisionTree model in MLlib deal with missing values?

2015-01-10 Thread Carter
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal with missing values? If so, what data structure should I use for the input? Moreover, my data has categorical features, but the LabeledPoint requires double data type, in this case what can I do? Thank you very much.