How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread sparkuser2345
Hi, I am trying to fit a logistic regression model with cross validation in Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where each element is a pair of RDDs containing the training and test data: (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Sean Owen
If you call .par on data_kfolded it will become a parallel collection in Scala and so the maps will happen in parallel . On Jul 5, 2014 9:35 AM, sparkuser2345 hm.spark.u...@gmail.com wrote: Hi, I am trying to fit a logistic regression model with cross validation in Spark 0.9.0 using

Spark 1.0 failed on HDP 2.0 with absurd exception

2014-07-05 Thread Konstantin Kudryavtsev
Hi all, I have cluster with HDP 2.0. I built Spark 1.0 on edge node and trying to run with a command ./bin/spark-submit --class test.etl.RunETL --master yarn-cluster --num-executors 14 --driver-memory 3200m --executor-memory 3g --executor-cores 2 my-etl-1.0-SNAPSHOT-hadoop2.2.0.jar in result I

Re: Graphx traversal and merge interesting edges

2014-07-05 Thread Ankur Dave
Interesting problem! My understanding is that you want to (1) find paths matching a particular pattern, and (2) add edges between the start and end vertices of the matched paths. For (1), I implemented a pattern matcher for GraphX

Re: Graphx traversal and merge interesting edges

2014-07-05 Thread HHB
Thanks Ankur, Cannot thank you enough for this!!! I am reading your example still digesting grokking it though :-) I was breaking my head over this for past few hours. In my last futile attempts over past few hours. I was looking at Pregel... E.g if that could be used to see at what step

[no subject]

2014-07-05 Thread Konstantin Kudryavtsev
I faced in very strange behavior of job that I was run on YARN hadoop cluster. One of stages (map function) was split in 80 tasks, 10 of them successfully finished in ~2 min, but all other jobs are running 40 min and still not finished... I suspect they hung on. Any ideas what's going on and how

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Evan R. Sparks
To be clear - each of the RDDs is still a distributed dataset and each of the individual SVM models will be trained in parallel across the cluster. Sean's suggestion effectively has you submitting multiple spark jobs simultaneously, which, depending on your cluster configuration and the size of

Re: taking top k values of rdd

2014-07-05 Thread Nick Pentreath
To make it efficient in your case you may need to do a bit of custom code to emit the top k per partition and then only send those to the driver. On the driver you can just top k the combined top k from each partition (assuming you have (object, count) for each top k list). — Sent from Mailbox

Re: taking top k values of rdd

2014-07-05 Thread Koert Kuipers
hey nick, you are right. i didnt explain myself well and my code example was wrong... i am keeping a priority-queue with k items per partition (using com.twitter.algebird.mutable.PriorityQueueMonoid.build to limit the sizes of the queues). but this still means i am sending k items per partition to

Re: taking top k values of rdd

2014-07-05 Thread Nick Pentreath
Right. That is unavoidable unless as you say you repartition into 1 partition, which may do the trick. When I say send the top k per partition I don't mean send the pq but the actual values. This may end up being relatively small if k and p are not too big. (I'm not sure how large serialized

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Christopher Nguyen
Hi sparkuser2345, I'm inferring the problem statement is something like how do I make this complete faster (given my compute resources)? Several comments. First, Spark only allows launching parallel tasks from the driver, not from workers, which is why you're seeing the exception when you try.

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Nick Pentreath
For linear models the 3rd option is by far most efficient and I suspect what Evan is alluding to.  Unfortunately it's not directly possible with the classes in Mllib now so you'll have to roll your own using underlying sgd / bfgs primitives. — Sent from Mailbox On Sat, Jul 5, 2014 at 10:45

Re: How to use groupByKey and CqlPagingInputFormat

2014-07-05 Thread Martin Gammelsæter
Ah, I see. Thank you! As we are in the process of building the system we have not tried with any large amounts of data yet, but when the time comes I'll try both implementations and do a small benchmark. On Fri, Jul 4, 2014 at 9:20 PM, Mohammed Guller moham...@glassbeam.com wrote: As far as I

Re: Spark 1.0 failed on HDP 2.0 with absurd exception

2014-07-05 Thread Cesar Arevalo
From looking at the exception message that was returned, I would try the following command for running the application: ./bin/spark-submit --class test.etl.RunETL --master yarn-cluster --num-workers 14 --driver-memory 3200m --worker-memory 3g --worker-cores 2 --jar

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-05 Thread DB Tsai
You may try LBFGS to have more stable convergence. In spark 1.1, we will be able to use LBFGS instead of GD in training process. On Jul 4, 2014 1:23 PM, Thomas Robert tho...@creativedata.fr wrote: Hi all, I too am having some issues with *RegressionWithSGD algorithms. Concerning your issue

Re: window analysis with Spark and Spark streaming

2014-07-05 Thread Mayur Rustagi
Key idea is to simulate your app time as you enter data . So you can connect spark streaming to a queue and insert data in it spaced by time. Easier said than done :). What are the parallelism issues you are hitting with your static approach. On Friday, July 4, 2014, alessandro finamore

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Koert Kuipers
thanks for replying. why is joining two vertexrdds without caching slow? what is recomputed unnecessarily? i am not sure what is different here from joining 2 regular RDDs (where nobody seems to recommend to cache before joining i think...) On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave

Re: reading compress lzo files

2014-07-05 Thread Nicholas Chammas
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: csv = sc.newAPIHadoopFile(opts.input,com.hadoop .mapreduce.LzoTextInputFormat,org.apache.hadoop .io.LongWritable,org.apache.hadoop.io.Text).count() Does anyone know what the rough equivalent of this would be in

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Ankur Dave
When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check for identical indexes is performed using reference equality. Without caching, two copies of the index are created. Although the two indexes are

Re: reading compress lzo files

2014-07-05 Thread Sean Owen
The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop class it starts with org.apache.hadoop On Jul 6, 2014 4:20 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: csv =