Hi,
I am trying to fit a logistic regression model with cross validation in
Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where
each element is a pair of RDDs containing the training and test data:
(training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],
If you call .par on data_kfolded it will become a parallel collection in
Scala and so the maps will happen in parallel .
On Jul 5, 2014 9:35 AM, sparkuser2345 hm.spark.u...@gmail.com wrote:
Hi,
I am trying to fit a logistic regression model with cross validation in
Spark 0.9.0 using
Hi all,
I have cluster with HDP 2.0. I built Spark 1.0 on edge node and trying to
run with a command
./bin/spark-submit --class test.etl.RunETL --master yarn-cluster
--num-executors 14 --driver-memory 3200m --executor-memory 3g
--executor-cores 2 my-etl-1.0-SNAPSHOT-hadoop2.2.0.jar
in result I
Interesting problem! My understanding is that you want to (1) find paths
matching a particular pattern, and (2) add edges between the start and end
vertices of the matched paths.
For (1), I implemented a pattern matcher for GraphX
Thanks Ankur,
Cannot thank you enough for this!!! I am reading your example still digesting
grokking it though :-)
I was breaking my head over this for past few hours.
In my last futile attempts over past few hours. I was looking at Pregel... E.g
if that could be used to see at what step
I faced in very strange behavior of job that I was run on YARN hadoop
cluster. One of stages (map function) was split in 80 tasks, 10 of them
successfully finished in ~2 min, but all other jobs are running 40 min
and still not finished... I suspect they hung on.
Any ideas what's going on and how
To be clear - each of the RDDs is still a distributed dataset and each of
the individual SVM models will be trained in parallel across the cluster.
Sean's suggestion effectively has you submitting multiple spark jobs
simultaneously, which, depending on your cluster configuration and the size
of
To make it efficient in your case you may need to do a bit of custom code to
emit the top k per partition and then only send those to the driver. On the
driver you can just top k the combined top k from each partition (assuming you
have (object, count) for each top k list).
—
Sent from Mailbox
hey nick,
you are right. i didnt explain myself well and my code example was wrong...
i am keeping a priority-queue with k items per partition (using
com.twitter.algebird.mutable.PriorityQueueMonoid.build to limit the sizes
of the queues).
but this still means i am sending k items per partition to
Right. That is unavoidable unless as you say you repartition into 1 partition,
which may do the trick.
When I say send the top k per partition I don't mean send the pq but the actual
values. This may end up being relatively small if k and p are not too big. (I'm
not sure how large serialized
Hi sparkuser2345,
I'm inferring the problem statement is something like how do I make this
complete faster (given my compute resources)?
Several comments.
First, Spark only allows launching parallel tasks from the driver, not from
workers, which is why you're seeing the exception when you try.
For linear models the 3rd option is by far most efficient and I suspect what
Evan is alluding to.
Unfortunately it's not directly possible with the classes in Mllib now so
you'll have to roll your own using underlying sgd / bfgs primitives.
—
Sent from Mailbox
On Sat, Jul 5, 2014 at 10:45
Ah, I see. Thank you!
As we are in the process of building the system we have not tried with
any large amounts of data yet, but when the time comes I'll try both
implementations and do a small benchmark.
On Fri, Jul 4, 2014 at 9:20 PM, Mohammed Guller moham...@glassbeam.com wrote:
As far as I
From looking at the exception message that was returned, I would try the
following command for running the application:
./bin/spark-submit --class test.etl.RunETL --master yarn-cluster
--num-workers 14 --driver-memory 3200m --worker-memory 3g --worker-cores 2
--jar
You may try LBFGS to have more stable convergence. In spark 1.1, we will be
able to use LBFGS instead of GD in training process.
On Jul 4, 2014 1:23 PM, Thomas Robert tho...@creativedata.fr wrote:
Hi all,
I too am having some issues with *RegressionWithSGD algorithms.
Concerning your issue
Key idea is to simulate your app time as you enter data . So you can
connect spark streaming to a queue and insert data in it spaced by time.
Easier said than done :). What are the parallelism issues you are hitting
with your static approach.
On Friday, July 4, 2014, alessandro finamore
thanks for replying. why is joining two vertexrdds without caching slow?
what is recomputed unnecessarily?
i am not sure what is different here from joining 2 regular RDDs (where
nobody seems to recommend to cache before joining i think...)
On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no
wrote:
csv =
sc.newAPIHadoopFile(opts.input,com.hadoop
.mapreduce.LzoTextInputFormat,org.apache.hadoop
.io.LongWritable,org.apache.hadoop.io.Text).count()
Does anyone know what the rough equivalent of this would be in
When joining two VertexRDDs with identical indexes, GraphX can use a fast
code path (a zip join without any hash lookups). However, the check for
identical indexes is performed using reference equality.
Without caching, two copies of the index are created. Although the two
indexes are
The package com.hadoop.mapreduce certainly looks wrong. If it is a Hadoop
class it starts with org.apache.hadoop
On Jul 6, 2014 4:20 AM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
gurvinder.si...@uninett.no wrote:
csv =
20 matches
Mail list logo