Also adding dev list in case anyone else has ideas / views.
On Sat, 12 Mar 2016 at 12:52, Nick Pentreath
wrote:
> Thanks for the feedback.
>
> I think Spark can certainly meet your use case when your data size scales
> up, as the actual model dimension is very small - you will need to use
> thos
Thanks for the feedback.
I think Spark can certainly meet your use case when your data size scales
up, as the actual model dimension is very small - you will need to use
those indexers or some other mapping mechanism.
There is ongoing work for Spark 2.0 to make it easier to use models outside
of
Thanks for the pointer to those indexers, those are some good examples. A
good way to go for the trainer and any scoring done in Spark. I will
definitely have to deal with scoring in non-Spark systems though.
I think I will need to scale up beyond what single-node liblinear can
practically provide
Ok, I think I understand things better now.
For Spark's current implementation, you would need to map those features as
you mention. You could also use say StringIndexer -> OneHotEncoder or
VectorIndexer. You could create a Pipeline to deal with the mapping and
training (e.g.
http://spark.apache.o
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath
wrote:
> Would you mind letting us know the # training examples in the datasets?
> Also, what do your features look like? Are they text, categorical etc? You
> mention that most rows only have a few features, and all rows together have
> a few 10,00
Would you mind letting us know the # training examples in the datasets?
Also, what do your features look like? Are they text, categorical etc? You
mention that most rows only have a few features, and all rows together have
a few 10,000s features, yet your max feature value is 20 million. How are
yo
Hi Nick,
Thanks for the feedback and the pointers. I tried coalescing to fewer
partitions and improved the situation dramatically. As you suggested, it is
communication overhead dominating the overall runtime.
The training run I mentioned originally had 900 partitions. Each tree
aggregation has t
Hi Daniel
The bottleneck in Spark ML is most likely (a) the fact that the weight
vector itself is dense, and (b) the related communication via the driver. A
tree aggregation mechanism is used for computing gradient sums (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apac
Just for the heck of it I tried the old MLlib implementation, but it had
the same scalability problem.
Anyone familiar with the logistic regression implementation who could weigh
in?
On Mon, Mar 7, 2016 at 5:35 PM, Michał Zieliński <
zielinski.mich...@gmail.com> wrote:
> We're using SparseVector
We're using SparseVector columns in a DataFrame, so they are definitely
supported. But maybe for LR some implicit magic is happening inside.
On 7 March 2016 at 23:04, Devin Jones wrote:
> I could be wrong but its possible that toDF populates a dataframe which I
> understand do not support sparse
I could be wrong but its possible that toDF populates a dataframe which I
understand do not support sparsevectors at the moment.
If you use the MlLib logistic regression implementation (not ml) you can
pass the RDD[LabeledPoint] data type directly to the learner.
http://spark.apache.org/docs/late
Yes, it is a SparseVector. Most rows only have a few features, and all the
rows together only have tens of thousands of features, but the vector size
is ~ 20 million because that is the largest feature.
On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones
wrote:
> Hi,
>
> Which data structure are you usi
Hi,
Which data structure are you using to train the model? If you haven't tried
yet, you should consider the SparseVector
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector
On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann wrote:
> I recently tri
I recently tried to a model using
org.apache.spark.ml.classification.LogisticRegression on a data set where
the feature vector size was around ~20 million. It did *not* go well. It
took around 10 hours to train on a substantial cluster. Additionally, it
pulled a lot data back to the driver - I even
14 matches
Mail list logo