Re: [EXTERNAL] - Re: Problem with the ML ALS algorithm

2019-06-26 Thread Nick Pentreath
t I create using some > likelihood distributions of the rating values. I am only experimenting / > learning. In practice though, the list of items is likely to be at least > in the 10’s if not 100’s. Are even this item numbers to low? > > > > Thanks. > > > > -S >

Re: [EXTERNAL] - Re: Problem with the ML ALS algorithm

2019-06-26 Thread Nick Pentreath
; Number of items is 4 > > Ratings values are either 120, 20, 0 > > > > > > *From:* Nick Pentreath > *Sent:* Wednesday, June 26, 2019 6:03 AM > *To:* user@spark.apache.org > *Subject:* [EXTERNAL] - Re: Problem with the ML ALS algorithm > > > > This means that

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-15 Thread Nick Pentreath
Multi column support for StringIndexer didn’t make it into Spark 2.3.0 The PR is still in progress I think - should be available in 2.4.0 On Mon, 14 May 2018 at 22:32, Mina Aslani wrote: > Please take a look at the api doc: >

Re: A naive ML question

2018-04-29 Thread Nick Pentreath
One potential approach could be to construct a transition matrix showing the probability of moving from each state to another state. This can be visualized with a “heat map” encoding (I think matshow in numpy/matplotlib does this). On Sat, 28 Apr 2018 at 21:34, kant kodali

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Nick Pentreath
Also check out FeatureHasher in Spark 2.3.0 which is designed to handle this use case in a more natural way than HashingTF (and handles multiple columns at once). On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin wrote: > Hi Shahab, > > do you actually need to have a few

Re: Spark MLlib: Should I call .cache before fitting a model?

2018-02-27 Thread Nick Pentreath
Currently, fit for many (most I think) models will cache the input data. For LogisticRegression this is definitely the case, so you won't get any benefit from caching it yourself. On Tue, 27 Feb 2018 at 21:25 Gevorg Hari wrote: > Imagine that I am training a Spark MLlib

Re: Reverse MinMaxScaler in SparkML

2018-01-29 Thread Nick Pentreath
This would be interesting and a good addition I think. It bears some thought about the API though. One approach is to have an "inverseTransform" method similar to sklearn. The other approach is to "formalize" something like StringIndexerModel -> IndexToString. Here, the inverse transformer is a

Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Nick Pentreath
At least one of their comparisons is flawed. The Spark ML version of linear regression (*note* they use linear regression and not logistic regression, it is not clear why) uses L-BFGS as the solver, not SGD (as MLLIB uses). Hence it is typically going to be slower. However, it should in most

Re: [ML] Allow CrossValidation ParamGrid on SVMWithSGD

2018-01-19 Thread Nick Pentreath
SVMWithSGD sits in the older "mllib" package and is not compatible directly with the DataFrame API. I suppose one could write a ML-API wrapper around it. However, there is LinearSVC in Spark 2.2.x: http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine

Re: does "Deep Learning Pipelines" scale out linearly?

2017-11-22 Thread Nick Pentreath
For that package specifically it’s best to see if they have a mailing list and if not perhaps ask on github issues. Having said that perhaps the folks involved in that package will reply here too. On Wed, 22 Nov 2017 at 20:03, Andy Davidson wrote: > I am starting

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Nick Pentreath
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently. The reason

Re: How to run MLlib's word2vec in CBOW mode?

2017-09-28 Thread Nick Pentreath
MLlib currently doesn't support CBOW - there is an open PR for it (see https://issues.apache.org/jira/browse/SPARK-20372). On Thu, 28 Sep 2017 at 09:56 pun wrote: > Hello, > My understanding is that word2vec can be ran in two modes: > >- continuous bag-of-words

Re: isCached

2017-09-01 Thread Nick Pentreath
t; > On Fri, Sep 1, 2017 at 11:46 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Dataset does have storageLevel. So you can use isCached = (storageLevel >> != StorageLevel.NONE) as a test. >> >> Arguably isCached could be added to dataset too, sh

Re: isCached

2017-09-01 Thread Nick Pentreath
Dataset does have storageLevel. So you can use isCached = (storageLevel != StorageLevel.NONE) as a test. Arguably isCached could be added to dataset too, shouldn't be a controversial change. On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld wrote: > I'm currently

Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for each release. I think generally we catch all breaking and most major behavior changes On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun wrote: > +1 > > On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li

Re: Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Nick Pentreath
l > method does that > > On Thu, Jul 20, 2017 at 12:50 PM, Nick Pentreath <nick.pentre...@gmail.com > > wrote: > >> Currently it's not supported, but is on the roadmap: see >> https://issues.apache.org/jira/browse/SPARK-13025 >> >> The most recent attempt

Re: Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Nick Pentreath
Currently it's not supported, but is on the roadmap: see https://issues.apache.org/jira/browse/SPARK-13025 The most recent attempt is to start with simple linear regression, as here: https://issues.apache.org/jira/browse/SPARK-21386 On Thu, 20 Jul 2017 at 08:36 Aseem Bansal

Re: Regarding Logistic Regression changes in Spark 2.2.0

2017-07-19 Thread Nick Pentreath
L-BFGS is the default optimization method since the initial ML package implementation. The OWLQN variant is used only when L1 regularization is specified (via the elasticNetParam). 2.2 adds the box constraints (optimized using the LBFGS-B variant). So no, no upgrade is required to use L-BFGS - if

Re: Spark 2.1.1: A bug in org.apache.spark.ml.linalg.* when using VectorAssembler.scala

2017-07-13 Thread Nick Pentreath
There are Vector classes under ml.linalg package - And VectorAssembler and other feature transformers all work with ml.linalg vectors. If you try to use mllib.linalg vectors instead you will get an error as the user defined type for SQL is not correct On Thu, 13 Jul 2017 at 11:23,

Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Nick Pentreath
You will need to use PySpark vectors to store in a DataFrame. They can be created from Numpy arrays as follows: from pyspark.ml.linalg import Vectors df = spark.createDataFrame([("src1", "pkey1", 1, Vectors.dense(np.array([0, 1, 2])))]) On Wed, 28 Jun 2017 at 12:23 Judit Planas

Re: Question about mllib.recommendation.ALS

2017-06-08 Thread Nick Pentreath
Spark 2.2 will support the recommend-all methods in ML. Also, both ML and MLLIB performance has been greatly improved for the recommend-all methods. Perhaps you could check out the current RC of Spark 2.2 or master branch to try it out? N On Thu, 8 Jun 2017 at 17:18, Sahib Aulakh [Search] ­ <

Re: spark ML Recommender program

2017-05-18 Thread Nick Pentreath
It sounds like this may be the same as https://issues.apache.org/jira/browse/SPARK-20402 On Thu, 18 May 2017 at 08:16 Nick Pentreath <nick.pentre...@gmail.com> wrote: > Could you try setting the checkpoint interval for ALS (try 3, 5 say) and > see what the effect is? > > >

Re: spark ML Recommender program

2017-05-18 Thread Nick Pentreath
Could you try setting the checkpoint interval for ALS (try 3, 5 say) and see what the effect is? On Thu, 18 May 2017 at 07:32 Mark Vervuurt wrote: > If you are running locally try increasing driver memory to for example 4G > en executor memory to 3G. > Regards, Mark >

Re: ElasticSearch Spark error

2017-05-15 Thread Nick Pentreath
It may be best to ask on the elasticsearch-Hadoop github project On Mon, 15 May 2017 at 13:19, nayan sharma wrote: > Hi All, > > *ERROR:-* > > *Caused by: org.apache.spark.util.TaskCompletionListenerException: > Connection error (check network and/or proxy settings)-

Re: pyspark vector

2017-04-25 Thread Nick Pentreath
Well the 3 in this case is the size of the sparse vector. This equates to the number of features, which for CountVectorizer (I assume that's what you're using) is also vocab size (number of unique terms). On Tue, 25 Apr 2017 at 04:06 Peyman Mohajerian wrote: > setVocabSize >

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Nick Pentreath
Why not use the RandomForest from Spark ML? On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > I have already posted this question to the StackOverflow > . >

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
dently. That sounds like something which > could be ran in parallel. > > > On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > What is the size of training data (number examples, number features)? > Dense or sparse features? How man

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
What is the size of training data (number examples, number features)? Dense or sparse features? How many classes? What commands are you using to submit your job via spark-submit? On Fri, 7 Apr 2017 at 13:12 Aseem Bansal wrote: > When using spark ml's LogisticRegression,

Re: Collaborative filtering steps in spark

2017-03-29 Thread Nick Pentreath
No, it does a random initialization. It does use a slightly different approach from pure normal random - it chooses non-negative draws which results in very slightly better results empirically. In practice I'm not sure if the average rating approach will make a big difference (it's been a long

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
send a patch. > > On 23 March 2017 at 13:49, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > Yup, that is true and a reasonable clarification of the doc. > > > > On Thu, 23 Mar 2017 at 00:03 chris snow <chsnow...@gmail.com> wrote: > >> >

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
Yup, that is true and a reasonable clarification of the doc. On Thu, 23 Mar 2017 at 00:03 chris snow wrote: > The documentation for collaborative filtering is as follows: > > === > Scaling of the regularization parameter > > Since v1.1, we scale the regularization parameter

Re: Contributing to Spark

2017-03-19 Thread Nick Pentreath
If you have experience and interest in Python then PySpark is a good area to look into. Yes, adding things like tests & documentation is a good starting point. Start out relatively small and go from there. Adding new wrappers to python for ML is useful for slightly larger tasks. On Mon, 20

Re: Check if dataframe is empty

2017-03-07 Thread Nick Pentreath
I believe take on an empty dataset will return an empty Array rather than throw an exception. df.take(1).isEmpty should work On Tue, 7 Mar 2017 at 07:42, Deepak Sharma wrote: > If the df is empty , the .take would return > java.util.NoSuchElementException. > This can be

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Nick Pentreath
helfpul > :) For instance, the similarity threshold, the number of hash tables, the > bucket width, etc... > > Thanks! > > On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > The original Uber authors provided this performanc

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-13 Thread Nick Pentreath
) but the > error is still happens. And it happens when I call similarity join. After > transformation, the size of dataset is about 4G. > > 2017-02-11 3:07 GMT+07:00 Nick Pentreath <nick.pentre...@gmail.com>: > > What other params are you using for the lsh transformer? &g

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Nick Pentreath
What other params are you using for the lsh transformer? Are the issues occurring during transform or during the similarity join? On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan wrote: > hi Das, > In general, I will apply them to larger datasets, so I want to use LSH, >

Re: ML PIC

2017-01-16 Thread Nick Pentreath
this have some opportunity for newbs (like me) to volunteer some > time? > > Sent from my iPhone > > On Dec 21, 2016, at 9:08 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > It is part of the general feature parity roadmap. I can't recall offhand > any

Re: ML PIC

2016-12-21 Thread Nick Pentreath
It is part of the general feature parity roadmap. I can't recall offhand any blocker reasons it's just resources On Wed, 21 Dec 2016 at 17:05, Robert Hamilton wrote: > Hi all. Is it on the roadmap to have an > Spark.ml.clustering.PowerIterationClustering? Are there

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib vectors for row matrix. I'd recommend using the vector conversion utils (I think in mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly). There are until methods for converting single vectors as well as

Re: how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

2016-11-28 Thread Nick Pentreath
This is because currently GBTClassifier doesn't extend the ClassificationModel abstract class, which in turn has the rawPredictionCol and related methods for generating that column. I'm actually not sure off hand whether this was because the GBT implementation could not produce the raw prediction

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
alyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just use spark.ml evaluators, which work with DataFrames. Try BinaryClassificationEvaluator. On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma

Re: Nearest neighbour search

2016-11-14 Thread Nick Pentreath
LSH-based NN search and similarity join should be out in Spark 2.1 - there's a little work being done still to clear up the APIs and some functionality. Check out https://issues.apache.org/jira/browse/SPARK-5992 On Mon, 14 Nov 2016 at 16:12, Kevin Mellott wrote: >

Re: Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread Nick Pentreath
For now OHE supports a single column. So you have to have 1000 OHE in a pipeline. However you can add them programatically so it is not too bad. If the cardinality of each feature is quite low, it should be workable. After that user VectorAssembler to stitch the vectors together (which accepts

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
Oh also you mention 20 partitions. Is that how many you have? How many ratings? It may be worth trying to reparation to larger number of partitions. On Fri, 21 Oct 2016 at 17:04, Nick Pentreath <nick.pentre...@gmail.com> wrote: > I wonder if you can try with setting different blocks

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
t was going out of memory with the default size too. > > On Fri, Oct 21, 2016 at 5:31 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > Did you try not setting the blocks parameter? It will then try to set it > automatically for your data size. > On Fri, 21 Oct 2016

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
lock size to 20,000 also results in the same. So there is > something I don't understand about how this is working. > > BTW, I am trying to find 50 latent factors (rank = 50). > > Do you have some insights as to how I should tweak things to get this > working? > > Thanks, > Nik >

Re: [Spark ML] Using GBTClassifier in OneVsRest

2016-10-21 Thread Nick Pentreath
Currently no - GBT implements the predictors, not the classifier interface. It might be possible to wrap it in a wrapper that extends the Classifier trait. Hopefully GBT will support multi-class at some point. But you can use RandomForest which does support multi-class. On Fri, 21 Oct 2016 at

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
The blocks params will set both user and item blocks. Spark 2.0 supports user and item blocks for PySpark: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra wrote: > Hi, > > I

Re: Making more features in Logistic Regression

2016-10-18 Thread Nick Pentreath
You can use the PolynomialExpansion in Spark ML ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion ) On Tue, 18 Oct 2016 at 21:47 miro wrote: > Yes, I was thinking going down this road: > > >

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-11 Thread Nick Pentreath
; > > > Sincerely, > > > > DB Tsai > > -- > > Web: https://www.dbtsai.com > > PGP Key ID: 0xAF08DF8D > > > > > > On Thu, Oct 6, 2016 at 4:09 AM, Nick Pentreath <nick.pentre...@gmail.com>

Re: why spark ml package doesn't contain svm algorithm

2016-09-27 Thread Nick Pentreath
There is a JIRA and PR for it - https://issues.apache.org/jira/browse/SPARK-14709 On Tue, 27 Sep 2016 at 09:10 hxw黄祥为 wrote: > I have found spark ml package have implement naivebayes algorithm and the > source code is simple,. > > I am confusing why spark ml package doesn’t

Re: Spark MLlib ALS algorithm

2016-09-24 Thread Nick Pentreath
The scale factor was only to scale up the number of ratings in the dataset for performance testing purposes, to illustrate the scalability of Spark ALS. It is not something you would normally do on your training dataset. On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote

Re: Similar Items

2016-09-21 Thread Nick Pentreath
Sorry, the original repo: https://github.com/karlhigley/spark-neighbors On Wed, 21 Sep 2016 at 13:09 Nick Pentreath <nick.pentre...@gmail.com> wrote: > I should also point out another library I had not come across before : > https://github.com/sethah/spark-neighbors > > >

Re: Similar Items

2016-09-21 Thread Nick Pentreath
in a mere 65 seconds! Thanks so much for the help! > > On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Thanks Nick - those examples will help a ton!! >> >> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath < >> nick

Re: Similar Items

2016-09-20 Thread Nick Pentreath
documents 1 and 2 need to be compared to one > another (via cosine similarity) because they both contain the token > 'hockey'. I will investigate the methods that you recommended to see if > they may resolve our problem. > > Thanks, > Kevin > > On Tue, Sep 20, 2016 at 1:45 AM,

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

Re: Similar Items

2016-09-20 Thread Nick Pentreath
How many products do you have? How large are your vectors? It could be that SVD / LSA could be helpful. But if you have many products then trying to compute all-pair similarity with brute force is not going to be scalable. In this case you may want to investigate hashing (LSH) techniques. On

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-19 Thread Nick Pentreath
Try als.setCheckpointInterval ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS@setCheckpointInterval(checkpointInterval:Int):ALS.this.type ) On Mon, 19 Sep 2016 at 20:01 Roshani Nagmote wrote: > Hello Sean, > > Can

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Nick Pentreath
The PR already exists for adding RankingEvaluator to ML - https://github.com/apache/spark/pull/12461. I need to revive and review it. DB, your review would be welcome too (and also on https://github.com/apache/spark/issues/12574 which has implications for the semantics of ranking metrics in the

Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Nick Pentreath
Could you create a JIRA ticket for it? https://issues.apache.org/jira/browse/SPARK On Thu, 8 Sep 2016 at 07:50 evanzamir wrote: > When I am trying to use LinearRegression, it seems that unless there is a > column specified with weights, it will raise a py4j error. Seems

Re: How to convert an ArrayType to DenseVector within DataFrame?

2016-09-08 Thread Nick Pentreath
You can use a udf like this: Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Python version 2.7.12 (default, Jul 2 2016 17:43:17) SparkSession available as 'spark'. In [1]: from

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Nick Pentreath
That does seem strange. Can you provide an example to reproduce? On Tue, 6 Sep 2016 at 21:49 evanzamir wrote: > Am I misinterpreting what r2() in the LinearRegression Model summary means? > By definition, R^2 should never be a negative number! > > > > -- > View this

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Nick Pentreath
at 15:37 Nick Pentreath <nick.pentre...@gmail.com> wrote: > Right now you are correct that Spark ML APIs do not support predicting on > a single instance (whether Vector for the models or a Row for a pipeline). > > See https://issues.apache.org/jira/browse/SPARK-10413 and > http

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Nick Pentreath
Right now you are correct that Spark ML APIs do not support predicting on a single instance (whether Vector for the models or a Row for a pipeline). See https://issues.apache.org/jira/browse/SPARK-10413 and https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some discussion. There

Re: Equivalent of "predict" function from LogisticRegressionWithLBFGS in OneVsRest with LogisticRegression classifier (Spark 2.0)

2016-08-29 Thread Nick Pentreath
Try this: val df = spark.createDataFrame(Seq(Vectors.dense(Array(10, 590, 190, 700))).map(Tuple1.apply)).toDF("features") On Sun, 28 Aug 2016 at 11:06 yaroslav wrote: > Hi, > > We use such kind of logic for training our model > > val model = new

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
y and >> it will work. I was wondering if I could do this in Spark/Scala with my >> limited knowledge >> >> Cheers >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
what is "text"? i.e. what is the "val text = ..." definition? If text is a String itself then indeed sc.parallelize(Array(text)) is doing the correct thing in this case. On Tue, 23 Aug 2016 at 19:42 Mich Talebzadeh wrote: > I am sure someone know this :) > > Created

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread Nick Pentreath
I believe it may be because of this issue ( https://issues.apache.org/jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where the number of categories differ between train and test, it's not usable in the current form. It's tricky to work around, though one option is to use

Re: Model Persistence

2016-08-18 Thread Nick Pentreath
Model metadata (mostly parameter values) are usually tiny. The parquet data is most often for model coefficients. So this depends on the size of your model, i.e. Your feature dimension. A high-dimensional linear model can be quite large - but still typically easy to fit into main memory on a

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Nick Pentreath
"{} min: {} max: {}".format(c, > min(mappings[c].values()), max(mappings[c].values( # some logging to > confirm the indexes. > logging.info("Missing value = {}".format(mappings[c]['missing'])) > return max_index, mappings > > I’d love to see

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
are computed. It almost but not quite enabled an > optimization. > > > On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Sean by 'offset' do you mean basically subtracting the mean but only from >> the non-zero elements

Re: Spark2 SBT Assembly

2016-08-10 Thread Nick Pentreath
You're correct - Spark packaging has been shifted to not use the assembly jar. To build now use "build/sbt package" On Wed, 10 Aug 2016 at 19:40, Efe Selcuk wrote: > Hi Spark folks, > > With Spark 1.6 the 'assembly' target for sbt would build a fat jar with > all of the

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
Sean by 'offset' do you mean basically subtracting the mean but only from the non-zero elements in each row? On Wed, 10 Aug 2016 at 19:02, Sean Owen wrote: > Yeah I had thought the same, that perhaps it's fine to let the > StandardScaler proceed, if it's explicitly asked to

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
putStream.java:897) > 16/08/04 10:36:03 WARN DFSClient: Error Recovery for block > BP-292564-10.196.101.2-1366289936494:blk_2802150425_1105993467488 in > pipeline 10.10.66.3:50010, 10.10.66.1:50010, 10.10.95.29:50010: bad > datanode 10.10.95.29:50010 > 16/08/04 10:40:48 WARN

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Hi Ben Perhaps with this size cardinality it is worth looking at feature hashing for your problem. Spark has the HashingTF transformer that works on a column of "sentences" (i.e. [string]). For categorical features you can hack it a little by converting your feature value into a

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document". So it's perhaps a bit of a

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-27 Thread Nick Pentreath
> > wrote: > >> Thanks Nick. I also ran into this issue. >> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and >> then use the dataset for the evaluator. In real life, probably detect the >> NaN and recommend most popular on some window. >

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Nick Pentreath
s Nick. I also ran into this issue. > VG, One workaround is to drop the NaN from predictions (df.na.drop()) and > then use the dataset for the evaluator. In real life, probably detect the > NaN and recommend most popular on some window. > HTH. > Cheers > > > On Sun, Jul 24,

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Nick Pentreath
It seems likely that you're running into https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the test dataset in the train/test split contains users or items that were not in the training set. Hence the model doesn't have computed factors for those ids, and ALS 'transform'

Re: Deploying ML Pipeline Model

2016-07-05 Thread Nick Pentreath
Spark for evaluating requests. > > Regards, > Saurabh > > > > > > > On Fri, Jul 1, 2016 at 10:47 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Generally there are 2 ways to use a trained pipeline model - (offline) >> batch scorin

Re: Deploying ML Pipeline Model

2016-07-05 Thread Nick Pentreath
't otherwise because the Affero license is > not Apache compatible.) > > On Fri, Jul 1, 2016 at 8:16 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > I believe open-scoring is one of the well-known PMML serving frameworks > in > > Java land (https://github.com/j

Re: Deploying ML Pipeline Model

2016-07-01 Thread Nick Pentreath
> https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Fri, Jul 1, 2016 at 6:47 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > Generally there are

Re: Deploying ML Pipeline Model

2016-07-01 Thread Nick Pentreath
Generally there are 2 ways to use a trained pipeline model - (offline) batch scoring, and real-time online scoring. For batch (or even "mini-batch" e.g. on Spark streaming data), then yes certainly loading the model back in Spark and feeding new data through the pipeline for prediction works just

Re: Performance issue with spark ml model to make single predictions on server side

2016-06-24 Thread Nick Pentreath
Currently, spark-ml models and pipelines are only usable in Spark. This means you must use Spark's machinery (and pull in all its dependencies) to do model serving. Also currently there is no fast "predict" method for a single Vector instance. So for now, you are best off going with PMML, or

Re: Spark ml and PMML export

2016-06-23 Thread Nick Pentreath
Currently there is no way within Spark itself. You may want to check out this issue (https://issues.apache.org/jira/browse/SPARK-11171) and here is an external project working on it (https://github.com/jpmml/jpmml-sparkml), that covers quite a number of transformers and models (but not all). On

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
.. thanks Nick. Figured that out since your last email... I deleted > the 2.10 by accident but then put 2+2 together. > > Got it working now. > > Still sticking to my story that it's somewhat complicated to setup :) > > Kevin > > On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreat

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
voke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Hey there When I used es-hadoop, I just pulled in the dependency into my pom.xml, with spark as a "provided" dependency, and built a fat jar with assembly. Then with spark-submit use the --jars option to include your assembly jar (IIRC I sometimes also needed to use --driver-classpath too, but

Re: Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Nick Pentreath
This should work: scala> val df = Seq((25.0, "foo"), (30.0, "bar")).toDF("age", "name") scala> df.withColumn("AgeInt", when(col("age") > 29.0, 1).otherwise(0)).show +++--+ | age|name|AgeInt| +++--+ |25.0| foo| 0| |30.0| bar| 1| +++--+ On Thu, 28 Apr 2016 at

Re: VectorAssembler handling null values

2016-04-19 Thread Nick Pentreath
Could you provide an example of what your input data looks like? Supporting missing values in a sparse result vector makes sense. On Tue, 19 Apr 2016 at 23:55, Andres Perez wrote: > Hi everyone. org.apache.spark.ml.feature.VectorAssembler currently cannot > handle null

Re: [ML] Training with bias

2016-04-12 Thread Nick Pentreath
Are you referring to fitting the intercept term? You can use lr.setFitIntercept (though it is true by default): scala> lr.explainParam(lr.fitIntercept) res27: String = fitIntercept: whether to fit an intercept term (default: true) On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann

Re: HashingTF "compatibility" across Python, Scala?

2016-04-12 Thread Nick Pentreath
. On Thu, 7 Apr 2016 at 18:19 Nick Pentreath <nick.pentre...@gmail.com> wrote: > You're right Sean, the implementation depends on hash code currently so > may differ. I opened a JIRA (which duplicated this one - > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK

Re: MLlib ALS MatrixFactorizationModel.save fails consistently

2016-04-08 Thread Nick Pentreath
Could you post some stack trace info? Generally, it can be problematic to run Spark within a web server framework as often there are dependency conflict and threading issues. You might prefer to run the model-building as a standalone app, or check out

Re: HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Nick Pentreath
You're right Sean, the implementation depends on hash code currently so may differ. I opened a JIRA (which duplicated this one - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574 which is the active JIRA), for using murmurhash3 which should then be consistent across platforms

Re: ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Ah I got it - Seq[(Int, Float)] is actually represented as Seq[Row] (seq of struct type) internally. So a further extraction is required, e.g. row => row.getSeq[Row](1).map { r => r.getInt(0) } On Wed, 6 Apr 2016 at 13:35 Nick Pentreath <nick.pentre...@gmail.com>

ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Hi there, In writing some tests for a PR I'm working on, with a more complex array type in a DF, I ran into this issue (running off latest master). Any thoughts? *// create DF with a column of Array[(Int, Double)]* val df = sc.parallelize(Seq( (0, Array((1, 6.0), (1, 4.0))), (1, Array((1, 3.0),

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Nick Pentreath
+1 for this proposal - as you mention I think it's the defacto current situation anyway. Note that from a developer view it's just the user-facing API that will be only "ml" - the majority of the actual algorithms still operate on RDDs under the good currently. On Wed, 6 Apr 2016 at 05:03, Chris

Re: is there any way to make WEB UI auto-refresh?

2016-03-15 Thread Nick Pentreath
You may want to check out https://github.com/hammerlab/spree On Tue, 15 Mar 2016 at 10:43 charles li wrote: > every time I can only get the latest info by refreshing the page, that's a > little boring. > > so is there any way to make the WEB UI auto-refreshing ? > > >

Re: [MLlib - ALS] Merging two Models?

2016-03-15 Thread Nick Pentreath
By the way, I created a JIRA for supporting initial model for warm start ALS here: https://issues.apache.org/jira/browse/SPARK-13856 On Fri, 11 Mar 2016 at 09:14, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Sean's old Myrrix slides contain an overview of the fold-in mat

  1   2   3   >