Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread Yuhao Yang
This is something that was just added to ML and will probably be released with 2.2. For now you can try to copy from the master code: https://github.com/apache/spark/blob/70f9d7f71c63d2b1fdfed75cb7a59285c272a62b/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L352 and give it a tr

Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Yuhao Yang
Hi Julian, Thanks for reporting this. This is a valid issue and I created https://issues.apache.org/jira/browse/SPARK-19957 to track it. Right now the seed is set to this.getClass.getName.hashCode.toLong by default, which indeed keeps the same among multiple fits. Feel free to leave your comments

Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Yuhao Yang
Hi Jinhong, Based on the error message, your second collection of vectors has a dimension of 804202, while the dimension of your training vectors was 144109. So please make sure your test dataset are of the same dimension as the training data. >From the test dataset you posted, the vector dimens

Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-14 Thread Yuhao Yang
Hi Raju, Have you tried setNumPartitions with a larger number? 2017-03-07 0:30 GMT-08:00 Eli Super : > Hi > > It's area of knowledge , you will need to read online several hours about > it > > What is your programming language ? > > Try search online : "machine learning binning %my_programing_la

Sharing my DataFrame (DataSet) cheat sheet.

2017-03-04 Thread Yuhao Yang
Sharing some snippets I accumulated during developing with Apache Spark DataFrame (DataSet). Hope it can help you in some way. https://github.com/hhbyyh/DataFrameCheatSheet. [image: 内嵌图片 1] Regards, Yuhao Yang

Re: scikit-learn and mllib difference in predictions python

2016-12-25 Thread Yuhao Yang
Hi ioanna, I'd like to help look into it. Is there a way to access your training data? 2016-12-20 17:21 GMT-08:00 ioanna : > I have an issue with an SVM model trained for binary classification using > Spark 2.0.0. > I have followed the same logic using scikit-learn and MLlib, using the > exact >

Re: spark linear regression error training dataset is empty

2016-12-25 Thread Yuhao Yang
Hi Xiaomeng, Have you tried to confirm the DataFrame contents before fitting? like assembleddata.show() before fitting. Regards, Yuhao 2016-12-21 10:05 GMT-08:00 Xiaomeng Wan : > Hi, > > I am running linear regression on a dataframe and get the following error: > > Exception in thread "main" ja

Re: Multilabel classification with Spark MLlib

2016-11-29 Thread Yuhao Yang
If problem transformation is not an option ( https://en.wikipedia.org/wiki/Multi-label_classification#Problem_transformation_methods), I would try to develop a customized algorithm based on MultilayerPerceptronClassifier, in which you probably need to rewrite LabelConverter. 2016-11-29 9:02 GMT-08

Re: OutOfMemoryError - When saving Word2Vec

2016-06-13 Thread Yuhao Yang
Hi Sharad, what's your vocabulary size and vector length for Word2Vec? Regards, Yuhao 2016-06-13 20:04 GMT+08:00 sharad82 : > Is this the right forum to post Spark related issues ? I have tried this > forum along with StackOverflow but not seeing any response. > > > > -- > View this message in

Re: Ignore features in Random Forest

2016-06-01 Thread Yuhao Yang
Hi Neha, This looks like a feature engineering task. I think VectorSlicer can help with your case. Please refer to http://spark.apache.org/docs/latest/ml-features.html#vectorslicer . Regards, Yuhao 2016-06-01 21:18 GMT+08:00 Neha Mehta : > Hi, > > I am performing Regression using Random Forest.