Re: MLlib libsvm isssues with data

2015-05-19 Thread Xiangrui Meng
The index should start from 1 for LIBSVM format, as defined in the README of LIBSVM (https://github.com/cjlin1/libsvm/blob/master/README#L64). The only exception is the precomputed kernel, which MLlib doesn't support. -Xiangrui On Wed, May 6, 2015 at 1:42 AM, doyere wrote: > Hi all, > > After do

Re: RandomSplit with Spark-ML and Dataframe

2015-05-19 Thread Xiangrui Meng
In 1.4, we added RAND as a DataFrame expression, which can be used for random split. Please check the example here: https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214. -Xiangrui On Thu, May 7, 2015 at 8:39 AM, Olivier Girardot wrote: > Hi, > is there any best practice to

Re: User Defined Type (UDT)

2015-05-19 Thread Xiangrui Meng
(Note that UDT is not a public API yet.) On Thu, May 7, 2015 at 7:11 AM, wjur wrote: > Hi all! > > I'm using Spark 1.3.0 and I'm struggling with a definition of a new type for > a project I'm working on. I've created a case class Person(name: String) and > now I'm trying to make Spark to be able

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-05-19 Thread Xiangrui Meng
>> work better in practice, but I have no evidence of this. It may make sense >> to weight lambda by sum_i cij instead? >> >> >> >> >> >> On Wed, Apr 1, 2015 at 7:59 PM, Xiangrui Meng wrote: >>> >>> Ravi, we just merged https://is

Re: Discretization

2015-05-19 Thread Xiangrui Meng
Thanks for asking! We should improve the documentation. The sample dataset is actually mimicking the MNIST digits dataset, where the values are gray levels (0-255). So by dividing by 16, we want to map it to 16 coarse bins for the gray levels. Actually, there is a bug in the doc, we should convert

Re: How to implement an Evaluator for a ML pipeline?

2015-05-19 Thread Xiangrui Meng
The documentation needs to be updated to state that higher metric values are better (https://issues.apache.org/jira/browse/SPARK-7740). I don't know why if you negate the return value of the Evaluator you still get the highest regularization parameter candidate. Maybe you should check the log messa

Re: Find KNN in Spark SQL

2015-05-19 Thread Xiangrui Meng
Spark SQL doesn't provide spatial features. Large-scale KNN is usually combined with locality-sensitive hashing (LSH). This Spark package may be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash. -Xiangrui On Sat, May 9, 2015 at 9:25 PM, Dong Li wrote: > Hello experts, > > I’m new t

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-19 Thread Xiangrui Meng
ectorUDT do exist in the assembly jar and in this example all the >> code is in a single file to make sure every thing is included. >> >> On Tue, Apr 21, 2015 at 1:17 AM, Xiangrui Meng wrote: >>> >>> You should check where MyDenseVectorUDT is defined and whether it

Re: spark mllib kmeans

2015-05-19 Thread Xiangrui Meng
Just curious, what distance measure do you need? -Xiangrui On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa wrote: > take a look at this > https://github.com/derrickburns/generalized-kmeans-clustering > > Best, > > Jao > > On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko > wrote: >> >> Hi Pa

Re: Stratified sampling with DataFrames

2015-05-19 Thread Xiangrui Meng
You need to convert DataFrame to RDD, call sampleByKey, and then apply the schema back to create DataFrame. val df: DataFrame = ... val schema = df.schema val sampledRDD = df.rdd.keyBy(r => r.getAs[Int](0)).sampleByKey(...).values val sampled = sqlContext.createDataFrame(sampledRDD, schema) Hopef

Re: question about customize kmeans distance measure

2015-05-19 Thread Xiangrui Meng
MLlib only supports Euclidean distance for k-means. You can find Bregman divergence support in Derrick's package: http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Which distance measure do you want to use? -Xiangrui On Tue, May 12, 2015 at 7:23 PM, June wrote: > Dear

Re: Increase maximum amount of columns for covariance matrix for principal components

2015-05-19 Thread Xiangrui Meng
We use a dense array to store the covariance matrix on the driver node. So its length is limited by the integer range, which is 65536 * 65536 (actually half). -Xiangrui On Wed, May 13, 2015 at 1:57 AM, Sebastian Alfers wrote: > Hello, > > > in order to compute a huge dataset, the amount of column

Re: Word2Vec with billion-word corpora

2015-05-19 Thread Xiangrui Meng
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would work. You might try increasing the minCount to decrease the vocabulary size and reduce the vec

Re: k-means core function for temporal geo data

2015-05-19 Thread Xiangrui Meng
I'm not sure whether k-means would converge with this customized distance measure. You can list (weighted) time as a feature along with coordinates, and then use Euclidean distance. For other supported distance measures, you can check Derrick's package: http://spark-packages.org/package/derrickburn

Re: LogisticRegressionWithLBFGS with large feature set

2015-05-19 Thread Xiangrui Meng
For ML applications, the best setting to set the number of partitions to match the number of cores to reduce shuffle size. You have 3072 partitions but 128 executors, which causes the overhead. For the MultivariateOnlineSummarizer, we plan to add flags to specify what need to be computed to reduce

Re: User Defined Type (UDT)

2015-05-20 Thread Xiangrui Meng
ing them to support java 8's ZonedDateTime. > > On Tue, May 19, 2015 at 3:14 PM Xiangrui Meng wrote: >> >> (Note that UDT is not a public API yet.) >> >> On Thu, May 7, 2015 at 7:11 AM, wjur wrote: >> > Hi all! >> > >> > I'm using

Re: FP Growth saveAsTextFile

2015-05-20 Thread Xiangrui Meng
Could you post the stack trace? If you are using Spark 1.3 or 1.4, it would be easier to save freq itemsets as a Parquet file. -Xiangrui On Wed, May 20, 2015 at 12:16 PM, Eric Tanner wrote: > I am having trouble with saving an FP-Growth model as a text file. I can > print out the results, but wh

Re: FP Growth saveAsTextFile

2015-05-20 Thread Xiangrui Meng
esultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java

Re: Pandas timezone problems

2015-05-21 Thread Xiangrui Meng
These are relevant: JIRA: https://issues.apache.org/jira/browse/SPARK-6411 PR: https://github.com/apache/spark/pull/6250 On Thu, May 21, 2015 at 3:16 PM, Def_Os wrote: > After deserialization, something seems to be wrong with my pandas DataFrames. > It looks like the timezone information is lost

Re: MLLib: instance weight

2015-06-17 Thread Xiangrui Meng
Hi Gajan, Please subscribe our user mailing list, which is the best place to get your questions answered. We don't have weighted instance support, but it should be easy to add and we plan to do it in the next release (1.5). Thanks for asking! Best, Xiangrui On Wed, Jun 17, 2015 at 2:33 PM, Gajan

Re: Parallel parameter tuning: distributed execution of MLlib algorithms

2015-06-17 Thread Xiangrui Meng
On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira wrote: > Hi, > > I am currently experimenting with linear regression (SGD) (Spark + MLlib, > ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I > do this (for now) by an exhaustive grid search of the step size and the > numbe

Re: Collabrative Filtering

2015-06-17 Thread Xiangrui Meng
Please following the code examples from the user guide: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark. -Xiangrui On Tue, May 26, 2015 at 12:34 AM, Yasemin Kaya wrote: > Hi, > > In CF > > String path = "data/mllib/als/test.data"; > JavaRDD data = sc.textFile

Re: Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-06-17 Thread Xiangrui Meng
That sounds like a bug. Could you create a JIRA and ping Yin Huai (cc'ed). -Xiangrui On Wed, May 27, 2015 at 12:57 AM, Justin Yip wrote: > Hello, > > I am trying out 1.4.0 and notice there are some differences in behavior with > Timestamp between 1.3.1 and 1.4.0. > > In 1.3.1, I can compare a Tim

Re: spark mlib variance analysis

2015-06-17 Thread Xiangrui Meng
We don't have R-like model summary in MLlib, but we plan to add some in 1.5. Please watch https://issues.apache.org/jira/browse/SPARK-7674. -Xiangrui On Thu, May 28, 2015 at 3:47 PM, rafac wrote: > I have a simple problem: > i got mean number of people on one place by hour(time-series like), and

Re: RandomForest - subsamplingRate parameter

2015-06-17 Thread Xiangrui Meng
Because we don't have random access to the record, sampling still need to go through the records sequentially. It does save some computation, which is perhaps noticeable only if you have data cached in memory. Different random seeds are used for trees. -Xiangrui On Wed, Jun 3, 2015 at 4:40 PM, And

Re: Why the default Params.copy doesn't work for Model.copy?

2015-06-17 Thread Xiangrui Meng
That's is a bug, which will be fixed in https://github.com/apache/spark/pull/6622. I disabled Model.copy because models usually doesn't have a default constructor and hence the default Params.copy implementation won't work. Unfortunately, due to insufficient test coverage, StringIndexModel.copy is

Re: redshift spark

2015-06-17 Thread Xiangrui Meng
Hi Hafiz, As Ewan mentioned, the path is the path to the S3 files unloaded from Redshift. This is a more scalable way to get a large amount of data from Redshift than via JDBC. I'd recommend using the SQL API instead of the Hadoop API (https://github.com/databricks/spark-redshift). Best, Xiangrui

Re: Optimization module in Python mllib

2015-06-17 Thread Xiangrui Meng
There is no plan at this time. We haven't reached 100% coverage on user-facing API in PySpark yet, which would have higher priority. -Xiangrui On Sun, Jun 7, 2015 at 1:42 AM, martingoodson wrote: > Am I right in thinking that Python mllib does not contain the optimization > module? Are there plan

Re: Official Mllib API does not correspond to auto completion

2015-06-17 Thread Xiangrui Meng
I don't fully understand your question. Could you explain it in more details? Thanks! -Xiangrui On Mon, Jun 8, 2015 at 2:26 AM, Jean-Charles RISCH < risch.jeanchar...@gmail.com> wrote: > Hi, > > I am playing with Mllib (Spark 1.3.1) and my auto completion propositions > don't correspond to the of

Re: k-means for text mining in a streaming context

2015-06-17 Thread Xiangrui Meng
Yes. You can apply HashingTF on your input stream and then use StreamingKMeans for training and prediction. -Xiangrui On Mon, Jun 8, 2015 at 11:05 AM, Ruslan Dautkhanov wrote: > Hello, > > https://spark.apache.org/docs/latest/mllib-feature-extraction.html > would Feature Extraction and Transforma

Re: How to use Apache spark mllib Model output in C++ component

2015-06-17 Thread Xiangrui Meng
In 1.3, we added some model save/load support in Parquet format. You can use Parquet's C++ library (https://github.com/Parquet/parquet-cpp) to load the data back. -Xiangrui On Wed, Jun 10, 2015 at 12:15 AM, Akhil Das wrote: > Hope Swig and JNA might help for accessing c++ libraries from Java. > >

Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-17 Thread Xiangrui Meng
This is implemented in MLlib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41. -Xiangrui On Wed, Jun 10, 2015 at 1:53 PM, erisa wrote: > Hi, > > I am a Spark newbie, and trying to solve the same problem, and have > implement

Re: Does MLLib has attribute importance?

2015-06-17 Thread Xiangrui Meng
We don't have it in MLlib. The closest would be the ChiSqSelector, which works for categorical data. -Xiangrui On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov wrote: > What would be closest equivalent in MLLib to Oracle Data Miner's Attribute > Importance mining function? > > http://docs.oracl

Re: Not albe to run FP-growth Example

2015-06-17 Thread Xiangrui Meng
You should add spark-mllib_2.10 as a dependency instead of declaring it as the artifactId. And always use the same version for spark-core and spark-mllib. I saw you used 1.3.0 for spark-core but 1.4.0 for spark-mllib, which is not guaranteed to work. If you set the scope to "provided", mllib jar wo

Re: *Metrics API is odd in MLLib

2015-06-17 Thread Xiangrui Meng
LabeledPoint was used for both classification and regression, where label type is Double for simplicity. So in BinaryClassificationMetrics, we still use Double for labels. We compute the confusion matrix at each threshold internally, but this is not exposed to users ( https://github.com/apache/spar

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-17 Thread Xiangrui Meng
You can try hashing to control the feature dimension. MLlib's k-means implementation can handle sparse data efficiently if the number of features is not huge. -Xiangrui On Tue, Jun 16, 2015 at 2:44 PM, Rex X wrote: > Hi Sujit, > > That's a good point. But 1-hot encoding will make our data changin

Re: Issue with PySpark UDF on a column of Vectors

2015-06-18 Thread Xiangrui Meng
This is a known issue. See https://issues.apache.org/jira/browse/SPARK-7902 -Xiangrui On Thu, Jun 18, 2015 at 6:41 AM, calstad wrote: > I am having trouble using a UDF on a column of Vectors in PySpark which can > be illustrated here: > > from pyspark import SparkContext > from pyspark.sql import

Re: Does MLLib has attribute importance?

2015-06-18 Thread Xiangrui Meng
; > > -- > Ruslan Dautkhanov > > On Wed, Jun 17, 2015 at 5:50 PM, Xiangrui Meng wrote: >> >> We don't have it in MLlib. The closest would be the ChiSqSelector, >> which works for categorical data. -Xiangrui >> >> On Thu, Jun 11, 2015 at 4:33 PM, R

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Xiangrui Meng
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to store the cluster centers. That is ~600MB. If there are 10 partitions, you might need 6GB on the driver to collect updates from workers. I guess the driver died. Did you specify driver memory with spark-submit? -Xiangrui On Thu

Re: NaiveBayes for MLPipeline is absent

2015-06-19 Thread Xiangrui Meng
Hi Justin, We plan to add it in 1.5, along with some other estimators. We are now preparing a list of JIRAs, but feel free to create a JIRA for this and submit a PR:) Best, Xiangrui On Thu, Jun 18, 2015 at 6:35 PM, Justin Yip wrote: > Hello, > > Currently, there is no NaiveBayes implementation

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
; SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g >> SPARK_DAEMON_JAVA_OPTS="-XX:MaxPermSize=30g -Xms30g -Xmx30g" IPYTHON=1 >> PYSPARK_PYTHON=/usr/bin/python SPARK_PRINT_LAUNCH_COMMAND=1 >> ./spark/bin/pyspark --master >> spark://54.165.202.17.

Re: Spark FP-Growth algorithm for frequent sequential patterns

2015-06-23 Thread Xiangrui Meng
This is on the wish list for Spark 1.5. Assuming that the items from the same transaction are distinct. We can still follow FP-Growth's steps: 1. find frequent items 2. filter transactions and keep only frequent items 3. do NOT order by frequency 4. use suffix to partition the transactions (whethe

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-23 Thread Xiangrui Meng
It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more information to guess what happened: 1. Could you share the ALS settings, e.g., number of blocks, rank and number of iterations, as well as number of users/items in your dataset? 2. If you monitor the progress in the WebUI, how

Re: How could output the StreamingLinearRegressionWithSGD prediction result?

2015-06-23 Thread Xiangrui Meng
Please check the input path to your test data, and call `.count()` and see whether there are records in it. -Xiangrui On Sat, Jun 20, 2015 at 9:23 PM, Gavin Yue wrote: > Hey, > > I am testing the StreamingLinearRegressionWithSGD following the tutorial. > > > It works, but I could not output the p

Re: which mllib algorithm for large multi-class classification?

2015-06-23 Thread Xiangrui Meng
We have multinomial logistic regression implemented. For your case, the model size is 500 * 300,000 = 150,000,000. MLlib's implementation might not be able to handle it efficiently, we plan to have a more scalable implementation in 1.5. However, it shouldn't give you an "array larger than MaxInt" e

Re: mutable vs. pure functional implementation - StatCounter

2015-06-23 Thread Xiangrui Meng
Creating millions of temporary (immutable) objects is bad for performance. It should be simple to do a micro-benchmark locally. -Xiangrui On Mon, Jun 22, 2015 at 7:25 PM, mzeltser wrote: > Using StatCounter as an example, I'd like to understand if "pure" functional > implementation would be more

Re: NaiveBayes for MLPipeline is absent

2015-06-25 Thread Xiangrui Meng
FYI, I made a JIRA for this: https://issues.apache.org/jira/browse/SPARK-8600. -Xiangrui On Fri, Jun 19, 2015 at 3:01 PM, Xiangrui Meng wrote: > Hi Justin, > > We plan to add it in 1.5, along with some other estimators. We are now > preparing a list of JIRAs, but feel free to creat

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
>> I am also having that with Pyspark 1.4 >> 380 Million observations >> 100 factors and 5 iterations >> Thanks >> Ayman >> >> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng wrote: >> >> > It shouldn't be hard to handle 1 billion rating

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
Farahat >> wrote: >>> >>> was there any resolution to that problem? >>> I am also having that with Pyspark 1.4 >>> 380 Million observations >>> 100 factors and 5 iterations >>> Thanks >>> Ayman >>> >>> On Ju

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
ue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=54> > Memory > Deserialized 1x Replicated 100 100% 35.0 GB 0.0 B 0.0 B userOutBlocks > <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=48> > Memory > Deserialized 1x Replicated 100 100% 1062.7 MB 0.0 B 0.0 B > > On Jun 26, 2015, at 8:26 AM, Xiangrui Meng wrote: > > number of CPU cores or less. > > >

Re: Spark FP-Growth algorithm for frequent sequential patterns

2015-06-28 Thread Xiangrui Meng
h I just submitted a PR for :) > > On Tue, Jun 23, 2015 at 6:09 PM, Xiangrui Meng wrote: >> >> This is on the wish list for Spark 1.5. Assuming that the items from >> the same transaction are distinct. We can still follow FP-Growth's >> steps: >> >> 1. find

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng
Hi all, More than a year ago, in Spark 1.2 we introduced the ML pipeline API built on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has been developed under the spark.ml package, while the old RDD-based API has been developed in parallel under the spark.mllib package. While

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng
amespace in the 2.x series ? > > Thanks > Shivaram > > On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen wrote: > > FWIW, all of that sounds like a good plan to me. Developing one API is > > certainly better than two. > > > > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui M

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-28 Thread Xiangrui Meng
It implements CombineInputFormat from Hadoop. isSplittable=false means each individual file cannot be split. If you only see one partition even with a large minPartitions, perhaps the total size of files is not big enough. Those are configurable in Hadoop conf. -Xiangrui On Tue, Apr 26, 2016, 8:32

Re: dataframe stat corr for multiple columns

2016-05-19 Thread Xiangrui Meng
This is nice to have. Please create a JIRA for it. Right now, you can merge all columns into a vector column using RFormula or VectorAssembler, then convert it into an RDD and call corr from MLlib. On Tue, May 17, 2016, 7:09 AM Ankur Jain wrote: > Hello Team, > > > > In my current usecase I am l

Re: Custom Spark Error on Hadoop Cluster

2016-07-07 Thread Xiangrui Meng
This seems like a deployment or dependency issue. Please check the following: 1. The unmodified Spark jars were not on the classpath (already existed on the cluster or pulled in by other packages). 2. The modified jars were indeed deployed to both master and slave nodes. On Tue, Jul 5, 2016 at 12:

Re: Custom Spark Error on Hadoop Cluster

2016-07-11 Thread Xiangrui Meng
We change entirely the contents of the directory for SPARK_HOME. The newly > built customized spark is the new contents of the current SPARK_HOME we > have right now. > > Thanks, > > Alger > > On Fri, Jul 8, 2016 at 1:32 PM, Xiangrui Meng wrote: > >> This seems

Re: Custom Spark Error on Hadoop Cluster

2016-07-18 Thread Xiangrui Meng
>>>>> On Wed, Jul 13, 2016 at 4:21 AM, Alger Remirata < >>>>> abremirat...@gmail.com> wrote: >>>>> >>>>>> Thanks for the reply however I couldn't locate the MLlib jar. What I >>>>>> have is a fat '

Re: AFTSurvivalRegression Prediction and QuantilProbabilities

2016-02-17 Thread Xiangrui Meng
You can get response, and quantiles if you enables quantilesCol. You can change quantile probabilities as well. There is some example code from the user guide: http://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression. -Xiangrui On Mon, Feb 1, 2016 at 9:09 AM Chris

Re: How to train and predict in parallel via Spark MLlib?

2016-02-18 Thread Xiangrui Meng
If you have a big cluster, you can trigger training jobs in different threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui On Thu, Feb 18, 2016, 4:28 AM Igor L. wrote: > Good day, Spark team! > I have to solve regression problem for different restricitons. There is a > bunch o

Re: Why no computations run on workers/slaves in cluster mode?

2016-02-18 Thread Xiangrui Meng
Did the test program finish and did you see any error messages from the console? -Xiangrui On Wed, Feb 17, 2016, 1:49 PM Junjie Qian wrote: > Hi all, > > I am new to Spark, and have one problem that, no computations run on > workers/slave_servers in the standalone cluster mode. > > The Spark ver

Re: How to train and predict in parallel via Spark MLlib?

2016-02-19 Thread Xiangrui Meng
e. > > Thanks a lot for your participation! > --Igor > > 2016-02-18 17:24 GMT+03:00 Xiangrui Meng : > >> If you have a big cluster, you can trigger training jobs in different >> threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui >> >> On Th

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-27 Thread Xiangrui Meng
Hi Stahlman, finalRDDStorageLevel is the storage level for the final user/item factors. It is not common to set it to StorageLevel.NONE, unless you want to save the factors directly to disk. So if it is NONE, we cannot unpersist the intermediate RDDs (in/out blocks) because the final user/item fac

Re: Proper saving/loading of MatrixFactorizationModel

2015-07-27 Thread Xiangrui Meng
The partitioner is not saved with the RDD. So when you load the model back, we lose the partitioner information. You can call repartition on the user/product factors and then create a new MatrixFactorizationModel object using the repartitioned RDDs. It would be useful to create a utility method for

Re: MovieALS Implicit Error

2015-07-27 Thread Xiangrui Meng
Hi Benedict, Did you set lambda to zero? -Xiangrui On Mon, Jul 13, 2015 at 4:18 AM, Benedict Liang wrote: > Hi Sean, > > This user dataset is organic. What do you think is a good ratings threshold > then? I am only encountering this with the implicit type though. The > explicit type works fine th

Re: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-27 Thread Xiangrui Meng
Hi Aniruddh, Increasing number of partitions doesn't always help in ALS due to communication/computation trade-off. What rank did you set? If the rank is not large, I'd recommend a small number of partitions. There are some other numbers to watch. Do you have super popular items/users in your data

Re: Cluster sizing for recommendations

2015-07-27 Thread Xiangrui Meng
Hi Danny, You might need to reduce the number of partitions (or set userBlocks and productBlocks directly in ALS). Using a large number of partitions increases shuffle size and memory requirement. If you have 16 x 16 = 256 cores. I would recommend 64 or 128 instead of 2048. model.recommendProduct

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Too many values to unpack

2015-07-27 Thread Xiangrui Meng
Scheduler.scala:730) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450) > at > org.apache.spar

Re: Why transformer from ml.Pipeline transform only a DataFrame ?

2015-08-28 Thread Xiangrui Meng
Yes, we will open up APIs in next release. There were some discussion about the APIs. One approach is to have multiple methods for different outputs like predicted class and probabilities. -Xiangrui On Aug 28, 2015 6:39 AM, "Jaonary Rabarisoa" wrote: > Hi there, > > The actual API of ml.Transform

Re: question about barrier execution mode in Spark 2.4.0

2018-12-19 Thread Xiangrui Meng
On Mon, Nov 12, 2018 at 7:33 AM Joe wrote: > Hello, > I was reading Spark 2.4.0 release docs and I'd like to find out more > about barrier execution mode. > In particular I'd like to know what happens when number of partitions > exceeds number of nodes (which I think is allowed, Spark tuning doc

Re: Spark ML with null labels

2019-01-10 Thread Xiangrui Meng
In your custom transformer that produces labels, can you filter null labels? A transformer doesn't always need to do 1:1 mapping. On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy I'm trying to implement an algorithm on the MNIST digits that runs like so: > > >- for every pair of digits (0,1), (

SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng
Hi all, I want to re-send the previous SPIP on introducing a DataFrame-based graph component to collect more feedback. It supports property graphs, Cypher graph queries, and graph algorithms built on top of the DataFrame API. If you are a GraphX user or your workload is essentially graph queries,

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Xiangrui Meng
Hi all, I want to revive this old thread since no action was taken so far. If we plan to mark Python 2 as deprecated in Spark 3.0, we should do it as early as possible and let users know ahead. PySpark depends on Python, numpy, pandas, and pyarrow, all of which are sunsetting Python 2 support by 2

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
From:* Reynold Xin > *Sent:* Thursday, May 30, 2019 12:59:14 AM > *To:* shane knapp > *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen > Fen; Xiangrui Meng; dev; user > *Subject:* Re: Should python-2 be supported in Spark 3.0? > > +1 on Xiangrui’s plan. &

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
, I'm going to upload it to Spark website and announce it here. Let me know if you think we should do a VOTE instead. On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng wrote: > I created https://issues.apache.org/jira/browse/SPARK-27884 to track the > work. > > On Thu, May 30, 2019

Re: Should python-2 be supported in Spark 3.0?

2019-06-03 Thread Xiangrui Meng
-- > *From:* shane knapp > *Sent:* Friday, May 31, 2019 7:38:10 PM > *To:* Denny Lee > *Cc:* Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark > Hamstra; Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; > dev; user > *Subj

[ANNOUNCEMENT] Plan for dropping Python 2 support

2019-06-03 Thread Xiangrui Meng
Hi all, Today we announced the plan for dropping Python 2 support [1] in Apache Spark: As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2 suppor

Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-20 Thread Xiangrui Meng
The assumption of implicit feedback model is that the unobserved ratings are more likely to be negative. So you may want to add some negatives for evaluation. Otherwise, the input ratings are all 1 and the test ratings are all 1 as well. The baseline predictor, which uses the average rating (that i

Re: How to create distributed matrixes from hive tables.

2015-01-20 Thread Xiangrui Meng
You can get a SchemaRDD from the Hive table, map it into a RDD of Vectors, and then construct a RowMatrix. The transformations are lazy, so there is no external storage requirement for intermediate data. -Xiangrui On Sun, Jan 18, 2015 at 4:07 AM, guxiaobo1982 wrote: > Hi, > > We have large datase

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Xiangrui Meng
You can save the cluster centers as a SchemaRDD of two columns (id: Int, center: Array[Double]). When you load it back, you can construct the k-means model from its cluster centers. -Xiangrui On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian wrote: > This is because KMeanModel is neither a built-in ty

Re: KNN for large data set

2015-01-21 Thread Xiangrui Meng
For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. wrote: > Hi all, > > Please help me to find out best

Re: word2vec: how to save an mllib model and reload it?

2015-02-09 Thread Xiangrui Meng
We are working on import/export for MLlib models. The umbrella JIRA is https://issues.apache.org/jira/browse/SPARK-4587. In 1.3, we are going to have save/load for linear models, naive Bayes, ALS, and tree models. I created a JIRA for Word2Vec and set the target version to 1.4. If anyone is interes

Re: rdd filter

2015-02-09 Thread Xiangrui Meng
How was this RDD generated? Any randomness involved? -Xiangrui On Mon, Feb 9, 2015 at 10:47 AM, SK wrote: > Hi, > > I am using the filter() method to separate the rdds based on a predicate,as > follows: > > val rdd1 = data.filter (t => { t._2 >0.0 && t._2 <= 1.0}) // t._2 is a > Double > val rdd

Re: word2vec more distributed

2015-02-09 Thread Xiangrui Meng
The C implementation of Word2Vec updates the model using multi-threads without locking. It is hard to implement it in a distributed way. In the MLlib implementation, each work holds the entire model in memory and output the part of model that gets updated. The driver still need to collect and aggre

Re: Number of goals to win championship

2015-02-09 Thread Xiangrui Meng
Logistic regression outputs probabilities if the data fits the model assumption. Otherwise, you might need to calibrate its output to correctly read it. You may be interested in reading this: http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/. We have isotonic reg

Re: no option to add intercepts for StreamingLinearAlgorithm

2015-02-09 Thread Xiangrui Meng
No particular reason. We didn't add it in the first version. Let's add it in 1.4. -Xiangrui On Thu, Feb 5, 2015 at 3:44 PM, jamborta wrote: > hi all, > > just wondering if there is a reason why it is not possible to add intercepts > for streaming regression models? I understand that run method in

Re: naive bayes text classifier with tf-idf in pyspark

2015-02-09 Thread Xiangrui Meng
On Fri, Feb 6, 2015 at 2:08 PM, Imran Akbar wrote: > Hi, > > I've got the following code that's almost complete, but I have 2 questions: > > 1) Once I've computed the TF-IDF vector, how do I compute the vector for > each string to feed into the LabeledPoint? > If I understand your code correctly

Re: MLLib: feature standardization

2015-02-09 Thread Xiangrui Meng
`mean()` and `variance()` are not defined in `Vector`. You can use the mean and variance implementation from commons-math3 (http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html) if you don't want to implement them. -Xiangrui On Fri, Feb 6, 2015 at 12:50 PM, SK wrote: > Hi,

Re: [MLlib] Performance issues when building GBM models

2015-02-09 Thread Xiangrui Meng
Could you check the Spark UI and see whether there are RDDs being kicked out during the computation? We cache the residual RDD after each iteration. If we don't have enough memory/disk, it gets recomputed and results something like `t(n) = t(n-1) + const`. We might cache the features multiple times

Re: [POWERED BY] Radius Intelligence

2015-02-17 Thread Xiangrui Meng
Thanks! I added Radius to https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. -Xiangrui On Tue, Feb 10, 2015 at 12:02 AM, Alexis Roos wrote: > Also long due given our usage of Spark .. > > Radius Intelligence: > URL: radius.com > > Description: > Spark, MLLib > Using Scala, Spark

Re: Stepsize with Linear Regression

2015-02-17 Thread Xiangrui Meng
The best step size depends on the condition number of the problem. You can try some conditioning heuristics first, e.g., normalizing the columns, and then try a common step size like 0.01. We should implement line search for linear regression in the future, as in LogisticRegressionWithLBFGS. Line s

Re: Unknown sample in Naive Baye's

2015-02-17 Thread Xiangrui Meng
If there exists a sample that doesn't not belong to A/B/C, it means that there exists another class D or Unknown besides A/B/C. You should have some of these samples in the training set in order to let naive Bayes learn the priors. -Xiangrui On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet wrote: > H

Re: Naive Bayes model fails after a few predictions

2015-02-17 Thread Xiangrui Meng
Could you share the error log? What do you mean by "500 instead of 200"? If this is the number of files, try to use `repartition` before calling naive Bayes, which works the best when the number of partitions matches the number of cores, or even less. -Xiangrui On Tue, Feb 10, 2015 at 10:34 PM, rk

Re: high GC in the Kmeans algorithm

2015-02-17 Thread Xiangrui Meng
Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using YourKit or VisualVM. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu wrote: > I just wan

Re: feeding DataFrames into predictive algorithms

2015-02-17 Thread Xiangrui Meng
Hey Sandy, The work should be done by a VectorAssembler, which combines multiple columns (double/int/vector) into a vector column, which becomes the features column for regression. We can going to create JIRAs for each of these standard feature transformers. It would be great if you can help imple

Re: WARN from Similarity Calculation

2015-02-17 Thread Xiangrui Meng
It may be caused by GC pause. Did you check the GC time in the Spark UI? -Xiangrui On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das wrote: > Hi, > > I am sometimes getting WARN from running Similarity calculation: > > 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager > BlockManag

Re: MLib usage on Spark Streaming

2015-02-17 Thread Xiangrui Meng
JavaDStream.foreachRDD (https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function)) and Statistics.corr (https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/mllib/stat/Statistics.html#corr(o

Re: Large Similarity Job failing

2015-02-17 Thread Xiangrui Meng
The complexity of DIMSUM is independent of the number of rows but still have quadratic dependency on the number of columns. 1.5M columns may be too large to use DIMSUM. Try to increase the threshold and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das wrote: > Hi, > >

Re: Unknown sample in Naive Baye's

2015-02-19 Thread Xiangrui Meng
> > I appreciate any help on this. > > Thanks, > Jatin > > On Wed, Feb 18, 2015 at 3:07 AM, Xiangrui Meng wrote: >> >> If there exists a sample that doesn't not belong to A/B/C, it means >> that there exists another class D or Unknown besides A/B/C. Yo

Re: high GC in the Kmeans algorithm

2015-02-20 Thread Xiangrui Meng
, so 10^7 is a little huge? > > On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng wrote: >> >> Did you cache the data? Was it fully cached? The k-means >> implementation doesn't create many temporary objects. I guess you need >> more RAM to avoid GC triggered frequently.

  1   2   3   4   5   6   >