[ANNOUNCEMENT] Plan for dropping Python 2 support

2019-06-03 Thread Xiangrui Meng
Hi all, Today we announced the plan for dropping Python 2 support [1] in Apache Spark: As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2

Re: Should python-2 be supported in Spark 3.0?

2019-06-03 Thread Xiangrui Meng
-- > *From:* shane knapp > *Sent:* Friday, May 31, 2019 7:38:10 PM > *To:* Denny Lee > *Cc:* Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark > Hamstra; Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; > dev; user > *Subj

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
, I'm going to upload it to Spark website and announce it here. Let me know if you think we should do a VOTE instead. On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng wrote: > I created https://issues.apache.org/jira/browse/SPARK-27884 to track the > work. > > On Thu, May 30, 2019 at 2

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
From:* Reynold Xin > *Sent:* Thursday, May 30, 2019 12:59:14 AM > *To:* shane knapp > *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen > Fen; Xiangrui Meng; dev; user > *Subject:* Re: Should python-2 be supported in Spark 3.0? > > +1 on Xiangrui’s plan. &

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Xiangrui Meng
Hi all, I want to revive this old thread since no action was taken so far. If we plan to mark Python 2 as deprecated in Spark 3.0, we should do it as early as possible and let users know ahead. PySpark depends on Python, numpy, pandas, and pyarrow, all of which are sunsetting Python 2 support by

SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng
Hi all, I want to re-send the previous SPIP on introducing a DataFrame-based graph component to collect more feedback. It supports property graphs, Cypher graph queries, and graph algorithms built on top of the DataFrame API. If you are a GraphX user or your workload is essentially graph queries,

Re: Spark ML with null labels

2019-01-10 Thread Xiangrui Meng
In your custom transformer that produces labels, can you filter null labels? A transformer doesn't always need to do 1:1 mapping. On Thu, Jan 10, 2019, 7:53 AM Patrick McCarthy I'm trying to implement an algorithm on the MNIST digits that runs like so: > > >- for every pair of digits (0,1),

Re: question about barrier execution mode in Spark 2.4.0

2018-12-19 Thread Xiangrui Meng
On Mon, Nov 12, 2018 at 7:33 AM Joe wrote: > Hello, > I was reading Spark 2.4.0 release docs and I'd like to find out more > about barrier execution mode. > In particular I'd like to know what happens when number of partitions > exceeds number of nodes (which I think is allowed, Spark tuning doc

Re: Custom Spark Error on Hadoop Cluster

2016-07-18 Thread Xiangrui Meng
gt;> Thanks for the reply however I couldn't locate the MLlib jar. What I >>>>>> have is a fat 'spark-assembly-1.5.1-hadoop2.6.0.jar'. >>>>>> >>>>>> There's an error on me copying user@spark.apache.org. The message >>>>>&g

Re: Custom Spark Error on Hadoop Cluster

2016-07-11 Thread Xiangrui Meng
not on the classpath? > We change entirely the contents of the directory for SPARK_HOME. The newly > built customized spark is the new contents of the current SPARK_HOME we > have right now. > > Thanks, > > Alger > > On Fri, Jul 8, 2016 at 1:32 PM, Xiangrui Meng &l

Re: Custom Spark Error on Hadoop Cluster

2016-07-07 Thread Xiangrui Meng
This seems like a deployment or dependency issue. Please check the following: 1. The unmodified Spark jars were not on the classpath (already existed on the cluster or pulled in by other packages). 2. The modified jars were indeed deployed to both master and slave nodes. On Tue, Jul 5, 2016 at

Re: dataframe stat corr for multiple columns

2016-05-19 Thread Xiangrui Meng
This is nice to have. Please create a JIRA for it. Right now, you can merge all columns into a vector column using RFormula or VectorAssembler, then convert it into an RDD and call corr from MLlib. On Tue, May 17, 2016, 7:09 AM Ankur Jain wrote: > Hello Team, > > > > In my

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-28 Thread Xiangrui Meng
It implements CombineInputFormat from Hadoop. isSplittable=false means each individual file cannot be split. If you only see one partition even with a large minPartitions, perhaps the total size of files is not big enough. Those are configurable in Hadoop conf. -Xiangrui On Tue, Apr 26, 2016,

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng
ace in the 2.x series ? > > Thanks > Shivaram > > On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote: > > FWIW, all of that sounds like a good plan to me. Developing one API is > > certainly better than two. > > > > On Tue, Apr 5, 20

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng
Hi all, More than a year ago, in Spark 1.2 we introduced the ML pipeline API built on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has been developed under the spark.ml package, while the old RDD-based API has been developed in parallel under the spark.mllib package.

Re: How to train and predict in parallel via Spark MLlib?

2016-02-19 Thread Xiangrui Meng
u show me anyone. > > Thanks a lot for your participation! > --Igor > > 2016-02-18 17:24 GMT+03:00 Xiangrui Meng <men...@gmail.com>: > >> If you have a big cluster, you can trigger training jobs in different >> threads on the driver. Putting RDDs inside an RDD w

Re: Why no computations run on workers/slaves in cluster mode?

2016-02-18 Thread Xiangrui Meng
Did the test program finish and did you see any error messages from the console? -Xiangrui On Wed, Feb 17, 2016, 1:49 PM Junjie Qian wrote: > Hi all, > > I am new to Spark, and have one problem that, no computations run on > workers/slave_servers in the standalone

Re: How to train and predict in parallel via Spark MLlib?

2016-02-18 Thread Xiangrui Meng
If you have a big cluster, you can trigger training jobs in different threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui On Thu, Feb 18, 2016, 4:28 AM Igor L. wrote: > Good day, Spark team! > I have to solve regression problem for different restricitons.

Re: AFTSurvivalRegression Prediction and QuantilProbabilities

2016-02-17 Thread Xiangrui Meng
You can get response, and quantiles if you enables quantilesCol. You can change quantile probabilities as well. There is some example code from the user guide: http://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression. -Xiangrui On Mon, Feb 1, 2016 at 9:09 AM

Re: Why transformer from ml.Pipeline transform only a DataFrame ?

2015-08-28 Thread Xiangrui Meng
Yes, we will open up APIs in next release. There were some discussion about the APIs. One approach is to have multiple methods for different outputs like predicted class and probabilities. -Xiangrui On Aug 28, 2015 6:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi there, The actual API of

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-28 Thread Xiangrui Meng
Hi Stahlman, finalRDDStorageLevel is the storage level for the final user/item factors. It is not common to set it to StorageLevel.NONE, unless you want to save the factors directly to disk. So if it is NONE, we cannot unpersist the intermediate RDDs (in/out blocks) because the final user/item

Re: Proper saving/loading of MatrixFactorizationModel

2015-07-28 Thread Xiangrui Meng
The partitioner is not saved with the RDD. So when you load the model back, we lose the partitioner information. You can call repartition on the user/product factors and then create a new MatrixFactorizationModel object using the repartitioned RDDs. It would be useful to create a utility method

Re: MovieALS Implicit Error

2015-07-28 Thread Xiangrui Meng
Hi Benedict, Did you set lambda to zero? -Xiangrui On Mon, Jul 13, 2015 at 4:18 AM, Benedict Liang bli...@thecarousell.com wrote: Hi Sean, This user dataset is organic. What do you think is a good ratings threshold then? I am only encountering this with the implicit type though. The explicit

Re: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-28 Thread Xiangrui Meng
Hi Aniruddh, Increasing number of partitions doesn't always help in ALS due to communication/computation trade-off. What rank did you set? If the rank is not large, I'd recommend a small number of partitions. There are some other numbers to watch. Do you have super popular items/users in your

Re: Cluster sizing for recommendations

2015-07-28 Thread Xiangrui Meng
Hi Danny, You might need to reduce the number of partitions (or set userBlocks and productBlocks directly in ALS). Using a large number of partitions increases shuffle size and memory requirement. If you have 16 x 16 = 256 cores. I would recommend 64 or 128 instead of 2048.

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Too many values to unpack

2015-07-28 Thread Xiangrui Meng
. Any idea what my errors mean, and why increasing memory causes them to go away? Thanks. On Fri, Jun 26, 2015 at 11:26 AM, Xiangrui Meng men...@gmail.com wrote: Please see my comments inline. It would be helpful if you can attach the full stack trace. -Xiangrui On Fri, Jun 26, 2015 at 7:18

Re: Spark FP-Growth algorithm for frequent sequential patterns

2015-06-28 Thread Xiangrui Meng
for this which I just submitted a PR for :) On Tue, Jun 23, 2015 at 6:09 PM, Xiangrui Meng men...@gmail.com wrote: This is on the wish list for Spark 1.5. Assuming that the items from the same transaction are distinct. We can still follow FP-Growth's steps: 1. find frequent items 2. filter

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
:20 PM, Xiangrui Meng men...@gmail.com wrote: It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more information to guess what happened: 1. Could you share the ALS settings, e.g., number of blocks, rank and number of iterations, as well as number of users/items in your

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
On Jun 23, 2015, at 6:20 PM, Xiangrui Meng men...@gmail.com wrote: It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more information to guess what happened: 1. Could you share the ALS settings, e.g., number of blocks, rank and number of iterations, as well as number

Re: NaiveBayes for MLPipeline is absent

2015-06-25 Thread Xiangrui Meng
FYI, I made a JIRA for this: https://issues.apache.org/jira/browse/SPARK-8600. -Xiangrui On Fri, Jun 19, 2015 at 3:01 PM, Xiangrui Meng men...@gmail.com wrote: Hi Justin, We plan to add it in 1.5, along with some other estimators. We are now preparing a list of JIRAs, but feel free to create

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
should be adding another extra argument --conf spark.driver.memory=15g . Is that correct? Regards, Rogers Jeffrey L On Thu, Jun 18, 2015 at 7:50 PM, Xiangrui Meng men...@gmail.com wrote: With 80,000 features and 1000 clusters, you need 80,000,000 doubles to store the cluster centers

Re: Spark FP-Growth algorithm for frequent sequential patterns

2015-06-23 Thread Xiangrui Meng
This is on the wish list for Spark 1.5. Assuming that the items from the same transaction are distinct. We can still follow FP-Growth's steps: 1. find frequent items 2. filter transactions and keep only frequent items 3. do NOT order by frequency 4. use suffix to partition the transactions

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-23 Thread Xiangrui Meng
It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more information to guess what happened: 1. Could you share the ALS settings, e.g., number of blocks, rank and number of iterations, as well as number of users/items in your dataset? 2. If you monitor the progress in the WebUI,

Re: How could output the StreamingLinearRegressionWithSGD prediction result?

2015-06-23 Thread Xiangrui Meng
Please check the input path to your test data, and call `.count()` and see whether there are records in it. -Xiangrui On Sat, Jun 20, 2015 at 9:23 PM, Gavin Yue yue.yuany...@gmail.com wrote: Hey, I am testing the StreamingLinearRegressionWithSGD following the tutorial. It works, but I could

Re: which mllib algorithm for large multi-class classification?

2015-06-23 Thread Xiangrui Meng
We have multinomial logistic regression implemented. For your case, the model size is 500 * 300,000 = 150,000,000. MLlib's implementation might not be able to handle it efficiently, we plan to have a more scalable implementation in 1.5. However, it shouldn't give you an array larger than MaxInt

Re: mutable vs. pure functional implementation - StatCounter

2015-06-23 Thread Xiangrui Meng
Creating millions of temporary (immutable) objects is bad for performance. It should be simple to do a micro-benchmark locally. -Xiangrui On Mon, Jun 22, 2015 at 7:25 PM, mzeltser mzelt...@gmail.com wrote: Using StatCounter as an example, I'd like to understand if pure functional implementation

Re: NaiveBayes for MLPipeline is absent

2015-06-19 Thread Xiangrui Meng
Hi Justin, We plan to add it in 1.5, along with some other estimators. We are now preparing a list of JIRAs, but feel free to create a JIRA for this and submit a PR:) Best, Xiangrui On Thu, Jun 18, 2015 at 6:35 PM, Justin Yip yipjus...@prediction.io wrote: Hello, Currently, there is no

Re: Issue with PySpark UDF on a column of Vectors

2015-06-18 Thread Xiangrui Meng
This is a known issue. See https://issues.apache.org/jira/browse/SPARK-7902 -Xiangrui On Thu, Jun 18, 2015 at 6:41 AM, calstad colin.als...@gmail.com wrote: I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here: from pyspark import SparkContext from

Re: Does MLLib has attribute importance?

2015-06-18 Thread Xiangrui Meng
, Jun 17, 2015 at 5:50 PM, Xiangrui Meng men...@gmail.com wrote: We don't have it in MLlib. The closest would be the ChiSqSelector, which works for categorical data. -Xiangrui On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov dautkha...@gmail.com wrote: What would be closest equivalent in MLLib

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Xiangrui Meng
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to store the cluster centers. That is ~600MB. If there are 10 partitions, you might need 6GB on the driver to collect updates from workers. I guess the driver died. Did you specify driver memory with spark-submit? -Xiangrui On

Re: MLLib: instance weight

2015-06-17 Thread Xiangrui Meng
Hi Gajan, Please subscribe our user mailing list, which is the best place to get your questions answered. We don't have weighted instance support, but it should be easy to add and we plan to do it in the next release (1.5). Thanks for asking! Best, Xiangrui On Wed, Jun 17, 2015 at 2:33 PM,

Re: spark mlib variance analysis

2015-06-17 Thread Xiangrui Meng
We don't have R-like model summary in MLlib, but we plan to add some in 1.5. Please watch https://issues.apache.org/jira/browse/SPARK-7674. -Xiangrui On Thu, May 28, 2015 at 3:47 PM, rafac rafaelme...@hotmail.com wrote: I have a simple problem: i got mean number of people on one place by

Re: Why the default Params.copy doesn't work for Model.copy?

2015-06-17 Thread Xiangrui Meng
That's is a bug, which will be fixed in https://github.com/apache/spark/pull/6622. I disabled Model.copy because models usually doesn't have a default constructor and hence the default Params.copy implementation won't work. Unfortunately, due to insufficient test coverage, StringIndexModel.copy is

Re: Optimization module in Python mllib

2015-06-17 Thread Xiangrui Meng
There is no plan at this time. We haven't reached 100% coverage on user-facing API in PySpark yet, which would have higher priority. -Xiangrui On Sun, Jun 7, 2015 at 1:42 AM, martingoodson martingood...@gmail.com wrote: Am I right in thinking that Python mllib does not contain the optimization

Re: Does MLLib has attribute importance?

2015-06-17 Thread Xiangrui Meng
We don't have it in MLlib. The closest would be the ChiSqSelector, which works for categorical data. -Xiangrui On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov dautkha...@gmail.com wrote: What would be closest equivalent in MLLib to Oracle Data Miner's Attribute Importance mining function?

Re: *Metrics API is odd in MLLib

2015-06-17 Thread Xiangrui Meng
LabeledPoint was used for both classification and regression, where label type is Double for simplicity. So in BinaryClassificationMetrics, we still use Double for labels. We compute the confusion matrix at each threshold internally, but this is not exposed to users (

Re: Parallel parameter tuning: distributed execution of MLlib algorithms

2015-06-17 Thread Xiangrui Meng
On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira h...@inesctec.pt wrote: Hi, I am currently experimenting with linear regression (SGD) (Spark + MLlib, ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I do this (for now) by an exhaustive grid search of the step size and

Re: Collabrative Filtering

2015-06-17 Thread Xiangrui Meng
Please following the code examples from the user guide: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark. -Xiangrui On Tue, May 26, 2015 at 12:34 AM, Yasemin Kaya godo...@gmail.com wrote: Hi, In CF String path = data/mllib/als/test.data; JavaRDDString

Re: How to use Apache spark mllib Model output in C++ component

2015-06-17 Thread Xiangrui Meng
In 1.3, we added some model save/load support in Parquet format. You can use Parquet's C++ library (https://github.com/Parquet/parquet-cpp) to load the data back. -Xiangrui On Wed, Jun 10, 2015 at 12:15 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Hope Swig and JNA might help for accessing

Re: RandomForest - subsamplingRate parameter

2015-06-17 Thread Xiangrui Meng
Because we don't have random access to the record, sampling still need to go through the records sequentially. It does save some computation, which is perhaps noticeable only if you have data cached in memory. Different random seeds are used for trees. -Xiangrui On Wed, Jun 3, 2015 at 4:40 PM,

Re: redshift spark

2015-06-17 Thread Xiangrui Meng
Hi Hafiz, As Ewan mentioned, the path is the path to the S3 files unloaded from Redshift. This is a more scalable way to get a large amount of data from Redshift than via JDBC. I'd recommend using the SQL API instead of the Hadoop API (https://github.com/databricks/spark-redshift). Best,

Re: k-means for text mining in a streaming context

2015-06-17 Thread Xiangrui Meng
Yes. You can apply HashingTF on your input stream and then use StreamingKMeans for training and prediction. -Xiangrui On Mon, Jun 8, 2015 at 11:05 AM, Ruslan Dautkhanov dautkha...@gmail.com wrote: Hello, https://spark.apache.org/docs/latest/mllib-feature-extraction.html would Feature

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-17 Thread Xiangrui Meng
You can try hashing to control the feature dimension. MLlib's k-means implementation can handle sparse data efficiently if the number of features is not huge. -Xiangrui On Tue, Jun 16, 2015 at 2:44 PM, Rex X dnsr...@gmail.com wrote: Hi Sujit, That's a good point. But 1-hot encoding will make

Re: Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-06-17 Thread Xiangrui Meng
That sounds like a bug. Could you create a JIRA and ping Yin Huai (cc'ed). -Xiangrui On Wed, May 27, 2015 at 12:57 AM, Justin Yip yipjus...@prediction.io wrote: Hello, I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I

Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-17 Thread Xiangrui Meng
This is implemented in MLlib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41. -Xiangrui On Wed, Jun 10, 2015 at 1:53 PM, erisa erisa...@gmail.com wrote: Hi, I am a Spark newbie, and trying to solve the same problem, and

Re: Not albe to run FP-growth Example

2015-06-17 Thread Xiangrui Meng
You should add spark-mllib_2.10 as a dependency instead of declaring it as the artifactId. And always use the same version for spark-core and spark-mllib. I saw you used 1.3.0 for spark-core but 1.4.0 for spark-mllib, which is not guaranteed to work. If you set the scope to provided, mllib jar

Re: FP Growth saveAsTextFile

2015-05-21 Thread Xiangrui Meng
) at java.lang.Thread.run(Thread.java:745) On Wed, May 20, 2015 at 2:05 PM, Xiangrui Meng men...@gmail.com wrote: Could you post the stack trace? If you are using Spark 1.3 or 1.4, it would be easier to save freq itemsets as a Parquet file. -Xiangrui On Wed, May 20, 2015 at 12:16 PM, Eric Tanner

Re: Pandas timezone problems

2015-05-21 Thread Xiangrui Meng
These are relevant: JIRA: https://issues.apache.org/jira/browse/SPARK-6411 PR: https://github.com/apache/spark/pull/6250 On Thu, May 21, 2015 at 3:16 PM, Def_Os njde...@gmail.com wrote: After deserialization, something seems to be wrong with my pandas DataFrames. It looks like the timezone

Re: FP Growth saveAsTextFile

2015-05-20 Thread Xiangrui Meng
Could you post the stack trace? If you are using Spark 1.3 or 1.4, it would be easier to save freq itemsets as a Parquet file. -Xiangrui On Wed, May 20, 2015 at 12:16 PM, Eric Tanner eric.tan...@justenough.com wrote: I am having trouble with saving an FP-Growth model as a text file. I can

Re: User Defined Type (UDT)

2015-05-20 Thread Xiangrui Meng
using them to support java 8's ZonedDateTime. On Tue, May 19, 2015 at 3:14 PM Xiangrui Meng men...@gmail.com wrote: (Note that UDT is not a public API yet.) On Thu, May 7, 2015 at 7:11 AM, wjur wojtek.jurc...@gmail.com wrote: Hi all! I'm using Spark 1.3.0 and I'm struggling

Re: How to implement an Evaluator for a ML pipeline?

2015-05-19 Thread Xiangrui Meng
The documentation needs to be updated to state that higher metric values are better (https://issues.apache.org/jira/browse/SPARK-7740). I don't know why if you negate the return value of the Evaluator you still get the highest regularization parameter candidate. Maybe you should check the log

Re: Word2Vec with billion-word corpora

2015-05-19 Thread Xiangrui Meng
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would work. You might try increasing the minCount to decrease the vocabulary size and reduce the

Re: Discretization

2015-05-19 Thread Xiangrui Meng
Thanks for asking! We should improve the documentation. The sample dataset is actually mimicking the MNIST digits dataset, where the values are gray levels (0-255). So by dividing by 16, we want to map it to 16 coarse bins for the gray levels. Actually, there is a bug in the doc, we should convert

Re: Stratified sampling with DataFrames

2015-05-19 Thread Xiangrui Meng
You need to convert DataFrame to RDD, call sampleByKey, and then apply the schema back to create DataFrame. val df: DataFrame = ... val schema = df.schema val sampledRDD = df.rdd.keyBy(r = r.getAs[Int](0)).sampleByKey(...).values val sampled = sqlContext.createDataFrame(sampledRDD, schema)

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-19 Thread Xiangrui Meng
: MyDenseVectorUDT do exist in the assembly jar and in this example all the code is in a single file to make sure every thing is included. On Tue, Apr 21, 2015 at 1:17 AM, Xiangrui Meng men...@gmail.com wrote: You should check where MyDenseVectorUDT is defined and whether it was on the classpath

Re: Increase maximum amount of columns for covariance matrix for principal components

2015-05-19 Thread Xiangrui Meng
We use a dense array to store the covariance matrix on the driver node. So its length is limited by the integer range, which is 65536 * 65536 (actually half). -Xiangrui On Wed, May 13, 2015 at 1:57 AM, Sebastian Alfers sebastian.alf...@googlemail.com wrote: Hello, in order to compute a huge

Re: k-means core function for temporal geo data

2015-05-19 Thread Xiangrui Meng
I'm not sure whether k-means would converge with this customized distance measure. You can list (weighted) time as a feature along with coordinates, and then use Euclidean distance. For other supported distance measures, you can check Derrick's package:

Re: spark mllib kmeans

2015-05-19 Thread Xiangrui Meng
Just curious, what distance measure do you need? -Xiangrui On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: take a look at this https://github.com/derrickburns/generalized-kmeans-clustering Best, Jao On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko

Re: LogisticRegressionWithLBFGS with large feature set

2015-05-19 Thread Xiangrui Meng
For ML applications, the best setting to set the number of partitions to match the number of cores to reduce shuffle size. You have 3072 partitions but 128 executors, which causes the overhead. For the MultivariateOnlineSummarizer, we plan to add flags to specify what need to be computed to reduce

Re: Find KNN in Spark SQL

2015-05-19 Thread Xiangrui Meng
Spark SQL doesn't provide spatial features. Large-scale KNN is usually combined with locality-sensitive hashing (LSH). This Spark package may be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash. -Xiangrui On Sat, May 9, 2015 at 9:25 PM, Dong Li lid...@lidong.net.cn wrote: Hello

Re: question about customize kmeans distance measure

2015-05-19 Thread Xiangrui Meng
MLlib only supports Euclidean distance for k-means. You can find Bregman divergence support in Derrick's package: http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Which distance measure do you want to use? -Xiangrui On Tue, May 12, 2015 at 7:23 PM, June

Re: MLlib libsvm isssues with data

2015-05-19 Thread Xiangrui Meng
The index should start from 1 for LIBSVM format, as defined in the README of LIBSVM (https://github.com/cjlin1/libsvm/blob/master/README#L64). The only exception is the precomputed kernel, which MLlib doesn't support. -Xiangrui On Wed, May 6, 2015 at 1:42 AM, doyere doy...@doyere.cn wrote: Hi

Re: RandomSplit with Spark-ML and Dataframe

2015-05-19 Thread Xiangrui Meng
In 1.4, we added RAND as a DataFrame expression, which can be used for random split. Please check the example here: https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214. -Xiangrui On Thu, May 7, 2015 at 8:39 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi,

Re: User Defined Type (UDT)

2015-05-19 Thread Xiangrui Meng
(Note that UDT is not a public API yet.) On Thu, May 7, 2015 at 7:11 AM, wjur wojtek.jurc...@gmail.com wrote: Hi all! I'm using Spark 1.3.0 and I'm struggling with a definition of a new type for a project I'm working on. I've created a case class Person(name: String) and now I'm trying to

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-05-19 Thread Xiangrui Meng
:59 PM, Xiangrui Meng men...@gmail.com wrote: Ravi, we just merged https://issues.apache.org/jira/browse/SPARK-6642 and used the same lambda scaling as in 1.2. The change will be included in Spark 1.3.1, which will be released soon. Thanks for reporting this issue! -Xiangrui On Tue, Mar 31

Re: MLLib SVMWithSGD is failing for large dataset

2015-05-18 Thread Xiangrui Meng
Reducing the number of instances won't help in this case. We use the driver to collect partial gradients. Even with tree aggregation, it still puts heavy workload on the driver with 20M features. Please try to reduce the number of partitions before training. We are working on a more scalable

Re: StandardScaler failing with OOM errors in PySpark

2015-05-18 Thread Xiangrui Meng
in how the JVM is created. No matter which memory settings I specify, the JVM for the driver is always made with 512Mb of memory. So I'm not sure if this is a feature or a bug? rok On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng men...@gmail.com wrote: You might need to specify driver memory

Re: bug: numClasses is not a valid argument of LogisticRegressionWithSGD

2015-05-18 Thread Xiangrui Meng
LogisticRegressionWithSGD doesn't support multi-class. Please use LogisticRegressionWithLBFGS instead. -Xiangrui On Mon, Apr 27, 2015 at 12:37 PM, Pagliari, Roberto rpagli...@appcomsci.com wrote: With the Python APIs, the available arguments I got (using inspect module) are the following:

Re: OOM error with GMMs on 4GB dataset

2015-05-06 Thread Xiangrui Meng
Did you set `--driver-memory` with spark-submit? -Xiangrui On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni vmuttin...@ebay.com wrote: Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). The spark (1.3.1) job is allocated 120 executors with 6GB each and the driver also

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Xiangrui Meng
How did you run the example app? Did you use spark-submit? -Xiangrui On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote: Sorry, accidentally sent the last email before finishing. I had asked this question before, but wanted to ask again as I think it is now related to my pom

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-27 Thread Xiangrui Meng
) org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527) org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203) On Thu, Apr 23, 2015 at 2:49 AM, Xiangrui Meng men...@gmail.com wrote: This is the size of the serialized task closure. Is stage 246 part of ALS iterations, or something

Re: StandardScaler failing with OOM errors in PySpark

2015-04-27 Thread Xiangrui Meng
-- I have to run in client mode so I can't set spark.driver.memory -- I've tried setting the spark.yarn.am.memory and overhead parameters but it doesn't seem to have an effect. Thanks, Rok On Apr 23, 2015, at 7:47 AM, Xiangrui Meng men...@gmail.com wrote: What is the feature dimension

Re: gridsearch - python

2015-04-27 Thread Xiangrui Meng
We will try to make them available in 1.4, which is coming soon. -Xiangrui On Thu, Apr 23, 2015 at 10:18 PM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I know grid search with cross validation is not supported. However, I was wondering if there is something availalable for the time being.

Re: setting cost in linear SVM [Python]

2015-04-23 Thread Xiangrui Meng
If by C you mean the parameter C in LIBLINEAR, the corresponding parameter in MLlib is regParam: https://github.com/apache/spark/blob/master/python/pyspark/mllib/classification.py#L273, while regParam = 1/C. -Xiangrui On Wed, Apr 22, 2015 at 3:25 PM, Pagliari, Roberto rpagli...@appcomsci.com

Re: StackOverflow Error when run ALS with 100 iterations

2015-04-23 Thread Xiangrui Meng
ALS.setCheckpointInterval was added in Spark 1.3.1. You need to upgrade Spark to use this feature. -Xiangrui On Wed, Apr 22, 2015 at 9:03 PM, amghost zhengweita...@outlook.com wrote: Hi, would you please how to checkpoint the training set rdd since all things are done in ALS.train method.

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-22 Thread Xiangrui Meng
This is the size of the serialized task closure. Is stage 246 part of ALS iterations, or something before or after it? -Xiangrui On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone christian.per...@gmail.com wrote: Hi Sean, thanks for the answer. I tried to call repartition() on the input

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-22 Thread Xiangrui Meng
The patched was merged and it will be included in 1.3.2 and 1.4.0. Thanks for reporting the bug! -Xiangrui On Tue, Apr 21, 2015 at 2:51 PM, ayan guha guha.a...@gmail.com wrote: Thank you all. On 22 Apr 2015 04:29, Xiangrui Meng men...@gmail.com wrote: SchemaRDD subclasses RDD in 1.2

Re: Problem with using Spark ML

2015-04-22 Thread Xiangrui Meng
Please try reducing the step size. The native BLAS library is not required. -Xiangrui On Tue, Apr 21, 2015 at 5:15 AM, Staffan staffan.arvids...@gmail.com wrote: Hi, I've written an application that performs some machine learning on some data. I've validated that the data _should_ give a good

Re: [MLlib] fail to run word2vec

2015-04-22 Thread Xiangrui Meng
We store the vectors on the driver node. So it is hard to handle a really large vocabulary. You can use setMinCount to filter out infrequent word to reduce the model size. -Xiangrui On Wed, Apr 22, 2015 at 12:32 AM, gm yu husty...@gmail.com wrote: When use Mllib.Word2Vec, I meet the following

Re: the indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Xiangrui Meng
Having ordered indices is a contract of SparseVector: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector. We do not verify it for performance. -Xiangrui On Wed, Apr 22, 2015 at 8:26 AM, yaochunnan yaochun...@gmail.com wrote: Hi all, I am using

Re: LDA code little error @Xiangrui Meng

2015-04-22 Thread Xiangrui Meng
to see there is error -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LDA-code-little-error-Xiangrui-Meng-tp22621.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Xiangrui Meng
What is the feature dimension? Did you set the driver memory? -Xiangrui On Tue, Apr 21, 2015 at 6:59 AM, rok rokros...@gmail.com wrote: I'm trying to use the StandardScaler in pyspark on a relatively small (a few hundred Mb) dataset of sparse vectors with 800k features. The fit method of

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Xiangrui Meng
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in 1.3. We should allow DataFrames in ALS.train. I will submit a patch. You can use `ALS.train(training.rdd, ...)` for now as a workaround. -Xiangrui On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley jos...@databricks.com wrote:

Re: Streaming Linear Regression problem

2015-04-20 Thread Xiangrui Meng
Did you keep adding new files under the `train/` folder? What was the exact warn message? -Xiangrui On Fri, Apr 17, 2015 at 4:56 AM, barisak baris.akg...@gmail.com wrote: Hi, I write this code for just train the Stream Linear Regression, but I took no data found warn, so no weights were not

Re: ChiSquared Test from user response flat files to RDD[Vector]?

2015-04-20 Thread Xiangrui Meng
You can find the user guide for vector creation here: http://spark.apache.org/docs/latest/mllib-data-types.html#local-vector. -Xiangrui On Mon, Apr 20, 2015 at 2:32 PM, Dan DeCapria, CivicScience dan.decap...@civicscience.com wrote: Hi Spark community, I'm very new to the Apache Spark

Re: MLlib - Naive Bayes Problem

2015-04-20 Thread Xiangrui Meng
Could you attach the full stack trace? Please also include the stack trace from executors, which you can find on the Spark WebUI. -Xiangrui On Thu, Apr 16, 2015 at 1:00 PM, riginos samarasrigi...@gmail.com wrote: I have a big dataset of categories of cars and descriptions of cars. So i want to

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-04-20 Thread Xiangrui Meng
You should check where MyDenseVectorUDT is defined and whether it was on the classpath (or in the assembly jar) at runtime. Make sure the full class name (with package name) is used. Btw, UDTs are not public yet, so please use it with caution. -Xiangrui On Fri, Apr 17, 2015 at 12:45 AM, Jaonary

Re: multinomial and Bernoulli model in NaiveBayes

2015-04-15 Thread Xiangrui Meng
CC Leah, who added Bernoulli option to MLlib's NaiveBayes. -Xiangrui On Wed, Apr 15, 2015 at 4:49 AM, 姜林和 linhe_ji...@163.com wrote: Dear meng: Thanks for the great work for park machine learning, and I saw the changes for NaiveBayes algorithm , separate the algorithm to : multinomial

Re: org.apache.spark.ml.recommendation.ALS

2015-04-14 Thread Xiangrui Meng
? Thanks, Jay On Apr 9, 2015, at 4:38 PM, Xiangrui Meng men...@gmail.com wrote: Could you share ALSNew.scala? Which Scala version did you use? -Xiangrui On Wed, Apr 8, 2015 at 4:09 PM, Jay Katukuri jkatuk...@apple.com wrote: Hi Xiangrui, I tried running this on my local machine

Re: spark ml model info

2015-04-14 Thread Xiangrui Meng
If you are using Scala/Java or pyspark.mllib.classification.LogisticRegressionModel, you should be able to call weights and intercept to get the model coefficients. If you are using the pipeline API in Python, you can try model._java_model.weights(), we are going to add a method to get the weights

  1   2   3   4   5   >