Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Weichen Xu
Could you provide your code, and running cluster info ? On Tue, Apr 23, 2019 at 4:10 PM Qian He wrote: > The dataset was using a sparse representation before feeding into > LogisticRegression. > > On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu > wrote: > >> Hi Qian, &

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Weichen Xu
Hi Qian, Do your dataset use sparse vector format ? On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote: > Hi all, > > I'm using Spark provided LogisticRegression to fit a dataset. Each row of > the data has 1.7 million columns, but it is sparse with only hundreds of > 1s. The Spark Ui reported hig

Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-22 Thread Weichen Xu
Hi Stephen, Agree with Nick said, the ML vs MLLib comparison test seems to be flawed. LR in Spark MLLib use SGD, in each iteration during training, SGD only sample a small fraction of data and do gradient computation, but in each iteration LBFGS need to aggregate over the whole input dataset. So

Re: Please Help with DecisionTree/FeatureIndexer

2017-12-19 Thread Weichen Xu
ta) // Make predictions. val predictions = model.transform(testData) Thanks. On Wed, Dec 20, 2017 at 5:26 AM, Marco Mistroni wrote: > Hello Weichen > i will try it out and le tyou know > But, if i add assembler to the pipeline, do i still have to call > Assembler.transfor

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Weichen Xu
Hi Sunitha, In the mapper function, you cannot update outer variables such as `personLst.add(person)`, this won't work so that's the reason you got an empty list. You can use `rdd.collect()` to get a local list of `Person` objects first, then you can safely iterate on the local list and do any up

Re: Please Help with DecisionTree/FeatureIndexer

2017-12-18 Thread Weichen Xu
> > Could you advise if this is the proper way to follow when using an > Assembler? > I was unable to add the Assembler at the beginning of the pipeline... it > seems it dint get invoked as , at the moment of calling the FeatureIndexer, > the column 'features' was not fou

Re: Please Help with DecisionTree/FeatureIndexer

2017-12-16 Thread Weichen Xu
es") > .setMaxCategories(5) // features with > 4 distinct values are > treated as continuous. > .fit(transformedData) > > ? > Apologies for the basic question btu last time i worked on an ML project i > was using Spark 1.x > > kr > marco > &g

Re: Please Help with DecisionTree/FeatureIndexer

2017-12-16 Thread Weichen Xu
Hi, Marco, val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") The data now include a feature column with name "features", val featureIndexer = new VectorIndexer() .setInputCol("features") <-- Here specify the "features" column to index. .setOutputCol("inde

Re: Row Encoder For DataSet

2017-12-07 Thread Weichen Xu
You can groupBy multiple columns on dataframe, so why you need so complicated schema ? suppose df schema: (x, y, u, v, z) df.groupBy($"x", $"y").agg(...) Is this you want ? On Fri, Dec 8, 2017 at 11:51 AM, Sandip Mehta wrote: > Hi, > > During my aggregation I end up having following schema. >

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Weichen Xu
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP. Thanks! On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath wrote: > For now, you must follow this approach of constructing a pipeline > consisting of a StringIndexer for each categorical column. See > https://issues.apache.

Re: Zero Coefficient in logistic regression

2017-10-24 Thread Weichen Xu
Yes chi-squared statistic only used in categorical features. It looks not proper here. Thanks! On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier wrote: > Hey, > as far as I know feature selection using the a chi-squared statistic, can > only be done on categorical features and not on possibly cont

Re: Spark ML - LogisticRegression interpreting prediction

2017-10-22 Thread Weichen Xu
The values you want to get (add up to 1.0) is "probability", not "rawPrediction". Thanks! On Mon, Oct 23, 2017 at 1:20 AM, pun wrote: > Hello, > I have a LogisticRegression model for predicting a binary label. Once I > train the model, I run it to get some predictions. I get the following > val

Re: jar file problem

2017-10-19 Thread Weichen Xu
Use `bin/spark-submit --jars` option. On Thu, Oct 19, 2017 at 11:54 PM, 郭鹏飞 wrote: > You can use bin/spark-submit tool to submit you jar to the cluster. > > > 在 2017年10月19日,下午11:24,Uğur Sopaoğlu 写道: > > > > Hello, > > > > I have a very easy problem. How I run a spark job, I must copy jar file >

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Weichen Xu
ppreciate if you can shed more insights on this issue or >>> point me to documentation where I can learn them. >>> >>> Thank you in advance. >>> >>> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu >>> wrote: >>> >>>> You

Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Weichen Xu
You should use `df.cache()` `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the original `df`. and then cache the new RDD. On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala wrote: > Hi all, > > I have been experimenting with cache/persist/unpersist methods with > respect to

Re: [MLlib] RowMatrix computeSVD Native ARPACK support not detecting.

2017-10-09 Thread Weichen Xu
Does you get the warning info such as: `Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS` `Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS` ? These two errors are thrown in `com.github.fommil.netlib.BLAS`, but it catch the original exception

Re: Example of GBTClassifier

2017-10-02 Thread Weichen Xu
It should be eclipses issues. The method is there, in super class `Predictor`. On Mon, Oct 2, 2017 at 11:51 PM, mckunkel wrote: > Greetings, > I am trying to run the example in the example directory for the > GBTClassifier. But when I view this code in eclipse, I get an error such > that > "The

Re: Replicating a row n times

2017-09-29 Thread Weichen Xu
I suggest you to use `monotonicallyIncreasingId` which is high efficient. But note that the ID it generated will not be consecutive. On Fri, Sep 29, 2017 at 3:21 PM, Kanagha Kumar wrote: > Thanks for the response. > I can use either row_number() or monotonicallyIncreasingId to generate > uniqueI

Re: Applying a Java script to many files: Java API or also Python API?

2017-09-29 Thread Weichen Xu
n that case is any difference (from directly coding in > the Java API) in performance expected? Thanks. > > > > On Sep 28, 2017, at 3:32 AM, Weichen Xu wrote: > > I think you have to use Spark Java API, in PySpark, functions running on > spark executors (such as map functi

Re: pyspark histogram

2017-09-27 Thread Weichen Xu
If you want to avoid pulling values into python you can use hive function "histogram_numeric", you need set `SparkSession.enableHiveSupport()`, but note that, calling hive function in spark will also slow down performance. Spark-sql haven't implemented "histogram_numeric" yet. But I think it will b

Re: Applying a Java script to many files: Java API or also Python API?

2017-09-27 Thread Weichen Xu
I think you have to use Spark Java API, in PySpark, functions running on spark executors (such as map function) can only written in python. On Thu, Sep 28, 2017 at 12:48 AM, Giuseppe Celano < cel...@informatik.uni-leipzig.de> wrote: > Hi everyone, > > I would like to apply a java script to many f

Re: How to pass sparkSession from driver to executor

2017-09-21 Thread Weichen Xu
Spark do not allow executor code using `sparkSession`. But I think you can move all json files to one directory, and them run: ``` spark.read.json("/path/to/jsonFileDir") ``` But if you want to get filename at the same time, you can use ``` spark.sparkContext.wholeTextFiles("/path/to/jsonFileDir")

Re: Pyspark define UDF for windows

2017-09-20 Thread Weichen Xu
UDF cannot be used as window function. You can use built-in window function or UDAF. On Wed, Sep 20, 2017 at 7:23 PM, Simon Dirmeier wrote: > Dear all, > I am trying to partition a DataFrame into windows and then for every > column and window use a custom function (udf) using Spark's Python > in

Re: Is there a SparkILoop for Java?

2017-09-20 Thread Weichen Xu
I haven't hear that. It seems that java do not have an official REPL. On Wed, Sep 20, 2017 at 8:38 PM, kant kodali wrote: > Hi All, > > I am wondering if there is a SparkILoop > > for > java so I can pass Java

Re: for loops in pyspark

2017-09-20 Thread Weichen Xu
Spark manage memory allocation and release automatically. Can you post the complete program which help checking where is wrong ? On Wed, Sep 20, 2017 at 8:12 PM, Alexander Czech < alexander.cz...@googlemail.com> wrote: > Hello all, > > I'm running a pyspark script that makes use of for loop to cr