Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-06 Thread Yanbo Liang
+1 for the proposal On Thu, Jan 31, 2019 at 12:46 PM Mingjie Tang wrote: > +1, this is a very very important feature. > > Mingjie > > On Thu, Jan 31, 2019 at 12:42 AM Xiao Li wrote: > >> Change my vote from +1 to ++1 >> >> Xiangrui Meng 于2019年1月30日周三 上午6:20写道: >> >>> Correction: +0 vote

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-11 Thread Yanbo Liang
+1 On Tue, Jul 10, 2018 at 10:15 PM Saisai Shao wrote: > https://issues.apache.org/jira/browse/SPARK-24530 is just merged, I will > cancel this vote and prepare a new RC2 cut with doc fixed. > > Thanks > Saisai > > Wenchen Fan 于2018年7月11日周三 下午12:25写道: > >> +1 >> >> On Wed, Jul 11, 2018 at 1:31

Re: [MLLib] Logistic Regression and standadization

2018-04-13 Thread Yanbo Liang
Hi Filipp, MLlib’s LR implementation did the same way as R’s glmnet for standardization. Actually you don’t need to care about the implementation detail, as the coefficients are always returned on the original scale, so it should be return the same result as other popular ML libraries. Could

[CFP] DataWorks Summit, San Jose, 2018

2018-02-07 Thread Yanbo Liang
Hi All, DataWorks Summit, San Jose, 2018 is a good place to share your experience of advanced analytics, data science, machine learning and deep learning. We have Artificial Intelligence and Data Science session, to cover technologies such as: Apache Spark, Sciki-learn, TensorFlow, Keras,

Re: Hinge Gradient

2017-12-16 Thread Yanbo Liang
Hello Deb, To optimize non-smooth function on LBFGS really should be considered carefully. Is there any literature that proves changing max to soft-max can behave well? I’m more than happy to see some benchmarks if you can have. + Yuhao, who did similar effort in this PR:

[CFP] DataWorks Summit Europe 2018 - Call for abstracts

2017-12-09 Thread Yanbo Liang
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19, 2018. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark for SQL/streaming processing, machine learning and data science. Information on submitting an abstract is at

Re: Welcoming Tejas Patil as a Spark committer

2017-10-06 Thread Yanbo Liang
Congratulations Tejas. On Fri, Oct 6, 2017 at 1:31 PM, DB Tsai wrote: > Congratulations! > > On Wed, Oct 4, 2017 at 6:55 PM, Liwei Lin wrote: > > Congratulations! > > > > Cheers, > > Liwei > > > > On Wed, Oct 4, 2017 at 2:27 PM, Yuval Itzchakov

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-23 Thread Yanbo Liang
+1 On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan wrote: > +1 > > Regards > Noman > -- > *From:* Denny Lee > *Sent:* Friday, September 22, 2017 2:59:33 AM > *To:* Apache Spark Dev; Sean Owen; Tim Hunter > *Cc:* Danil

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Yanbo Liang
Congratulations, Jerry. On Tue, Aug 29, 2017 at 9:42 AM, John Deng wrote: > > Congratulations, Jerry ! > > On 8/29/2017 09:28,Matei Zaharia > wrote: > > Hi everyone, > > The PMC recently voted to add Saisai (Jerry) Shao as a >

Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Yanbo Liang
Great. Congratulations, Hyukjin and Sameer! On Tue, Aug 8, 2017 at 7:53 AM, Holden Karau wrote: > Congrats! > > On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler wrote: > >> Great work Hyukjin and Sameer! >> >> On Mon, Aug 7, 2017 at 10:22 AM, Mridul

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Yanbo Liang
+1 On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > +1 > > On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida < > ricardo.alme...@actnowib.com> wrote: > >> +1 (non-binding) >> >> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn

Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yanbo Liang
Do you want to get sparse model that most of the coefficients are zeros? If yes, using L1 regularization leads to sparsity. But the LogisticRegressionModel coefficients vector's size is still equal with the number of features, you can get the non-zero elements manually. Actually, it would be a

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Yanbo Liang
Congratulations! On Mon, Feb 13, 2017 at 3:29 PM, Kazuaki Ishizaki wrote: > Congrats! > > Kazuaki Ishizaki > > > > From:Reynold Xin > To:"dev@spark.apache.org" > Date:2017/02/14 04:18 > Subject:

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Yanbo Liang
Congratulations, Burak and Holden. On Tue, Jan 24, 2017 at 7:32 PM, Chester Chen wrote: > Congratulation to both. > > > > Holden, we need catch up. > > > > > > *Chester Chen * > > ■ Senior Manager – Data Science & Engineering > > 3000 Clearview Way > > San Mateo, CA

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Yanbo Liang
+1 On Thu, Oct 27, 2016 at 3:15 AM, Reynold Xin wrote: > I created a JIRA ticket to track this: https://issues.apache. > org/jira/browse/SPARK-18138 > > > > On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran > wrote: > >> >> On 27 Oct 2016, at 10:03,

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
t; Thanks! > > 发件人: didi <wangleikidd...@didichuxing.com> > 日期: 2016年10月8日 星期六 上午12:21 > 至: Yanbo Liang <yblia...@gmail.com> > > 抄送: "dev@spark.apache.org" <dev@spark.apache.org>, "u...@spark.apache.org" > <u...@spark.apache.org> &

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon. Thanks. Yanbo On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) wrote: > > Hi,

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Yanbo Liang
Congrats and welcome! On Tue, Oct 4, 2016 at 9:01 AM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > Congratulations Xiao! Very well deserved! > > On Mon, Oct 3, 2016 at 10:46 PM, Reynold Xin wrote: > >> Hi all, >> >> Xiao Li, aka gatorsmile, has

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Yanbo Liang
+1 On Mon, Sep 26, 2016 at 4:53 PM, akchin wrote: > +1 (non-bind) > -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Psparkr > CentOS 7.2 / openjdk version "1.8.0_101" > > > > > - > IBM Spark Technology Center > -- > View this message in context: http://apache-spark- >

Discuss SparkR executors/workers support virtualenv

2016-09-07 Thread Yanbo Liang
Hi All, Many users have requirements to use third party R packages in executors/workers, but SparkR can not satisfy this requirements elegantly. For example, you should to mess with the IT/administrators of the cluster to deploy these R packages on each executors/workers node which is very

Re: KMeans calls takeSample() twice?

2016-08-31 Thread Yanbo Liang
I added println at the start of function takeSample, and found it was printed only once for each run of KMeans. Thanks Yanbo On Tue, Aug 30, 2016 at 10:31 AM, Georgios Samaras < georgesamaras...@gmail.com> wrote: > Good catch Shivaram. However, the very next line states: > > // this shouldn't

Re: KMeans calls takeSample() twice?

2016-08-30 Thread Yanbo Liang
I run KMeans with probes and found that takeSample() was called only once actually. It looks like this issue was caused by mistake display at Spark UI. Thanks Yanbo On Mon, Aug 29, 2016 at 2:34 PM, gsamaras wrote: > After reading the internal code of Spark about it,

Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Yanbo Liang
Congrats Felix! 2016-08-08 18:21 GMT-07:00 Kai Jiang : > Congrats Felix! > > On Mon, Aug 8, 2016, 18:14 Jeff Zhang wrote: > >> Congrats Felix! >> >> On Tue, Aug 9, 2016 at 8:49 AM, Hyukjin Kwon wrote: >> >>> Congratulations! >>> >>>

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
Hi Hao, HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced

Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing. Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in Spark 2.0. We can say that MLlib will focus on the Dataset-based API for futher development more accurately. Thanks Yanbo 2016-07-10 20:35

Re: MinMaxScaler With features include category variables

2016-07-01 Thread Yanbo Liang
You can combine the columns which are need to be normalized into a vector by VectorAssembler and do normalization on it. Do another assembling for columns should not be normalized. At last, you can assemble the two vector into one vector as the feature column and feed it into model training.

Re: Creation of SparkML Estimators in Java broken?

2016-05-27 Thread Yanbo Liang
Create JIRA https://issues.apache.org/jira/browse/SPARK-15605 . 2016-05-27 1:02 GMT-07:00 Yanbo Liang <yblia...@gmail.com>: > This is because we do not have excellent coverage for Java-friendly > wrappers. > I found we only implement JavaParams who is the wrappers of Scala Par

Re: Creation of SparkML Estimators in Java broken?

2016-05-27 Thread Yanbo Liang
This is because we do not have excellent coverage for Java-friendly wrappers. I found we only implement JavaParams who is the wrappers of Scala Params. We still need Java-friendly wrappers for other traits who extends from Scala Params. For example, in Scala we have: trait HasLabelCol extends

Re: Cross Validator to work with K-Fold value of 1?

2016-05-04 Thread Yanbo Liang
Here is the JIRA and PR for supporting PolynomialExpansion with degree 1, and it has been merged. https://issues.apache.org/jira/browse/SPARK-13338 https://github.com/apache/spark/pull/11216 2016-05-02 9:20 GMT-07:00 Nick Pentreath : > There is a JIRA and PR around for

Re: [spark.ml] Why is private class ColumnPruner?

2016-04-19 Thread Yanbo Liang
Hi Jacek, This is due to ColumnPruner is only used for RFormula currently, we did not expose it as a feature transformer. Please feel free to create JIRA and work on it. Thanks Yanbo 2016-03-25 8:50 GMT-07:00 Jacek Laskowski : > Hi, > > Came across `private class ColumnPruner`

Re: Organizing Spark ML example packages

2016-04-19 Thread Yanbo Liang
This sounds good to me, and it will make ML examples more neatly. 2016-04-14 5:28 GMT-07:00 Nick Pentreath : > Hey Spark devs > > I noticed that we now have a large number of examples for ML & MLlib in > the examples project - 57 for ML and 67 for MLLIB to be precise.

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
can handle large models. (master should > have more memory because it runs LBFGS) In my experiments, I’ve trained the > models 12M and 32M parameters without issues. > > > > Best regards, Alexander > > > > *From:* Yanbo Liang [mailto:yblia...@gmail.com] > *Sent:* Sunda

Re: 答复: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-25 Thread Yanbo Liang
Actually you can call df.collect_list("a"). 2015-12-25 16:00 GMT+08:00 Jeff Zhang : > You can use udf to convert one column for array type. Here's one sample > > val conf = new SparkConf().setMaster("local[4]").setAppName("test") > val sc = new SparkContext(conf) > val

Re: query on SVD++

2015-12-02 Thread Yanbo Liang
You means the SVDPlusPlus in GraphX? If you want to use SVD++ to train CF model, I recommend you to use ALS which is more efficiency and has python interface. 2015-12-02 11:21 GMT+08:00 张志强(旺轩) : > Hi All, > > > > I came across the SVD++ algorithm implementation in

[mllib] useFeatureScaling likes hardcode in LogisticRegressionWithLBFGS and is not comprehensive for users.

2014-11-26 Thread Yanbo Liang
Hi All, LogisticRegressionWithLBFGS set useFeatureScaling to true default which can improve the convergence during optimization. However, other model training method such as LogisticRegressionWithSGD does not set useFeatureScaling to true by default and the corresponding set function is private

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Yanbo Liang
by multiply a constant to the weights. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used

Re: A Spark Compilation Question

2014-09-26 Thread Yanbo Liang
Hi Hansu, I have encountered the same problem. Maven compiled avro file and generated corresponding Java file in new directory which is not source file directory of the project. I have modified pom.xml file and it can be work. The line marked as red is added, you can add them to your

Re: Spark SQL use of alias in where clause

2014-09-24 Thread Yanbo Liang
Maybe it's the way SQL works. The select part is executed after the where filter is applied, so you cannot use alias declared in select part in where clause. Hive and Oracle behavior the same as Spark SQL. 2014-09-25 8:58 GMT+08:00 Du Li l...@yahoo-inc.com.invalid: Hi, The following query

[mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
Hi All, I found that LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD in master and 1.1 release. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala#L199 In the above code snippet,

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
I also found https://github.com/apache/spark/commit/8f6e2e9df41e7de22b1d1cbd524e20881f861dd0 had resolve this issue but it seems that right code snippet not occurs in master or 1.1 release. 2014-09-13 17:12 GMT+08:00 Yanbo Liang yanboha...@gmail.com: Hi All, I found