[CFP] DataWorks Summit, San Jose, 2018

2018-02-07 Thread Yanbo Liang
, Apache MXNet, PyTorch/Torch, XGBoost, Apache Livy, Apache Zeppelin, Jupyter, etc. Please consider to submit abstract at https://dataworkssummit.com/san-jose-2018/ <https://dataworkssummit.com/san-jose-2018/> Thanks Yanbo

[CFP] DataWorks Summit Europe 2018 - Call for abstracts

2017-12-09 Thread Yanbo Liang
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19, 2018. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark for SQL/streaming processing, machine learning and data science. Information on submitting an abstract is at

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
You are right, native Spark MLlib CrossValidation can't run *different *algorithms in parallel. Thanks Yanbo On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem <prem.tims...@mssm.edu> wrote: > Hi Yanboo, > > Thank You, I very much appreciate your help. > > For the current use c

Re: sparkR 3rd library

2017-09-05 Thread Yanbo Liang
of SparkR UDF, please refer this test case: https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_context.R#L171 Thanks Yanbo On Tue, Sep 5, 2017 at 6:42 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Can you include the code you call spa

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
If yes, you can also try spark-sklearn, which can distribute multiple model training(single node training with sklearn) across a distributed cluster and do parameter search. FYI: https://github.com/databricks/spark-sklearn Thanks Yanbo On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <pmc

Re: Training A ML Model on a Huge Dataframe

2017-08-24 Thread Yanbo Liang
Hi Sea, Could you let us know which ML algorithm you use? What's the number instances and dimension of your dataset? AFAIK, Spark MLlib can train model with several millions of feature if you configure it correctly. Thanks Yanbo On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet <su...@acm.

Re: [BlockMatrix] multiply is an action or a transformation ?

2017-08-20 Thread Yanbo Liang
Yanbo On Sun, Aug 13, 2017 at 10:30 PM, Jose Francisco Saray Villamizar < jsa...@gmail.com> wrote: > Hi Everyone, > > Sorry if the question can be simple, or confusing, but I have not see > anywhere in documentation > the anwser: > > Is multiply method in BlockMatrix a

Re: Huber regression in PySpark?

2017-08-20 Thread Yanbo Liang
be merged into LinearRegression. I will update this PR ASAP, and I'm looking forward your reviews and comments. After the Scala implementation is merged, it's very easy to add corresponding PySpark API, then you can use it to train huber regression model in the distributed environment. Thanks Yanbo

Re: Collecting matrix's entries raises an error only when run inside a test

2017-07-06 Thread Yanbo Liang
Hi Simone, Would you mind to share the minimized code to reproduce this issue? Yanbo On Wed, Jul 5, 2017 at 10:52 PM, Simone Robutti <simone.robu...@gmail.com> wrote: > Hello, I have this problem and Google is not helping. Instead, it looks > like an unreported bug and there

Re: PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-28 Thread Yanbo Liang
and file system. Could you write a Spark DataFrame to this file system and check whether it works well? Thanks Yanbo On Tue, Jun 27, 2017 at 8:47 PM, John Omernik <j...@omernik.com> wrote: > Hello all, I am running PySpark 2.1.1 as a user, jomernik. I am working > through some docume

Re: Help in Parsing 'Categorical' type of data

2017-06-23 Thread Yanbo Liang
Please consider to use other classification models such as logistic regression or GBT. Naive bayes usually consider features as count, which is not suitable to be used on features generated by one-hot encoder. Thanks Yanbo On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti <amlan.jy...@tcs.com>

Re: RowMatrix: tallSkinnyQR

2017-06-23 Thread Yanbo Liang
Since this function is used to compute QR decomposition for RowMatrix of a tall and skinny shape, the output R is always with small rank. [image: Inline image 1] On Fri, Jun 9, 2017 at 10:33 PM, Arun wrote: > hi > > *def tallSkinnyQR(computeQ: Boolean = false):

Re: spark higher order functions

2017-06-23 Thread Yanbo Liang
See reply here: http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson wrote: > Hi, > > I have seen that databricks have higher order functions

Re: gfortran runtime library for Spark

2017-06-23 Thread Yanbo Liang
gfortran runtime library is still required for Spark 2.1 for better performance. If it's not present on your nodes, you will see a warning message and a pure JVM implementation will be used instead, but you will not get the best performance. Thanks Yanbo On Wed, Jun 21, 2017 at 5:30 PM, Saroj C

Re: BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-12 Thread Yanbo Liang
Yeah, for binary data, you can also use MulticlassClassificationEvaluator to evaluate other metrics which BinaryClassificationEvaluator doesn't cover, such as accuracy, f1, weightedPrecision and weightedRecall. Thanks Yanbo On Thu, May 11, 2017 at 10:31 PM, Lan Jiang <lanjiang...@gmail.

[CFP] DataWorks Summit/Hadoop Summit Sydney - Call for abstracts

2017-05-03 Thread Yanbo Liang
The Australia/Pacific version of DataWorks Summit is in Sydney this year, September 20-21. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark. Information on submitting an abstract is at

Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-01 Thread Yanbo Liang
Hi Tim, Spark ML API doesn't support set initial model for GMM currently. I wish we can get this feature in Spark 2.3. Thanks Yanbo On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith <secs...@gmail.com> wrote: > Hi, > > I am trying to figure out the API to initialize a gaussian mixtur

Re: How to create SparkSession using SparkConf?

2017-04-28 Thread Yanbo Liang
StreamingContext is an old API, if you want to process streaming data, you can use SparkSession directly. FYI: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Thanks Yanbo On Fri, Apr 28, 2017 at 12:12 AM, kant kodali <kanth...@gmail.com> wrote: > Act

Re: How to create SparkSession using SparkConf?

2017-04-27 Thread Yanbo Liang
Could you try the following way? val spark = SparkSession.builder.appName("my-application").config("spark.jars", "a.jar, b.jar").getOrCreate() Thanks Yanbo On Thu, Apr 27, 2017 at 9:21 AM, kant kodali <kanth...@gmail.com> wrote: > I am using Spark 2.

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Yanbo Liang
What about JOIN your table with a map table? On Thu, Apr 27, 2017 at 9:58 PM, Nishanth wrote: > I am facing a major issue on replacement of Synonyms in my DataSet. > > I am trying to replace the synonym of the Brand names to its equivalent > names. > > I have

Re: how to create List in pyspark

2017-04-27 Thread Yanbo Liang
;split_value", split_func("value")).show() Thanks Yanbo On Tue, Apr 25, 2017 at 12:27 AM, Selvam Raman <sel...@gmail.com> wrote: > documentDF = spark.createDataFrame([ > > ("Hi I heard about Spark".split(" "), ), > > ("I

Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yanbo Liang
be a sparse vector (or matrix for multinomial case) if it's sparse enough. Thanks Yanbo On Sun, Mar 19, 2017 at 5:02 AM, Dhanesh Padmanabhan <dhanesh12...@gmail.com > wrote: > It shouldn't be difficult to convert the coefficients to a sparse vector. > Not sure if that is what you

Re: How does preprocessing fit into Spark MLlib pipeline

2017-03-17 Thread Yanbo Liang
Hi Adrian, Did you try SQLTransformer? Your preprocessing steps are SQL operations and can be handled by SQLTransformer in MLlib pipeline scope. Thanks Yanbo On Thu, Mar 9, 2017 at 11:02 AM, aATv <adr...@vidora.com> wrote: > I want to start using PySpark Mllib pipelines, but I don't u

Re: ML PIC

2016-12-21 Thread Yanbo Liang
You can track https://issues.apache.org/jira/browse/SPARK-15784 for the progress. On Wed, Dec 21, 2016 at 7:08 AM, Nick Pentreath wrote: > It is part of the general feature parity roadmap. I can't recall offhand > any blocker reasons it's just resources > On Wed, 21

Re: Usage of mllib api in ml

2016-11-20 Thread Yanbo Liang
You can refer this example( http://spark.apache.org/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation) which use BinaryClassificationEvaluator, and it should be very straightforward to switch to MulticlassClassificationEvaluator. Thanks Yanbo On Sat, Nov 19, 2016 at 9:03 AM

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-19 Thread Yanbo Liang
ix(oldRDD, nRows, nCols) mat.columnSimilarities() Please feel free to let me know whether it can satisfy your requirements. Thanks Yanbo On Wed, Nov 16, 2016 at 9:26 AM, Russell Jurney <russell.jur...@gmail.com> wrote: > Asher, can you cast like that? Does that casting work? That is my > confusion: I

Re: VectorUDT and ml.Vector

2016-11-19 Thread Yanbo Liang
dataframe (Vector or Matrix)? I think it's ml.linalg.Vector, so your should use *MLUtils.convertVectorColumnsFromML.* Thanks Yanbo On Mon, Nov 7, 2016 at 5:25 AM, Ganesh <m...@ganeshkrishnan.com> wrote: > I am trying to run a SVD on a dataframe and I have used ml TF-IDF which > has crea

Re: why is method predict protected in PredictionModel

2016-11-19 Thread Yanbo Liang
This function is used internally currently, we will expose it as public to support make prediction on single instance. See discussion at https://issues.apache.org/jira/browse/SPARK-10413. Thanks Yanbo On Thu, Nov 17, 2016 at 1:24 AM, wobu <buchn...@gmail.com> wrote: > Hi, > >

Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-16 Thread Yanbo Liang
requirements. BTW, I'm the author of Spark AFTSurvivalRegression. Any more questions, please feel free to let me know. http://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression http://spark.apache.org/docs/latest/api/R/index.html Thanks Yanbo On Tue, Nov 15, 2016

Re: HashingTF for TF.IDF computation

2016-10-23 Thread Yanbo Liang
generated by HashingTF or CountVectorizer. FYI http://spark.apache.org/docs/latest/ml-features.html#tf-idf Thanks Yanbo On Thu, Oct 20, 2016 at 10:00 AM, Ciumac Sergiu <ciumac.ser...@gmail.com> wrote: > Hello everyone, > > I'm having a usage issue with HashingTF class from Spark

Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread Yanbo Liang
​Please increase the value of "maxMemoryInMB"​ of your RandomForestClassifier or RandomForestRegressor. It's a warning which will not affect the result but may lead your training slower. Thanks Yanbo On Mon, Oct 17, 2016 at 8:21 PM, 张建鑫(市场部) <zhangjian...@didichuxing.com> wrot

Re: Logistic Regression Standardization in ML

2016-10-10 Thread Yanbo Liang
#L551 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L588 Thanks Yanbo On Mon, Oct 10, 2016 at 7:27 AM, Sean Owen <so...@cloudera.com> wrote: > (BTW I think it means "when no standardization is a

Re: SVD output within Spark

2016-08-31 Thread Yanbo Liang
The signs of the eigenvectors are essentially arbitrary, so both result of Spark and Matlab are right. Thanks On Thu, Jul 21, 2016 at 3:50 PM, Martin Somers wrote: > > just looking at a comparision between Matlab and Spark for svd with an > input matrix N > > > this is

Re: Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-18 Thread Yanbo Liang
is in maintenance mode. So do all your work under the same APIs. Thanks Yanbo 2016-08-17 1:30 GMT-07:00 <luohui20...@sina.com>: > Hello guys: > I have a problem in loading recommend model. I have 2 models, one is > good(able to get recommend result) and another is not working. I ch

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread Yanbo Liang
If you want to tie them with other data, I think the best way is to use DataFrame join operation on condition that they share an identity column. Thanks Yanbo 2016-08-16 20:39 GMT-07:00 ayan guha <guha.a...@gmail.com>: > Hi > > Thank you for your reply. Yes, I can get predicti

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-18 Thread Yanbo Liang
mode, so we strongly recommend users to use the DataFrame-based spark.ml API. Thanks Yanbo 2016-08-17 11:46 GMT-07:00 Michał Zieliński <zielinski.mich...@gmail.com>: > I'm using Spark 1.6.2 for Vector-based UDAF and this works: > > def inputSchema: StructType = new StructType().a

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Yanbo Liang
It seams that VectorUDT is private and can not be accessed out of Spark currently. It should be public but we need to do some refactor before make it public. You can refer the discussion at https://github.com/apache/spark/pull/12259 . Thanks Yanbo 2016-08-16 9:48 GMT-07:00 alexeys <a

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread Yanbo Liang
MLlib will keep the original dataset during transformation, it just append new columns to existing DataFrame. That is you can get both prediction value and original features from the output DataFrame of model.transform. Thanks Yanbo 2016-08-16 17:48 GMT-07:00 ayan guha <guha.a...@gmail.

Re: Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-16 Thread Yanbo Liang
Could you check the log to see how much iterations does your LoR runs? Does your program output same model between different attempts? Thanks Yanbo 2016-08-12 3:08 GMT-07:00 olivierjeunen <olivierjeu...@gmail.com>: > I'm using pyspark ML's logistic regression implementation t

Re: Linear regression, weights constraint

2016-08-16 Thread Yanbo Liang
Spark MLlib does not support boxed constraints on model coefficients currently. Thanks Yanbo 2016-08-15 3:53 GMT-07:00 letaiv <tleginev...@gmail.com>: > Hi all, > > Is there any approach to add constrain for weights in linear regression? > What I need is least squares r

Re: using matrix as column datatype in SparkSQL Dataframe

2016-08-10 Thread Yanbo Liang
. Thanks Yanbo 2016-08-08 11:06 GMT-07:00 Vadla, Karthik <karthik.va...@intel.com>: > Hello all, > > > > I'm trying to load set of medical images(dicom) into spark SQL dataframe. > Here each image is loaded into matrix column of dataframe. I see spark > recently added Mat

Re: Random forest binary classification H20 difference Spark

2016-08-10 Thread Yanbo Liang
Hi Samir, Did you use VectorAssembler to assemble some columns into the feature column? If there are NULLs in your dataset, VectorAssembler will throw this exception. You can use DataFrame.drop() or DataFrame.replace() to drop/substitute NULL values. Thanks Yanbo 2016-08-07 19:51 GMT-07:00

Re: Logistic regression formula string

2016-08-10 Thread Yanbo Liang
I think you can output the schema of DataFrame which will be feed into the estimator such as LogisticRegression. The output array will be the encoded feature names corresponding the coefficients of the model. Thanks Yanbo 2016-08-08 15:53 GMT-07:00 Cesar <ces...@gmail.com>: > > I

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
to compute term frequency divided by the length of the document, you should write your own function based on transformers provided by MLlib. Thanks Yanbo 2016-08-01 15:29 GMT-07:00 Hao Ren <inv...@gmail.com>: > When computing term frequency, we can use either HashTF or CountVectorizer > featur

Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang
Spark MLlib KMeansModel provides "computeCost" function which return the sum of squared distances of points to their nearest center as the k-means cost on the given dataset. Thanks Yanbo 2016-07-24 17:30 GMT-07:00 janardhan shetty <janardhan...@gmail.com>: > Hi, > > I

Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang
You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501) for porting spark.mllib.fpm to spark.ml. Thanks Yanbo 2016-07-24 11:18 GMT-07:00 janardhan shetty <janardhan...@gmail.com>: > Is there any implementation of FPGrowth and Association rules in Spark > Dataf

Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang
Hi Janardhan, Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992) for the discussion about LSH. Regards Yanbo 2016-07-24 7:13 GMT-07:00 Karl Higley <kmhig...@gmail.com>: > Hi Janardhan, > > I collected some LSH papers while working on an RDD-based implemen

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Sorry for the wrong link, what you should refer is jpmml-sparkml ( https://github.com/jpmml/jpmml-sparkml). Thanks Yanbo 2016-07-24 4:46 GMT-07:00 Yanbo Liang <yblia...@gmail.com>: > Spark does not support exporting ML models to PMML currently. You can try > the third party jpmml-

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Spark does not support exporting ML models to PMML currently. You can try the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package which supports a part of ML models. Thanks Yanbo 2016-07-20 11:14 GMT-07:00 Ajinkya Kale <kaleajin...@gmail.com>: > Just found Google dat

Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang
, MatrixEntry l = [(1, 1, 10), (2, 2, 20), (3, 3, 30)] df = sqlContext.createDataFrame(l, ['row', 'column', 'value']) rdd = df.select('row', 'column', 'value').rdd.map(lambda row: MatrixEntry(*row)) mat = CoordinateMatrix(rdd) mat.entries.collect() Thanks Yanbo 2016-07-22 13:14 GMT-07:00 Gourav

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-17 Thread Yanbo Liang
Hi Tobi, Thanks for clarifying the question. It's very straight forward to convert the filtered RDD to DataFrame, you can refer the following code snippets: from pyspark.sql import Row rdd2 = filteredRDD.map(lambda v: Row(features=v)) df = rdd2.toDF() Thanks Yanbo 2016-07-16 14:51 GMT-07:00

Re: Feature importance IN random forest

2016-07-16 Thread Yanbo Liang
="indexed", seed=42) model = rf.fit(td) model.featureImportances Then you can get the feature importances which is a Vector. Thanks Yanbo 2016-07-12 10:30 GMT-07:00 pseudo oduesp <pseudo20...@gmail.com>: > Hi, > i use pyspark 1.5.0 > can i ask you how i can get feature imp

Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni <roni.epi...@gmail.

Re: Dense Vectors outputs in feature engineering

2016-07-16 Thread Yanbo Liang
orm(df2) df3.show() // Decode to get the original categories. val group = AttributeGroup.fromStructField(df3.schema("encodedName")) val categories = group.attributes.get.map(_.name.get) println(categories.mkString(",")) // Output: b,a,c Thanks Yanbo 2016-07-14 6:4

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Yanbo Liang
= sc.parallelize(data) model = ChiSqSelector(1).fit(rdd) filteredRDD = model.transform(rdd.map(lambda lp: lp.features)) filteredRDD.collect() However, we strongly recommend you to migrate to DataFrame-based API since the RDD-based API is switched to maintain mode. Thanks Yanbo 2016-07-14 13:23 GMT

Re: QuantileDiscretizer not working properly with big dataframes

2016-07-16 Thread Yanbo Liang
Could you tell us the Spark version you used? We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to these versions and retry. If this issue still exists, please let us know. Thanks Yanbo 2016-07-12 11:03 GMT-07:00 Pasquinell Urbani < pasquinell.urb...@exalitica.

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Yanbo Liang
diction").rdd.map { case Row(pred) => pred }.collect() assert(predictions === Array(1, 2, 2, 2, 6, 16.5, 16.5, 17, 18)) Thanks Yanbo 2016-07-11 6:14 GMT-07:00 Fridtjof Sander <fridtjof.san...@googlemail.com>: > Hi Swaroop, > > from my understanding, Isotonic Regress

Re: Isotonic Regression, run method overloaded Error

2016-07-10 Thread Yanbo Liang
Hi Swaroop, Would you mind to share your code that others can help you to figure out what caused this error? I can run the isotonic regression examples well. Thanks Yanbo 2016-07-08 13:38 GMT-07:00 dsp <durgaswar...@gmail.com>: > Hi I am trying to perform Isotonic Regression on a

Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing. Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in Spark 2.0. We can say that MLlib will focus on the Dataset-based API for futher development more accurately. Thanks Yanbo 2016-07-10 20:35 GMT

Re: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread Yanbo Liang
Would you mind to file a JIRA to track this issue? I will take a look when I have time. 2016-07-04 14:09 GMT-07:00 mshiryae : > Hi, > > I am trying to train model by MultilayerPerceptronClassifier. > > It works on sample data from >

Re: Graphframe Error

2016-07-04 Thread Yanbo Liang
with bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar to launch PySpark with graphframes enabled. You should set "--py-files" and "--jars" options with the directory where you saved graphframes.jar. Thanks Yanbo 2016-07-03 15:48 GMT-07:00 Arun Patel <

Re: Several questions about how pyspark.ml works

2016-07-02 Thread Yanbo Liang
Hi Nick, Please see my inline reply. Thanks Yanbo 2016-06-12 3:08 GMT-07:00 XapaJIaMnu <nhe...@gmail.com>: > Hey, > > I have some additional Spark ML algorithms implemented in scala that I > would > like to make available in pyspark. For a reference I am looking at the

Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-07-02 Thread Yanbo Liang
Yes, WeightedLeastSquares can not solve some ill-conditioned problem currently, the community members have paid some efforts to resolve it (SPARK-13777). For the work around, you can set the solver to "l-bfgs" which will train the LogisticRegressionModel by L-BFGS optimization method. 2016-06-09

Re: Get both feature importance and ROC curve from a random forest classifier

2016-07-02 Thread Yanbo Liang
ble, label: Double) => (rawPrediction, label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) metrics.roc() Thanks Yanbo 2016-06-15 7:13 GMT-07:00 matd <matd...@gmail.com>: > Hi ml folks ! > > I'm using a Random Forest for a binary classification. > I'm in

Re: Ideas to put a Spark ML model in production

2016-07-02 Thread Yanbo Liang
/lr-model") val data = newDataset val prediction = model.transform(data) However, usually we save/load PipelineModel which include necessary feature transformers and model training process rather than the single model, but they are similar operations. Thanks Yanbo 2016-06-23 10:54 GMT-07:00

Re: Custom Optimizer

2016-07-02 Thread Yanbo Liang
Spark MLlib does not support optimizer as a plugin, since the optimizer interface is private. Thanks Yanbo 2016-06-23 16:56 GMT-07:00 Stephen Boesch <java...@gmail.com>: > My team has a custom optimization routine that we would have wanted to > plug in as a replacement for the de

Re: Spark ML - Java implementation of custom Transformer

2016-07-02 Thread Yanbo Liang
the solution for the compatibility issue has been figured out, we will add it back at 2.1. Thanks Yanbo 2016-06-27 11:57 GMT-07:00 Mehdi Meziane <mehdi.mezi...@ldmobile.net>: > Hi all, > > We have some problems while implementing custom Transformers in JAVA > (SPARK 1.6.1)

Re: ML regression - spark context dies without error

2016-06-05 Thread Yanbo Liang
Could you tell me which regression algorithm, the parameters you set and the detail exception information? Or it's better to paste your code and exception here if it's applicable, then other members can help you to diagnose the problem. Thanks Yanbo 2016-05-12 2:03 GMT-07:00 AlexModestov

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Yes, you are right. 2016-05-30 2:34 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.com>: > > Thanks Yanbo. > > So, you mean that if I have a variable which is of type double but I want > to treat it like String in my model I just have to cast those columns into > strin

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Hi Abhi, In SparkR glm, category features (columns of type string) will be one-hot encoded automatically. So pre-processing like `as.factor` is not necessary, you can directly feed your data to the model training. Thanks Yanbo 2016-05-30 2:06 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.

Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang
Spark MLlib Vector only supports data of double type, it's reasonable to throw exception when you creating a Vector with element of unicode type. 2016-05-24 7:27 GMT-07:00 flyinggip : > Hi there, > > I notice that there might be a bug in pyspark.mllib.linalg.Vectors when

Re: Reg:Reading a csv file with String label into labelepoint

2016-03-16 Thread Yanbo Liang
featureCol and labelCol. Thanks Yanbo 2016-03-16 13:41 GMT+08:00 Dharmin Siddesh J <siddeshjdhar...@gmail.com>: > Hi > > I am trying to read a csv with few double attributes and String Label . > How can i convert it to labelpoint RDD so that i can run it with spark > mllib classificati

Re: SparkML Using Pipeline API locally on driver

2016-02-28 Thread Yanbo Liang
the progress of https://issues.apache.org/jira/browse/SPARK-10413. Thanks Yanbo 2016-02-27 8:52 GMT+08:00 Eugene Morozov <evgeny.a.moro...@gmail.com>: > Hi everyone. > > I have a requirement to run prediction for random forest model locally on > a web-service without touching sp

Re: Saving and Loading Dataframes

2016-02-28 Thread Yanbo Liang
("parquet").mode("overwrite").save(output) > val data = sqlContext.read.format("parquet").load(output) Thanks Yanbo 2016-02-27 2:01 GMT+08:00 Raj Kumar <raj.ku...@hooklogic.com>: > Thanks for the response Yanbo. Here is the source (it uses the > sample_libs

Re: Survival Curves using AFT implementation in Spark

2016-02-26 Thread Yanbo Liang
/ml/AFTSurvivalRegressionExample.scala#L48> . Maybe we can add this feature later. Thanks Yanbo 2016-02-26 14:35 GMT+08:00 Stuti Awasthi <stutiawas...@hcl.com>: > Hi All, > > I wanted to apply Survival Analysis using Spark AFT algorithm > implementation. Now I perform the sam

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-25 Thread Yanbo Liang
Actually Spark SQL `groupBy` with `count` can get frequency in each bin. You can also try with DataFrameStatFunctions.freqItems() to get the frequent items for columns. Thanks Yanbo 2016-02-24 1:21 GMT+08:00 Burak Yavuz <brk...@gmail.com>: > You could use the Bucketizer transformer in

Re: Saving and Loading Dataframes

2016-02-25 Thread Yanbo Liang
Hi Raj, Could you share your code which can help others to diagnose this issue? Which version did you use? I can not reproduce this problem in my environment. Thanks Yanbo 2016-02-26 10:49 GMT+08:00 raj.kumar <raj.ku...@hooklogic.com>: > Hi, > > I am using mllib. I use the m

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Yanbo Liang
= standardScaler.fit(ovarian2) val ovarian3 = ssModel.transform(ovarian2) val aft = new AFTSurvivalRegression().setFeaturesCol("standardized_features") val model = aft.fit(ovarian3) val newCoefficients = model.coefficients.toArray.zip(ssModel.std.toArray).map { x => x._1 / x._2 }

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-15 Thread Yanbo Liang
Hi Stuti, This is a bug of AFTSurvivalRegression, we did not handle "lossSum == infinity" properly. I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this issue and will send a PR. Thanks for reporting this issue. Yanbo 2016-02-12 15:03 GMT+08:00 Stuti Awasthi

Re: [MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread Yanbo Liang
For you case, it's true. But not always correct for a pipeline model, some transformers in pipeline will change the features such as OneHotEncoder. 2016-02-03 1:21 GMT+08:00 jmvllt : > Hi everyone, > > This may sound like a stupid question but I need to be sure of this

Re: Extracting p values in Logistic regression using mllib scala

2016-01-24 Thread Yanbo Liang
Hi Chandan, MLlib only support getting p-value, t-value from Linear Regression model, other models such as Logistic Model are not supported currently. This feature is under development and will be released at the next version(Spark 2.0). Thanks Yanbo 2016-01-18 16:45 GMT+08:00 Chandan Verma

Re: has any one implemented TF_IDF using ML transformers?

2016-01-24 Thread Yanbo Liang
Hi Andy, I will take a look at your code after your share it. Thanks! Yanbo 2016-01-23 0:18 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: > Hi Yanbo > > I recently code up the trivial example from > http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-tex

Re: can we create dummy variables from categorical variables, using sparkR

2016-01-24 Thread Yanbo Liang
Yanbo 2016-01-20 1:15 GMT+08:00 Vinayak Agrawal <vinayakagrawa...@gmail.com>: > Yes, you can use Rformula library. Please see > > https://databricks.com/blog/2015/10/05/generalized-linear-models-in-sparkr-and-r-formula-support-in-mllib.html > > On Tue, Jan 19, 2016 at 10:34

Re: how to save Matrix type result to hdfs file using java

2016-01-24 Thread Yanbo Liang
Matrix can be save as column of type MatrixUDT.

Re: has any one implemented TF_IDF using ML transformers?

2016-01-19 Thread Yanbo Liang
/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L226 Thanks Yanbo 2016-01-19 7:05 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: > Hi Yanbo > > I am using 1.6.0. I am having a hard of time trying to figure out what the > exact

Re: has any one implemented TF_IDF using ML transformers?

2016-01-17 Thread Yanbo Liang
/spark/ml/feature/IDF.scala#L121 I found the document of IDF is not very clear, we need to update it. Thanks Yanbo 2016-01-16 6:10 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: > I wonder if I am missing something? TF-IDF is very popular. Spark ML has a > lot of transform

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-17 Thread Yanbo Liang
-classification-regression.html#random-forest-classifier . Thanks Yanbo 2016-01-16 0:16 GMT+08:00 Robin East <robin.e...@xense.co.uk>: > re 1. > The pull requests reference the JIRA ticket in this case > https://issues.apache.org/jira/browse/SPARK-5133. The JIRA says it was &g

Re: AIC in Linear Regression in ml pipeline

2016-01-15 Thread Yanbo Liang
Hi Arunkumar, It does not support output AIC value for Linear Regression currently. This feature is under development and will be released at Spark 2.0. Thanks Yanbo 2016-01-15 17:20 GMT+08:00 Arunkumar Pillai <arunkumar1...@gmail.com>: > Hi > > Is it possible to get AIC

Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Yanbo Liang
Yep, row of Matrix theta is the number of classes and column of theta is the number of features. 2016-01-13 10:47 GMT+08:00 Andy Davidson : > I am trying to debug my trained model by exploring theta > Theta is a Matrix. The java Doc for Matrix says that it is

Re: Deploying model built in SparkR

2016-01-11 Thread Yanbo Liang
Hi Chandan, Could you tell us the meaning of deploying model? Using the model to make prediction by R? Thanks Yanbo 2016-01-11 20:40 GMT+08:00 Chandan Verma <chandan.ve...@citiustech.com>: > Hi All, > > Does any one over here has deployed a model produced in SparkR or at

Re: broadcast params to workers at the very beginning

2016-01-11 Thread Yanbo Liang
Hi, The parameters should be broadcasted again after you update it at driver side, then you can get updated version at worker side. Thanks Yanbo 2016-01-09 23:12 GMT+08:00 octavian.ganea <octavian.ga...@inf.ethz.ch>: > Hi, > > In my app, I have a Params scala object that keeps a

Re: StandardScaler in spark.ml.feature requires vector input?

2016-01-11 Thread Yanbo Liang
into StandardScaler. Thanks Yanbo 2016-01-10 8:10 GMT+08:00 Kristina Rogale Plazonic <kpl...@gmail.com>: > Hi, > > The code below gives me an unexpected result. I expected that > StandardScaler (in ml, not mllib) will take a specified column of an input > dataframe and subtract t

Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
input into the features which can be feed into model trainer. OneHotEncoder and VectorAssembler are feature transformers provided by Spark ML, you can refer https://spark.apache.org/docs/latest/ml-features.html Thanks Yanbo 2016-01-08 7:52 GMT+08:00 Annabel Melongo <melongo_a

Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext. sc <- sparkR.init() sqlContext <- sparkRHive.init(sc) 2016-01-06 20:35 GMT+08:00 Sandeep Khurana : > Felix > > I tried the option suggested by you. It gave below error. I am going to > try the option suggested by Prem .

Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar, You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or approxCountDistinct for a approximate result. 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai : > Hi > > Is there any functions to find distinct count of all the variables in > dataframe.

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
Hi Alexander, That's cool! Thanks for the clarification. Yanbo 2016-01-05 5:06 GMT+08:00 Ulanov, Alexander <alexander.ula...@hpe.com>: > Hi Yanbo, > > > > As long as two models fit into memory of a single machine, there should be > no problems, so even 16GB machines

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
like the following code snippet: gmmModel.predictSoft(rdd) then you will get a new RDD which is the soft prediction result. And all the models in ML package follow this rule. Yanbo 2016-01-04 22:16 GMT+08:00 Tomasz Fruboes <tomasz.frub...@ncbj.gov.pl>: > Hi Yanbo, > >

Re: GLM I'm ml pipeline

2016-01-03 Thread Yanbo Liang
AFAIK, Spark MLlib will improve and support most GLM functions in the next release(Spark 2.0). 2016-01-03 23:02 GMT+08:00 : > keyStoneML could be an alternative. > > Ardo. > > On 03 Jan 2016, at 15:50, Arunkumar Pillai > wrote: > > Is there any road

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-02 Thread Yanbo Liang
in map(). Cheers Yanbo 2016-01-01 4:12 GMT+08:00 Tomasz Fruboes <tomasz.frub...@ncbj.gov.pl>: > Dear All, > > I'm trying to implement a procedure that iteratively updates a rdd using > results from GaussianMixtureModel.predictSoft. In order to avoid problems > with local v

Re: frequent itemsets

2016-01-02 Thread Yanbo Liang
Hi Roberto, Could you share your code snippet that others can help to diagnose your problems? 2016-01-02 7:51 GMT+08:00 Roberto Pagliari : > When using the frequent itemsets APIs, I’m running into stackOverflow > exception whenever there are too many combinations to

  1   2   3   >