Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-16 Thread Bryan Cutler
ject: > OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql > .Dataset.withColumns > > Regards, > Mina > > On Tue, May 15, 2018 at 2:37 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Multi column support for StringIndexer didn’t make it into Spark

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-15 Thread Mina Aslani
: OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql .Dataset.withColumns Regards, Mina On Tue, May 15, 2018 at 2:37 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Multi column support for StringIndexer didn’t make it into Spark 2.3.0 > > The PR is stil

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-15 Thread Nick Pentreath
Multi column support for StringIndexer didn’t make it into Spark 2.3.0 The PR is still in progress I think - should be available in 2.4.0 On Mon, 14 May 2018 at 22:32, Mina Aslani <aslanim...@gmail.com> wrote: > Please take a look at the api doc: > https://spark.apache.org/docs/2.

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Mina Aslani
Please take a look at the api doc: https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feature/StringIndexer.html On Mon, May 14, 2018 at 4:30 PM, Mina Aslani <aslanim...@gmail.com> wrote: > Hi, > > There is no SetInputCols/SetOutputCols for StringIndexer in Sp

How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Mina Aslani
Hi, There is no SetInputCols/SetOutputCols for StringIndexer in Spark java. How multiple input/output columns can be specified then? Regards, Mina

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
> If no, then you may use value's hash code as a category or combine all >> columns into a single vector using HashingTF. >> >> Regards, >> Filipp. >> >> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com> >> wrote: >> >

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Nick Pentreath
Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com> > wrote: > > Is the StringIndexer keeps all the mapped label to indices in the memory > of > > the driver machine? It seems to be unless I am missing something. > > > > What if our data that ne

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Filipp Zhinkin
, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com> wrote: > Is the StringIndexer keeps all the mapped label to indices in the memory of > the driver machine? It seems to be unless I am missing something. > > What if our data that needs to be indexed is huge and columns to be

StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Is the StringIndexer <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala> keeps all the mapped label to indices in the memory of the driver machine? It seems to be unless I am missing something. What if our data that

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Md. Rezaul Karim
> > On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> For now, you must follow this approach of constructing a pipeline >> consisting of a StringIndexer for each categorical column. See >> https://issues.apache.org/jira/browse

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Weichen Xu
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP. Thanks! On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > For now, you must follow this approach of constructing a pipeline > consisting of a StringIndexer for each catego

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Nick Pentreath
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently. The reason

StringIndexer on several columns in a DataFrame with Scala

2017-10-27 Thread Md. Rezaul Karim
Hi All, There are several categorical columns in my dataset as follows: [image: Inline images 1] How can I transform values in each (categorical) columns into numeric using StringIndexer so that the resulting DataFrame can be feed into VectorAssembler to generate a feature vector? A naive

Re: Why StringIndexer uses double instead of int for indexing?

2017-01-21 Thread Holden Karau
.org/docs/latest/ml-features.html#stringindexer. > What's the rationale for using double to index? Would it be more > appropriate to use int to index (which is consistent with other place > like Vector.sparse) > > Shiyuan > > >

Why StringIndexer uses double instead of int for indexing?

2017-01-21 Thread Shiyuan
Hi Spark, StringIndex uses double instead of int for indexing http://spark.apache.org/docs/latest/ml-features.html#stringindexer. What's the rationale for using double to index? Would it be more appropriate to use int to index (which is consistent with other place like Vector.sparse) Shiyuan

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-22 Thread Ben Teeuwen
<bteeu...@gmail.com> wrote: >>> >>> So I wrote some code to reproduce the problem. >>> >>> I assume here that a pipeline should be able to transform a categorical >>> feature with a few million levels. >>> So I create a dataframe

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu
reate a dataframe with the categorical feature (‘id’), apply a >> StringIndexer and OneHotEncoder transformer, and run a loop where I increase >> the amount of levels. >> It breaks at 1.276.000 levels. >> >> Shall I report this as a ticket in JIRA? >> &

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu
a pipeline should be able to transform a categorical > feature with a few million levels. > So I create a dataframe with the categorical feature (‘id’), apply a > StringIndexer and OneHotEncoder transformer, and run a loop where I increase > the amount of levels. > It breaks at 1.2

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Ben Teeuwen
So I wrote some code to reproduce the problem. I assume here that a pipeline should be able to transform a categorical feature with a few million levels. So I create a dataframe with the categorical feature (‘id’), apply a StringIndexer and OneHotEncoder transformer, and run a loop where I

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Nick Pentreath
ictionary to be broadcasted > later on, used for creating sparse vectors > max_index = grouped.selectExpr("MAX(index) t").rdd.map(lambda r: > r.t).first() > > logging.info("Sanity check for indexes:") > for c in cat_int[:]: > logging.info(

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Ben Teeuwen
# some logging to confirm the indexes. logging.info("Missing value = {}".format(mappings[c]['missing'])) return max_index, mappings I’d love to see the StringIndexer + OneHotEncoder transformers cope with missing values during prediction; for now I’ll work with the ha

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Sure, I understand there are some issues with handling this missing value situation in StringIndexer currently. Your workaround is not ideal but I see that it is probably the only mechanism available currently to avoid the problem. But the OOM issues seem to be more about the feature cardinality

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
was missing in the training data. I wouldn’t need this workaround, if I had a better strategy in Spark for dealing with missing levels. How Spark can deal with it: "Additionally, there are two strategies regarding how StringIndexer will handle unseen labels when you have fit aStringIndexer o

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
. The disadvantages include (i) no way to reverse the mapping from feature_index -> feature_name; (ii) potential for hash collisions (can be helped a bit by increasing your feature vector size). Here is a minimal example: In [1]: from pyspark.ml.feature import StringIndexer, OneH

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark. Code run: cat_int = ['bigfeature'] stagesIndex = [] stagesOhe = [] for c in cat_int: stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c))) stagesOhe.append(OneHotEncoder(dropL

OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-03 Thread Ben Teeuwen
Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error. I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G

STringindexer

2016-06-16 Thread pseudo oduesp
Hi , i have dataframe with 1000 columns to dummies with stingIndexer when i apply pipliene take long times whene i want merge result with other data frame i mean : originnal data frame + columns indexed by STringindexers PB save stage it s long why ? code indexers = [StringIndexer

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Vishnu Viswanath
5 at 12:32 AM, Vishnu Viswanath < >> vishnu.viswanat...@gmail.com> wrote: >> >> Thank you Jeff. >>> >>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote: >>> >>>> StringIndexer is an estimator which would train a model t

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Yanbo Liang
>>> >>> Thanks, >>> >>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath < >>> vishnu.viswanat...@gmail.com> wrote: >>> >>> Thank you Jeff. >>>> >>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gma

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Vishnu Viswanath
;>> having values which was not there in train data. >>>> Since fit() is done only on the train data, the indexing is failing. >>>> >>>> Can you suggest me what can be done in this situation. >>>> >>>> Thanks, >>>> >>>

Re: General question on using StringIndexer in SparkML

2015-12-01 Thread Vishnu Viswanath
< vishnu.viswanat...@gmail.com> wrote: Thank you Jeff. > > On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote: > >> StringIndexer is an estimator which would train a model to be used both >> in training & prediction. So it is consistent between trai

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
Thanks for the reply Yanbo. I understand that the model will be trained using the indexer map created during the training stage. But since I am getting a new set of data during prediction, and I have to do StringIndexing on the new data also, Right now I am using a new StringIndexer

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Jeff Zhang
StringIndexer is an estimator which would train a model to be used both in training & prediction. So it is consistent between training & prediction. You may want to read this section of spark ml doc http://spark.apache.org/docs/latest/ml-guide.html#how-it-works On Mon, Nov 30, 2015 at

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
Thank you Jeff. On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote: > StringIndexer is an estimator which would train a model to be used both in > training & prediction. So it is consistent between training & prediction. > > You may want to read th

General question on using StringIndexer in SparkML

2015-11-28 Thread Vishnu Viswanath
Hi All, I have a general question on using StringIndexer. StringIndexer gives an index to each label in the feature starting from 0 ( 0 for least frequent word). Suppose I am building a model, and I use StringIndexer for transforming on of my column. e.g., suppose A was most frequent word

Re: Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-21 Thread Yanbo Liang
ML plans to make Machine Learning pipeline that users can make machine learning more efficient. It's more general to make StringIndexer chain with any kinds of Estimators. I think we can make StringIndexer and reverse process automatic in the future. If you want to know your original labels, you

Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam
Hi, If I understand the RandomForest model in the ML Pipeline implementation in the ml package correctly, I have to first run my outcome label variable through the StringIndexer, even if my labels are numeric. The StringIndexer then converts the labels into numeric indices based on frequency

StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread praveen S
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting the document for analysis?

Re: StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread Peter Rudenko
(SI1, SI2).setOutputCol(features) - features 00 11 01 22 HashingTF.setNumFeatures(2).setInputCol(COL1).setOutputCol(HT1) bucket1 bucket2 a,a,b c HT1 3 //Hash collision 3 3 1 Thanks, Peter Rudenko On 2015-08-07 09:55, praveen S wrote: Is StringIndexer + VectorAssembler equivalent

Interaction between StringIndexer feature transformer and CrossValidator

2015-06-18 Thread cyz
Hi, I encountered errors fitting a model using a CrossValidator. The training set contained a feature which was initially a String with many unique values. I used a StringIndexer to transform this feature column into label indices. Fitting a model with a regular pipeline worked fine, but I ran