Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-16 Thread Bryan Cutler
java.lang.NoSuchMethodError: org.apache.spark.sql > .Dataset.withColumns > > Regards, > Mina > > On Tue, May 15, 2018 at 2:37 AM, Nick Pentreath > wrote: > >> Multi column support for StringIndexer didn’t make it into Spark 2.3.0 >> >> The PR is still in progress I think -

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-15 Thread Mina Aslani
: OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql .Dataset.withColumns Regards, Mina On Tue, May 15, 2018 at 2:37 AM, Nick Pentreath wrote: > Multi column support for StringIndexer didn’t make it into Spark 2.3.0 > > The PR is still in progress I think - should be

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Nick Pentreath
Multi column support for StringIndexer didn’t make it into Spark 2.3.0 The PR is still in progress I think - should be available in 2.4.0 On Mon, 14 May 2018 at 22:32, Mina Aslani wrote: > Please take a look at the api doc: > https://spark.apache.org/docs/2.3.0/api/java/org/apache/sp

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Mina Aslani
Please take a look at the api doc: https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feature/StringIndexer.html On Mon, May 14, 2018 at 4:30 PM, Mina Aslani wrote: > Hi, > > There is no SetInputCols/SetOutputCols for StringIndexer in Spark java. > How multiple

How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Mina Aslani
Hi, There is no SetInputCols/SetOutputCols for StringIndexer in Spark java. How multiple input/output columns can be specified then? Regards, Mina

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
a category or combine all >> columns into a single vector using HashingTF. >> >> Regards, >> Filipp. >> >> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus >> wrote: >> > Is the StringIndexer keeps all the mapped label to indices in the >> memory of

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Nick Pentreath
at 4:01 PM, Shahab Yunus > wrote: > > Is the StringIndexer keeps all the mapped label to indices in the memory > of > > the driver machine? It seems to be unless I am missing something. > > > > What if our data that needs to be indexed is huge and columns to be > index

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Filipp Zhinkin
ue, Apr 10, 2018 at 4:01 PM, Shahab Yunus wrote: > Is the StringIndexer keeps all the mapped label to indices in the memory of > the driver machine? It seems to be unless I am missing something. > > What if our data that needs to be indexed is huge and columns to be indexed > are high

StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Is the StringIndexer <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala> keeps all the mapped label to indices in the memory of the driver machine? It seems to be unless I am missing something. What if our data that needs

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Md. Rezaul Karim
7 at 5:19 PM, Nick Pentreath > wrote: > >> For now, you must follow this approach of constructing a pipeline >> consisting of a StringIndexer for each categorical column. See >> https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA >> to allow multiple col

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Weichen Xu
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP. Thanks! On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath wrote: > For now, you must follow this approach of constructing a pipeline > consisting of a StringIndexer for each categorical column. See &

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Nick Pentreath
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently. The reason

StringIndexer on several columns in a DataFrame with Scala

2017-10-27 Thread Md. Rezaul Karim
Hi All, There are several categorical columns in my dataset as follows: [image: Inline images 1] How can I transform values in each (categorical) columns into numeric using StringIndexer so that the resulting DataFrame can be feed into VectorAssembler to generate a feature vector? A naive

Re: Why StringIndexer uses double instead of int for indexing?

2017-01-21 Thread Holden Karau
t/ml-features.html#stringindexer. > What's the rationale for using double to index? Would it be more > appropriate to use int to index (which is consistent with other place > like Vector.sparse) > > Shiyuan > > >

Why StringIndexer uses double instead of int for indexing?

2017-01-21 Thread Shiyuan
Hi Spark, StringIndex uses double instead of int for indexing http://spark.apache.org/docs/latest/ml-features.html#stringindexer. What's the rationale for using double to index? Would it be more appropriate to use int to index (which is consistent with other place like Vector.sparse) Shiyuan

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-22 Thread Ben Teeuwen
de to reproduce the problem. >>> >>> I assume here that a pipeline should be able to transform a categorical >>> feature with a few million levels. >>> So I create a dataframe with the categorical feature (‘id’), apply a >>> StringIndexer and OneHotEncoder

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu
8:19 AM, Ben Teeuwen wrote: >> >> So I wrote some code to reproduce the problem. >> >> I assume here that a pipeline should be able to transform a categorical >> feature with a few million levels. >> So I create a dataframe with the categorical feature (‘i

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu
to transform a categorical > feature with a few million levels. > So I create a dataframe with the categorical feature (‘id’), apply a > StringIndexer and OneHotEncoder transformer, and run a loop where I increase > the amount of levels. > It breaks at 1.276.000 levels. > > S

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Ben Teeuwen
So I wrote some code to reproduce the problem. I assume here that a pipeline should be able to transform a categorical feature with a few million levels. So I create a dataframe with the categorical feature (‘id’), apply a StringIndexer and OneHotEncoder transformer, and run a loop where I

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Nick Pentreath
d[c], d['index'])).collectAsMap() # build up dictionary to be broadcasted > later on, used for creating sparse vectors > max_index = grouped.selectExpr("MAX(index) t").rdd.map(lambda r: > r.t).first() > > logging.info("Sanity check for indexes

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Ben Teeuwen
uot;{} min: {} max: {}".format(c, min(mappings[c].values()), max(mappings[c].values( # some logging to confirm the indexes. logging.info("Missing value = {}".format(mappings[c]['missing'])) return max_index, mappings I’d love to see the StringIndexe

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Sure, I understand there are some issues with handling this missing value situation in StringIndexer currently. Your workaround is not ideal but I see that it is probably the only mechanism available currently to avoid the problem. But the OOM issues seem to be more about the feature cardinality

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
missing in the training data. I wouldn’t need this workaround, if I had a better strategy in Spark for dealing with missing levels. How Spark can deal with it: "Additionally, there are two strategies regarding how StringIndexer will handle unseen labels when you have fit aStringIndexer o

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
. The disadvantages include (i) no way to reverse the mapping from feature_index -> feature_name; (ii) potential for hash collisions (can be helped a bit by increasing your feature vector size). Here is a minimal example: In [1]: from pyspark.ml.feature import StringIndexer, OneH

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark. Code run: cat_int = ['bigfeature'] stagesIndex = [] stagesOhe = [] for c in cat_int: stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c))) stagesOhe.append(OneHotEnc

OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-03 Thread Ben Teeuwen
Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error. I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G

STringindexer

2016-06-16 Thread pseudo oduesp
Hi , i have dataframe with 1000 columns to dummies with stingIndexer when i apply pipliene take long times whene i want merge result with other data frame i mean : originnal data frame + columns indexed by STringindexers PB save stage it s long why ? code indexers = [StringIndexer

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Vishnu Viswanath
ne only on the train data, the indexing is failing. >>>> >>>> Can you suggest me what can be done in this situation. >>>> >>>> Thanks, >>>> >>>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath < >>>> vishnu.viswa

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Yanbo Liang
h was not there in train data. >>> Since fit() is done only on the train data, the indexing is failing. >>> >>> Can you suggest me what can be done in this situation. >>> >>> Thanks, >>> >>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath &l

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Vishnu Viswanath
train data. >> Since fit() is done only on the train data, the indexing is failing. >> >> Can you suggest me what can be done in this situation. >> >> Thanks, >> >> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath < >> vishnu.viswanat...@gmail.c

Re: General question on using StringIndexer in SparkML

2015-12-02 Thread Yanbo Liang
te: > > Thank you Jeff. >> >> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang wrote: >> >>> StringIndexer is an estimator which would train a model to be used both >>> in training & prediction. So it is consistent between training & prediction. >>> >

Re: General question on using StringIndexer in SparkML

2015-12-01 Thread Vishnu Viswanath
< vishnu.viswanat...@gmail.com> wrote: Thank you Jeff. > > On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang wrote: > >> StringIndexer is an estimator which would train a model to be used both >> in training & prediction. So it is consistent between training & prediction. >

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
Thank you Jeff. On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang wrote: > StringIndexer is an estimator which would train a model to be used both in > training & prediction. So it is consistent between training & prediction. > > You may want to read this section of

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Jeff Zhang
StringIndexer is an estimator which would train a model to be used both in training & prediction. So it is consistent between training & prediction. You may want to read this section of spark ml doc http://spark.apache.org/docs/latest/ml-guide.html#how-it-works On Mon, Nov 30, 2015 at

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
Thanks for the reply Yanbo. I understand that the model will be trained using the indexer map created during the training stage. But since I am getting a new set of data during prediction, and I have to do StringIndexing on the new data also, Right now I am using a new StringIndexer for this

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Yanbo Liang
: > Hi All, > > I have a general question on using StringIndexer. > StringIndexer gives an index to each label in the feature starting from 0 ( > 0 for least frequent word). > > Suppose I am building a model, and I use StringIndexer for transforming on > of my column. > e.g.,

General question on using StringIndexer in SparkML

2015-11-28 Thread Vishnu Viswanath
Hi All, I have a general question on using StringIndexer. StringIndexer gives an index to each label in the feature starting from 0 ( 0 for least frequent word). Suppose I am building a model, and I use StringIndexer for transforming on of my column. e.g., suppose A was most frequent word

Re: Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-21 Thread Yanbo Liang
ML plans to make Machine Learning pipeline that users can make machine learning more efficient. It's more general to make StringIndexer chain with any kinds of Estimators. I think we can make StringIndexer and reverse process automatic in the future. If you want to know your original labels

Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam
Hi, If I understand the RandomForest model in the ML Pipeline implementation in the ml package correctly, I have to first run my outcome label variable through the StringIndexer, even if my labels are numeric. The StringIndexer then converts the labels into numeric indices based on frequency of

Re: StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread Peter Rudenko
two, 2->three) SI1 0 1 1 2 VectorAssembler.setInputCols(SI1, SI2).setOutputCol(features) -> features 00 11 01 22 HashingTF.setNumFeatures(2).setInputCol(COL1).setOutputCol(HT1) bucket1 bucket2 a,a,b c HT1 3 //Hash collision 3 3 1 Thanks, Peter Rudenko On 2015-08-07 09:55, praveen S wrote: Is StringInde

StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-06 Thread praveen S
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting the document for analysis?

Interaction between StringIndexer feature transformer and CrossValidator

2015-06-18 Thread cyz
Hi, I encountered errors fitting a model using a CrossValidator. The training set contained a feature which was initially a String with many unique values. I used a StringIndexer to transform this feature column into label indices. Fitting a model with a regular pipeline worked fine, but I ran