ject:
> OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql
> .Dataset.withColumns
>
> Regards,
> Mina
>
> On Tue, May 15, 2018 at 2:37 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Multi column support for StringIndexer didn’t make it into Spark
:
OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql
.Dataset.withColumns
Regards,
Mina
On Tue, May 15, 2018 at 2:37 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:
> Multi column support for StringIndexer didn’t make it into Spark 2.3.0
>
> The PR is stil
Multi column support for StringIndexer didn’t make it into Spark 2.3.0
The PR is still in progress I think - should be available in 2.4.0
On Mon, 14 May 2018 at 22:32, Mina Aslani <aslanim...@gmail.com> wrote:
> Please take a look at the api doc:
> https://spark.apache.org/docs/2.
Please take a look at the api doc:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feature/StringIndexer.html
On Mon, May 14, 2018 at 4:30 PM, Mina Aslani <aslanim...@gmail.com> wrote:
> Hi,
>
> There is no SetInputCols/SetOutputCols for StringIndexer in Sp
Hi,
There is no SetInputCols/SetOutputCols for StringIndexer in Spark java.
How multiple input/output columns can be specified then?
Regards,
Mina
> If no, then you may use value's hash code as a category or combine all
>> columns into a single vector using HashingTF.
>>
>> Regards,
>> Filipp.
>>
>> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com>
>> wrote:
>> >
Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com>
> wrote:
> > Is the StringIndexer keeps all the mapped label to indices in the memory
> of
> > the driver machine? It seems to be unless I am missing something.
> >
> > What if our data that ne
, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com> wrote:
> Is the StringIndexer keeps all the mapped label to indices in the memory of
> the driver machine? It seems to be unless I am missing something.
>
> What if our data that needs to be indexed is huge and columns to be
Is the StringIndexer
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala>
keeps all the mapped label to indices in the memory of the driver machine?
It seems to be unless I am missing something.
What if our data that
>
> On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> For now, you must follow this approach of constructing a pipeline
>> consisting of a StringIndexer for each categorical column. See
>> https://issues.apache.org/jira/browse
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP.
Thanks!
On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:
> For now, you must follow this approach of constructing a pipeline
> consisting of a StringIndexer for each catego
For now, you must follow this approach of constructing a pipeline
consisting of a StringIndexer for each categorical column. See
https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to
allow multiple columns for StringIndexer, which is being worked on
currently.
The reason
Hi All,
There are several categorical columns in my dataset as follows:
[image: Inline images 1]
How can I transform values in each (categorical) columns into numeric using
StringIndexer so that the resulting DataFrame can be feed into
VectorAssembler to generate a feature vector?
A naive
.org/docs/latest/ml-features.html#stringindexer.
> What's the rationale for using double to index? Would it be more
> appropriate to use int to index (which is consistent with other place
> like Vector.sparse)
>
> Shiyuan
>
>
>
Hi Spark,
StringIndex uses double instead of int for indexing
http://spark.apache.org/docs/latest/ml-features.html#stringindexer. What's
the rationale for using double to index? Would it be more appropriate to
use int to index (which is consistent with other place like Vector.sparse)
Shiyuan
<bteeu...@gmail.com> wrote:
>>>
>>> So I wrote some code to reproduce the problem.
>>>
>>> I assume here that a pipeline should be able to transform a categorical
>>> feature with a few million levels.
>>> So I create a dataframe
reate a dataframe with the categorical feature (‘id’), apply a
>> StringIndexer and OneHotEncoder transformer, and run a loop where I increase
>> the amount of levels.
>> It breaks at 1.276.000 levels.
>>
>> Shall I report this as a ticket in JIRA?
>>
&
a pipeline should be able to transform a categorical
> feature with a few million levels.
> So I create a dataframe with the categorical feature (‘id’), apply a
> StringIndexer and OneHotEncoder transformer, and run a loop where I increase
> the amount of levels.
> It breaks at 1.2
So I wrote some code to reproduce the problem.
I assume here that a pipeline should be able to transform a categorical feature
with a few million levels.
So I create a dataframe with the categorical feature (‘id’), apply a
StringIndexer and OneHotEncoder transformer, and run a loop where I
ictionary to be broadcasted
> later on, used for creating sparse vectors
> max_index = grouped.selectExpr("MAX(index) t").rdd.map(lambda r:
> r.t).first()
>
> logging.info("Sanity check for indexes:")
> for c in cat_int[:]:
> logging.info(
# some logging to confirm the indexes.
logging.info("Missing value = {}".format(mappings[c]['missing']))
return max_index, mappings
I’d love to see the StringIndexer + OneHotEncoder transformers cope with
missing values during prediction; for now I’ll work with the ha
Sure, I understand there are some issues with handling this missing value
situation in StringIndexer currently. Your workaround is not ideal but I
see that it is probably the only mechanism available currently to avoid the
problem.
But the OOM issues seem to be more about the feature cardinality
was
missing in the training data. I wouldn’t need this workaround, if I had a
better strategy in Spark for dealing with missing levels. How Spark can deal
with it:
"Additionally, there are two strategies regarding how StringIndexer will handle
unseen labels when you have fit aStringIndexer o
. The disadvantages
include (i) no way to reverse the mapping from feature_index ->
feature_name; (ii) potential for hash collisions (can be helped a bit by
increasing your feature vector size).
Here is a minimal example:
In [1]: from pyspark.ml.feature import StringIndexer, OneH
I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark.
Code run:
cat_int = ['bigfeature']
stagesIndex = []
stagesOhe = []
for c in cat_int:
stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c)))
stagesOhe.append(OneHotEncoder(dropL
Hi,
I want to one hot encode a column containing 56 million distinct values. My
dataset is 800m rows + 17 columns.
I first apply a StringIndexer, but it already breaks there giving a OOM java
heap space error.
I launch my app on YARN with:
/opt/spark/2.0.0/bin/spark-shell --executor-memory 10G
Hi ,
i have dataframe with 1000 columns to dummies with stingIndexer
when i apply pipliene take long times whene i want merge result with other
data frame
i mean :
originnal data frame + columns indexed by STringindexers
PB save stage it s long why ?
code
indexers = [StringIndexer
5 at 12:32 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>> Thank you Jeff.
>>>
>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> StringIndexer is an estimator which would train a model t
>>>
>>> Thanks,
>>>
>>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com> wrote:
>>>
>>> Thank you Jeff.
>>>>
>>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gma
;>> having values which was not there in train data.
>>>> Since fit() is done only on the train data, the indexing is failing.
>>>>
>>>> Can you suggest me what can be done in this situation.
>>>>
>>>> Thanks,
>>>>
>>>
<
vishnu.viswanat...@gmail.com> wrote:
Thank you Jeff.
>
> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> StringIndexer is an estimator which would train a model to be used both
>> in training & prediction. So it is consistent between trai
Thanks for the reply Yanbo.
I understand that the model will be trained using the indexer map created
during the training stage.
But since I am getting a new set of data during prediction, and I have to
do StringIndexing on the new data also,
Right now I am using a new StringIndexer
StringIndexer is an estimator which would train a model to be used both in
training & prediction. So it is consistent between training & prediction.
You may want to read this section of spark ml doc
http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
On Mon, Nov 30, 2015 at
Thank you Jeff.
On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
> StringIndexer is an estimator which would train a model to be used both in
> training & prediction. So it is consistent between training & prediction.
>
> You may want to read th
Hi All,
I have a general question on using StringIndexer.
StringIndexer gives an index to each label in the feature starting from 0 (
0 for least frequent word).
Suppose I am building a model, and I use StringIndexer for transforming on
of my column.
e.g., suppose A was most frequent word
ML plans to make Machine Learning pipeline that users can make machine
learning more efficient.
It's more general to make StringIndexer chain with any kinds of Estimators.
I think we can make StringIndexer and reverse process automatic in the
future.
If you want to know your original labels, you
Hi,
If I understand the RandomForest model in the ML Pipeline implementation in
the ml package correctly, I have to first run my outcome label variable
through the StringIndexer, even if my labels are numeric. The StringIndexer
then converts the labels into numeric indices based on frequency
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting
the document for analysis?
(SI1, SI2).setOutputCol(features) -
features
00
11
01
22
HashingTF.setNumFeatures(2).setInputCol(COL1).setOutputCol(HT1)
bucket1 bucket2
a,a,b c
HT1
3 //Hash collision
3
3
1
Thanks,
Peter Rudenko
On 2015-08-07 09:55, praveen S wrote:
Is StringIndexer + VectorAssembler equivalent
Hi,
I encountered errors fitting a model using a CrossValidator. The training
set contained a feature which was initially a String with many unique
values. I used a StringIndexer to transform this feature column into label
indices. Fitting a model with a regular pipeline worked fine, but I ran
40 matches
Mail list logo