java.lang.NoSuchMethodError: org.apache.spark.sql
> .Dataset.withColumns
>
> Regards,
> Mina
>
> On Tue, May 15, 2018 at 2:37 AM, Nick Pentreath
> wrote:
>
>> Multi column support for StringIndexer didn’t make it into Spark 2.3.0
>>
>> The PR is still in progress I think -
:
OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql
.Dataset.withColumns
Regards,
Mina
On Tue, May 15, 2018 at 2:37 AM, Nick Pentreath
wrote:
> Multi column support for StringIndexer didn’t make it into Spark 2.3.0
>
> The PR is still in progress I think - should be
Multi column support for StringIndexer didn’t make it into Spark 2.3.0
The PR is still in progress I think - should be available in 2.4.0
On Mon, 14 May 2018 at 22:32, Mina Aslani wrote:
> Please take a look at the api doc:
> https://spark.apache.org/docs/2.3.0/api/java/org/apache/sp
Please take a look at the api doc:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feature/StringIndexer.html
On Mon, May 14, 2018 at 4:30 PM, Mina Aslani wrote:
> Hi,
>
> There is no SetInputCols/SetOutputCols for StringIndexer in Spark java.
> How multiple
Hi,
There is no SetInputCols/SetOutputCols for StringIndexer in Spark java.
How multiple input/output columns can be specified then?
Regards,
Mina
a category or combine all
>> columns into a single vector using HashingTF.
>>
>> Regards,
>> Filipp.
>>
>> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus
>> wrote:
>> > Is the StringIndexer keeps all the mapped label to indices in the
>> memory of
at 4:01 PM, Shahab Yunus
> wrote:
> > Is the StringIndexer keeps all the mapped label to indices in the memory
> of
> > the driver machine? It seems to be unless I am missing something.
> >
> > What if our data that needs to be indexed is huge and columns to be
> index
ue, Apr 10, 2018 at 4:01 PM, Shahab Yunus wrote:
> Is the StringIndexer keeps all the mapped label to indices in the memory of
> the driver machine? It seems to be unless I am missing something.
>
> What if our data that needs to be indexed is huge and columns to be indexed
> are high
Is the StringIndexer
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala>
keeps all the mapped label to indices in the memory of the driver machine?
It seems to be unless I am missing something.
What if our data that needs
7 at 5:19 PM, Nick Pentreath
> wrote:
>
>> For now, you must follow this approach of constructing a pipeline
>> consisting of a StringIndexer for each categorical column. See
>> https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA
>> to allow multiple col
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP.
Thanks!
On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath
wrote:
> For now, you must follow this approach of constructing a pipeline
> consisting of a StringIndexer for each categorical column. See
&
For now, you must follow this approach of constructing a pipeline
consisting of a StringIndexer for each categorical column. See
https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to
allow multiple columns for StringIndexer, which is being worked on
currently.
The reason
Hi All,
There are several categorical columns in my dataset as follows:
[image: Inline images 1]
How can I transform values in each (categorical) columns into numeric using
StringIndexer so that the resulting DataFrame can be feed into
VectorAssembler to generate a feature vector?
A naive
t/ml-features.html#stringindexer.
> What's the rationale for using double to index? Would it be more
> appropriate to use int to index (which is consistent with other place
> like Vector.sparse)
>
> Shiyuan
>
>
>
Hi Spark,
StringIndex uses double instead of int for indexing
http://spark.apache.org/docs/latest/ml-features.html#stringindexer. What's
the rationale for using double to index? Would it be more appropriate to
use int to index (which is consistent with other place like Vector.sparse)
Shiyuan
de to reproduce the problem.
>>>
>>> I assume here that a pipeline should be able to transform a categorical
>>> feature with a few million levels.
>>> So I create a dataframe with the categorical feature (‘id’), apply a
>>> StringIndexer and OneHotEncoder
8:19 AM, Ben Teeuwen wrote:
>>
>> So I wrote some code to reproduce the problem.
>>
>> I assume here that a pipeline should be able to transform a categorical
>> feature with a few million levels.
>> So I create a dataframe with the categorical feature (‘i
to transform a categorical
> feature with a few million levels.
> So I create a dataframe with the categorical feature (‘id’), apply a
> StringIndexer and OneHotEncoder transformer, and run a loop where I increase
> the amount of levels.
> It breaks at 1.276.000 levels.
>
> S
So I wrote some code to reproduce the problem.
I assume here that a pipeline should be able to transform a categorical feature
with a few million levels.
So I create a dataframe with the categorical feature (‘id’), apply a
StringIndexer and OneHotEncoder transformer, and run a loop where I
d[c], d['index'])).collectAsMap() # build up dictionary to be broadcasted
> later on, used for creating sparse vectors
> max_index = grouped.selectExpr("MAX(index) t").rdd.map(lambda r:
> r.t).first()
>
> logging.info("Sanity check for indexes
uot;{} min: {} max: {}".format(c, min(mappings[c].values()),
max(mappings[c].values( # some logging to confirm the indexes.
logging.info("Missing value = {}".format(mappings[c]['missing']))
return max_index, mappings
I’d love to see the StringIndexe
Sure, I understand there are some issues with handling this missing value
situation in StringIndexer currently. Your workaround is not ideal but I
see that it is probably the only mechanism available currently to avoid the
problem.
But the OOM issues seem to be more about the feature cardinality
missing in the training data. I wouldn’t need this workaround, if I had a
better strategy in Spark for dealing with missing levels. How Spark can deal
with it:
"Additionally, there are two strategies regarding how StringIndexer will handle
unseen labels when you have fit aStringIndexer o
. The disadvantages
include (i) no way to reverse the mapping from feature_index ->
feature_name; (ii) potential for hash collisions (can be helped a bit by
increasing your feature vector size).
Here is a minimal example:
In [1]: from pyspark.ml.feature import StringIndexer, OneH
I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark.
Code run:
cat_int = ['bigfeature']
stagesIndex = []
stagesOhe = []
for c in cat_int:
stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c)))
stagesOhe.append(OneHotEnc
Hi,
I want to one hot encode a column containing 56 million distinct values. My
dataset is 800m rows + 17 columns.
I first apply a StringIndexer, but it already breaks there giving a OOM java
heap space error.
I launch my app on YARN with:
/opt/spark/2.0.0/bin/spark-shell --executor-memory 10G
Hi ,
i have dataframe with 1000 columns to dummies with stingIndexer
when i apply pipliene take long times whene i want merge result with other
data frame
i mean :
originnal data frame + columns indexed by STringindexers
PB save stage it s long why ?
code
indexers = [StringIndexer
ne only on the train data, the indexing is failing.
>>>>
>>>> Can you suggest me what can be done in this situation.
>>>>
>>>> Thanks,
>>>>
>>>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
>>>> vishnu.viswa
h was not there in train data.
>>> Since fit() is done only on the train data, the indexing is failing.
>>>
>>> Can you suggest me what can be done in this situation.
>>>
>>> Thanks,
>>>
>>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath &l
train data.
>> Since fit() is done only on the train data, the indexing is failing.
>>
>> Can you suggest me what can be done in this situation.
>>
>> Thanks,
>>
>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.c
te:
>
> Thank you Jeff.
>>
>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang wrote:
>>
>>> StringIndexer is an estimator which would train a model to be used both
>>> in training & prediction. So it is consistent between training & prediction.
>>>
>
<
vishnu.viswanat...@gmail.com> wrote:
Thank you Jeff.
>
> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang wrote:
>
>> StringIndexer is an estimator which would train a model to be used both
>> in training & prediction. So it is consistent between training & prediction.
>
Thank you Jeff.
On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang wrote:
> StringIndexer is an estimator which would train a model to be used both in
> training & prediction. So it is consistent between training & prediction.
>
> You may want to read this section of
StringIndexer is an estimator which would train a model to be used both in
training & prediction. So it is consistent between training & prediction.
You may want to read this section of spark ml doc
http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
On Mon, Nov 30, 2015 at
Thanks for the reply Yanbo.
I understand that the model will be trained using the indexer map created
during the training stage.
But since I am getting a new set of data during prediction, and I have to
do StringIndexing on the new data also,
Right now I am using a new StringIndexer for this
:
> Hi All,
>
> I have a general question on using StringIndexer.
> StringIndexer gives an index to each label in the feature starting from 0 (
> 0 for least frequent word).
>
> Suppose I am building a model, and I use StringIndexer for transforming on
> of my column.
> e.g.,
Hi All,
I have a general question on using StringIndexer.
StringIndexer gives an index to each label in the feature starting from 0 (
0 for least frequent word).
Suppose I am building a model, and I use StringIndexer for transforming on
of my column.
e.g., suppose A was most frequent word
ML plans to make Machine Learning pipeline that users can make machine
learning more efficient.
It's more general to make StringIndexer chain with any kinds of Estimators.
I think we can make StringIndexer and reverse process automatic in the
future.
If you want to know your original labels
Hi,
If I understand the RandomForest model in the ML Pipeline implementation in
the ml package correctly, I have to first run my outcome label variable
through the StringIndexer, even if my labels are numeric. The StringIndexer
then converts the labels into numeric indices based on frequency of
two, 2->three)
SI1
0
1
1
2
VectorAssembler.setInputCols(SI1, SI2).setOutputCol(features) ->
features
00
11
01
22
HashingTF.setNumFeatures(2).setInputCol(COL1).setOutputCol(HT1)
bucket1 bucket2
a,a,b c
HT1
3 //Hash collision
3
3
1
Thanks,
Peter Rudenko
On 2015-08-07 09:55, praveen S wrote:
Is StringInde
Is StringIndexer + VectorAssembler equivalent to HashingTF while converting
the document for analysis?
Hi,
I encountered errors fitting a model using a CrossValidator. The training
set contained a feature which was initially a String with many unique
values. I used a StringIndexer to transform this feature column into label
indices. Fitting a model with a regular pipeline worked fine, but I ran
42 matches
Mail list logo