Re: [C++] CSV string column category to dictionary/indices?

2019-12-03 Thread ntfs hard
Hello

Thank you for your advice! I'll try to adapt it to my code.

Best,
--

вт, 3 дек. 2019 г. в 17:16, Antoine Pitrou :

>
> Agreed.  I've opened https://issues.apache.org/jira/browse/ARROW-7302 to
> track it.
>
> Regards
>
> Antoine.
>
>
> Le 03/12/2019 à 04:55, Wes McKinney a écrit :
> > An option was recently added to dictionary encode all string columns
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L82
> >
> > I think it would be useful to be able to hard-opt-in to
> > dictionary-encode a particular column (regardless of the what
> > cardinality ends up being). Whatever the way to do this, it should be
> > clear and well documented. A new JIRA issue may be in order. Antoine,
> > what do you think?
> >
> > On Sun, Dec 1, 2019 at 5:32 PM ntfs hard  wrote:
> >>
> >> Hello
> >>
> >> I'm a newcomer and not quite sure about the library usage. I tried to
> find
> >> some documentation about it but failed.
> >>
> >> I have a dataset in CSV file where one column(let's call it colour) is a
> >> string category. I'd like to get indices instead of text_lines to pass
> it
> >> inside algorithm.
> >> I tried to set column_types in ConvertOptions in
> >> {{"colour", arrow::dictionary(std::make_shared(),
> >> arrow::utf8()) }} but it seems to be not right api usage, a wild
> run-time
> >> error appears: NotImplemented: CSV conversion to
> dictionary >> indices=int32, ordered=0> is not supported
> >> Also I find a merged PR #5785 <
> https://github.com/apache/arrow/pull/5785> but
> >> not quite sure that's applicable for my case.
> >>
> >> So, my question is: can I get indices inside a category column only w/
> >> library API. And if yes, what I doing wrong. :)
> >>
> >> *In other word,* I'd like to something like such python pandas code:
> >> df[column] = df[column].cat.codes # if str(column_data_type) ==
> "category"
> >>
> >> Thank you!
>


Re: [C++] CSV string column category to dictionary/indices?

2019-12-03 Thread Antoine Pitrou


Agreed.  I've opened https://issues.apache.org/jira/browse/ARROW-7302 to
track it.

Regards

Antoine.


Le 03/12/2019 à 04:55, Wes McKinney a écrit :
> An option was recently added to dictionary encode all string columns
> 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L82
> 
> I think it would be useful to be able to hard-opt-in to
> dictionary-encode a particular column (regardless of the what
> cardinality ends up being). Whatever the way to do this, it should be
> clear and well documented. A new JIRA issue may be in order. Antoine,
> what do you think?
> 
> On Sun, Dec 1, 2019 at 5:32 PM ntfs hard  wrote:
>>
>> Hello
>>
>> I'm a newcomer and not quite sure about the library usage. I tried to find
>> some documentation about it but failed.
>>
>> I have a dataset in CSV file where one column(let's call it colour) is a
>> string category. I'd like to get indices instead of text_lines to pass it
>> inside algorithm.
>> I tried to set column_types in ConvertOptions in
>> {{"colour", arrow::dictionary(std::make_shared(),
>> arrow::utf8()) }} but it seems to be not right api usage, a wild run-time
>> error appears: NotImplemented: CSV conversion to dictionary> indices=int32, ordered=0> is not supported
>> Also I find a merged PR #5785  but
>> not quite sure that's applicable for my case.
>>
>> So, my question is: can I get indices inside a category column only w/
>> library API. And if yes, what I doing wrong. :)
>>
>> *In other word,* I'd like to something like such python pandas code:
>> df[column] = df[column].cat.codes # if str(column_data_type) == "category"
>>
>> Thank you!


Re: [C++] CSV string column category to dictionary/indices?

2019-12-02 Thread Wes McKinney
An option was recently added to dictionary encode all string columns

https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L82

I think it would be useful to be able to hard-opt-in to
dictionary-encode a particular column (regardless of the what
cardinality ends up being). Whatever the way to do this, it should be
clear and well documented. A new JIRA issue may be in order. Antoine,
what do you think?

On Sun, Dec 1, 2019 at 5:32 PM ntfs hard  wrote:
>
> Hello
>
> I'm a newcomer and not quite sure about the library usage. I tried to find
> some documentation about it but failed.
>
> I have a dataset in CSV file where one column(let's call it colour) is a
> string category. I'd like to get indices instead of text_lines to pass it
> inside algorithm.
> I tried to set column_types in ConvertOptions in
> {{"colour", arrow::dictionary(std::make_shared(),
> arrow::utf8()) }} but it seems to be not right api usage, a wild run-time
> error appears: NotImplemented: CSV conversion to dictionary indices=int32, ordered=0> is not supported
> Also I find a merged PR #5785  but
> not quite sure that's applicable for my case.
>
> So, my question is: can I get indices inside a category column only w/
> library API. And if yes, what I doing wrong. :)
>
> *In other word,* I'd like to something like such python pandas code:
> df[column] = df[column].cat.codes # if str(column_data_type) == "category"
>
> Thank you!


[C++] CSV string column category to dictionary/indices?

2019-12-01 Thread ntfs hard
Hello

I'm a newcomer and not quite sure about the library usage. I tried to find
some documentation about it but failed.

I have a dataset in CSV file where one column(let's call it colour) is a
string category. I'd like to get indices instead of text_lines to pass it
inside algorithm.
I tried to set column_types in ConvertOptions in
{{"colour", arrow::dictionary(std::make_shared(),
arrow::utf8()) }} but it seems to be not right api usage, a wild run-time
error appears: NotImplemented: CSV conversion to dictionary is not supported
Also I find a merged PR #5785  but
not quite sure that's applicable for my case.

So, my question is: can I get indices inside a category column only w/
library API. And if yes, what I doing wrong. :)

*In other word,* I'd like to something like such python pandas code:
df[column] = df[column].cat.codes # if str(column_data_type) == "category"

Thank you!