Re: UTF-8 Support for TextParser

2018-03-09 Thread Anirudh
Won't run entire text through converter, will just ignore the BOM character
during parsing stage.

Anirudh

On Fri, Mar 9, 2018 at 1:12 PM, Chris Olivier  wrote:

> For this, are you going to run the entire text through a converter, or just
> prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?
>
> On Fri, Mar 9, 2018 at 12:43 PM, Anirudh  wrote:
>
> > Hi,
> >
> > Upon deeper understanding of customer requirement we found out that the
> > customer uses only ASCII data with MXNet, just that they want the files
> > containing UTF-8 BOM at the start and files with different control
> > characters for newline to play well. dmlc-core already supports control
> > characters for newline.
> > Since, the UTF-8 BOM in files is a common use case for other users of
> MXNet
> > too (for example, saving excel as UTF-8 csv) I will add support for
> > handling the UTF-8 BOM in dmlc-core.
> > I won't be working on UTF8CSVParser unless there is a customer
> requirement
> > that comes up later on.
> >
> > Anirudh
> >
> >
> >
> > On Wed, Feb 28, 2018 at 11:50 PM, Anirudh  wrote:
> >
> > > Hi Tianqi,
> > >
> > > What do you think about adding a separate parser for CSV with UTF8
> > support
> > > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> > > UTF8 or the ASCII parser based on this flag. (This idea was suggested
> by
> > > Mu).
> > >
> > > I think there will be some small changes required to the base class
> > > "TextParserBase" as the method "BackFindEndLine" will have more logic
> in
> > it
> > > to check for other code-points for line-breaks, which can be
> refactored.
> > > This approach will likely retain the performance of the existing ASCII
> > CSV
> > > Parser, while allowing MXNet users to make the decision w.r.t usability
> > > with UTF-8 CSV parser / performance with ASCII CSV parser.
> > >
> > > Thanks,
> > > Anirudh
> > >
> > >
> > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh 
> wrote:
> > >
> > >> Hi Marco,
> > >>
> > >> I understand that there needs to be a different discussion on strong
> > >> dependency of mxnet and dmlc-core and how to fix it.
> > >>
> > >> Having said that, I think the goals of dmlc-core and mxnet are
> somewhat
> > >> aligned. Posting in the MXNet dev list for this case
> > >> is a good way to gather feedback from both the communities since I
> > >> consider the MXNet community to be mostly a superset of the dmlc-core
> > >> community.
> > >>
> > >> Anirudh
> > >>
> > >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <
> > ani...@amazon.com>
> > >> wrote:
> > >>
> > >>> Hi Tianqi,
> > >>>
> > >>> The UTF-8 support would enable other formats like CSV more usable.
> > >>> Otherwise, they have to handle normalizing their data in some way
> > before
> > >>> using mxnet.
> > >>> I understand that there is a tradeoff here because of the efficiency
> > >>> gains from the parser but the expectation of having to normalize
> their
> > UTF-8
> > >>> files may turn users away.
> > >>>
> > >>> Anirudh
> > >>>
> > >>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" <
> > >>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote:
> > >>>
> > >>> Since LibSVM format is only going to involve numbers and possibly
> > >>> ascii
> > >>> characters, is there any reason adding UTF-8 support? Note that
> > >>> generalization always comes with cost of efficiency and there is
> > some
> > >>> effort spent on making parser fast
> > >>>
> > >>> Tianqi
> > >>>
> > >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh 
> > >>> wrote:
> > >>>
> > >>> > Hi all,
> > >>> >
> > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV
> Text
> > >>> parsers.
> > >>> > I am currently working on adding UTF-8 support for Text
> parsers.
> > >>> Since C++
> > >>> > doesn't have a great built-in support for UTF-8, I am looking
> at
> > >>> > third-party libraries which provide Unicode support. I am
> > >>> considering ICU
> > >>> > currently. Any comments, suggestions, past experience, gotchas
> > >>> about
> > >>> > unicode third party libraries or adding unicode support in
> > general
> > >>> is
> > >>> > highly appreciated.
> > >>> >
> > >>> > I have created an issue about the same:
> > >>> > https://github.com/dmlc/dmlc-core/issues/372
> > >>> > Please feel free to reply to this email or comment on the
> github
> > >>> issue if
> > >>> > you have any inputs.
> > >>> >
> > >>> > Anirudh
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
>


Re: UTF-8 Support for TextParser

2018-03-09 Thread Chris Olivier
For this, are you going to run the entire text through a converter, or just
prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?

On Fri, Mar 9, 2018 at 12:43 PM, Anirudh  wrote:

> Hi,
>
> Upon deeper understanding of customer requirement we found out that the
> customer uses only ASCII data with MXNet, just that they want the files
> containing UTF-8 BOM at the start and files with different control
> characters for newline to play well. dmlc-core already supports control
> characters for newline.
> Since, the UTF-8 BOM in files is a common use case for other users of MXNet
> too (for example, saving excel as UTF-8 csv) I will add support for
> handling the UTF-8 BOM in dmlc-core.
> I won't be working on UTF8CSVParser unless there is a customer requirement
> that comes up later on.
>
> Anirudh
>
>
>
> On Wed, Feb 28, 2018 at 11:50 PM, Anirudh  wrote:
>
> > Hi Tianqi,
> >
> > What do you think about adding a separate parser for CSV with UTF8
> support
> > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> > UTF8 or the ASCII parser based on this flag. (This idea was suggested by
> > Mu).
> >
> > I think there will be some small changes required to the base class
> > "TextParserBase" as the method "BackFindEndLine" will have more logic in
> it
> > to check for other code-points for line-breaks, which can be refactored.
> > This approach will likely retain the performance of the existing ASCII
> CSV
> > Parser, while allowing MXNet users to make the decision w.r.t usability
> > with UTF-8 CSV parser / performance with ASCII CSV parser.
> >
> > Thanks,
> > Anirudh
> >
> >
> > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh  wrote:
> >
> >> Hi Marco,
> >>
> >> I understand that there needs to be a different discussion on strong
> >> dependency of mxnet and dmlc-core and how to fix it.
> >>
> >> Having said that, I think the goals of dmlc-core and mxnet are somewhat
> >> aligned. Posting in the MXNet dev list for this case
> >> is a good way to gather feedback from both the communities since I
> >> consider the MXNet community to be mostly a superset of the dmlc-core
> >> community.
> >>
> >> Anirudh
> >>
> >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <
> ani...@amazon.com>
> >> wrote:
> >>
> >>> Hi Tianqi,
> >>>
> >>> The UTF-8 support would enable other formats like CSV more usable.
> >>> Otherwise, they have to handle normalizing their data in some way
> before
> >>> using mxnet.
> >>> I understand that there is a tradeoff here because of the efficiency
> >>> gains from the parser but the expectation of having to normalize their
> UTF-8
> >>> files may turn users away.
> >>>
> >>> Anirudh
> >>>
> >>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" <
> >>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote:
> >>>
> >>> Since LibSVM format is only going to involve numbers and possibly
> >>> ascii
> >>> characters, is there any reason adding UTF-8 support? Note that
> >>> generalization always comes with cost of efficiency and there is
> some
> >>> effort spent on making parser fast
> >>>
> >>> Tianqi
> >>>
> >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh 
> >>> wrote:
> >>>
> >>> > Hi all,
> >>> >
> >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> >>> parsers.
> >>> > I am currently working on adding UTF-8 support for Text parsers.
> >>> Since C++
> >>> > doesn't have a great built-in support for UTF-8, I am looking at
> >>> > third-party libraries which provide Unicode support. I am
> >>> considering ICU
> >>> > currently. Any comments, suggestions, past experience, gotchas
> >>> about
> >>> > unicode third party libraries or adding unicode support in
> general
> >>> is
> >>> > highly appreciated.
> >>> >
> >>> > I have created an issue about the same:
> >>> > https://github.com/dmlc/dmlc-core/issues/372
> >>> > Please feel free to reply to this email or comment on the github
> >>> issue if
> >>> > you have any inputs.
> >>> >
> >>> > Anirudh
> >>> >
> >>>
> >>>
> >>>
> >>
> >
>


Re: UTF-8 Support for TextParser

2018-03-09 Thread Anirudh
Hi,

Upon deeper understanding of customer requirement we found out that the
customer uses only ASCII data with MXNet, just that they want the files
containing UTF-8 BOM at the start and files with different control
characters for newline to play well. dmlc-core already supports control
characters for newline.
Since, the UTF-8 BOM in files is a common use case for other users of MXNet
too (for example, saving excel as UTF-8 csv) I will add support for
handling the UTF-8 BOM in dmlc-core.
I won't be working on UTF8CSVParser unless there is a customer requirement
that comes up later on.

Anirudh



On Wed, Feb 28, 2018 at 11:50 PM, Anirudh  wrote:

> Hi Tianqi,
>
> What do you think about adding a separate parser for CSV with UTF8 support
> in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> UTF8 or the ASCII parser based on this flag. (This idea was suggested by
> Mu).
>
> I think there will be some small changes required to the base class
> "TextParserBase" as the method "BackFindEndLine" will have more logic in it
> to check for other code-points for line-breaks, which can be refactored.
> This approach will likely retain the performance of the existing ASCII CSV
> Parser, while allowing MXNet users to make the decision w.r.t usability
> with UTF-8 CSV parser / performance with ASCII CSV parser.
>
> Thanks,
> Anirudh
>
>
> On Mon, Feb 26, 2018 at 5:18 PM, Anirudh  wrote:
>
>> Hi Marco,
>>
>> I understand that there needs to be a different discussion on strong
>> dependency of mxnet and dmlc-core and how to fix it.
>>
>> Having said that, I think the goals of dmlc-core and mxnet are somewhat
>> aligned. Posting in the MXNet dev list for this case
>> is a good way to gather feedback from both the communities since I
>> consider the MXNet community to be mostly a superset of the dmlc-core
>> community.
>>
>> Anirudh
>>
>> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh 
>> wrote:
>>
>>> Hi Tianqi,
>>>
>>> The UTF-8 support would enable other formats like CSV more usable.
>>> Otherwise, they have to handle normalizing their data in some way before
>>> using mxnet.
>>> I understand that there is a tradeoff here because of the efficiency
>>> gains from the parser but the expectation of having to normalize their UTF-8
>>> files may turn users away.
>>>
>>> Anirudh
>>>
>>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" <
>>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote:
>>>
>>> Since LibSVM format is only going to involve numbers and possibly
>>> ascii
>>> characters, is there any reason adding UTF-8 support? Note that
>>> generalization always comes with cost of efficiency and there is some
>>> effort spent on making parser fast
>>>
>>> Tianqi
>>>
>>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh 
>>> wrote:
>>>
>>> > Hi all,
>>> >
>>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
>>> parsers.
>>> > I am currently working on adding UTF-8 support for Text parsers.
>>> Since C++
>>> > doesn't have a great built-in support for UTF-8, I am looking at
>>> > third-party libraries which provide Unicode support. I am
>>> considering ICU
>>> > currently. Any comments, suggestions, past experience, gotchas
>>> about
>>> > unicode third party libraries or adding unicode support in general
>>> is
>>> > highly appreciated.
>>> >
>>> > I have created an issue about the same:
>>> > https://github.com/dmlc/dmlc-core/issues/372
>>> > Please feel free to reply to this email or comment on the github
>>> issue if
>>> > you have any inputs.
>>> >
>>> > Anirudh
>>> >
>>>
>>>
>>>
>>
>


Re: UTF-8 Support for TextParser

2018-02-26 Thread Anirudh
Hi Marco,

I understand that there needs to be a different discussion on strong
dependency of mxnet and dmlc-core and how to fix it.

Having said that, I think the goals of dmlc-core and mxnet are somewhat
aligned. Posting in the MXNet dev list for this case
is a good way to gather feedback from both the communities since I consider
the MXNet community to be mostly a superset of the dmlc-core community.

Anirudh

On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh 
wrote:

> Hi Tianqi,
>
> The UTF-8 support would enable other formats like CSV more usable.
> Otherwise, they have to handle normalizing their data in some way before
> using mxnet.
> I understand that there is a tradeoff here because of the efficiency gains
> from the parser but the expectation of having to normalize their UTF-8
> files may turn users away.
>
> Anirudh
>
> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" <
> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote:
>
> Since LibSVM format is only going to involve numbers and possibly ascii
> characters, is there any reason adding UTF-8 support? Note that
> generalization always comes with cost of efficiency and there is some
> effort spent on making parser fast
>
> Tianqi
>
> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh 
> wrote:
>
> > Hi all,
> >
> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> parsers.
> > I am currently working on adding UTF-8 support for Text parsers.
> Since C++
> > doesn't have a great built-in support for UTF-8, I am looking at
> > third-party libraries which provide Unicode support. I am
> considering ICU
> > currently. Any comments, suggestions, past experience, gotchas about
> > unicode third party libraries or adding unicode support in general is
> > highly appreciated.
> >
> > I have created an issue about the same:
> > https://github.com/dmlc/dmlc-core/issues/372
> > Please feel free to reply to this email or comment on the github
> issue if
> > you have any inputs.
> >
> > Anirudh
> >
>
>
>


Re: UTF-8 Support for TextParser

2018-02-26 Thread Marco de Abreu
The problem is that the DMLC organization and dmlc-core is not part of the
Apache software foundation. If that change is specifically for dmlc-core,
it has to be discussed in that community. This email list is for MXNet
under the Apache incubator.

Apparently, there's a very strong dependency of MXNet on the dmlc-core
package which is not managed by this community. Risks like code over there
not being properly validated by our CI (there has been a thread created by
Chris just recently) aside - this is not the way an Apache project should
work. MXNet is currently under the Apache software foundation while the
actual core is managed by the DMLC organization, leaving the mxnet
community without any say in decisions happening over there.

We as a community should discuss whether we want to keep this strong
dependency up.

Anirudh  schrieb am Di., 27. Feb. 2018, 00:51:

> The code is going to go in the dmlc repository. What is wrong with
> referencing the dmlc repository issue ?
>
> On Mon, Feb 26, 2018 at 3:48 PM, Marco de Abreu <
> marco.g.ab...@googlemail.com> wrote:
>
> > That's not what I mean. Please create a proper issue and don't just
> > reference the DMLC repository.
> >
> > Anirudh  schrieb am Di., 27. Feb. 2018, 00:46:
> >
> > > Sure! Here is the link to the issue in MXNet repo:
> > > https://github.com/apache/incubator-mxnet/issues/9891
> > >
> > > On Mon, Feb 26, 2018 at 3:41 PM, Marco de Abreu <
> > > marco.g.ab...@googlemail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > since DMLC is not affiliated with Apache, please create a GitHub
> issue
> > on
> > > > our repository and link the issue here in order to provide a base for
> > > > discussions.
> > > >
> > > > -Marco
> > > >
> > > > Anirudh  schrieb am Di., 27. Feb. 2018,
> 00:38:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> > > > parsers.
> > > > > I am currently working on adding UTF-8 support for Text parsers.
> > Since
> > > > C++
> > > > > doesn't have a great built-in support for UTF-8, I am looking at
> > > > > third-party libraries which provide Unicode support. I am
> considering
> > > ICU
> > > > > currently. Any comments, suggestions, past experience, gotchas
> about
> > > > > unicode third party libraries or adding unicode support in general
> is
> > > > > highly appreciated.
> > > > >
> > > > > I have created an issue about the same:
> > > > > https://github.com/dmlc/dmlc-core/issues/372
> > > > > Please feel free to reply to this email or comment on the github
> > issue
> > > if
> > > > > you have any inputs.
> > > > >
> > > > > Anirudh
> > > > >
> > > >
> > >
> >
>


Re: UTF-8 Support for TextParser

2018-02-26 Thread Tianqi Chen
Since LibSVM format is only going to involve numbers and possibly ascii
characters, is there any reason adding UTF-8 support? Note that
generalization always comes with cost of efficiency and there is some
effort spent on making parser fast

Tianqi

On Mon, Feb 26, 2018 at 3:38 PM, Anirudh  wrote:

> Hi all,
>
> Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text parsers.
> I am currently working on adding UTF-8 support for Text parsers. Since C++
> doesn't have a great built-in support for UTF-8, I am looking at
> third-party libraries which provide Unicode support. I am considering ICU
> currently. Any comments, suggestions, past experience, gotchas about
> unicode third party libraries or adding unicode support in general is
> highly appreciated.
>
> I have created an issue about the same:
> https://github.com/dmlc/dmlc-core/issues/372
> Please feel free to reply to this email or comment on the github issue if
> you have any inputs.
>
> Anirudh
>


Re: UTF-8 Support for TextParser

2018-02-26 Thread Anirudh
The code is going to go in the dmlc repository. What is wrong with
referencing the dmlc repository issue ?

On Mon, Feb 26, 2018 at 3:48 PM, Marco de Abreu <
marco.g.ab...@googlemail.com> wrote:

> That's not what I mean. Please create a proper issue and don't just
> reference the DMLC repository.
>
> Anirudh  schrieb am Di., 27. Feb. 2018, 00:46:
>
> > Sure! Here is the link to the issue in MXNet repo:
> > https://github.com/apache/incubator-mxnet/issues/9891
> >
> > On Mon, Feb 26, 2018 at 3:41 PM, Marco de Abreu <
> > marco.g.ab...@googlemail.com> wrote:
> >
> > > Hello,
> > >
> > > since DMLC is not affiliated with Apache, please create a GitHub issue
> on
> > > our repository and link the issue here in order to provide a base for
> > > discussions.
> > >
> > > -Marco
> > >
> > > Anirudh  schrieb am Di., 27. Feb. 2018, 00:38:
> > >
> > > > Hi all,
> > > >
> > > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> > > parsers.
> > > > I am currently working on adding UTF-8 support for Text parsers.
> Since
> > > C++
> > > > doesn't have a great built-in support for UTF-8, I am looking at
> > > > third-party libraries which provide Unicode support. I am considering
> > ICU
> > > > currently. Any comments, suggestions, past experience, gotchas about
> > > > unicode third party libraries or adding unicode support in general is
> > > > highly appreciated.
> > > >
> > > > I have created an issue about the same:
> > > > https://github.com/dmlc/dmlc-core/issues/372
> > > > Please feel free to reply to this email or comment on the github
> issue
> > if
> > > > you have any inputs.
> > > >
> > > > Anirudh
> > > >
> > >
> >
>


Re: UTF-8 Support for TextParser

2018-02-26 Thread Marco de Abreu
That's not what I mean. Please create a proper issue and don't just
reference the DMLC repository.

Anirudh  schrieb am Di., 27. Feb. 2018, 00:46:

> Sure! Here is the link to the issue in MXNet repo:
> https://github.com/apache/incubator-mxnet/issues/9891
>
> On Mon, Feb 26, 2018 at 3:41 PM, Marco de Abreu <
> marco.g.ab...@googlemail.com> wrote:
>
> > Hello,
> >
> > since DMLC is not affiliated with Apache, please create a GitHub issue on
> > our repository and link the issue here in order to provide a base for
> > discussions.
> >
> > -Marco
> >
> > Anirudh  schrieb am Di., 27. Feb. 2018, 00:38:
> >
> > > Hi all,
> > >
> > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> > parsers.
> > > I am currently working on adding UTF-8 support for Text parsers. Since
> > C++
> > > doesn't have a great built-in support for UTF-8, I am looking at
> > > third-party libraries which provide Unicode support. I am considering
> ICU
> > > currently. Any comments, suggestions, past experience, gotchas about
> > > unicode third party libraries or adding unicode support in general is
> > > highly appreciated.
> > >
> > > I have created an issue about the same:
> > > https://github.com/dmlc/dmlc-core/issues/372
> > > Please feel free to reply to this email or comment on the github issue
> if
> > > you have any inputs.
> > >
> > > Anirudh
> > >
> >
>


Re: UTF-8 Support for TextParser

2018-02-26 Thread Anirudh
Sure! Here is the link to the issue in MXNet repo:
https://github.com/apache/incubator-mxnet/issues/9891

On Mon, Feb 26, 2018 at 3:41 PM, Marco de Abreu <
marco.g.ab...@googlemail.com> wrote:

> Hello,
>
> since DMLC is not affiliated with Apache, please create a GitHub issue on
> our repository and link the issue here in order to provide a base for
> discussions.
>
> -Marco
>
> Anirudh  schrieb am Di., 27. Feb. 2018, 00:38:
>
> > Hi all,
> >
> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> parsers.
> > I am currently working on adding UTF-8 support for Text parsers. Since
> C++
> > doesn't have a great built-in support for UTF-8, I am looking at
> > third-party libraries which provide Unicode support. I am considering ICU
> > currently. Any comments, suggestions, past experience, gotchas about
> > unicode third party libraries or adding unicode support in general is
> > highly appreciated.
> >
> > I have created an issue about the same:
> > https://github.com/dmlc/dmlc-core/issues/372
> > Please feel free to reply to this email or comment on the github issue if
> > you have any inputs.
> >
> > Anirudh
> >
>


Re: UTF-8 Support for TextParser

2018-02-26 Thread Marco de Abreu
Hello,

since DMLC is not affiliated with Apache, please create a GitHub issue on
our repository and link the issue here in order to provide a base for
discussions.

-Marco

Anirudh  schrieb am Di., 27. Feb. 2018, 00:38:

> Hi all,
>
> Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text parsers.
> I am currently working on adding UTF-8 support for Text parsers. Since C++
> doesn't have a great built-in support for UTF-8, I am looking at
> third-party libraries which provide Unicode support. I am considering ICU
> currently. Any comments, suggestions, past experience, gotchas about
> unicode third party libraries or adding unicode support in general is
> highly appreciated.
>
> I have created an issue about the same:
> https://github.com/dmlc/dmlc-core/issues/372
> Please feel free to reply to this email or comment on the github issue if
> you have any inputs.
>
> Anirudh
>