Re: UTF-8 Support for TextParser

Anirudh Wed, 28 Feb 2018 23:50:33 -0800

Hi Tianqi,

What do you think about adding a separate parser for CSV with UTF8 support
in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
UTF8 or the ASCII parser based on this flag. (This idea was suggested by
Mu).


I think there will be some small changes required to the base class
"TextParserBase" as the method "BackFindEndLine" will have more logic in it
to check for other code-points for line-breaks, which can be refactored.
This approach will likely retain the performance of the existing ASCII CSV
Parser, while allowing MXNet users to make the decision w.r.t usability
with UTF-8 CSV parser / performance with ASCII CSV parser.

Thanks,
Anirudh


On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <anirudh2...@gmail.com> wrote:

> Hi Marco,
>
> I understand that there needs to be a different discussion on strong
> dependency of mxnet and dmlc-core and how to fix it.
>
> Having said that, I think the goals of dmlc-core and mxnet are somewhat
> aligned. Posting in the MXNet dev list for this case
> is a good way to gather feedback from both the communities since I
> consider the MXNet community to be mostly a superset of the dmlc-core
> community.
>
> Anirudh
>
> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <ani...@amazon.com>
> wrote:
>
>> Hi Tianqi,
>>
>> The UTF-8 support would enable other formats like CSV more usable.
>> Otherwise, they have to handle normalizing their data in some way before
>> using mxnet.
>> I understand that there is a tradeoff here because of the efficiency
>> gains from the parser but the expectation of having to normalize their UTF-8
>> files may turn users away.
>>
>> Anirudh
>>
>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" <
>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote:
>>
>>     Since LibSVM format is only going to involve numbers and possibly
>> ascii
>>     characters, is there any reason adding UTF-8 support? Note that
>>     generalization always comes with cost of efficiency and there is some
>>     effort spent on making parser fast
>>
>>     Tianqi
>>
>>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <anirudh2...@gmail.com>
>> wrote:
>>
>>     > Hi all,
>>     >
>>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
>> parsers.
>>     > I am currently working on adding UTF-8 support for Text parsers.
>> Since C++
>>     > doesn't have a great built-in support for UTF-8, I am looking at
>>     > third-party libraries which provide Unicode support. I am
>> considering ICU
>>     > currently. Any comments, suggestions, past experience, gotchas about
>>     > unicode third party libraries or adding unicode support in general
>> is
>>     > highly appreciated.
>>     >
>>     > I have created an issue about the same:
>>     > https://github.com/dmlc/dmlc-core/issues/372
>>     > Please feel free to reply to this email or comment on the github
>> issue if
>>     > you have any inputs.
>>     >
>>     > Anirudh
>>     >
>>
>>
>>
>

Re: UTF-8 Support for TextParser

Reply via email to