Re: UTF-8 Support for TextParser

Anirudh Fri, 09 Mar 2018 13:59:05 -0800

Won't run entire text through converter, will just ignore the BOM character
during parsing stage.


Anirudh

On Fri, Mar 9, 2018 at 1:12 PM, Chris Olivier <cjolivie...@gmail.com> wrote:

> For this, are you going to run the entire text through a converter, or just
> prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?
>
> On Fri, Mar 9, 2018 at 12:43 PM, Anirudh <anirudh2...@gmail.com> wrote:
>
> > Hi,
> >
> > Upon deeper understanding of customer requirement we found out that the
> > customer uses only ASCII data with MXNet, just that they want the files
> > containing UTF-8 BOM at the start and files with different control
> > characters for newline to play well. dmlc-core already supports control
> > characters for newline.
> > Since, the UTF-8 BOM in files is a common use case for other users of
> MXNet
> > too (for example, saving excel as UTF-8 csv) I will add support for
> > handling the UTF-8 BOM in dmlc-core.
> > I won't be working on UTF8CSVParser unless there is a customer
> requirement
> > that comes up later on.
> >
> > Anirudh
> >
> >
> >
> > On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <anirudh2...@gmail.com> wrote:
> >
> > > Hi Tianqi,
> > >
> > > What do you think about adding a separate parser for CSV with UTF8
> > support
> > > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> > > UTF8 or the ASCII parser based on this flag. (This idea was suggested
> by
> > > Mu).
> > >
> > > I think there will be some small changes required to the base class
> > > "TextParserBase" as the method "BackFindEndLine" will have more logic
> in
> > it
> > > to check for other code-points for line-breaks, which can be
> refactored.
> > > This approach will likely retain the performance of the existing ASCII
> > CSV
> > > Parser, while allowing MXNet users to make the decision w.r.t usability
> > > with UTF-8 CSV parser / performance with ASCII CSV parser.
> > >
> > > Thanks,
> > > Anirudh
> > >
> > >
> > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <anirudh2...@gmail.com>
> wrote:
> > >
> > >> Hi Marco,
> > >>
> > >> I understand that there needs to be a different discussion on strong
> > >> dependency of mxnet and dmlc-core and how to fix it.
> > >>
> > >> Having said that, I think the goals of dmlc-core and mxnet are
> somewhat
> > >> aligned. Posting in the MXNet dev list for this case
> > >> is a good way to gather feedback from both the communities since I
> > >> consider the MXNet community to be mostly a superset of the dmlc-core
> > >> community.
> > >>
> > >> Anirudh
> > >>
> > >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <
> > ani...@amazon.com>
> > >> wrote:
> > >>
> > >>> Hi Tianqi,
> > >>>
> > >>> The UTF-8 support would enable other formats like CSV more usable.
> > >>> Otherwise, they have to handle normalizing their data in some way
> > before
> > >>> using mxnet.
> > >>> I understand that there is a tradeoff here because of the efficiency
> > >>> gains from the parser but the expectation of having to normalize
> their
> > UTF-8
> > >>> files may turn users away.
> > >>>
> > >>> Anirudh
> > >>>
> > >>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" <
> > >>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote:
> > >>>
> > >>>     Since LibSVM format is only going to involve numbers and possibly
> > >>> ascii
> > >>>     characters, is there any reason adding UTF-8 support? Note that
> > >>>     generalization always comes with cost of efficiency and there is
> > some
> > >>>     effort spent on making parser fast
> > >>>
> > >>>     Tianqi
> > >>>
> > >>>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <anirudh2...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>     > Hi all,
> > >>>     >
> > >>>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV
> Text
> > >>> parsers.
> > >>>     > I am currently working on adding UTF-8 support for Text
> parsers.
> > >>> Since C++
> > >>>     > doesn't have a great built-in support for UTF-8, I am looking
> at
> > >>>     > third-party libraries which provide Unicode support. I am
> > >>> considering ICU
> > >>>     > currently. Any comments, suggestions, past experience, gotchas
> > >>> about
> > >>>     > unicode third party libraries or adding unicode support in
> > general
> > >>> is
> > >>>     > highly appreciated.
> > >>>     >
> > >>>     > I have created an issue about the same:
> > >>>     > https://github.com/dmlc/dmlc-core/issues/372
> > >>>     > Please feel free to reply to this email or comment on the
> github
> > >>> issue if
> > >>>     > you have any inputs.
> > >>>     >
> > >>>     > Anirudh
> > >>>     >
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: UTF-8 Support for TextParser

Reply via email to