Hi Tianqi, What do you think about adding a separate parser for CSV with UTF8 support in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the UTF8 or the ASCII parser based on this flag. (This idea was suggested by Mu).
I think there will be some small changes required to the base class "TextParserBase" as the method "BackFindEndLine" will have more logic in it to check for other code-points for line-breaks, which can be refactored. This approach will likely retain the performance of the existing ASCII CSV Parser, while allowing MXNet users to make the decision w.r.t usability with UTF-8 CSV parser / performance with ASCII CSV parser. Thanks, Anirudh On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <anirudh2...@gmail.com> wrote: > Hi Marco, > > I understand that there needs to be a different discussion on strong > dependency of mxnet and dmlc-core and how to fix it. > > Having said that, I think the goals of dmlc-core and mxnet are somewhat > aligned. Posting in the MXNet dev list for this case > is a good way to gather feedback from both the communities since I > consider the MXNet community to be mostly a superset of the dmlc-core > community. > > Anirudh > > On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <ani...@amazon.com> > wrote: > >> Hi Tianqi, >> >> The UTF-8 support would enable other formats like CSV more usable. >> Otherwise, they have to handle normalizing their data in some way before >> using mxnet. >> I understand that there is a tradeoff here because of the efficiency >> gains from the parser but the expectation of having to normalize their UTF-8 >> files may turn users away. >> >> Anirudh >> >> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" < >> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote: >> >> Since LibSVM format is only going to involve numbers and possibly >> ascii >> characters, is there any reason adding UTF-8 support? Note that >> generalization always comes with cost of efficiency and there is some >> effort spent on making parser fast >> >> Tianqi >> >> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <anirudh2...@gmail.com> >> wrote: >> >> > Hi all, >> > >> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text >> parsers. >> > I am currently working on adding UTF-8 support for Text parsers. >> Since C++ >> > doesn't have a great built-in support for UTF-8, I am looking at >> > third-party libraries which provide Unicode support. I am >> considering ICU >> > currently. Any comments, suggestions, past experience, gotchas about >> > unicode third party libraries or adding unicode support in general >> is >> > highly appreciated. >> > >> > I have created an issue about the same: >> > https://github.com/dmlc/dmlc-core/issues/372 >> > Please feel free to reply to this email or comment on the github >> issue if >> > you have any inputs. >> > >> > Anirudh >> > >> >> >> >