Won't run entire text through converter, will just ignore the BOM character during parsing stage.
Anirudh On Fri, Mar 9, 2018 at 1:12 PM, Chris Olivier <cjolivie...@gmail.com> wrote: > For this, are you going to run the entire text through a converter, or just > prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)? > > On Fri, Mar 9, 2018 at 12:43 PM, Anirudh <anirudh2...@gmail.com> wrote: > > > Hi, > > > > Upon deeper understanding of customer requirement we found out that the > > customer uses only ASCII data with MXNet, just that they want the files > > containing UTF-8 BOM at the start and files with different control > > characters for newline to play well. dmlc-core already supports control > > characters for newline. > > Since, the UTF-8 BOM in files is a common use case for other users of > MXNet > > too (for example, saving excel as UTF-8 csv) I will add support for > > handling the UTF-8 BOM in dmlc-core. > > I won't be working on UTF8CSVParser unless there is a customer > requirement > > that comes up later on. > > > > Anirudh > > > > > > > > On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <anirudh2...@gmail.com> wrote: > > > > > Hi Tianqi, > > > > > > What do you think about adding a separate parser for CSV with UTF8 > > support > > > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the > > > UTF8 or the ASCII parser based on this flag. (This idea was suggested > by > > > Mu). > > > > > > I think there will be some small changes required to the base class > > > "TextParserBase" as the method "BackFindEndLine" will have more logic > in > > it > > > to check for other code-points for line-breaks, which can be > refactored. > > > This approach will likely retain the performance of the existing ASCII > > CSV > > > Parser, while allowing MXNet users to make the decision w.r.t usability > > > with UTF-8 CSV parser / performance with ASCII CSV parser. > > > > > > Thanks, > > > Anirudh > > > > > > > > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <anirudh2...@gmail.com> > wrote: > > > > > >> Hi Marco, > > >> > > >> I understand that there needs to be a different discussion on strong > > >> dependency of mxnet and dmlc-core and how to fix it. > > >> > > >> Having said that, I think the goals of dmlc-core and mxnet are > somewhat > > >> aligned. Posting in the MXNet dev list for this case > > >> is a good way to gather feedback from both the communities since I > > >> consider the MXNet community to be mostly a superset of the dmlc-core > > >> community. > > >> > > >> Anirudh > > >> > > >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh < > > ani...@amazon.com> > > >> wrote: > > >> > > >>> Hi Tianqi, > > >>> > > >>> The UTF-8 support would enable other formats like CSV more usable. > > >>> Otherwise, they have to handle normalizing their data in some way > > before > > >>> using mxnet. > > >>> I understand that there is a tradeoff here because of the efficiency > > >>> gains from the parser but the expectation of having to normalize > their > > UTF-8 > > >>> files may turn users away. > > >>> > > >>> Anirudh > > >>> > > >>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" < > > >>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote: > > >>> > > >>> Since LibSVM format is only going to involve numbers and possibly > > >>> ascii > > >>> characters, is there any reason adding UTF-8 support? Note that > > >>> generalization always comes with cost of efficiency and there is > > some > > >>> effort spent on making parser fast > > >>> > > >>> Tianqi > > >>> > > >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <anirudh2...@gmail.com> > > >>> wrote: > > >>> > > >>> > Hi all, > > >>> > > > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV > Text > > >>> parsers. > > >>> > I am currently working on adding UTF-8 support for Text > parsers. > > >>> Since C++ > > >>> > doesn't have a great built-in support for UTF-8, I am looking > at > > >>> > third-party libraries which provide Unicode support. I am > > >>> considering ICU > > >>> > currently. Any comments, suggestions, past experience, gotchas > > >>> about > > >>> > unicode third party libraries or adding unicode support in > > general > > >>> is > > >>> > highly appreciated. > > >>> > > > >>> > I have created an issue about the same: > > >>> > https://github.com/dmlc/dmlc-core/issues/372 > > >>> > Please feel free to reply to this email or comment on the > github > > >>> issue if > > >>> > you have any inputs. > > >>> > > > >>> > Anirudh > > >>> > > > >>> > > >>> > > >>> > > >> > > > > > >