Re: UTF-8 Support for TextParser
Won't run entire text through converter, will just ignore the BOM character during parsing stage. Anirudh On Fri, Mar 9, 2018 at 1:12 PM, Chris Olivierwrote: > For this, are you going to run the entire text through a converter, or just > prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)? > > On Fri, Mar 9, 2018 at 12:43 PM, Anirudh wrote: > > > Hi, > > > > Upon deeper understanding of customer requirement we found out that the > > customer uses only ASCII data with MXNet, just that they want the files > > containing UTF-8 BOM at the start and files with different control > > characters for newline to play well. dmlc-core already supports control > > characters for newline. > > Since, the UTF-8 BOM in files is a common use case for other users of > MXNet > > too (for example, saving excel as UTF-8 csv) I will add support for > > handling the UTF-8 BOM in dmlc-core. > > I won't be working on UTF8CSVParser unless there is a customer > requirement > > that comes up later on. > > > > Anirudh > > > > > > > > On Wed, Feb 28, 2018 at 11:50 PM, Anirudh wrote: > > > > > Hi Tianqi, > > > > > > What do you think about adding a separate parser for CSV with UTF8 > > support > > > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the > > > UTF8 or the ASCII parser based on this flag. (This idea was suggested > by > > > Mu). > > > > > > I think there will be some small changes required to the base class > > > "TextParserBase" as the method "BackFindEndLine" will have more logic > in > > it > > > to check for other code-points for line-breaks, which can be > refactored. > > > This approach will likely retain the performance of the existing ASCII > > CSV > > > Parser, while allowing MXNet users to make the decision w.r.t usability > > > with UTF-8 CSV parser / performance with ASCII CSV parser. > > > > > > Thanks, > > > Anirudh > > > > > > > > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh > wrote: > > > > > >> Hi Marco, > > >> > > >> I understand that there needs to be a different discussion on strong > > >> dependency of mxnet and dmlc-core and how to fix it. > > >> > > >> Having said that, I think the goals of dmlc-core and mxnet are > somewhat > > >> aligned. Posting in the MXNet dev list for this case > > >> is a good way to gather feedback from both the communities since I > > >> consider the MXNet community to be mostly a superset of the dmlc-core > > >> community. > > >> > > >> Anirudh > > >> > > >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh < > > ani...@amazon.com> > > >> wrote: > > >> > > >>> Hi Tianqi, > > >>> > > >>> The UTF-8 support would enable other formats like CSV more usable. > > >>> Otherwise, they have to handle normalizing their data in some way > > before > > >>> using mxnet. > > >>> I understand that there is a tradeoff here because of the efficiency > > >>> gains from the parser but the expectation of having to normalize > their > > UTF-8 > > >>> files may turn users away. > > >>> > > >>> Anirudh > > >>> > > >>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" < > > >>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote: > > >>> > > >>> Since LibSVM format is only going to involve numbers and possibly > > >>> ascii > > >>> characters, is there any reason adding UTF-8 support? Note that > > >>> generalization always comes with cost of efficiency and there is > > some > > >>> effort spent on making parser fast > > >>> > > >>> Tianqi > > >>> > > >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh > > >>> wrote: > > >>> > > >>> > Hi all, > > >>> > > > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV > Text > > >>> parsers. > > >>> > I am currently working on adding UTF-8 support for Text > parsers. > > >>> Since C++ > > >>> > doesn't have a great built-in support for UTF-8, I am looking > at > > >>> > third-party libraries which provide Unicode support. I am > > >>> considering ICU > > >>> > currently. Any comments, suggestions, past experience, gotchas > > >>> about > > >>> > unicode third party libraries or adding unicode support in > > general > > >>> is > > >>> > highly appreciated. > > >>> > > > >>> > I have created an issue about the same: > > >>> > https://github.com/dmlc/dmlc-core/issues/372 > > >>> > Please feel free to reply to this email or comment on the > github > > >>> issue if > > >>> > you have any inputs. > > >>> > > > >>> > Anirudh > > >>> > > > >>> > > >>> > > >>> > > >> > > > > > >
Re: UTF-8 Support for TextParser
For this, are you going to run the entire text through a converter, or just prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)? On Fri, Mar 9, 2018 at 12:43 PM, Anirudhwrote: > Hi, > > Upon deeper understanding of customer requirement we found out that the > customer uses only ASCII data with MXNet, just that they want the files > containing UTF-8 BOM at the start and files with different control > characters for newline to play well. dmlc-core already supports control > characters for newline. > Since, the UTF-8 BOM in files is a common use case for other users of MXNet > too (for example, saving excel as UTF-8 csv) I will add support for > handling the UTF-8 BOM in dmlc-core. > I won't be working on UTF8CSVParser unless there is a customer requirement > that comes up later on. > > Anirudh > > > > On Wed, Feb 28, 2018 at 11:50 PM, Anirudh wrote: > > > Hi Tianqi, > > > > What do you think about adding a separate parser for CSV with UTF8 > support > > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the > > UTF8 or the ASCII parser based on this flag. (This idea was suggested by > > Mu). > > > > I think there will be some small changes required to the base class > > "TextParserBase" as the method "BackFindEndLine" will have more logic in > it > > to check for other code-points for line-breaks, which can be refactored. > > This approach will likely retain the performance of the existing ASCII > CSV > > Parser, while allowing MXNet users to make the decision w.r.t usability > > with UTF-8 CSV parser / performance with ASCII CSV parser. > > > > Thanks, > > Anirudh > > > > > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh wrote: > > > >> Hi Marco, > >> > >> I understand that there needs to be a different discussion on strong > >> dependency of mxnet and dmlc-core and how to fix it. > >> > >> Having said that, I think the goals of dmlc-core and mxnet are somewhat > >> aligned. Posting in the MXNet dev list for this case > >> is a good way to gather feedback from both the communities since I > >> consider the MXNet community to be mostly a superset of the dmlc-core > >> community. > >> > >> Anirudh > >> > >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh < > ani...@amazon.com> > >> wrote: > >> > >>> Hi Tianqi, > >>> > >>> The UTF-8 support would enable other formats like CSV more usable. > >>> Otherwise, they have to handle normalizing their data in some way > before > >>> using mxnet. > >>> I understand that there is a tradeoff here because of the efficiency > >>> gains from the parser but the expectation of having to normalize their > UTF-8 > >>> files may turn users away. > >>> > >>> Anirudh > >>> > >>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" < > >>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote: > >>> > >>> Since LibSVM format is only going to involve numbers and possibly > >>> ascii > >>> characters, is there any reason adding UTF-8 support? Note that > >>> generalization always comes with cost of efficiency and there is > some > >>> effort spent on making parser fast > >>> > >>> Tianqi > >>> > >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh > >>> wrote: > >>> > >>> > Hi all, > >>> > > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text > >>> parsers. > >>> > I am currently working on adding UTF-8 support for Text parsers. > >>> Since C++ > >>> > doesn't have a great built-in support for UTF-8, I am looking at > >>> > third-party libraries which provide Unicode support. I am > >>> considering ICU > >>> > currently. Any comments, suggestions, past experience, gotchas > >>> about > >>> > unicode third party libraries or adding unicode support in > general > >>> is > >>> > highly appreciated. > >>> > > >>> > I have created an issue about the same: > >>> > https://github.com/dmlc/dmlc-core/issues/372 > >>> > Please feel free to reply to this email or comment on the github > >>> issue if > >>> > you have any inputs. > >>> > > >>> > Anirudh > >>> > > >>> > >>> > >>> > >> > > >
Re: UTF-8 Support for TextParser
Hi, Upon deeper understanding of customer requirement we found out that the customer uses only ASCII data with MXNet, just that they want the files containing UTF-8 BOM at the start and files with different control characters for newline to play well. dmlc-core already supports control characters for newline. Since, the UTF-8 BOM in files is a common use case for other users of MXNet too (for example, saving excel as UTF-8 csv) I will add support for handling the UTF-8 BOM in dmlc-core. I won't be working on UTF8CSVParser unless there is a customer requirement that comes up later on. Anirudh On Wed, Feb 28, 2018 at 11:50 PM, Anirudhwrote: > Hi Tianqi, > > What do you think about adding a separate parser for CSV with UTF8 support > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the > UTF8 or the ASCII parser based on this flag. (This idea was suggested by > Mu). > > I think there will be some small changes required to the base class > "TextParserBase" as the method "BackFindEndLine" will have more logic in it > to check for other code-points for line-breaks, which can be refactored. > This approach will likely retain the performance of the existing ASCII CSV > Parser, while allowing MXNet users to make the decision w.r.t usability > with UTF-8 CSV parser / performance with ASCII CSV parser. > > Thanks, > Anirudh > > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh wrote: > >> Hi Marco, >> >> I understand that there needs to be a different discussion on strong >> dependency of mxnet and dmlc-core and how to fix it. >> >> Having said that, I think the goals of dmlc-core and mxnet are somewhat >> aligned. Posting in the MXNet dev list for this case >> is a good way to gather feedback from both the communities since I >> consider the MXNet community to be mostly a superset of the dmlc-core >> community. >> >> Anirudh >> >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh >> wrote: >> >>> Hi Tianqi, >>> >>> The UTF-8 support would enable other formats like CSV more usable. >>> Otherwise, they have to handle normalizing their data in some way before >>> using mxnet. >>> I understand that there is a tradeoff here because of the efficiency >>> gains from the parser but the expectation of having to normalize their UTF-8 >>> files may turn users away. >>> >>> Anirudh >>> >>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" < >>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote: >>> >>> Since LibSVM format is only going to involve numbers and possibly >>> ascii >>> characters, is there any reason adding UTF-8 support? Note that >>> generalization always comes with cost of efficiency and there is some >>> effort spent on making parser fast >>> >>> Tianqi >>> >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh >>> wrote: >>> >>> > Hi all, >>> > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text >>> parsers. >>> > I am currently working on adding UTF-8 support for Text parsers. >>> Since C++ >>> > doesn't have a great built-in support for UTF-8, I am looking at >>> > third-party libraries which provide Unicode support. I am >>> considering ICU >>> > currently. Any comments, suggestions, past experience, gotchas >>> about >>> > unicode third party libraries or adding unicode support in general >>> is >>> > highly appreciated. >>> > >>> > I have created an issue about the same: >>> > https://github.com/dmlc/dmlc-core/issues/372 >>> > Please feel free to reply to this email or comment on the github >>> issue if >>> > you have any inputs. >>> > >>> > Anirudh >>> > >>> >>> >>> >> >
Re: UTF-8 Support for TextParser
Hi Marco, I understand that there needs to be a different discussion on strong dependency of mxnet and dmlc-core and how to fix it. Having said that, I think the goals of dmlc-core and mxnet are somewhat aligned. Posting in the MXNet dev list for this case is a good way to gather feedback from both the communities since I consider the MXNet community to be mostly a superset of the dmlc-core community. Anirudh On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudhwrote: > Hi Tianqi, > > The UTF-8 support would enable other formats like CSV more usable. > Otherwise, they have to handle normalizing their data in some way before > using mxnet. > I understand that there is a tradeoff here because of the efficiency gains > from the parser but the expectation of having to normalize their UTF-8 > files may turn users away. > > Anirudh > > On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" < > workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote: > > Since LibSVM format is only going to involve numbers and possibly ascii > characters, is there any reason adding UTF-8 support? Note that > generalization always comes with cost of efficiency and there is some > effort spent on making parser fast > > Tianqi > > On Mon, Feb 26, 2018 at 3:38 PM, Anirudh > wrote: > > > Hi all, > > > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text > parsers. > > I am currently working on adding UTF-8 support for Text parsers. > Since C++ > > doesn't have a great built-in support for UTF-8, I am looking at > > third-party libraries which provide Unicode support. I am > considering ICU > > currently. Any comments, suggestions, past experience, gotchas about > > unicode third party libraries or adding unicode support in general is > > highly appreciated. > > > > I have created an issue about the same: > > https://github.com/dmlc/dmlc-core/issues/372 > > Please feel free to reply to this email or comment on the github > issue if > > you have any inputs. > > > > Anirudh > > > > >
Re: UTF-8 Support for TextParser
The problem is that the DMLC organization and dmlc-core is not part of the Apache software foundation. If that change is specifically for dmlc-core, it has to be discussed in that community. This email list is for MXNet under the Apache incubator. Apparently, there's a very strong dependency of MXNet on the dmlc-core package which is not managed by this community. Risks like code over there not being properly validated by our CI (there has been a thread created by Chris just recently) aside - this is not the way an Apache project should work. MXNet is currently under the Apache software foundation while the actual core is managed by the DMLC organization, leaving the mxnet community without any say in decisions happening over there. We as a community should discuss whether we want to keep this strong dependency up. Anirudhschrieb am Di., 27. Feb. 2018, 00:51: > The code is going to go in the dmlc repository. What is wrong with > referencing the dmlc repository issue ? > > On Mon, Feb 26, 2018 at 3:48 PM, Marco de Abreu < > marco.g.ab...@googlemail.com> wrote: > > > That's not what I mean. Please create a proper issue and don't just > > reference the DMLC repository. > > > > Anirudh schrieb am Di., 27. Feb. 2018, 00:46: > > > > > Sure! Here is the link to the issue in MXNet repo: > > > https://github.com/apache/incubator-mxnet/issues/9891 > > > > > > On Mon, Feb 26, 2018 at 3:41 PM, Marco de Abreu < > > > marco.g.ab...@googlemail.com> wrote: > > > > > > > Hello, > > > > > > > > since DMLC is not affiliated with Apache, please create a GitHub > issue > > on > > > > our repository and link the issue here in order to provide a base for > > > > discussions. > > > > > > > > -Marco > > > > > > > > Anirudh schrieb am Di., 27. Feb. 2018, > 00:38: > > > > > > > > > Hi all, > > > > > > > > > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text > > > > parsers. > > > > > I am currently working on adding UTF-8 support for Text parsers. > > Since > > > > C++ > > > > > doesn't have a great built-in support for UTF-8, I am looking at > > > > > third-party libraries which provide Unicode support. I am > considering > > > ICU > > > > > currently. Any comments, suggestions, past experience, gotchas > about > > > > > unicode third party libraries or adding unicode support in general > is > > > > > highly appreciated. > > > > > > > > > > I have created an issue about the same: > > > > > https://github.com/dmlc/dmlc-core/issues/372 > > > > > Please feel free to reply to this email or comment on the github > > issue > > > if > > > > > you have any inputs. > > > > > > > > > > Anirudh > > > > > > > > > > > > > > >
Re: UTF-8 Support for TextParser
Since LibSVM format is only going to involve numbers and possibly ascii characters, is there any reason adding UTF-8 support? Note that generalization always comes with cost of efficiency and there is some effort spent on making parser fast Tianqi On Mon, Feb 26, 2018 at 3:38 PM, Anirudhwrote: > Hi all, > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text parsers. > I am currently working on adding UTF-8 support for Text parsers. Since C++ > doesn't have a great built-in support for UTF-8, I am looking at > third-party libraries which provide Unicode support. I am considering ICU > currently. Any comments, suggestions, past experience, gotchas about > unicode third party libraries or adding unicode support in general is > highly appreciated. > > I have created an issue about the same: > https://github.com/dmlc/dmlc-core/issues/372 > Please feel free to reply to this email or comment on the github issue if > you have any inputs. > > Anirudh >
Re: UTF-8 Support for TextParser
The code is going to go in the dmlc repository. What is wrong with referencing the dmlc repository issue ? On Mon, Feb 26, 2018 at 3:48 PM, Marco de Abreu < marco.g.ab...@googlemail.com> wrote: > That's not what I mean. Please create a proper issue and don't just > reference the DMLC repository. > > Anirudhschrieb am Di., 27. Feb. 2018, 00:46: > > > Sure! Here is the link to the issue in MXNet repo: > > https://github.com/apache/incubator-mxnet/issues/9891 > > > > On Mon, Feb 26, 2018 at 3:41 PM, Marco de Abreu < > > marco.g.ab...@googlemail.com> wrote: > > > > > Hello, > > > > > > since DMLC is not affiliated with Apache, please create a GitHub issue > on > > > our repository and link the issue here in order to provide a base for > > > discussions. > > > > > > -Marco > > > > > > Anirudh schrieb am Di., 27. Feb. 2018, 00:38: > > > > > > > Hi all, > > > > > > > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text > > > parsers. > > > > I am currently working on adding UTF-8 support for Text parsers. > Since > > > C++ > > > > doesn't have a great built-in support for UTF-8, I am looking at > > > > third-party libraries which provide Unicode support. I am considering > > ICU > > > > currently. Any comments, suggestions, past experience, gotchas about > > > > unicode third party libraries or adding unicode support in general is > > > > highly appreciated. > > > > > > > > I have created an issue about the same: > > > > https://github.com/dmlc/dmlc-core/issues/372 > > > > Please feel free to reply to this email or comment on the github > issue > > if > > > > you have any inputs. > > > > > > > > Anirudh > > > > > > > > > >
Re: UTF-8 Support for TextParser
That's not what I mean. Please create a proper issue and don't just reference the DMLC repository. Anirudhschrieb am Di., 27. Feb. 2018, 00:46: > Sure! Here is the link to the issue in MXNet repo: > https://github.com/apache/incubator-mxnet/issues/9891 > > On Mon, Feb 26, 2018 at 3:41 PM, Marco de Abreu < > marco.g.ab...@googlemail.com> wrote: > > > Hello, > > > > since DMLC is not affiliated with Apache, please create a GitHub issue on > > our repository and link the issue here in order to provide a base for > > discussions. > > > > -Marco > > > > Anirudh schrieb am Di., 27. Feb. 2018, 00:38: > > > > > Hi all, > > > > > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text > > parsers. > > > I am currently working on adding UTF-8 support for Text parsers. Since > > C++ > > > doesn't have a great built-in support for UTF-8, I am looking at > > > third-party libraries which provide Unicode support. I am considering > ICU > > > currently. Any comments, suggestions, past experience, gotchas about > > > unicode third party libraries or adding unicode support in general is > > > highly appreciated. > > > > > > I have created an issue about the same: > > > https://github.com/dmlc/dmlc-core/issues/372 > > > Please feel free to reply to this email or comment on the github issue > if > > > you have any inputs. > > > > > > Anirudh > > > > > >
Re: UTF-8 Support for TextParser
Sure! Here is the link to the issue in MXNet repo: https://github.com/apache/incubator-mxnet/issues/9891 On Mon, Feb 26, 2018 at 3:41 PM, Marco de Abreu < marco.g.ab...@googlemail.com> wrote: > Hello, > > since DMLC is not affiliated with Apache, please create a GitHub issue on > our repository and link the issue here in order to provide a base for > discussions. > > -Marco > > Anirudhschrieb am Di., 27. Feb. 2018, 00:38: > > > Hi all, > > > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text > parsers. > > I am currently working on adding UTF-8 support for Text parsers. Since > C++ > > doesn't have a great built-in support for UTF-8, I am looking at > > third-party libraries which provide Unicode support. I am considering ICU > > currently. Any comments, suggestions, past experience, gotchas about > > unicode third party libraries or adding unicode support in general is > > highly appreciated. > > > > I have created an issue about the same: > > https://github.com/dmlc/dmlc-core/issues/372 > > Please feel free to reply to this email or comment on the github issue if > > you have any inputs. > > > > Anirudh > > >
Re: UTF-8 Support for TextParser
Hello, since DMLC is not affiliated with Apache, please create a GitHub issue on our repository and link the issue here in order to provide a base for discussions. -Marco Anirudhschrieb am Di., 27. Feb. 2018, 00:38: > Hi all, > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text parsers. > I am currently working on adding UTF-8 support for Text parsers. Since C++ > doesn't have a great built-in support for UTF-8, I am looking at > third-party libraries which provide Unicode support. I am considering ICU > currently. Any comments, suggestions, past experience, gotchas about > unicode third party libraries or adding unicode support in general is > highly appreciated. > > I have created an issue about the same: > https://github.com/dmlc/dmlc-core/issues/372 > Please feel free to reply to this email or comment on the github issue if > you have any inputs. > > Anirudh >