[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156165#comment-16156165 ]
Matt Sun edited comment on CSV-196 at 9/6/17 10:58 PM: ------------------------------------------------------- I'm reopening this issue because I found that getCharacterPosition doesn't serve the purpose when the characters are multiple bytes. I will submit a pull request on Github to suggest a fix. was (Author: mattsun): I'm reopening this issue because I found that getCharacterPosition doesn't serve the position when the characters are multiple bytes. I will submit a pull request on Github to suggest a fix. > Store the information of raw data read by lexer > ----------------------------------------------- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser > Affects Versions: 1.4 > Reporter: Matt Sun > Labels: patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.4.14#64029)