[jira] [Commented] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172439#comment-16172439 ] Matt Sun commented on CSV-196: -- https://github.com/apache/commons-csv/pull/22 > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156340#comment-16156340 ] Matt Sun commented on CSV-196: -- This is exactly what I'm proposing. We should track both, with the byte position be optional. It's completely backward compatible. > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156165#comment-16156165 ] Matt Sun edited comment on CSV-196 at 9/6/17 10:58 PM: --- I'm reopening this issue because I found that getCharacterPosition doesn't serve the purpose when the characters are multiple bytes. I will submit a pull request on Github to suggest a fix. was (Author: mattsun): I'm reopening this issue because I found that getCharacterPosition doesn't serve the position when the characters are multiple bytes. I will submit a pull request on Github to suggest a fix. > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Reopened] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Sun reopened CSV-196: -- I'm reopening this issue because I found that getCharacterPosition doesn't serve the position when the characters are multiple bytes. I will submit a pull request on Github to suggest a fix. > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Sun closed CSV-196. Resolution: Not A Bug Fix Version/s: (was: Patch Needed) > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962321#comment-15962321 ] Matt Sun commented on CSV-196: -- I found that the offset in CSVRecord can fit our need in some way. So there is no need to store additional raw data for now. I'm going to close this issue unless you have other opinions. > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866217#comment-15866217 ] Matt Sun commented on CSV-196: -- [~britter] I want to clarify that I never asked to have CSVParser store the byte information. I was saying that if CSV Token could store some information about the raw data, downstream users can use those info to do some computation, FOR EXAMPLE counting bytes. Regarding performance, I don't think storing raw data will incur *significant* cost. Timing wise, Lexer is already reading input file char by char, storing the characters read will not increase time complexity. It's still same order of time complexity. You may argue that appending to StringBuffer is a cost, I agree. However, I wouldn't say it's *significant*. Memory wise, given the fact a CSV token is fairly small, I also don't think it will increase the burden of memory. But your suggestion of "opt-in" sounds fine and reasonable. Another suggestion I have is to only store the number of characters read by the Lexer in Token. That saves a little time and memory space. [~b.eckenfels] Do you mean offset from the beginning of the file? In splitting case, it will be more useful to store the offset of the *END* of each record. While hadoop is processing the split, it wants to make sure it doesn't go across split boundary. After reading a CSV record, the program could figure out the current position by retrieve the information of offset of the *END* of each record. If only beginning offset is given and hadoop knows the beginning is within the boundary, the end of the record may still go beyond the boundary. I'm not sure how *easy* it is doing this day? Could you briefly point out how to achieve this? What is the performance and memory impact? > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Sun updated CSV-196: - Labels: patch (was: easyfix features patch) > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864138#comment-15864138 ] Matt Sun commented on CSV-196: -- I just changed the title because I realized that the problem is more complicated than previously thought. Delimiter information might not be the only information missing for downstream user. Again, same scenario as described, Commons CSV library is being used with Hadoop library as an input format (csv). To support splitting the input of Hadoop jobs, the program needs to know how much input file has been read (thus working within the split boundary). And to leverage the capability of CSVParser, we usually want to "ignoreSurroundingSpace", "trim" and also handling "encapsulator". Thus, for a csv field like A, "B" , C the parser gives us back A, B and C. It seems to the java program that there are only three characters read from input file, which is not true. So now what I'm proposing: using another StringBuilder in token which stores the raw data read for the token, including space and encapsulator. Maintainers, what do you think? > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CSV-196) Store the information of raw data read by lexer
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Sun updated CSV-196: - Summary: Store the information of raw data read by lexer (was: Store the info of whether a field is enclosed by quotes) > Store the information of raw data read by lexer > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CSV-196) Store the info of whether a field is enclosed by quotes
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566230#comment-15566230 ] Matt Sun edited comment on CSV-196 at 10/11/16 6:40 PM: Add to what I just said, the CSV parser already takes care of spaces (if any). So if we format the parser not to remove leading and trailing spaces during parser initialization, we will get info of the original data. Whether or not the field was encapsulated is the only missing info. was (Author: mattsun): Add to what I just said, the CSV parser also takes care of spaces (if any). So if we format the parser not to remove leading and trailing spaces during initialization, we will get info of the original data. Whether or not the field was encapsulated is the only missing info. > Store the info of whether a field is enclosed by quotes > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CSV-196) Store the info of whether a field is enclosed by quotes
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566230#comment-15566230 ] Matt Sun commented on CSV-196: -- Add to what I just said, the CSV parser also takes care of spaces (if any). So if we format the parser not to remove leading and trailing spaces during initialization, we will get info of the original data. Whether or not the field was encapsulated is the only missing info. > Store the info of whether a field is enclosed by quotes > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CSV-196) Store the info of whether a field is enclosed by quotes
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566209#comment-15566209 ] Matt Sun commented on CSV-196: -- You're right, parser reads characters. But as downstream users, they can count bytes by characters read. As I said, as a integration with Hadoop API, we want to know the original data read by the parser. What I asked is not having parsing to count the bytes, but only to set a flag indicating whether the field was enclosed by quotes. > Store the info of whether a field is enclosed by quotes > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CSV-196) Store the info of whether a field is enclosed by quotes
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1534#comment-1534 ] Matt Sun commented on CSV-196: -- yes, pretty much. The bytes read from the source file, including encapsulators, spaces, line breaks etc. > Store the info of whether a field is enclosed by quotes > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CSV-196) Store the info of whether a field is enclosed by quotes
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519443#comment-15519443 ] Matt Sun commented on CSV-196: -- Sure, I can do. Which version should the patch based on? > Store the info of whether a field is enclosed by quotes > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CSV-196) Store the info of whether a field is enclosed by quotes
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Sun updated CSV-196: - Description: It will be good to have CSVParser class to store the info of whether a field was enclosed by quotes in the original source file. For example, for this data sample: A, B, C a1, "b1", c1 CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes, but we also lost the information of original data at the same time. We can't tell from the CSVRecord returned whether the original data is enclosed by double quotes or not. In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV is one kind of input of Hadoop Jobs, which should support splitting input data. To accurately split a CSV file into pieces, we need to count the bytes of data CSVParser actually read. CSVParser doesn't have accurate information of whether a field was enclosed by quotes, neither does it store raw data of the original source. Downstream users of commons CSVParser is not able to get those info. To suggest a fix: Extend the token/CSVRecord to have a boolean field indicating whether the column was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field is encapsulated and successfully parsed. I find another issue reported with similar request, but it was marked as resolved: [CSV91] https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 was: It will be good to have CSVParser class to store the info of whether a field was enclosed by quotes in the original source file. For example, for this data sample: A, B, C a1, "b1", c1 CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes, but we also lost the information of original data at the same time. We can tell from the CSVRecord returned whether the original data is enclosed by double quotes or not. In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV is one kind of input of Hadoop, which should splitting input data. To accurately split a CSV file into pieces, the program needs to count the bytes of data CSVParser actually read. CSVParser doesn't have accurate information of whether a field was enclosed by quotes, neither does it store raw data of the original source. Downstream users of commons CSVParser is not able to get those info. To suggest a fix: Extend the token/CSVRecord to have a field indicating whether the column was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field is encapsulated and successfully parsed. I find another issue reported, but it was marked as resolved: [CSV91] https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 > Store the info of whether a field is enclosed by quotes > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can't tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop Jobs, which should support splitting input > data. To accurately split a CSV file into pieces, we need to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field > indicating whether the column was enclosed by quotes. While Lexer is doing > getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as > resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CSV-196) Store the info of whether a field is enclosed by quotes
[ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Sun updated CSV-196: - Priority: Major (was: Minor) > Store the info of whether a field is enclosed by quotes > --- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser >Affects Versions: 1.4 >Reporter: Matt Sun > Labels: easyfix, features, patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field > was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed > double quotes, but we also lost the information of original data at the same > time. We can tell from the CSVRecord returned whether the original data is > enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV > is one kind of input of Hadoop, which should splitting input data. To > accurately split a CSV file into pieces, the program needs to count the bytes > of data CSVParser actually read. CSVParser doesn't have accurate information > of whether a field was enclosed by quotes, neither does it store raw data of > the original source. Downstream users of commons CSVParser is not able to get > those info. > To suggest a fix: Extend the token/CSVRecord to have a field indicating > whether the column was enclosed by quotes. While Lexer is doing getNextToken, > set the flag if a field is encapsulated and successfully parsed. > I find another issue reported, but it was marked as resolved: [CSV91] > https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CSV-196) Store the info of whether a field is enclosed by quotes
Matt Sun created CSV-196: Summary: Store the info of whether a field is enclosed by quotes Key: CSV-196 URL: https://issues.apache.org/jira/browse/CSV-196 Project: Commons CSV Issue Type: Improvement Components: Parser Affects Versions: 1.4 Reporter: Matt Sun Priority: Minor It will be good to have CSVParser class to store the info of whether a field was enclosed by quotes in the original source file. For example, for this data sample: A, B, C a1, "b1", c1 CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes, but we also lost the information of original data at the same time. We can tell from the CSVRecord returned whether the original data is enclosed by double quotes or not. In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV is one kind of input of Hadoop, which should splitting input data. To accurately split a CSV file into pieces, the program needs to count the bytes of data CSVParser actually read. CSVParser doesn't have accurate information of whether a field was enclosed by quotes, neither does it store raw data of the original source. Downstream users of commons CSVParser is not able to get those info. To suggest a fix: Extend the token/CSVRecord to have a field indicating whether the column was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field is encapsulated and successfully parsed. I find another issue reported, but it was marked as resolved: [CSV91] https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)