[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2018-08-08 Thread Serge P. Nekoval (JIRA)


[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574019#comment-16574019
 ] 

Serge P. Nekoval commented on CSV-196:
--

FYI I've submitted a patch CSV-229 with similar feature. Not sure how it 
compares.

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>Priority: Major
>  Labels: patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-09-19 Thread Matt Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172439#comment-16172439
 ] 

Matt Sun commented on CSV-196:
--

https://github.com/apache/commons-csv/pull/22

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-09-06 Thread Matt Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156340#comment-16156340
 ] 

Matt Sun commented on CSV-196:
--

This is exactly what I'm proposing. We should track both, with the byte 
position be optional. It's completely backward compatible.

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-09-06 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156194#comment-16156194
 ] 

Gary Gregory commented on CSV-196:
--

A character is different that a byte, so maybe we need to track both the 
character position and the byte position. Some folks might rely on the current 
behavior...

Thoughts?

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-04-09 Thread Matt Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962321#comment-15962321
 ] 

Matt Sun commented on CSV-196:
--

I found that the offset in CSVRecord can fit our need in some way. So there is 
no need to store additional raw data for now. I'm going to close this issue 
unless you have other opinions.

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-03-25 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942129#comment-15942129
 ] 

Gary Gregory commented on CSV-196:
--

Is this issue dead or are we waiting for a patch?

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-02-14 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866468#comment-15866468
 ] 

Gary Gregory commented on CSV-196:
--

I think we need to talk about an actual patch before judging performance. For 
me personally, I have cases where I read and write millions of rows (like 50), 
so any change might be quite significant.

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-02-14 Thread Matt Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866217#comment-15866217
 ] 

Matt Sun commented on CSV-196:
--

[~britter] I want to clarify that I never asked to have CSVParser store the 
byte information. I was saying that if CSV Token could store some information 
about the raw data, downstream users can use those info to do some computation, 
FOR EXAMPLE counting bytes. Regarding performance, I don't think storing raw 
data will incur *significant* cost. Timing wise, Lexer is already reading input 
file char by char, storing the characters read will not increase time 
complexity. It's still same order of time complexity. You may argue that 
appending to StringBuffer is a cost, I agree. However, I wouldn't say it's 
*significant*. Memory wise, given the fact a CSV token is fairly small, I also 
don't think it will increase the burden of memory.
But your suggestion of "opt-in" sounds fine and reasonable. Another suggestion 
I have is to only store the number of characters read by the Lexer in Token. 
That saves a little time and memory space.

[~b.eckenfels]  Do you mean offset from the beginning of the file? In splitting 
case, it will be more useful to store the offset of the *END* of each record. 
While hadoop is processing the split, it wants to make sure it doesn't go 
across split boundary. After reading a CSV record, the program could figure out 
the current position by retrieve the information of offset of the *END* of each 
record. If only beginning offset is given and hadoop knows the beginning is 
within the boundary, the end of the record may still go beyond the boundary.
I'm not sure how *easy* it is doing this day? Could you briefly point out how 
to achieve this? What is the performance and memory impact?



> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-02-14 Thread Bernd Eckenfels (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865324#comment-15865324
 ] 

Bernd Eckenfels commented on CSV-196:
-

What about storing the begin of each record as byte/char offset? That's more 
general useful and easier to extract.

(Besides for a splitting usecase I would split on the normalized data size 
since this is the chunk Hadoop will work on)

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-02-13 Thread Benedikt Ritter (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865306#comment-15865306
 ] 

Benedikt Ritter commented on CSV-196:
-

I've asked the ML to comment on the proposal.

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-02-13 Thread Benedikt Ritter (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865296#comment-15865296
 ] 

Benedikt Ritter commented on CSV-196:
-

[~mattsun] so you don't need to know which bytes have been read, but rather 
need information about the characters? Do you think storing the additional 
characters will have a significant impact on performance? Maybe this should be 
opt-in for the user, so he can configure it.

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CSV-196) Store the information of raw data read by lexer

2017-02-13 Thread Matt Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864138#comment-15864138
 ] 

Matt Sun commented on CSV-196:
--

I just changed the title because I realized that the problem is more 
complicated than previously thought. Delimiter information might not be the 
only information missing for downstream user. Again, same scenario as 
described, Commons CSV library is being used with Hadoop library as an input 
format (csv). To support splitting the input of Hadoop jobs, the program needs 
to know how much input file has been read (thus working within the split 
boundary). And to leverage the capability of CSVParser, we usually want to 
"ignoreSurroundingSpace", "trim" and also handling "encapsulator". Thus, for a 
csv field like A, "B"   , C the parser gives us back A, B and C. It 
seems to the java program that there are only three characters read from input 
file, which is not true.

So now what I'm proposing:
using another StringBuilder in token which stores the raw data read for the 
token, including space and encapsulator.

Maintainers, what do you think?

> Store the information of raw data read by lexer
> ---
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.4
>Reporter: Matt Sun
>  Labels: easyfix, features, patch
> Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can't tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop Jobs, which should support splitting input 
> data. To accurately split a CSV file into pieces, we need to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field 
> indicating whether the column was enclosed by quotes. While Lexer is doing 
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as 
> resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)