> On Aug. 28, 2015, 9:42 p.m., Jacques Nadeau wrote:
> > Can you please start out by discussing an approach.  I need to look at this 
> > in more detail but I think you're trying to correct a symptom rather than 
> > the root problem.  This is generic code (sometimes used for tab delimited, 
> > sometimes for comma, somtimes for space or pipe).  As such, I don't expect 
> > to see a tab character in the common code.  If we need to change conditions 
> > for this situation, we need to figure out the right way.  For example, what 
> > if someone uses a space delimiter for fields?  It seems like we're going to 
> > hit a similar problem.
> > 
> > Additionally, this is extremely performance sensitive code.  Before doing 
> > submitting any change on this code, you need to do performance testing.  I 
> > used a 2gb CSV file for performance testing purposes previously.

Let me first explain the symptom.

The problem happens when we have double-quoted fields in the file.

If there are whitespaces after the second quote (e.g., “current field”      
,next field),

TextReader will try to ignore them (the if-statement is doing this task).

However, it is incorrect because those whitespaces could be used as delimiter. 


My solution:

Thus, my solution is to add one more condition as follows:

“if character is whitespace and it is not used as delimiter” 

Besides, I found a new writing which would be more generic and not need to have 
‘\t’ in the code:

if (ch != newLine && ch <= ' ' && ch != delimiter)   

——

Now I am doing some performance testing. Once it is done and makes sense, I 
will give the result and upload the latest patch.


- Sean Hsuan-Yi


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37893/#review96936
-----------------------------------------------------------


On Aug. 28, 2015, 9:35 p.m., Sean Hsuan-Yi Chu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37893/
> -----------------------------------------------------------
> 
> (Updated Aug. 28, 2015, 9:35 p.m.)
> 
> 
> Review request for drill, Jacques Nadeau and Mehant Baid.
> 
> 
> Bugs: DRILL-3718
>     https://issues.apache.org/jira/browse/DRILL-3718
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> For TSV files, if the TextReader reads a double quote, it would keep scanning 
> until it gets the second double quote.
> 
> However, even getting the second double quote, the current reader will keep 
> going in order to trim the space (i.e., ' '). 
> 
> In tsv, there is no need to trim '\t' (tab), which is used to separate fields.
> 
> 
> Diffs
> -----
> 
>   
> exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java
>  3899509 
>   exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java 
> 6b74ecf 
>   exec/java-exec/src/test/resources/store/text/WithQuote.tsv PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/37893/diff/
> 
> 
> Testing
> -------
> 
> All
> 
> 
> Thanks,
> 
> Sean Hsuan-Yi Chu
> 
>

Reply via email to