Re: Review Request 37893: DRILL-3718: TSV reader fails when "" appears

Sean Hsuan-Yi Chu Sat, 29 Aug 2015 18:05:00 -0700


> On Aug. 28, 2015, 9:42 p.m., Jacques Nadeau wrote:
> > Can you please start out by discussing an approach.  I need to look at this 
> > in more detail but I think you're trying to correct a symptom rather than 
> > the root problem.  This is generic code (sometimes used for tab delimited, 
> > sometimes for comma, somtimes for space or pipe).  As such, I don't expect 
> > to see a tab character in the common code.  If we need to change conditions 
> > for this situation, we need to figure out the right way.  For example, what 
> > if someone uses a space delimiter for fields?  It seems like we're going to 
> > hit a similar problem.
> > 
> > Additionally, this is extremely performance sensitive code.  Before doing 
> > submitting any change on this code, you need to do performance testing.  I 
> > used a 2gb CSV file for performance testing purposes previously.
> 
> Sean Hsuan-Yi Chu wrote:
>     Let me first explain the symptom.
>     
>     The problem happens when we have double-quoted fields in the file.
>     
>     If there are whitespaces after the second quote (e.g., “current field”    
>   ,next field),
>     
>     TextReader will try to ignore them (the if-statement is doing this task).
>     
>     However, it is incorrect because those whitespaces could be used as 
> delimiter. 
>     
>     
>     
>     
>     
>     My solution:
>     
>     Thus, my solution is to add one more condition as follows:
>     
>     “if character is whitespace and it is not used as delimiter” 
>     
>     Besides, I found a new writing which would be more generic and not need 
> to have ‘\t’ in the code:
>     
>     if (ch != newLine && ch <= ' ' && ch != delimiter)   
>     
>     ——
>     
>     Now I am doing some performance testing. Once it is done and makes sense, 
> I will give the result and upload the latest patch.
> 
> Sean Hsuan-Yi Chu wrote:
>     I did an experiment and collected statistics here. 
>     
>     1. Data: a csv file with 6 rows and 60 fields per row.
>     2. Let the two versions of TextReader read the same csv file three times: 
> On average, the running times are 394.68 (the one on master) versus 413.11 
> (this patch). It is 4% slower. It seems ok?


My bad. It is 6 million rows. The size of the file is 3 gb.
(I do apologize. Drill should never have spent 400 secs to scan just 6 rows...)


- Sean Hsuan-Yi


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37893/#review96936
-----------------------------------------------------------


On Aug. 29, 2015, 3:10 a.m., Sean Hsuan-Yi Chu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37893/
> -----------------------------------------------------------
> 
> (Updated Aug. 29, 2015, 3:10 a.m.)
> 
> 
> Review request for drill, Jacques Nadeau and Mehant Baid.
> 
> 
> Bugs: DRILL-3718
>     https://issues.apache.org/jira/browse/DRILL-3718
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> For TSV files, if the TextReader reads a double quote, it would keep scanning 
> until it gets the second double quote.
> 
> However, even getting the second double quote, the current reader will keep 
> going in order to trim the space (i.e., ' '). 
> 
> In tsv, there is no need to trim '\t' (tab), which is used to separate fields.
> 
> 
> Diffs
> -----
> 
>   
> exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java
>  3899509 
>   exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java 
> 6b74ecf 
>   exec/java-exec/src/test/resources/store/text/WithQuote.tsv PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/37893/diff/
> 
> 
> Testing
> -------
> 
> All
> 
> 
> Thanks,
> 
> Sean Hsuan-Yi Chu
> 
>

Re: Review Request 37893: DRILL-3718: TSV reader fails when "" appears

Reply via email to