Hsuan, A test with 6 rows and 60 fields per row is not the best test (at least to evaluate the performance). Please note a previous comment in this email chain mentions the use of a 2gb CSV file for performance testing purposes previously.
I would suggest that you should run your patch of a larger dataset and then evaluate performance regressions if any. On Fri, Aug 28, 2015 at 7:32 PM, Sean Hsuan-Yi Chu <[email protected]> wrote: > > > > On Aug. 28, 2015, 9:42 p.m., Jacques Nadeau wrote: > > > Can you please start out by discussing an approach. I need to look at > this in more detail but I think you're trying to correct a symptom rather > than the root problem. This is generic code (sometimes used for tab > delimited, sometimes for comma, somtimes for space or pipe). As such, I > don't expect to see a tab character in the common code. If we need to > change conditions for this situation, we need to figure out the right way. > For example, what if someone uses a space delimiter for fields? It seems > like we're going to hit a similar problem. > > > > > > Additionally, this is extremely performance sensitive code. Before > doing submitting any change on this code, you need to do performance > testing. I used a 2gb CSV file for performance testing purposes previously. > > > > Sean Hsuan-Yi Chu wrote: > > Let me first explain the symptom. > > > > The problem happens when we have double-quoted fields in the file. > > > > If there are whitespaces after the second quote (e.g., “current > field” ,next field), > > > > TextReader will try to ignore them (the if-statement is doing this > task). > > > > However, it is incorrect because those whitespaces could be used as > delimiter. > > > > > > > > > > > > My solution: > > > > Thus, my solution is to add one more condition as follows: > > > > “if character is whitespace and it is not used as delimiter” > > > > Besides, I found a new writing which would be more generic and not > need to have ‘\t’ in the code: > > > > if (ch != newLine && ch <= ' ' && ch != delimiter) > > > > —— > > > > Now I am doing some performance testing. Once it is done and makes > sense, I will give the result and upload the latest patch. > > I did an experiment and collected statistics here. > > 1. Data: a csv file with 6 rows and 60 fields per row. > 2. Let the two versions of TextReader read the same csv file three times: > On average, the running times are 394.68 (the one on master) versus 413.11 > (this patch). It is 4% slower. It seems ok? > > > - Sean Hsuan-Yi > > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/37893/#review96936 > ----------------------------------------------------------- > > > On Aug. 28, 2015, 9:35 p.m., Sean Hsuan-Yi Chu wrote: > > > > ----------------------------------------------------------- > > This is an automatically generated e-mail. To reply, visit: > > https://reviews.apache.org/r/37893/ > > ----------------------------------------------------------- > > > > (Updated Aug. 28, 2015, 9:35 p.m.) > > > > > > Review request for drill, Jacques Nadeau and Mehant Baid. > > > > > > Bugs: DRILL-3718 > > https://issues.apache.org/jira/browse/DRILL-3718 > > > > > > Repository: drill-git > > > > > > Description > > ------- > > > > For TSV files, if the TextReader reads a double quote, it would keep > scanning until it gets the second double quote. > > > > However, even getting the second double quote, the current reader will > keep going in order to trim the space (i.e., ' '). > > > > In tsv, there is no need to trim '\t' (tab), which is used to separate > fields. > > > > > > Diffs > > ----- > > > > > > exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java > 3899509 > > exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java > 6b74ecf > > exec/java-exec/src/test/resources/store/text/WithQuote.tsv PRE-CREATION > > > > Diff: https://reviews.apache.org/r/37893/diff/ > > > > > > Testing > > ------- > > > > All > > > > > > Thanks, > > > > Sean Hsuan-Yi Chu > > > > > >
