[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-14480. --------------------------------- Resolution: Fixed > Remove meaningless StringIteratorReader for CSV data source for better > performance > ---------------------------------------------------------------------------------- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Hyukjin Kwon > Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, CSV data source reads and parses CSV data bytes by bytes (not line > by line). > In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think > is made like this for better performance. However, it looks there are two > problems. > Firstly, it was actually not faster than processing line by line with > {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. > Secondly, this brought a bit of complexity because it needs additional logics > to allow every line to be read bytes by bytes. So, it was pretty difficult to > figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes > in {{CSVParser}} might not be needed. > I made a rough patch and tested this. The test results for the first problem > are below: > h4. Results > - Original codes with {{Reader}} wrapping {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 14116265034 | 2008277960 | > - New codes with {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 13451699644 | 1549050564 | > In more details, > h4. Method > - TCP-H lineitem table is being tested. > - The results are collected only by 1000000. > - End-to-end tests and parsing time tests are performed 10 times and averages > are calculated for each. > h4. Environment > - Machine: MacBook Pro Retina > - CPU: 4 > - Memory: 8GB > h4. Dataset > - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 > ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) > - Size : 724.66 MB > h4. Test Codes > - Function to measure time > {code} > def time[A](f: => A) = { > val s = System.nanoTime > val ret = f > println("time: "+(System.nanoTime-s)/1e6+"ms") > ret > } > {code} > - End-to-end test > {code} > val path = "lineitem.tbl" > val df = sqlContext > .read > .format("csv") > .option("header", "false") > .option("delimiter", "|") > .load(path) > time(df.take(1000000)) > {code} > - Parsing time test for original (in {{BulkCsvParser}}) > {code} > ... > // `reader` is a wrapper for an Iterator. > private val reader = new StringIteratorReader(iter) > parser.beginParsing(reader) > ... > time(parser.parseNext()) > ... > {code} > - Parsing time test for new (in {{BulkCsvParser}}) > {code} > ... > time(parser.parseLine(iter.next())) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org