[jira] [Reopened] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

Davies Liu (JIRA) Thu, 26 Jan 2017 17:57:59 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davies Liu reopened SPARK-14480:
--------------------------------

This patch have a regression: A column that have escaped newline can't be 
correctly parsed anymore.

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14480
>                 URL: https://issues.apache.org/jira/browse/SPARK-14480
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>             Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 1000000.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>       .read
>       .format("csv")
>       .option("header", "false")
>       .option("delimiter", "|")
>       .load(path)
> time(df.take(1000000))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

Reply via email to