[jira] [Updated] (SPARK-5863) Performance regression in Spark SQL/Parquet due to ScalaReflection.convertRowToScala

2015-02-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5863:
---
Priority: Critical  (was: Major)

 Performance regression in Spark SQL/Parquet due to 
 ScalaReflection.convertRowToScala
 

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian
Priority: Critical

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5863) Performance regression in Spark SQL/Parquet due to ScalaReflection.convertRowToScala

2015-02-17 Thread Cristian O (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian O updated SPARK-5863:
--
Description: 
Was doing some perf testing on reading parquet files and noticed that moving 
from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit 
showed up as being in ScalaReflection.convertRowToScala.

Particularly this zip is the issue:

{code}
r.toSeq.zip(schema.fields.map(_.dataType))
{code}

I see there's a comment on that currently that this is slow but it wasn't 
fixed. This actually produces a 3x degradation in parquet read performance, at 
least in my test case.

Edit: the map is part of the issue as well. This whole code block is in a tight 
loop and allocates a new ListBuffer that needs to grow for each transformation. 
A possible solution is to change to using seq.view which would allocate 
iterators instead.


  was:
Was doing some perf testing on reading parquet files and noticed that moving 
from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit 
showed up as being in ScalaReflection.convertRowToScala.

Particularly this zip is the issue:

{code}
r.toSeq.zip(schema.fields.map(_.dataType))
{code}

I see there's a comment on that currently that this is slow but it wasn't 
fixed. This actually produces a 3x degradation in parquet read performance, at 
least in my test case.



 Performance regression in Spark SQL/Parquet due to 
 ScalaReflection.convertRowToScala
 

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian O

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org