[jira] [Commented] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging

2015-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535018#comment-14535018
 ] 

Apache Spark commented on SPARK-7447:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6012

 Large Job submission lag when using Parquet w/ Schema Merging
 -

 Key: SPARK-7447
 URL: https://issues.apache.org/jira/browse/SPARK-7447
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Spark Submit
Affects Versions: 1.3.0, 1.3.1
 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs 
 storage, pyspark, 8 x c3.8xlarge nodes. 
 spark-conf
 spark.executor.memory 50g
 spark.driver.cores 32
 spark.driver.memory 50g
 spark.default.parallelism 512
 spark.sql.shuffle.partitions 512
 spark.task.maxFailures  30
 spark.executor.logs.rolling.maxRetainedFiles 2
 spark.executor.logs.rolling.size.maxBytes 102400
 spark.executor.logs.rolling.strategy size
 spark.shuffle.spill false
 spark.sql.parquet.cacheMetadata true
 spark.sql.parquet.filterPushdown true
 spark.sql.codegen true
 spark.akka.threads = 64
Reporter: Brad Willard

 I have 2.6 billion rows in parquet format and I'm trying to use the new 
 schema merging feature (I was enforcing a consistent schema manually before 
 in 0.8-1.2 which was annoying). 
 I have approximate 200 parquet files with key=date. When I load the 
 dataframe with the sqlcontext that process is understandably slow because I 
 assume it's reading all the meta data from the parquet files and doing the 
 initial schema merging. So that's ok.
 However the problem I have is that once I have the dataframe. Doing any 
 operation on the dataframe seems to have a 10-30 second lag before it 
 actually starts processing the Job and shows up as an Active Job in the Spark 
 Manager. This was an instant operation in all previous versions of Spark. 
 Once the job actually is running the performance is fantastic, however this 
 job submission lag is horrible.
 I'm wondering if there is a bug with recomputing the schema merging. Running 
 top on the master node shows some thread maxed out on 1 cpu during the 
 lagging time which makes me think it's not net i/o but something 
 pre-processing before job submission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging

2015-05-08 Thread Brad Willard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535208#comment-14535208
 ] 

Brad Willard commented on SPARK-7447:
-

Thanks, you are a hero.

 Large Job submission lag when using Parquet w/ Schema Merging
 -

 Key: SPARK-7447
 URL: https://issues.apache.org/jira/browse/SPARK-7447
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Spark Submit
Affects Versions: 1.3.0, 1.3.1
 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs 
 storage, pyspark, 8 x c3.8xlarge nodes. 
 spark-conf
 spark.executor.memory 50g
 spark.driver.cores 32
 spark.driver.memory 50g
 spark.default.parallelism 512
 spark.sql.shuffle.partitions 512
 spark.task.maxFailures  30
 spark.executor.logs.rolling.maxRetainedFiles 2
 spark.executor.logs.rolling.size.maxBytes 102400
 spark.executor.logs.rolling.strategy size
 spark.shuffle.spill false
 spark.sql.parquet.cacheMetadata true
 spark.sql.parquet.filterPushdown true
 spark.sql.codegen true
 spark.akka.threads = 64
Reporter: Brad Willard

 I have 2.6 billion rows in parquet format and I'm trying to use the new 
 schema merging feature (I was enforcing a consistent schema manually before 
 in 0.8-1.2 which was annoying). 
 I have approximate 200 parquet files with key=date. When I load the 
 dataframe with the sqlcontext that process is understandably slow because I 
 assume it's reading all the meta data from the parquet files and doing the 
 initial schema merging. So that's ok.
 However the problem I have is that once I have the dataframe. Doing any 
 operation on the dataframe seems to have a 10-30 second lag before it 
 actually starts processing the Job and shows up as an Active Job in the Spark 
 Manager. This was an instant operation in all previous versions of Spark. 
 Once the job actually is running the performance is fantastic, however this 
 job submission lag is horrible.
 I'm wondering if there is a bug with recomputing the schema merging. Running 
 top on the master node shows some thread maxed out on 1 cpu during the 
 lagging time which makes me think it's not net i/o but something 
 pre-processing before job submission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org