[ 
https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535018#comment-14535018
 ] 

Apache Spark commented on SPARK-7447:
-------------------------------------

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6012

> Large Job submission lag when using Parquet w/ Schema Merging
> -------------------------------------------------------------
>
>                 Key: SPARK-7447
>                 URL: https://issues.apache.org/jira/browse/SPARK-7447
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core, Spark Submit
>    Affects Versions: 1.3.0, 1.3.1
>         Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs 
> storage, pyspark, 8 x c3.8xlarge nodes. 
> spark-conf
> spark.executor.memory 50g
> spark.driver.cores 32
> spark.driver.memory 50g
> spark.default.parallelism 512
> spark.sql.shuffle.partitions 512
> spark.task.maxFailures  30
> spark.executor.logs.rolling.maxRetainedFiles 2
> spark.executor.logs.rolling.size.maxBytes 102400
> spark.executor.logs.rolling.strategy size
> spark.shuffle.spill false
> spark.sql.parquet.cacheMetadata true
> spark.sql.parquet.filterPushdown true
> spark.sql.codegen true
> spark.akka.threads = 64
>            Reporter: Brad Willard
>
> I have 2.6 billion rows in parquet format and I'm trying to use the new 
> schema merging feature (I was enforcing a consistent schema manually before 
> in 0.8-1.2 which was annoying). 
> I have approximate 200 parquet files with key=<date>. When I load the 
> dataframe with the sqlcontext that process is understandably slow because I 
> assume it's reading all the meta data from the parquet files and doing the 
> initial schema merging. So that's ok.
> However the problem I have is that once I have the dataframe. Doing any 
> operation on the dataframe seems to have a 10-30 second lag before it 
> actually starts processing the Job and shows up as an Active Job in the Spark 
> Manager. This was an instant operation in all previous versions of Spark. 
> Once the job actually is running the performance is fantastic, however this 
> job submission lag is horrible.
> I'm wondering if there is a bug with recomputing the schema merging. Running 
> top on the master node shows some thread maxed out on 1 cpu during the 
> lagging time which makes me think it's not net i/o but something 
> pre-processing before job submission.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to