[ https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-7447: ----------------------------------- Assignee: (was: Apache Spark) > Large Job submission lag when using Parquet w/ Schema Merging > ------------------------------------------------------------- > > Key: SPARK-7447 > URL: https://issues.apache.org/jira/browse/SPARK-7447 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, Spark Submit > Affects Versions: 1.3.0, 1.3.1 > Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs > storage, pyspark, 8 x c3.8xlarge nodes. > spark-conf > spark.executor.memory 50g > spark.driver.cores 32 > spark.driver.memory 50g > spark.default.parallelism 512 > spark.sql.shuffle.partitions 512 > spark.task.maxFailures 30 > spark.executor.logs.rolling.maxRetainedFiles 2 > spark.executor.logs.rolling.size.maxBytes 102400 > spark.executor.logs.rolling.strategy size > spark.shuffle.spill false > spark.sql.parquet.cacheMetadata true > spark.sql.parquet.filterPushdown true > spark.sql.codegen true > spark.akka.threads = 64 > Reporter: Brad Willard > > I have 2.6 billion rows in parquet format and I'm trying to use the new > schema merging feature (I was enforcing a consistent schema manually before > in 0.8-1.2 which was annoying). > I have approximate 200 parquet files with key=<date>. When I load the > dataframe with the sqlcontext that process is understandably slow because I > assume it's reading all the meta data from the parquet files and doing the > initial schema merging. So that's ok. > However the problem I have is that once I have the dataframe. Doing any > operation on the dataframe seems to have a 10-30 second lag before it > actually starts processing the Job and shows up as an Active Job in the Spark > Manager. This was an instant operation in all previous versions of Spark. > Once the job actually is running the performance is fantastic, however this > job submission lag is horrible. > I'm wondering if there is a bug with recomputing the schema merging. Running > top on the master node shows some thread maxed out on 1 cpu during the > lagging time which makes me think it's not net i/o but something > pre-processing before job submission. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org