[jira] [Assigned] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging
[ https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7447: --- Assignee: (was: Apache Spark) Large Job submission lag when using Parquet w/ Schema Merging - Key: SPARK-7447 URL: https://issues.apache.org/jira/browse/SPARK-7447 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Spark Submit Affects Versions: 1.3.0, 1.3.1 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs storage, pyspark, 8 x c3.8xlarge nodes. spark-conf spark.executor.memory 50g spark.driver.cores 32 spark.driver.memory 50g spark.default.parallelism 512 spark.sql.shuffle.partitions 512 spark.task.maxFailures 30 spark.executor.logs.rolling.maxRetainedFiles 2 spark.executor.logs.rolling.size.maxBytes 102400 spark.executor.logs.rolling.strategy size spark.shuffle.spill false spark.sql.parquet.cacheMetadata true spark.sql.parquet.filterPushdown true spark.sql.codegen true spark.akka.threads = 64 Reporter: Brad Willard I have 2.6 billion rows in parquet format and I'm trying to use the new schema merging feature (I was enforcing a consistent schema manually before in 0.8-1.2 which was annoying). I have approximate 200 parquet files with key=date. When I load the dataframe with the sqlcontext that process is understandably slow because I assume it's reading all the meta data from the parquet files and doing the initial schema merging. So that's ok. However the problem I have is that once I have the dataframe. Doing any operation on the dataframe seems to have a 10-30 second lag before it actually starts processing the Job and shows up as an Active Job in the Spark Manager. This was an instant operation in all previous versions of Spark. Once the job actually is running the performance is fantastic, however this job submission lag is horrible. I'm wondering if there is a bug with recomputing the schema merging. Running top on the master node shows some thread maxed out on 1 cpu during the lagging time which makes me think it's not net i/o but something pre-processing before job submission. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging
[ https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7447: --- Assignee: Apache Spark Large Job submission lag when using Parquet w/ Schema Merging - Key: SPARK-7447 URL: https://issues.apache.org/jira/browse/SPARK-7447 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Spark Submit Affects Versions: 1.3.0, 1.3.1 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs storage, pyspark, 8 x c3.8xlarge nodes. spark-conf spark.executor.memory 50g spark.driver.cores 32 spark.driver.memory 50g spark.default.parallelism 512 spark.sql.shuffle.partitions 512 spark.task.maxFailures 30 spark.executor.logs.rolling.maxRetainedFiles 2 spark.executor.logs.rolling.size.maxBytes 102400 spark.executor.logs.rolling.strategy size spark.shuffle.spill false spark.sql.parquet.cacheMetadata true spark.sql.parquet.filterPushdown true spark.sql.codegen true spark.akka.threads = 64 Reporter: Brad Willard Assignee: Apache Spark I have 2.6 billion rows in parquet format and I'm trying to use the new schema merging feature (I was enforcing a consistent schema manually before in 0.8-1.2 which was annoying). I have approximate 200 parquet files with key=date. When I load the dataframe with the sqlcontext that process is understandably slow because I assume it's reading all the meta data from the parquet files and doing the initial schema merging. So that's ok. However the problem I have is that once I have the dataframe. Doing any operation on the dataframe seems to have a 10-30 second lag before it actually starts processing the Job and shows up as an Active Job in the Spark Manager. This was an instant operation in all previous versions of Spark. Once the job actually is running the performance is fantastic, however this job submission lag is horrible. I'm wondering if there is a bug with recomputing the schema merging. Running top on the master node shows some thread maxed out on 1 cpu during the lagging time which makes me think it's not net i/o but something pre-processing before job submission. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org