[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table
[ https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24316: -- Affects Version/s: (was: 3.0.0) 3.1.0 > Spark sql queries stall for column width more than 6k for parquet based table > -- > > Key: SPARK-24316 > URL: https://issues.apache.org/jira/browse/SPARK-24316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bimalendu Choudhary >Priority: Major > > When we create a table from a data frame using spark sql with columns around > 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while > the same table if we create through Hive with same data , the same query just > takes 5 minutes. > > Instrumenting the code we see that the executors are looping in the while > loop of the function initializeInternal(). The majority of time is getting > spent in the for loop in below code looping through the columns and the > executor appears to be stalled for long time . > > {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid} > private void initializeInternal() .. > .. > for (int i = 0; i < requestedSchema.getFieldCount(); ++i) > { ... } > } > {code:java} > {code} > > When spark sql is creating table, it also stores the metadata in the > TBLPROPERTIES in json format. We see that if we remove this metadata from the > table the queries become fast , which is the case when we create the same > table through Hive. The exact same table takes 5 times more time with the > Json meta data as compared to without the json metadata. > > So looks like as the number of columns are growing bigger than 5 to 6k, the > processing of the metadata and comparing it becomes more and more expensive > and the performance degrades drastically. > To recreate the problem simply run the following query: > import org.apache.spark.sql.SparkSession > val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7") > resp_data.write.format("csv").save("/tmp/filename") > > The table should be created by spark sql from dataframe so that the Json meta > data is stored. For ex:- > val dff = spark.read.format("csv").load("hdfs:///tmp/test.csv") > dff.createOrReplaceTempView("my_temp_table") > val tmp = spark.sql("Create table tableName stored as parquet as select * > from my_temp_table") > > > from pyspark.sql import SQL > Context > sqlContext = SQLContext(sc) > resp_data = spark.sql( " select * from test").limit(2000) > print resp_data_fgv_1k.count() > (resp_data_fgv_1k.write.option('header', > False).mode('overwrite').csv('/tmp/2.csv') ) > > > The performance seems to be even slow in the loop if the schema does not > match or the fields are empty and the code goes into the if condition where > the missing column is marked true: > missingColumns[i] = true; > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table
[ https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24316: -- Affects Version/s: (was: 2.4.0) (was: 2.2.1) (was: 2.3.0) (was: 2.2.0) 3.0.0 > Spark sql queries stall for column width more than 6k for parquet based table > -- > > Key: SPARK-24316 > URL: https://issues.apache.org/jira/browse/SPARK-24316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bimalendu Choudhary >Priority: Major > > When we create a table from a data frame using spark sql with columns around > 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while > the same table if we create through Hive with same data , the same query just > takes 5 minutes. > > Instrumenting the code we see that the executors are looping in the while > loop of the function initializeInternal(). The majority of time is getting > spent in the for loop in below code looping through the columns and the > executor appears to be stalled for long time . > > {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid} > private void initializeInternal() .. > .. > for (int i = 0; i < requestedSchema.getFieldCount(); ++i) > { ... } > } > {code:java} > {code} > > When spark sql is creating table, it also stores the metadata in the > TBLPROPERTIES in json format. We see that if we remove this metadata from the > table the queries become fast , which is the case when we create the same > table through Hive. The exact same table takes 5 times more time with the > Json meta data as compared to without the json metadata. > > So looks like as the number of columns are growing bigger than 5 to 6k, the > processing of the metadata and comparing it becomes more and more expensive > and the performance degrades drastically. > To recreate the problem simply run the following query: > import org.apache.spark.sql.SparkSession > val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7") > resp_data.write.format("csv").save("/tmp/filename") > > The table should be created by spark sql from dataframe so that the Json meta > data is stored. For ex:- > val dff = spark.read.format("csv").load("hdfs:///tmp/test.csv") > dff.createOrReplaceTempView("my_temp_table") > val tmp = spark.sql("Create table tableName stored as parquet as select * > from my_temp_table") > > > from pyspark.sql import SQL > Context > sqlContext = SQLContext(sc) > resp_data = spark.sql( " select * from test").limit(2000) > print resp_data_fgv_1k.count() > (resp_data_fgv_1k.write.option('header', > False).mode('overwrite').csv('/tmp/2.csv') ) > > > The performance seems to be even slow in the loop if the schema does not > match or the fields are empty and the code goes into the if condition where > the missing column is marked true: > missingColumns[i] = true; > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table
[ https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bimalendu Choudhary updated SPARK-24316: Summary: Spark sql queries stall for column width more than 6k for parquet based table (was: Spark sql queries stall for column width more 6k for parquet based table) > Spark sql queries stall for column width more than 6k for parquet based table > -- > > Key: SPARK-24316 > URL: https://issues.apache.org/jira/browse/SPARK-24316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0 >Reporter: Bimalendu Choudhary >Priority: Major > > When we create a table from a data frame using spark sql with columns around > 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while > the same table if we create through Hive with same data , the same query just > takes 5 minutes. > > Instrumenting the code we see that the executors are looping in the while > loop of the function initializeInternal(). The majority of time is getting > spent in the for loop in below code looping through the columns and the > executor appears to be stalled for long time . > > {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid} > private void initializeInternal() .. > .. > for (int i = 0; i < requestedSchema.getFieldCount(); ++i) > { ... } > } > {code:java} > {code} > > When spark sql is creating table, it also stores the metadata in the > TBLPROPERTIES in json format. We see that if we remove this metadata from the > table the queries become fast , which is the case when we create the same > table through Hive. The exact same table takes 5 times more time with the > Json meta data as compared to without the json metadata. > > So looks like as the number of columns are growing bigger than 5 to 6k, the > processing of the metadata and comparing it becomes more and more expensive > and the performance degrades drastically. > To recreate the problem simply run the following query: > import org.apache.spark.sql.SparkSession > val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7") > resp_data.write.format("csv").save("/tmp/filename") > > The table should be created by spark sql from dataframe so that the Json meta > data is stored. For ex:- > val dff = spark.read.format("csv").load("hdfs:///tmp/test.csv") > dff.createOrReplaceTempView("my_temp_table") > val tmp = spark.sql("Create table tableName stored as parquet as select * > from my_temp_table") > > > from pyspark.sql import SQL > Context > sqlContext = SQLContext(sc) > resp_data = spark.sql( " select * from test").limit(2000) > print resp_data_fgv_1k.count() > (resp_data_fgv_1k.write.option('header', > False).mode('overwrite').csv('/tmp/2.csv') ) > > > The performance seems to be even slow in the loop if the schema does not > match or the fields are empty and the code goes into the if condition where > the missing column is marked true: > missingColumns[i] = true; > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org