[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386395#comment-15386395 ]
Michael Allman commented on SPARK-16320: ---------------------------------------- The code path for reading data from parquet files has been refactored extensively. The fact that [~maver1ck] is testing performance on a table with 400 partitions makes me wonder if my PR for https://issues.apache.org/jira/browse/SPARK-15968 will make a difference for repeated queries on partitioned tables. That PR was merged into master and backported to 2.0. The commit short hash is d5d2457. Another issue that's in play when querying Hive metastore tables with a large number of partitions is the time it takes to query the Hive metastore for the table's partition metadata. Especially if your metastore is not configured to use direct sql, this in and of itself can take from seconds to minutes. With 400 partitions and without direct sql, you might see about 10 to 20 seconds or so of query planning time. It would be helpful to look at query times minus the time spent in query planning. Since query planning happens in the driver before the job's stages are launched, you can estimate this by looking at the actual stage times in your SQL query job. > Spark 2.0 slower than 1.6 when querying nested columns > ------------------------------------------------------ > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Maciej BryĆski > Priority: Critical > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 1000000, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org