[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386395#comment-15386395
 ] 

Michael Allman commented on SPARK-16320:
----------------------------------------

The code path for reading data from parquet files has been refactored 
extensively. The fact that [~maver1ck] is testing performance on a table with 
400 partitions makes me wonder if my PR for 
https://issues.apache.org/jira/browse/SPARK-15968 will make a difference for 
repeated queries on partitioned tables. That PR was merged into master and 
backported to 2.0. The commit short hash is d5d2457.

Another issue that's in play when querying Hive metastore tables with a large 
number of partitions is the time it takes to query the Hive metastore for the 
table's partition metadata. Especially if your metastore is not configured to 
use direct sql, this in and of itself can take from seconds to minutes. With 
400 partitions and without direct sql, you might see about 10 to 20 seconds or 
so of query planning time. It would be helpful to look at query times minus the 
time spent in query planning. Since query planning happens in the driver before 
the job's stages are launched, you can estimate this by looking at the actual 
stage times in your SQL query job.

> Spark 2.0 slower than 1.6 when querying nested columns
> ------------------------------------------------------
>
>                 Key: SPARK-16320
>                 URL: https://issues.apache.org/jira/browse/SPARK-16320
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Maciej BryƄski
>            Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
>     
>     def _create_column_data(cols):
>         import random
>         random.seed()
>         return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
>     
>     def _create_sample_df(cols, rows):
>         rdd = sc.parallelize(range(rows))           
>         data = rdd.map(lambda r: _create_column_data(cols))
>         df = sqlctx.createDataFrame(data)
>         return df
>     
>     def _create_nested_data(levels, rows):
>         if len(levels) == 1:
>             return _create_sample_df(levels[0], rows).cache()
>         else:
>             df = _create_nested_data(levels[1:], rows)
>             return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
>     df = _create_nested_data(levels, rows)
>     df.write.mode('overwrite').parquet(path)
>     
> #Sample data
> create_sample_data([2,10,200], 1000000, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to