Re: DataFrame creation delay?

Isabelle Phan Fri, 11 Dec 2015 00:05:59 -0800

Hi Harsh,

Thanks a lot for your reply.


I added a predicate to my query to select a single partition in the table,
and tested with both "spark.sql.hive.metastorePartitionPruning" setting on
and off, and there is no difference in DataFrame creation time.

Yes, Michael's proposed workaround works. But I was under the impression
that this workaround was only for Spark version < 1.5. With the Hive
metastore partition pruning feature from Spark 1.5, I thought there would
be no more delay, so I could create DataFrames left and right.

I noticed that regardless of the setting, when I create a DataFrame with or
without a predicate, I get a log message from HadoopFsRelation class which
lists hdfs filepaths to ALL partitions in my table (see logInfo call in the
code
<https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L770>).
Is this expected? I am not sure who is creating this Array of filepaths,
but I am guessing this is the source of my delay.


Thanks,

Isabelle



On Thu, Dec 10, 2015 at 7:38 PM, Harsh J <ha...@cloudera.com> wrote:

> The option of "spark.sql.hive.metastorePartitionPruning=true" will not
> work unless you have a partition column predicate in your query. Your query
> of "select * from temp.log" does not do this. The slowdown appears to be
> due to the need of loading all partition metadata.
>
> Have you also tried to see if Michael's temp-table suggestion helps you
> cache the expensive partition lookup? (re-quoted below)
>
> """
> If you run sqlContext.table("...").registerTempTable("...") that temptable
> will cache the lookup of partitions [the first time is slow, but subsequent
> lookups will be faster].
> """ - X-Ref: Permalink
> <https://issues.apache.org/jira/browse/SPARK-6910?focusedCommentId=14529666&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14529666>
>
> Also, do you absolutely need to use "select * from temp.log"? Adding a
> where clause to the query with a partition condition will help Spark prune
> the request to just the required partitions (vs. all, which is proving
> expensive).
>
> On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan <nlip...@gmail.com> wrote:
>
>> Hi Michael,
>>
>> We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are
>> on cloudera), and Parquet formatted tables. I turned on  spark
>> .sql.hive.metastorePartitionPruning=true, but DataFrame creation still
>> takes a long time.
>> Is there any other configuration to consider?
>>
>>
>> Thanks a lot for your help,
>>
>> Isabelle
>>
>> On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> If you run sqlContext.table("...").registerTempTable("...") that
>>> temptable will cache the lookup of partitions.
>>>
>>> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> Thanks a lot for your reply.
>>>>
>>>> This table is stored as text file with tab delimited columns.
>>>>
>>>> You are correct, the problem is because my table has too many
>>>> partitions (1825 in total). Since I am on Spark 1.4, I think I am hitting
>>>> bug 6984 <https://issues.apache.org/jira/browse/SPARK-6984>.
>>>>
>>>> Not sure when my company can move to 1.5. Would you know some
>>>> workaround for this bug?
>>>> If I cannot find workaround for this, will have to change our schema
>>>> design to reduce number of partitions.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Isabelle
>>>>
>>>>
>>>>
>>>> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Also, do you mean two partitions or two partition columns?  If there
>>>>> are many partitions it can be much slower.  In Spark 1.5 I'd consider
>>>>> setting spark.sql.hive.metastorePartitionPruning=true if you have
>>>>> predicates over the partition columns.
>>>>>
>>>>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>>>>>
>>>>>> What format is this table.  For parquet and other optimized formats
>>>>>> we cache a bunch of file metadata on first access to make interactive
>>>>>> queries faster.
>>>>>>
>>>>>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am using SparkSQL to query some Hive tables. Most of the time,
>>>>>>> when I create a DataFrame using sqlContext.sql("select * from
>>>>>>> table") command, DataFrame creation is less than 0.5 second.
>>>>>>> But I have this one table with which it takes almost 12 seconds!
>>>>>>>
>>>>>>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>>>>>>> sqlContext.sql("select * from temp.log"); val execution =
>>>>>>> scala.compat.Platform.currentTime - start
>>>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>>>>>>> temp.log
>>>>>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>>>>>>> start: Long = 1441336022731
>>>>>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option:
>>>>>>> int, log_time: string, tag: string, dt: string, test_id: int]
>>>>>>> execution: Long = *11567*
>>>>>>>
>>>>>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id
>>>>>>> columns).
>>>>>>> I have created DataFrames on even larger tables and do not see such
>>>>>>> delay.
>>>>>>> So my questions are:
>>>>>>> - What can impact DataFrame creation time?
>>>>>>> - Is it related to the table partitions?
>>>>>>>
>>>>>>>
>>>>>>> Thanks much your help!
>>>>>>>
>>>>>>> Isabelle
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: DataFrame creation delay?

Reply via email to