Re: DataFrame creation delay?

Isabelle Phan Fri, 04 Sep 2015 13:17:39 -0700

Hi Michael,

Thanks a lot for your reply.


This table is stored as text file with tab delimited columns.

You are correct, the problem is because my table has too many partitions
(1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
<https://issues.apache.org/jira/browse/SPARK-6984>.

Not sure when my company can move to 1.5. Would you know some workaround
for this bug?
If I cannot find workaround for this, will have to change our schema design
to reduce number of partitions.


Thanks,

Isabelle



On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Also, do you mean two partitions or two partition columns?  If there are
> many partitions it can be much slower.  In Spark 1.5 I'd consider setting 
> spark.sql.hive.metastorePartitionPruning=true
> if you have predicates over the partition columns.
>
> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> What format is this table.  For parquet and other optimized formats we
>> cache a bunch of file metadata on first access to make interactive queries
>> faster.
>>
>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am using SparkSQL to query some Hive tables. Most of the time, when I
>>> create a DataFrame using sqlContext.sql("select * from table") command,
>>> DataFrame creation is less than 0.5 second.
>>> But I have this one table with which it takes almost 12 seconds!
>>>
>>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>>> sqlContext.sql("select * from temp.log"); val execution =
>>> scala.compat.Platform.currentTime - start
>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>>> temp.log
>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>>> start: Long = 1441336022731
>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
>>> log_time: string, tag: string, dt: string, test_id: int]
>>> execution: Long = *11567*
>>>
>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id columns).
>>> I have created DataFrames on even larger tables and do not see such
>>> delay.
>>> So my questions are:
>>> - What can impact DataFrame creation time?
>>> - Is it related to the table partitions?
>>>
>>>
>>> Thanks much your help!
>>>
>>> Isabelle
>>>
>>
>>
>

Re: DataFrame creation delay?

Reply via email to