Re: DataFrame creation delay?

Michael Armbrust Fri, 04 Sep 2015 13:43:37 -0700

If you run sqlContext.table("...").registerTempTable("...") that temptable
will cache the lookup of partitions.


On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan <nlip...@gmail.com> wrote:

> Hi Michael,
>
> Thanks a lot for your reply.
>
> This table is stored as text file with tab delimited columns.
>
> You are correct, the problem is because my table has too many partitions
> (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
> <https://issues.apache.org/jira/browse/SPARK-6984>.
>
> Not sure when my company can move to 1.5. Would you know some workaround
> for this bug?
> If I cannot find workaround for this, will have to change our schema
> design to reduce number of partitions.
>
>
> Thanks,
>
> Isabelle
>
>
>
> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Also, do you mean two partitions or two partition columns?  If there are
>> many partitions it can be much slower.  In Spark 1.5 I'd consider setting 
>> spark.sql.hive.metastorePartitionPruning=true
>> if you have predicates over the partition columns.
>>
>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>>> What format is this table.  For parquet and other optimized formats we
>>> cache a bunch of file metadata on first access to make interactive queries
>>> faster.
>>>
>>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan <nlip...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am using SparkSQL to query some Hive tables. Most of the time, when I
>>>> create a DataFrame using sqlContext.sql("select * from table") command,
>>>> DataFrame creation is less than 0.5 second.
>>>> But I have this one table with which it takes almost 12 seconds!
>>>>
>>>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>>>> sqlContext.sql("select * from temp.log"); val execution =
>>>> scala.compat.Platform.currentTime - start
>>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>>>> temp.log
>>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>>>> start: Long = 1441336022731
>>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
>>>> log_time: string, tag: string, dt: string, test_id: int]
>>>> execution: Long = *11567*
>>>>
>>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id columns).
>>>> I have created DataFrames on even larger tables and do not see such
>>>> delay.
>>>> So my questions are:
>>>> - What can impact DataFrame creation time?
>>>> - Is it related to the table partitions?
>>>>
>>>>
>>>> Thanks much your help!
>>>>
>>>> Isabelle
>>>>
>>>
>>>
>>
>

Re: DataFrame creation delay?

Reply via email to