Re: DataFrame creation delay?

2015-12-11 Thread Isabelle Phan
Hi Harsh,

Thanks a lot for your reply.

I added a predicate to my query to select a single partition in the table,
and tested with both "spark.sql.hive.metastorePartitionPruning" setting on
and off, and there is no difference in DataFrame creation time.

Yes, Michael's proposed workaround works. But I was under the impression
that this workaround was only for Spark version < 1.5. With the Hive
metastore partition pruning feature from Spark 1.5, I thought there would
be no more delay, so I could create DataFrames left and right.

I noticed that regardless of the setting, when I create a DataFrame with or
without a predicate, I get a log message from HadoopFsRelation class which
lists hdfs filepaths to ALL partitions in my table (see logInfo call in the
code
).
Is this expected? I am not sure who is creating this Array of filepaths,
but I am guessing this is the source of my delay.


Thanks,

Isabelle



On Thu, Dec 10, 2015 at 7:38 PM, Harsh J  wrote:

> The option of "spark.sql.hive.metastorePartitionPruning=true" will not
> work unless you have a partition column predicate in your query. Your query
> of "select * from temp.log" does not do this. The slowdown appears to be
> due to the need of loading all partition metadata.
>
> Have you also tried to see if Michael's temp-table suggestion helps you
> cache the expensive partition lookup? (re-quoted below)
>
> """
> If you run sqlContext.table("...").registerTempTable("...") that temptable
> will cache the lookup of partitions [the first time is slow, but subsequent
> lookups will be faster].
> """ - X-Ref: Permalink
> 
>
> Also, do you absolutely need to use "select * from temp.log"? Adding a
> where clause to the query with a partition condition will help Spark prune
> the request to just the required partitions (vs. all, which is proving
> expensive).
>
> On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan  wrote:
>
>> Hi Michael,
>>
>> We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are
>> on cloudera), and Parquet formatted tables. I turned on  spark
>> .sql.hive.metastorePartitionPruning=true, but DataFrame creation still
>> takes a long time.
>> Is there any other configuration to consider?
>>
>>
>> Thanks a lot for your help,
>>
>> Isabelle
>>
>> On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust 
>> wrote:
>>
>>> If you run sqlContext.table("...").registerTempTable("...") that
>>> temptable will cache the lookup of partitions.
>>>
>>> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan  wrote:
>>>
 Hi Michael,

 Thanks a lot for your reply.

 This table is stored as text file with tab delimited columns.

 You are correct, the problem is because my table has too many
 partitions (1825 in total). Since I am on Spark 1.4, I think I am hitting
 bug 6984 .

 Not sure when my company can move to 1.5. Would you know some
 workaround for this bug?
 If I cannot find workaround for this, will have to change our schema
 design to reduce number of partitions.


 Thanks,

 Isabelle



 On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> Also, do you mean two partitions or two partition columns?  If there
> are many partitions it can be much slower.  In Spark 1.5 I'd consider
> setting spark.sql.hive.metastorePartitionPruning=true if you have
> predicates over the partition columns.
>
> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> What format is this table.  For parquet and other optimized formats
>> we cache a bunch of file metadata on first access to make interactive
>> queries faster.
>>
>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan 
>> wrote:
>>
>>> Hello,
>>>
>>> I am using SparkSQL to query some Hive tables. Most of the time,
>>> when I create a DataFrame using sqlContext.sql("select * from
>>> table") command, DataFrame creation is less than 0.5 second.
>>> But I have this one table with which it takes almost 12 seconds!
>>>
>>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>>> sqlContext.sql("select * from temp.log"); val execution =
>>> scala.compat.Platform.currentTime - start
>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>>> temp.log
>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>>> start: Long = 1441336022731
>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option:
>>> int, log_time: string, tag: string, dt: string, test_id: int]
>>> execu

Re: DataFrame creation delay?

2015-12-10 Thread Harsh J
The option of "spark.sql.hive.metastorePartitionPruning=true" will not work
unless you have a partition column predicate in your query. Your query of
"select * from temp.log" does not do this. The slowdown appears to be due
to the need of loading all partition metadata.

Have you also tried to see if Michael's temp-table suggestion helps you
cache the expensive partition lookup? (re-quoted below)

"""
If you run sqlContext.table("...").registerTempTable("...") that temptable
will cache the lookup of partitions [the first time is slow, but subsequent
lookups will be faster].
""" - X-Ref: Permalink


Also, do you absolutely need to use "select * from temp.log"? Adding a
where clause to the query with a partition condition will help Spark prune
the request to just the required partitions (vs. all, which is proving
expensive).

On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan  wrote:

> Hi Michael,
>
> We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are
> on cloudera), and Parquet formatted tables. I turned on  spark
> .sql.hive.metastorePartitionPruning=true, but DataFrame creation still
> takes a long time.
> Is there any other configuration to consider?
>
>
> Thanks a lot for your help,
>
> Isabelle
>
> On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust 
> wrote:
>
>> If you run sqlContext.table("...").registerTempTable("...") that
>> temptable will cache the lookup of partitions.
>>
>> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan  wrote:
>>
>>> Hi Michael,
>>>
>>> Thanks a lot for your reply.
>>>
>>> This table is stored as text file with tab delimited columns.
>>>
>>> You are correct, the problem is because my table has too many partitions
>>> (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
>>> .
>>>
>>> Not sure when my company can move to 1.5. Would you know some workaround
>>> for this bug?
>>> If I cannot find workaround for this, will have to change our schema
>>> design to reduce number of partitions.
>>>
>>>
>>> Thanks,
>>>
>>> Isabelle
>>>
>>>
>>>
>>> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 Also, do you mean two partitions or two partition columns?  If there
 are many partitions it can be much slower.  In Spark 1.5 I'd consider
 setting spark.sql.hive.metastorePartitionPruning=true if you have
 predicates over the partition columns.

 On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> What format is this table.  For parquet and other optimized formats we
> cache a bunch of file metadata on first access to make interactive queries
> faster.
>
> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan 
> wrote:
>
>> Hello,
>>
>> I am using SparkSQL to query some Hive tables. Most of the time, when
>> I create a DataFrame using sqlContext.sql("select * from table") command,
>> DataFrame creation is less than 0.5 second.
>> But I have this one table with which it takes almost 12 seconds!
>>
>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>> sqlContext.sql("select * from temp.log"); val execution =
>> scala.compat.Platform.currentTime - start
>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>> temp.log
>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>> start: Long = 1441336022731
>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
>> log_time: string, tag: string, dt: string, test_id: int]
>> execution: Long = *11567*
>>
>> This table has 3.6 B rows, and 2 partitions (on dt and test_id
>> columns).
>> I have created DataFrames on even larger tables and do not see such
>> delay.
>> So my questions are:
>> - What can impact DataFrame creation time?
>> - Is it related to the table partitions?
>>
>>
>> Thanks much your help!
>>
>> Isabelle
>>
>
>

>>>
>>
>


Re: DataFrame creation delay?

2015-12-10 Thread Isabelle Phan
Hi Michael,

We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are
on cloudera), and Parquet formatted tables. I turned on  spark
.sql.hive.metastorePartitionPruning=true, but DataFrame creation still
takes a long time.
Is there any other configuration to consider?


Thanks a lot for your help,

Isabelle

On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust 
wrote:

> If you run sqlContext.table("...").registerTempTable("...") that
> temptable will cache the lookup of partitions.
>
> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan  wrote:
>
>> Hi Michael,
>>
>> Thanks a lot for your reply.
>>
>> This table is stored as text file with tab delimited columns.
>>
>> You are correct, the problem is because my table has too many partitions
>> (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
>> .
>>
>> Not sure when my company can move to 1.5. Would you know some workaround
>> for this bug?
>> If I cannot find workaround for this, will have to change our schema
>> design to reduce number of partitions.
>>
>>
>> Thanks,
>>
>> Isabelle
>>
>>
>>
>> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust > > wrote:
>>
>>> Also, do you mean two partitions or two partition columns?  If there are
>>> many partitions it can be much slower.  In Spark 1.5 I'd consider setting 
>>> spark.sql.hive.metastorePartitionPruning=true
>>> if you have predicates over the partition columns.
>>>
>>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 What format is this table.  For parquet and other optimized formats we
 cache a bunch of file metadata on first access to make interactive queries
 faster.

 On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan 
 wrote:

> Hello,
>
> I am using SparkSQL to query some Hive tables. Most of the time, when
> I create a DataFrame using sqlContext.sql("select * from table") command,
> DataFrame creation is less than 0.5 second.
> But I have this one table with which it takes almost 12 seconds!
>
> scala>  val start = scala.compat.Platform.currentTime; val logs =
> sqlContext.sql("select * from temp.log"); val execution =
> scala.compat.Platform.currentTime - start
> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
> temp.log
> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
> start: Long = 1441336022731
> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
> log_time: string, tag: string, dt: string, test_id: int]
> execution: Long = *11567*
>
> This table has 3.6 B rows, and 2 partitions (on dt and test_id
> columns).
> I have created DataFrames on even larger tables and do not see such
> delay.
> So my questions are:
> - What can impact DataFrame creation time?
> - Is it related to the table partitions?
>
>
> Thanks much your help!
>
> Isabelle
>


>>>
>>
>


Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
If you run sqlContext.table("...").registerTempTable("...") that temptable
will cache the lookup of partitions.

On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan  wrote:

> Hi Michael,
>
> Thanks a lot for your reply.
>
> This table is stored as text file with tab delimited columns.
>
> You are correct, the problem is because my table has too many partitions
> (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
> .
>
> Not sure when my company can move to 1.5. Would you know some workaround
> for this bug?
> If I cannot find workaround for this, will have to change our schema
> design to reduce number of partitions.
>
>
> Thanks,
>
> Isabelle
>
>
>
> On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust 
> wrote:
>
>> Also, do you mean two partitions or two partition columns?  If there are
>> many partitions it can be much slower.  In Spark 1.5 I'd consider setting 
>> spark.sql.hive.metastorePartitionPruning=true
>> if you have predicates over the partition columns.
>>
>> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust > > wrote:
>>
>>> What format is this table.  For parquet and other optimized formats we
>>> cache a bunch of file metadata on first access to make interactive queries
>>> faster.
>>>
>>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan  wrote:
>>>
 Hello,

 I am using SparkSQL to query some Hive tables. Most of the time, when I
 create a DataFrame using sqlContext.sql("select * from table") command,
 DataFrame creation is less than 0.5 second.
 But I have this one table with which it takes almost 12 seconds!

 scala>  val start = scala.compat.Platform.currentTime; val logs =
 sqlContext.sql("select * from temp.log"); val execution =
 scala.compat.Platform.currentTime - start
 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
 temp.log
 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
 start: Long = 1441336022731
 logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
 log_time: string, tag: string, dt: string, test_id: int]
 execution: Long = *11567*

 This table has 3.6 B rows, and 2 partitions (on dt and test_id columns).
 I have created DataFrames on even larger tables and do not see such
 delay.
 So my questions are:
 - What can impact DataFrame creation time?
 - Is it related to the table partitions?


 Thanks much your help!

 Isabelle

>>>
>>>
>>
>


Re: DataFrame creation delay?

2015-09-04 Thread Isabelle Phan
Hi Michael,

Thanks a lot for your reply.

This table is stored as text file with tab delimited columns.

You are correct, the problem is because my table has too many partitions
(1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984
.

Not sure when my company can move to 1.5. Would you know some workaround
for this bug?
If I cannot find workaround for this, will have to change our schema design
to reduce number of partitions.


Thanks,

Isabelle



On Fri, Sep 4, 2015 at 12:56 PM, Michael Armbrust 
wrote:

> Also, do you mean two partitions or two partition columns?  If there are
> many partitions it can be much slower.  In Spark 1.5 I'd consider setting 
> spark.sql.hive.metastorePartitionPruning=true
> if you have predicates over the partition columns.
>
> On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust 
> wrote:
>
>> What format is this table.  For parquet and other optimized formats we
>> cache a bunch of file metadata on first access to make interactive queries
>> faster.
>>
>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan  wrote:
>>
>>> Hello,
>>>
>>> I am using SparkSQL to query some Hive tables. Most of the time, when I
>>> create a DataFrame using sqlContext.sql("select * from table") command,
>>> DataFrame creation is less than 0.5 second.
>>> But I have this one table with which it takes almost 12 seconds!
>>>
>>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>>> sqlContext.sql("select * from temp.log"); val execution =
>>> scala.compat.Platform.currentTime - start
>>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>>> temp.log
>>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>>> start: Long = 1441336022731
>>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
>>> log_time: string, tag: string, dt: string, test_id: int]
>>> execution: Long = *11567*
>>>
>>> This table has 3.6 B rows, and 2 partitions (on dt and test_id columns).
>>> I have created DataFrames on even larger tables and do not see such
>>> delay.
>>> So my questions are:
>>> - What can impact DataFrame creation time?
>>> - Is it related to the table partitions?
>>>
>>>
>>> Thanks much your help!
>>>
>>> Isabelle
>>>
>>
>>
>


Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
Also, do you mean two partitions or two partition columns?  If there are
many partitions it can be much slower.  In Spark 1.5 I'd consider
setting spark.sql.hive.metastorePartitionPruning=true
if you have predicates over the partition columns.

On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust 
wrote:

> What format is this table.  For parquet and other optimized formats we
> cache a bunch of file metadata on first access to make interactive queries
> faster.
>
> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan  wrote:
>
>> Hello,
>>
>> I am using SparkSQL to query some Hive tables. Most of the time, when I
>> create a DataFrame using sqlContext.sql("select * from table") command,
>> DataFrame creation is less than 0.5 second.
>> But I have this one table with which it takes almost 12 seconds!
>>
>> scala>  val start = scala.compat.Platform.currentTime; val logs =
>> sqlContext.sql("select * from temp.log"); val execution =
>> scala.compat.Platform.currentTime - start
>> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from
>> temp.log
>> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
>> start: Long = 1441336022731
>> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
>> log_time: string, tag: string, dt: string, test_id: int]
>> execution: Long = *11567*
>>
>> This table has 3.6 B rows, and 2 partitions (on dt and test_id columns).
>> I have created DataFrames on even larger tables and do not see such
>> delay.
>> So my questions are:
>> - What can impact DataFrame creation time?
>> - Is it related to the table partitions?
>>
>>
>> Thanks much your help!
>>
>> Isabelle
>>
>
>


Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
What format is this table.  For parquet and other optimized formats we
cache a bunch of file metadata on first access to make interactive queries
faster.

On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan  wrote:

> Hello,
>
> I am using SparkSQL to query some Hive tables. Most of the time, when I
> create a DataFrame using sqlContext.sql("select * from table") command,
> DataFrame creation is less than 0.5 second.
> But I have this one table with which it takes almost 12 seconds!
>
> scala>  val start = scala.compat.Platform.currentTime; val logs =
> sqlContext.sql("select * from temp.log"); val execution =
> scala.compat.Platform.currentTime - start
> 15/09/04 12:07:02 INFO ParseDriver: Parsing command: select * from temp.log
> 15/09/04 12:07:02 INFO ParseDriver: Parse Completed
> start: Long = 1441336022731
> logs: org.apache.spark.sql.DataFrame = [user_id: string, option: int,
> log_time: string, tag: string, dt: string, test_id: int]
> execution: Long = *11567*
>
> This table has 3.6 B rows, and 2 partitions (on dt and test_id columns).
> I have created DataFrames on even larger tables and do not see such delay.
> So my questions are:
> - What can impact DataFrame creation time?
> - Is it related to the table partitions?
>
>
> Thanks much your help!
>
> Isabelle
>