Re: Issue of Hive parquet partitioned table schema mismatch

Rex Xiong Fri, 06 Nov 2015 00:36:23 -0800

spark.sql.hive.convertMetastoreParquet is true. I can't repro the issue of
scanning all partitions now.. : P


Anyway, I found another email thread "Re: Spark Sql behaves strangely with
tables with a lot of partitions"

I observe the same issue as Jerrick, spark driver will call listStatus
for the whole table folder even if I load a pure hive table (not spark
created), it cost some time for large partitioned table.

Even through spark.sql.parquet.cacheMetadata is true, this listStatus
will run before every query, it's not cached.



2015-11-05 8:50 GMT+08:00 Cheng Lian <lian.cs....@gmail.com>:

> Is there any chance that " spark.sql.hive.convertMetastoreParquet" is
> turned off?
>
> Cheng
>
> On 11/4/15 5:15 PM, Rex Xiong wrote:
>
> Thanks Cheng Lian.
> I found in 1.5, if I use spark to create this table with partition
> discovery, the partition pruning can be performed, but for my old table
> definition in pure Hive, the execution plan will do a parquet scan across
> all partitions, and it runs very slow.
> Looks like the execution plan optimization is different.
>
> 2015-11-03 23:10 GMT+08:00 Cheng Lian <lian.cs....@gmail.com>:
>
>> SPARK-11153 should be irrelevant because you are filtering on a partition
>> key while SPARK-11153 is about Parquet filter push-down and doesn't affect
>> partition pruning.
>>
>> Cheng
>>
>>
>> On 11/3/15 7:14 PM, Rex Xiong wrote:
>>
>> We found the query performance is very poor due to this issue
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11153
>> We usually use filter on partition key, the date, it's in string type in
>> 1.3.1 and works great.
>> But in 1.5, it needs to do parquet scan for all partitions.
>> 2015年10月31日 下午7:38，"Rex Xiong" <bycha...@gmail.com> 写道：
>>
>>> Add back this thread to email list, forgot to reply all.
>>> 2015年10月31日 下午7:23，"Michael Armbrust" < <mich...@databricks.com>
>>> mich...@databricks.com> 写道：
>>>
>>>> Not that I know of.
>>>>
>>>> On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong < <bycha...@gmail.com>
>>>> bycha...@gmail.com> wrote:
>>>>
>>>>> Good to know that, will have a try.
>>>>> So there is no easy way to achieve it in pure hive method?
>>>>> 2015年10月 31日 下午7:17，"Michael Armbrust" <mich...@databricks.com> 写道：
>>>>>
>>>>>> Yeah, this was rewritten to be faster in Spark 1.5.  We use it with
>>>>>> 10,000s of partitions.
>>>>>>
>>>>>> On Sat, Oct 31, 2015 at 7:17 AM, Rex Xiong < <bycha...@gmail.com>
>>>>>> bycha...@gmail.com> wrote:
>>>>>>
>>>>>>> 1.3.1
>>>>>>> It is a lot of improvement in 1.5+?
>>>>>>>
>>>>>>> 2015-10-30 19:23 GMT+08:00 Michael Armbrust <
>>>>>>> <mich...@databricks.com>mich...@databricks.com>:
>>>>>>>
>>>>>>>> We have tried schema merging feature, but it's too slow, there're
>>>>>>>>> hundreds of partitions.
>>>>>>>>>
>>>>>>>> Which version of Spark?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>
>

Re: Issue of Hive parquet partitioned table schema mismatch

Reply via email to