Re: How useful are tools for Hive data modeling

Austin Hackett Wed, 11 Nov 2020 12:31:08 -0800

Hi Mich

Understood, I was thinking along the lines of the tool being able to 
auto-generate SQL join syntax etc, rather than in terms of scan performance.


I’m not so familiar with Parquet with Hive. I know that Parquet also has min 
and max indexes, and more recently bloom filters. However, I recall reading 
that Hive can’t take advantage of them. That might have changed since though? 
In order to make the most of of these, you usually need to sort your data at 
insert time, which may or may not be feasible.

If nicely selective partitioning key, plus a columnar file format (which of 
course Parquet is) doesn’t give you the performance you need, I guess a hand 
rolled "materialised view" is where I’d look next (Hive 3.x does have native MV 
support, but I I think only with ORC).

Thanks

Austin



> On 11 Nov 2020, at 19:59, Mich Talebzadeh <[email protected]> wrote:
> 
> Many thanks Austin.
> 
> The challenge I have been told is how to effectively query a subset of data 
> avoiding full table scan. The tables I believe are parquet.
> 
> I know performance in Hive is not that great, so anything that could help 
> would be great.
> 
> Cheers,
> 
>  
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Wed, 11 Nov 2020 at 19:32, Austin Hackett <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Mich
> 
> Hive also has non-validated primary key, foreign key etc constraints. Whilst 
> I’m not too familiar with the modelling tools you mention, perhaps they’re 
> able to use these for generating SQL etc?
> 
> ORC files have indexes (min, max, bloom filters) - not particularly relevant 
> to the data modelling tools question, but mentioning it for completeness…
> 
> Thanks
> 
> Austin
> 
> 
>> On 11 Nov 2020, at 17:14, Mich Talebzadeh <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Many thanks Peter. 
>> 
>> 
>>  
>> 
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>> On Wed, 11 Nov 2020 at 16:58, Peter Vary <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi Mich,
>> 
>> Index support was removed from hive:
>> https://issues.apache.org/jira/browse/HIVE-21968 
>> <https://issues.apache.org/jira/browse/HIVE-21968>
>> https://issues.apache.org/jira/browse/HIVE-18715 
>> <https://issues.apache.org/jira/browse/HIVE-18715>
>> 
>> Thanks,
>> Peter
>> 
>>> On Nov 11, 2020, at 17:25, Mich Talebzadeh <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I wrote these notes earlier this year. 
>>> 
>>> I heard today that someone mentioned Hive 1 does not support indexes but 
>>> hive 2 does.
>>> 
>>> I still believe that Hive does not support indexing as per below. Has this 
>>> been changed?
>>> 
>>> Regards,
>>> 
>>> Mich
>>> 
>>> ---------- Forwarded message ---------
>>> From: Mich Talebzadeh <[email protected] 
>>> <mailto:[email protected]>>
>>> Date: Thu, 2 Apr 2020 at 12:17
>>> Subject: How useful are tools for Hive data modeling
>>> To: user <[email protected] <mailto:[email protected]>>
>>> 
>>> 
>>> Hi,
>>> 
>>> Fundamentally Hive tables have structure and support provided by desc 
>>> formatted <TABLE> and show partitions <TABLE>.
>>> 
>>> Hive does not support indexes in real HQL operations (I stand corrected). 
>>> So what we have are tables, partitions and clustering (AKA hash 
>>> partitioning). 
>>> 
>>> Hive does not support indexes because Hadoop lacks blocks locality 
>>> necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what 
>>> advantage(s) one is going to gain on top a simple sell scrip to get table 
>>> and partition definitions?
>>> 
>>> Thanks,
>>> 
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>> 
>

Re: How useful are tools for Hive data modeling

Reply via email to