Re: How to manage huge partitioned table with 1000+ columns in Hive

Furcy Pin Tue, 26 Nov 2019 12:25:42 -0800

Hello,

Sorry for the late reply, but this problem is very interesting. How did you
end up solving it in the end?


I have an idea which is very ugly but might work:

Create a big view that is an union of all partitions

SELECT
'2019-10-01' as ds, *
FROM test_1 a
JOIN test_2 b ON a.id = b.id
JOIN test_3 c ON c.id = a.id
WHERE a.ds = '2019-10-01' AND b.ds = '2019-10-01' AND c.ds = '2019-10-01'
UNION ALL
SELECT
'2019-10-02' as ds, *
FROM test_1 a
JOIN test_2 b ON a.id = b.id
JOIN test_3 c ON c.id = a.id
WHERE a.ds = '2019-10-02' AND b.ds = '2019-10-02' AND c.ds = '2019-10-02'
UNION ALL
...
;


That way, each part of the union will use the *sortedmerge* optimization,
and hopefully if the optimizer is smart enough, when you filter on several
days,
he will be able to prune the parts of the union that don't match.

I haven't tested it, so I can't tell if that works or not. It just an idea.
If Hive as limitation about maximum resolved query size, it might reach it,
though.

Hope this helps.

Furcy


On Wed, 2 Oct 2019 at 11:09, Pau Tallada <tall...@pic.es> wrote:

> Hi,
>
> I would say the most efficient way would be option (3), where all the
> subtables are partitioned by date, and clustered+**sorted** by id.
> This way, efficient SMB map joins can be performed over the 10 tables of
> the same partition.
>
> Unfortunately, I haven't found a way to achieve SMB map joins* over more
> than one* partition :(
>
> Example:
>
> CREATE TABLE test_1 (id INT, c1 FLOAT, c2 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> CREATE TABLE test_2 (id INT, c3 FLOAT, c4 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> CREATE TABLE test_3 (id INT, c5 FLOAT, c6 FLOAT)
> PARTITIONED BY (ds STRING)
> CLUSTERED BY (id)
> SORTED BY (id ASC)
> INTO 4 BUCKETS
> STORED AS ORC;
>
> Over this tables, one can perform efficient single-stage SMB map joins, *if
> you filter on a single partition*:
>
> set hive.execution.engine=tez;
> set hive.enforce.sortmergebucketmapjoin=false;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.join=true;
> set hive.auto.convert.join.noconditionaltask.size=0;
>
> EXPLAIN
> SELECT *
> FROM test_1 a
> JOIN test_2 b
>   ON a.id = b.id AND a.ds = b.ds
> JOIN test_3 c
>   ON b.id = c.id AND b.ds = c.ds
> WHERE a.ds = '2019-10-01'
> ;
>
> When you try to query on two partitions, then it does a shuffle :(
>
> set hive.execution.engine=tez;
> set hive.enforce.sortmergebucketmapjoin=false;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.join=true;
> set hive.auto.convert.join.noconditionaltask.size=0;
>
> EXPLAIN
> SELECT *
> FROM test_1 a
> JOIN test_2 b
>   ON a.id = b.id AND a.ds = b.ds
> JOIN test_3 c
>   ON b.id = c.id AND b.ds = c.ds
> WHERE a.ds IN ('2019-10-01', '2019-10-02')
> ;
>
>
> My problem is very similar, so let's hope someone out there has the answer
> :)
>
> Cheers,
>
> Pau.
>
> Missatge de Saurabh Santhosh <saurabhsanth...@gmail.com> del dia dt., 1
> d’oct. 2019 a les 8:48:
>
>> Hi,
>>
>> I am facing the following problem while trying to store/use a huge
>> partitioned table with 1000+ columns in Hive. I would like to know how to
>> solve this problem either using hive or any other store.
>>
>> Requirement:
>>
>> 1).There is a table with around 1000+ columns which is partitioned by
>> date.
>> 2).Every day consist of data about around 500 million entities. There
>> will be an id column with the id of the entity and around 1000+ columns
>> which represent attributes of given entity for each day.
>> 3).We have to store data for around 2 years
>> 4). New columns may be added/logic of existing column may be changed any
>> day and when this happens we have to populate data for the given column for
>> last 2 years
>>
>>
>> Our Solution 1:
>>
>> 1). We created a table with 1000+ columns and partitioned by date.
>> 2). Every day we create a new partition and delete partition older than 2
>> years
>>
>> Problems Faced in Solution 1:
>>
>> Whenever we had to add/modify certain columns, the backfill of data took
>> a long time and it was taking months to backfill the data for 2 years (this
>> was because there is lot of IO due to the read/write of each partition)
>>
>>
>> Our Solution 2:
>>
>> 1). We created 10 tables with around 100+ columns each and each of them
>> was partitioned by date.
>> 2). Every day we create a new partition in each of the small tables and
>> delete partition older than 2 years
>> 3). Created a view which was a join between all the tables with id, date
>> as join key
>>
>>
>> Problems Faced in Solution 2:
>>
>> Now the backfill time was considerably reduced from months to weeks as we
>> need to only refresh the small table which contained the columns to be
>> backfilled thus reducing the IO drastically.
>> But this lead to very slow queries on top of the view. Especially when we
>> query for 6 months data, the queries were taking more than 10 hrs due to
>> the shuffling of data
>>
>> Our Solution 3:
>>
>> 1). We also tried to bucket each small table based on the id column, but
>> this did not give us the desired result as the shuffling was still happening
>>
>>
>> Can anyone suggest what is the best approach to go with in the above
>> scenario?
>>
>
>
> --
> ----------------------------------
> Pau Tallada Crespí
> Dep. d'Astrofísica i Cosmologia
> Port d'Informació Científica (PIC)
> Tel: +34 93 170 2729
> ----------------------------------
>
>

Re: How to manage huge partitioned table with 1000+ columns in Hive

Reply via email to