Hello, Sorry for the late reply, but this problem is very interesting. How did you end up solving it in the end?
I have an idea which is very ugly but might work: Create a big view that is an union of all partitions SELECT '2019-10-01' as ds, * FROM test_1 a JOIN test_2 b ON a.id = b.id JOIN test_3 c ON c.id = a.id WHERE a.ds = '2019-10-01' AND b.ds = '2019-10-01' AND c.ds = '2019-10-01' UNION ALL SELECT '2019-10-02' as ds, * FROM test_1 a JOIN test_2 b ON a.id = b.id JOIN test_3 c ON c.id = a.id WHERE a.ds = '2019-10-02' AND b.ds = '2019-10-02' AND c.ds = '2019-10-02' UNION ALL ... ; That way, each part of the union will use the *sortedmerge* optimization, and hopefully if the optimizer is smart enough, when you filter on several days, he will be able to prune the parts of the union that don't match. I haven't tested it, so I can't tell if that works or not. It just an idea. If Hive as limitation about maximum resolved query size, it might reach it, though. Hope this helps. Furcy On Wed, 2 Oct 2019 at 11:09, Pau Tallada <tall...@pic.es> wrote: > Hi, > > I would say the most efficient way would be option (3), where all the > subtables are partitioned by date, and clustered+**sorted** by id. > This way, efficient SMB map joins can be performed over the 10 tables of > the same partition. > > Unfortunately, I haven't found a way to achieve SMB map joins* over more > than one* partition :( > > Example: > > CREATE TABLE test_1 (id INT, c1 FLOAT, c2 FLOAT) > PARTITIONED BY (ds STRING) > CLUSTERED BY (id) > SORTED BY (id ASC) > INTO 4 BUCKETS > STORED AS ORC; > > CREATE TABLE test_2 (id INT, c3 FLOAT, c4 FLOAT) > PARTITIONED BY (ds STRING) > CLUSTERED BY (id) > SORTED BY (id ASC) > INTO 4 BUCKETS > STORED AS ORC; > > CREATE TABLE test_3 (id INT, c5 FLOAT, c6 FLOAT) > PARTITIONED BY (ds STRING) > CLUSTERED BY (id) > SORTED BY (id ASC) > INTO 4 BUCKETS > STORED AS ORC; > > Over this tables, one can perform efficient single-stage SMB map joins, *if > you filter on a single partition*: > > set hive.execution.engine=tez; > set hive.enforce.sortmergebucketmapjoin=false; > set hive.optimize.bucketmapjoin=true; > set hive.optimize.bucketmapjoin.sortedmerge=true; > set hive.auto.convert.sortmerge.join=true; > set hive.auto.convert.join=true; > set hive.auto.convert.join.noconditionaltask.size=0; > > EXPLAIN > SELECT * > FROM test_1 a > JOIN test_2 b > ON a.id = b.id AND a.ds = b.ds > JOIN test_3 c > ON b.id = c.id AND b.ds = c.ds > WHERE a.ds = '2019-10-01' > ; > > When you try to query on two partitions, then it does a shuffle :( > > set hive.execution.engine=tez; > set hive.enforce.sortmergebucketmapjoin=false; > set hive.optimize.bucketmapjoin=true; > set hive.optimize.bucketmapjoin.sortedmerge=true; > set hive.auto.convert.sortmerge.join=true; > set hive.auto.convert.join=true; > set hive.auto.convert.join.noconditionaltask.size=0; > > EXPLAIN > SELECT * > FROM test_1 a > JOIN test_2 b > ON a.id = b.id AND a.ds = b.ds > JOIN test_3 c > ON b.id = c.id AND b.ds = c.ds > WHERE a.ds IN ('2019-10-01', '2019-10-02') > ; > > > My problem is very similar, so let's hope someone out there has the answer > :) > > Cheers, > > Pau. > > Missatge de Saurabh Santhosh <saurabhsanth...@gmail.com> del dia dt., 1 > d’oct. 2019 a les 8:48: > >> Hi, >> >> I am facing the following problem while trying to store/use a huge >> partitioned table with 1000+ columns in Hive. I would like to know how to >> solve this problem either using hive or any other store. >> >> Requirement: >> >> 1).There is a table with around 1000+ columns which is partitioned by >> date. >> 2).Every day consist of data about around 500 million entities. There >> will be an id column with the id of the entity and around 1000+ columns >> which represent attributes of given entity for each day. >> 3).We have to store data for around 2 years >> 4). New columns may be added/logic of existing column may be changed any >> day and when this happens we have to populate data for the given column for >> last 2 years >> >> >> Our Solution 1: >> >> 1). We created a table with 1000+ columns and partitioned by date. >> 2). Every day we create a new partition and delete partition older than 2 >> years >> >> Problems Faced in Solution 1: >> >> Whenever we had to add/modify certain columns, the backfill of data took >> a long time and it was taking months to backfill the data for 2 years (this >> was because there is lot of IO due to the read/write of each partition) >> >> >> Our Solution 2: >> >> 1). We created 10 tables with around 100+ columns each and each of them >> was partitioned by date. >> 2). Every day we create a new partition in each of the small tables and >> delete partition older than 2 years >> 3). Created a view which was a join between all the tables with id, date >> as join key >> >> >> Problems Faced in Solution 2: >> >> Now the backfill time was considerably reduced from months to weeks as we >> need to only refresh the small table which contained the columns to be >> backfilled thus reducing the IO drastically. >> But this lead to very slow queries on top of the view. Especially when we >> query for 6 months data, the queries were taking more than 10 hrs due to >> the shuffling of data >> >> Our Solution 3: >> >> 1). We also tried to bucket each small table based on the id column, but >> this did not give us the desired result as the shuffling was still happening >> >> >> Can anyone suggest what is the best approach to go with in the above >> scenario? >> > > > -- > ---------------------------------- > Pau Tallada Crespí > Dep. d'Astrofísica i Cosmologia > Port d'Informació Científica (PIC) > Tel: +34 93 170 2729 > ---------------------------------- > >