I would still need some time to dig deeper in this. Are you using a specific 
distribution? Would it be possible to upgrade to a more recent Hive version?

However, having so many small partitions is a bad practice which seriously 
affects performance. Each partition should at least contain several Orc stripes 
of data. 300 rows is definitely too little. In doubt, having less partitions is 
better than having more. In this respect it is not different from relational 
databases. Especially subpartitions should be avoided if possible. Partitions 
are also no good solution if each partition is very different in size. Instead 
you should insert the data sorted on that columns to leverage ORC indexes and 
bloom filters at their best. Having only String types is also very bad for 
performance in any database. Additionally you can use buckets.
Alternatively, if your use case is search then you may look for Solr or 
ElasticSearch on Hadoop.


> On 30 Apr 2016, at 03:49, Matt Olson <maolso...@gmail.com> wrote:
> 
> Hi all,
> 
> I am using Hive 1.0.1 and trying to do a simple insert into an ORC table, 
> creating dynamic partitions. I am selecting from a table partitioned by dt 
> and category, and inserting into a table partitioned by dt, title, and 
> title_type. Other than the partitioning, the tables have the same schemas. 
> Both title and title_type are fields in the first table, and when I insert 
> into the second table, I am using them to create dynamic partitions. The .q 
> file with the CREATE and INSERT statements is copied below.
> 
> SET hive.optimize.sort.dynamic.partition=true;
> SET hive.exec.orc.memory.pool=1.0;
> SET hive.exec.max.dynamic.partitions = 5000;
> SET hive.exec.max.dynamic.partitions.pernode = 5000;
> SET hive.merge.mapfiles = true;
> SET mapred.min.split.size=134217728;
> SET mapred.min.split.size.per.node=134217728;
> SET mapred.min.split.size.per.rack=134217728;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.max.split.size=134217728;
> SET hive.map.aggr.hash.percentmemory=0.125;
> SET hive.exec.parallel=true;
> SET hive.exec.compress.intermediate=true;
> SET hive.exec.compress.output=true;
> SET mapred.map.child.java.opts=-Xmx2048M; 
> SET mapred.child.java.opts=-Xmx2048M;
> SET mapred.task.profile=false;
> 
> CREATE EXTERNAL TABLE IF NOT EXISTS dynamic_partition_table (
> field1 string,
> field2 string,
> ...
> field26 string
> )
> PARTITIONED BY (dt string, title string, title_type string)
> STORED AS ORC 
> LOCATION '/hive/warehouse/partitioned_table'
> TBLPROPERTIES ("orc.compress.size"="16000");
> 
> INSERT OVERWRITE TABLE dynamic_partition_table PARTITION (dt="2016-04-05", 
> title, title_type)
> SELECT 
> field1,
> field2,
> ...
> title,
> title_type
> FROM original_table
> WHERE dt = "2016-04-05";
> 
> The original table has about 250 GB of data for 2016-04-05, and about 260 
> different titles (some titles have very little data, some have ~20 GB). There 
> is generally only one title_type per title. The INSERT action succeeds on 
> that data set, but when I add 2000 new titles with 300 rows each to the 
> original table, I get the following error during the INSERT:
> 
> Container [pid=6278,containerID=container_e26_1460661845156_49295_01_000244] 
> is running beyond physical memory limits. Current usage: 2.2 GB of 2 GB 
> physical memory used; 2.7 GB of 4.2 GB virtual memory used. Killing container.
> 
> I've found a couple questions online about this same error message for ORC 
> files with lots of dynamic partitions, on an older version of Hive:
> https://qnalist.com/questions/4836037/hive-0-12-orc-heap-issues-on-write
> 
> Based on that and the information about configuration properties at 
> https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-ORCFileFormat,
>  I have tried setting hive.exec.orc.memory.pool=1.0 in order to give as much 
> heap space as possible to the ORC file writers. As you can see from the 
> CREATE TABLE statement, I also decreased the orc.compress.size from the 
> default 256 kb to 16 kb. After making these changes, the INSERT is still 
> failing with the "beyond physical memory limits" error.
> 
> I've tried inserting into a table stored as RCFile rather than ORC, and in 
> that case the action succeeds even with the additional 2000 titles.
> 
> Can anyone explain how exactly the two ORC parameters above affect the 
> writing of dynamic partitions in ORC files, and why I'm not getting the OOM 
> error when I use the RCFile format instead?  I'd also appreciate any 
> suggestions for other tuning I could do to fix the memory management when 
> using ORC.
> 
> Thanks for any help,
> Matt

Reply via email to