Thanks for the 'small files' heads up. The current LOGDATE file is about 7.2GB Gzipped. So I don't think we'll be running into many small files when that's split into 24 hours. But thanks for the reminder about 'small files'.
I guess I'd rather just reload the 2500 logs or so...than have to jump through hoops... On Tue, Jun 19, 2012 at 10:45 AM, Mark Grover <mgro...@oanda.com> wrote: > I agree. > It would be best if you wrote a script that iterates through each > leaf-level partition of your existing table (logdate='A', source='B', > datacenter='C', hostname='D') > and populate new leaf-level partitions in the new table (logdate='A', > source='B', datacenter='C', hostname='D', loghour). > > By leaf-level partition, I am referring to the partition column that > appears the last in the list of partition columns. > > Technically, you can do dynamic partitioning in 1 query with something > like: > FROM test_table src > INSERT OVERWRITE TABLE dest PARTITION(logdate, source, datacenter, > hostname, loghour) > SELECT ts, exec_time, domain_id, domain_name, logdate, source, > datacenter, hostname, loghour; > > However, as far as I understand, you wouldn't be able to take advantage of > the existing partitioning in your source table (test_table). If you would > like to take advantage of existing partitioning, you would have to issue a > series of queries like this: > FROM test_table src > INSERT OVERWRITE TABLE dest PARTITION(logdate='A', source='B', > datacenter='C', hostname='D', loghour) > SELECT ts, exec_time, domain_id, domain_name, logdate, source, > datacenter, hostname, loghour where logdate='A' and source='B' and > datacenter='C' and hostname='D'; > > Just as a side note, whenever you think of your table partitioning, keep > in mind to not overdo it. Creating more partitions could lead to a lot of > small files on HDFS which reduces the performance of your Hadoop cluster. A > couple people have talked about this small files problem: > > http://arunxjacob.blogspot.ca/2011/04/hdfs-file-size-vs-allocation-other.html > http://www.cloudera.com/blog/2009/02/the-small-files-problem/ > > http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/ > > In general, you would like your file sizes to be atleast of HDFS block > size, most likely a small multiple of the block size. If you do find that > you are running into the small files problem, there are other ways to get > around like bucketing. > > Good luck! > Mark > > ----- Original Message ----- > From: "Edward Capriolo" <edlinuxg...@gmail.com> > To: user@hive.apache.org > Sent: Tuesday, June 19, 2012 11:12:48 AM > Subject: Re: Please HELP: HIVE alter table add new partition to schema... > > You can not change the partition columns. I would use a dynamic > partition insert to select all the data from the original table into > the new table. > > On 6/19/12, Tim Havens <timhav...@gmail.com> wrote: > > So...I have a table that has thousands of files, and Billions of rows > > related it. > > > > Lets make this a simple table: > > > > CREATE TABLE test_table ( > > ts BIGINT, > > exec_time DOUBLE, > > domain_id BIGINT, > > domain_name STRING, > > ) > > PARTITIONED BY (logdate STRING, source STRING, datacenter STRING, > > hostname STRING) > > ROW FORMAT DELIMITED > > FIELDS TERMINATED BY '\t' > > COLLECTION ITEMS TERMINATED BY '\001' > > MAP KEYS TERMINATED BY '\002' > > LINES TERMINATED BY '\n' > > STORED AS TextFile; > > > > So...what I need to do is ADD a partition to the PARTITIONED BY spec > > above.... > > > > The partitioned by column I want to add is 'loghour STRING'. > > > > I can't seem to find any way to accomplish actually adding a NEW > > PARTITION COLUMN in the 'PARTITIONED BY' spec, without completely > > recreating and reloading the table. > > > > What's the correct way of adding to the partition schema and new > > column like 'loghour STRING'. > > > > I'm not trying to add an entry into the table DATA, I'm trying to add > > a completely new PARTITIONED BY Column... > > > -- "The whole world is you. Yet you keep thinking there is something else." - Xuefeng Yicun 822-902 A.D. Tim R. Havens Google Phone: 573.454.1232 ICQ: 495992798 ICBM: 37°51'34.79"N 90°35'24.35"W ham radio callsign: NW0W