I don't think these are splittable. Compression on sequencefiles is splittable 
across sequencefile blocks.

Ashish 

-----Original Message-----
From: Bill Craig [mailto:bcra...@gmail.com] 
Sent: Tuesday, July 21, 2009 8:06 AM
To: hive-user@hadoop.apache.org
Subject: bz2 Splits.

I loaded 5 files of bzip2 compressed data into a table in Hive. Three are small 
test files containing 10,000 records. Two were large ~8Gb compressed.
When I run a query against the table I see three tasks that complete almost 
immediately and two tasks that run for a very long time. It appears to me that 
Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails 
about this, but could not find any resolution for this problem. I compressed 
the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop
0.19.1 r745977

Reply via email to