I don't think these are splittable. Compression on sequencefiles is splittable across sequencefile blocks.
Ashish -----Original Message----- From: Bill Craig [mailto:bcra...@gmail.com] Sent: Tuesday, July 21, 2009 8:06 AM To: hive-user@hadoop.apache.org Subject: bz2 Splits. I loaded 5 files of bzip2 compressed data into a table in Hive. Three are small test files containing 10,000 records. Two were large ~8Gb compressed. When I run a query against the table I see three tasks that complete almost immediately and two tasks that run for a very long time. It appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails about this, but could not find any resolution for this problem. I compressed the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop 0.19.1 r745977