Re: Intermittent BindException during long MR jobs

2015-03-25 Thread Krishna Rao
Thanks for the responses. In our case the port is 0, and so from the link
http://wiki.apache.org/hadoop/BindException Ted mentioned it says that a
collision is highly unlikely:

If the port is 0, then the OS is looking for any free port -so the
port-in-use and port-below-1024 problems are highly unlikely to be the
cause of the problem.

I think load may be the culprit since the nodes will be heavily used during
the times that the exception occurs.

Is there anyway to set/increase the timeout for the call/connection
attempt? In all cases so far it seems to be on a call to delete a file in
HDFS. I had a search through the HDFS code base but couldn't see an obvious
way to set a timeout, and couldn't see it being set.

Krishna

On 28 February 2015 at 15:20, Ted Yu yuzhih...@gmail.com wrote:

 Krishna:
 Please take a look at:
 http://wiki.apache.org/hadoop/BindException

 Cheers

 On Thu, Feb 26, 2015 at 10:30 PM, hadoop.supp...@visolve.com wrote:

 Hello Krishna,



 Exception seems to be IP specific. It might be occurred due to
 unavailability of IP address in the system to assign. Double check the IP
 address availability and run the job.



 *Thanks,*

 *S.RagavendraGanesh*

 ViSolve Hadoop Support Team
 ViSolve Inc. | San Jose, California
 Website: www.visolve.com

 email: servi...@visolve.com | Phone: 408-850-2243





 *From:* Krishna Rao [mailto:krishnanj...@gmail.com]
 *Sent:* Thursday, February 26, 2015 9:48 PM
 *To:* user@hive.apache.org; u...@hadoop.apache.org
 *Subject:* Intermittent BindException during long MR jobs



 Hi,



 we occasionally run into a BindException causing long running jobs to
 occasionally fail.



 The stacktrace is below.



 Any ideas what this could be caused by?



 Cheers,



 Krishna





 Stacktrace:

 379969 [Thread-980] ERROR org.apache.hadoop.hive.ql.exec.Task  - Job
 Submission failed with exception 'java.net.BindException(Problem binding to
 [back10/10.4.2.10:0] java.net.BindException: Cann

 ot assign requested address; For more details see:
 http://wiki.apache.org/hadoop/BindException)'

 java.net.BindException: Problem binding to [back10/10.4.2.10:0]
 java.net.BindException: Cannot assign requested address; For more details
 see:  http://wiki.apache.org/hadoop/BindException

 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:718)

 at org.apache.hadoop.ipc.Client.call(Client.java:1242)

 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)

 at com.sun.proxy.$Proxy10.create(Unknown Source)

 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:193)

 at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)

 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)

 at com.sun.proxy.$Proxy11.create(Unknown Source)

 at
 org.apache.hadoop.hdfs.DFSOutputStream.init(DFSOutputStream.java:1376)

 at
 org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1395)

 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1255)

 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1212)

 at
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:276)

 at
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:265)

 at
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:82)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:888)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:869)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:768)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:757)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:558)

 at
 org.apache.hadoop.mapreduce.split.JobSplitWriter.createFile(JobSplitWriter.java:96)

 at
 org.apache.hadoop.mapreduce.split.JobSplitWriter.createSplitFiles(JobSplitWriter.java:85)

 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:517)

 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:487)

 at
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:369)

 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1286)

 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1283)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396

Intermittent BindException during long MR jobs

2015-02-26 Thread Krishna Rao
Hi,

we occasionally run into a BindException causing long running jobs to
occasionally fail.

The stacktrace is below.

Any ideas what this could be caused by?

Cheers,

Krishna


Stacktrace:
379969 [Thread-980] ERROR org.apache.hadoop.hive.ql.exec.Task  - Job
Submission failed with exception 'java.net.BindException(Problem binding to
[back10/10.4.2.10:0] java.net.BindException: Cann
ot assign requested address; For more details see:
http://wiki.apache.org/hadoop/BindException)'
java.net.BindException: Problem binding to [back10/10.4.2.10:0]
java.net.BindException: Cannot assign requested address; For more details
see:  http://wiki.apache.org/hadoop/BindException
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:718)
at org.apache.hadoop.ipc.Client.call(Client.java:1242)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at com.sun.proxy.$Proxy10.create(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:193)
at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at com.sun.proxy.$Proxy11.create(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream.init(DFSOutputStream.java:1376)
at
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1395)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1255)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1212)
at
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:276)
at
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:265)
at
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:888)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:869)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:768)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:757)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:558)
at
org.apache.hadoop.mapreduce.split.JobSplitWriter.createFile(JobSplitWriter.java:96)
at
org.apache.hadoop.mapreduce.split.JobSplitWriter.createSplitFiles(JobSplitWriter.java:85)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:517)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:487)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:369)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1286)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1283)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1283)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:606)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:601)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:601)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:586)
at
org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
at
org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:66)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:56)


Reduce the amount of logging going into /var/log/hive/userlogs

2014-06-13 Thread Krishna Rao
Last time I looked there wasn't much info available on how to reduce the
size of the logs written here (the only suggestions being delete them after
a day).

Is there anything I can do now to reduce what's logged there in the first
place?

Cheers,

Krishna


Hive query parser bug resulting in FAILED: NullPointerException null

2014-02-27 Thread Krishna Rao
Hi all,

we've experienced a bug which seems to be caused by having a query
constraint involving partitioned columns. The following query results in
FAILED: NullPointerException null being returned nearly instantly:

EXPLAIN SELECT
  col1
FROM
  tbl1
WHERE
(part_col1 = 2014 AND part_col2 = 2)
OR part_col1  2014;

The exception doesn't happen if any of the conditions are removed. The
table is defined like the following:

CREATE TABLE tbl1 (
  col1STRING,
  ...
  col12   STRING
)
PARTITIONED BY (part_col1 INT, part_col2 TINYINT, part_col3 TINYINT)
STORED AS SEQUENCEFILE;


Unfortunately I cannot construct a test case to replicate this. Seen as
though it appears to be a query parser bug, I thought the following would
replicate it:

CREATE TABLE tbl2 LIKE tbl1;
EXPLAIN SELECT
  col1
FROM
  tbl2
WHERE
(part_col1 = 2014 AND part_col2 = 2)
OR part_col1  2014;

But it does not. Could it somehow be data specific? Does the query parser
use partition information?

Are there any logs I could see to investigate this further? Or is this a
known bug?

We're using hive 0.10.0-cdh4.4.0.


Cheers,

Krishna


Failed to report status for x minutes

2013-11-29 Thread Krishna Rao
Hi all,

We've been running into this problem a lot recently on a particular reduce
task. I'm aware that I can work around it by uping the
mapred.task.timeout.

However, I would like to know what the underlying problem is. How can I
find this out?

Alternatively, can I force a generated hive task to report a status, maybe
just increment a counter?

Cheers,

Krishna


Add external jars automatically

2013-03-13 Thread Krishna Rao
Hi all,

I'm using the hive json serde and need to run: ADD JAR
/usr/lib/hive/lib/hive-json-serde-0.2.jar;, before I can use tables that
require it.

Is it possible to have this jar available automatically?

I could do it via adding the statement to a .hiverc file, but I was
wondering if there is some better way...

Cheers,

Krishna


Re: Add external jars automatically

2013-03-13 Thread Krishna Rao
Ah great, the auxlib dir option sounds perfect.

Cheers


On 13 March 2013 17:41, Alex Kozlov ale...@cloudera.com wrote:

 If you look into ${HIVE_HOME}/bin/hive script there are multiple ways to
 add the jar.  One of my favorite, besides the .hiverc file, has been to put
 the jar into ${HIVE_HOME}/auxlib dir.  There always is the
 HIVE_AUX_JARS_PATH environment variable (but this introduces a dependency
 on the environment).


 On Wed, Mar 13, 2013 at 10:26 AM, Krishna Rao krishnanj...@gmail.comwrote:

 Hi all,

 I'm using the hive json serde and need to run: ADD JAR
 /usr/lib/hive/lib/hive-json-serde-0.2.jar;, before I can use tables that
 require it.

 Is it possible to have this jar available automatically?

 I could do it via adding the statement to a .hiverc file, but I was
 wondering if there is some better way...

 Cheers,

 Krishna





NoClassDefFoundError: org/apache/hadoop/mapreduce/util/HostUtil

2013-02-07 Thread Krishna Rao
Hi all,

I'm occasionally getting the following error, usually after running an
expensive Hive query (creating 20 or so MR jobs):

***
Error during job, obtaining debugging information...
Examining task ID: task_201301291405_1640_r_01 (and more) from job
job_201301291405_1640
Exception in thread Thread-29 java.lang.NoClassDefFoundError:
org/apache/hadoop/mapreduce/util/HostUtil
at
org.apache.hadoop.hive.shims.Hadoop23Shims.getTaskAttemptLogUrl(Hadoop23Shims.java:51)
at
org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.getTaskInfos(JobDebugger.java:186)
at
org.apache.hadoop.hive.ql.exec.JobDebugger$TaskInfoGrabber.run(JobDebugger.java:142)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.mapreduce.util.HostUtil
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 4 more
CmdRunner::runCmd: Error running cmd in script, error: FAILED: Execution
Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
***

Any ideas on what's causing it?
How can I find out more info on this error?

Cheers,

Krishna


Find out what's causing an InvalidOperationException

2013-01-09 Thread Krishna Rao
Hi all,

On running a statement of the form INSERT INTO TABLE tbl1 PARTITION(p1)
SELECT x1 FROM tbl2, I get the following error:

Failed with exception java.lang.ClassCastException:
org.apache.hadoop.hive.metastore.api.InvalidOperationException cannot be
cast to java.lang.RuntimeException

How can I find out what is causing this error? I.e. which logs should I
look at?

Cheers,

Krishna


Re: Find out what's causing an InvalidOperationException

2013-01-09 Thread Krishna Rao
The data types are the same. In fact, the statement works the first time,
but not the second (I change a WHERE constraint to give different data).

I presume it is some invalid data, but is there any way to find a clue in a
log file?


On 9 January 2013 13:21, Nitin Pawar nitinpawar...@gmail.com wrote:

 can you give table definition of both the tables?

 are both the columns of same type ?


 On Wed, Jan 9, 2013 at 5:15 AM, Krishna Rao krishnanj...@gmail.comwrote:

 Hi all,

 On running a statement of the form INSERT INTO TABLE tbl1 PARTITION(p1)
 SELECT x1 FROM tbl2, I get the following error:

 Failed with exception java.lang.ClassCastException:
 org.apache.hadoop.hive.metastore.api.InvalidOperationException cannot be
 cast to java.lang.RuntimeException

 How can I find out what is causing this error? I.e. which logs should I
 look at?

 Cheers,

 Krishna





 --
 Nitin Pawar



Re: Job counters limit exceeded exception

2013-01-04 Thread Krishna Rao
I ended up increasing the counters limit to 130 which solved my issue.

Do you know of any good sources to learn how to decipher hive's EXPLAIN?

Cheers,

Krishna


On 2 January 2013 11:20, Alexander Alten-Lorenz wget.n...@gmail.com wrote:

 Hi,

 These happens when operators are used in queries (Hive Operators). Hive
 creates 4 counters per operator, max upto 1000, plus a few additional
 counters like file read/write, partitions and tables. Hence the number of
 counter required is going to be dependent upon the query.

 Using EXPLAIN EXTENDED and grep -ri operators | wc -l print out the
 used numbers of operators. Use this value to tweak the MR settings
 carefully.

 Praveen has a good explanation 'bout counters online:

 http://www.thecloudavenue.com/2011/12/limiting-usage-counters-in-hadoop.html

 Rule of thumb for Hive:
 count of operators * 4 + n (n for file ops and other stuff).

 cheers,
  Alex


 On Jan 2, 2013, at 10:35 AM, Krishna Rao krishnanj...@gmail.com wrote:

  A particular query that I run fails with the following error:
 
  ***
  Job 18: Map: 2  Reduce: 1   Cumulative CPU: 3.67 sec   HDFS Read: 0 HDFS
  Write: 0 SUCCESS
  Exception in thread main
  org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many
  counters: 121 max=120
  ...
  ***
 
  Googling suggests that I should increase mapreduce.job.counters.limit.
  And that the number of counters a job uses
  has an effect on the memory used by the JobTracker, so I shouldn't
 increase
  this number too high.
 
  Is there a rule of thumb for what this number should be as a function of
  JobTracker memory? That is should I be cautious and
  increase by 5 at a time, or could I just double it?
 
  Cheers,
 
  Krishna

 --
 Alexander Alten-Lorenz
 http://mapredit.blogspot.com
 German Hadoop LinkedIn Group: http://goo.gl/N8pCF




Re: Possible to set map/reduce log level in configuration file?

2012-12-18 Thread Krishna Rao
On 18 December 2012 02:05, Mark Grover grover.markgro...@gmail.com wrote:

 I usually put it in my home directory and that works. Did you try that?


I need it to work for all users. So the cleanest non-duplicating solution,
seems to be in the hive bin directory (and then conf dir, when I upgrade
hive).


Re: Possible to set map/reduce log level in configuration file?

2012-12-17 Thread Krishna Rao
Thanks for the replies.

I went for the hiverc option. Unfortunately, with the verion of hive I'm
using, it meant I had to place the file in a bin directory. Our sys admin
was not pleased, but it look's like that issue is fixed in a later version
of Hive (https://issues.apache.org/jira/browse/HIVE-2911)!



On 14 December 2012 17:10, Ted Reynolds t...@hortonworks.com wrote:

 Hi Krishna,

 You can also set these properties in the mapred-site.xml, but this would
 require a restart of your cluster.

 Ted.

 Ted Reynolds
 Technical Support Engineer
 Hortonworks
 Work Phone: 408-645-7079

 http://hortonworks.com/download/



 On Fri, Dec 14, 2012 at 2:44 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 if you want this to be set at every query you execute best would have a
 hiverc file and then include it as hive -i hiverc

 alternatively, you can create a .hiverc into your home directory and set
 the parameters you want, these will be included in each session


 On Fri, Dec 14, 2012 at 4:05 PM, Krishna Rao krishnanj...@gmail.comwrote:

 Hi all,

 is it possible to set: mapreduce.map.log.level 
 mapreduce.reduce.log.level, within some config file?

 At the moment I have to remember to set these at the start of a hive
 session, or script.

 Cheers,

 Krishna




 --
 Nitin Pawar





Possible to set map/reduce log level in configuration file?

2012-12-14 Thread Krishna Rao
Hi all,

is it possible to set: mapreduce.map.log.level 
mapreduce.reduce.log.level, within some config file?

At the moment I have to remember to set these at the start of a hive
session, or script.

Cheers,

Krishna


Problems Sqoop importing columns with NULLs

2012-12-04 Thread Krishna Rao
Hi all,

I'm haivng trouble transfering NULLs in a VARCHAR column in a table in
PostgresQL into Hive. A null value ends up as an empty value in Hive,
rather than NULL.

I'm running the following command:

sqoop import --username username -P --hive-import --hive-overwrite
--null-string='\\N' --null-non-string '\\N' --direct --compression-codec
org.apache.hadoop.io.compress.SnappyCodec

I'm using Sqoop version 1.4.1  Hive 0.9.0

Cheers


Re: Problems Sqoop importing columns with NULLs

2012-12-04 Thread Krishna Rao
Thanks Jarek. Good to hear it's at least a known issue.

On 4 December 2012 17:20, Jarek Jarcec Cecho jar...@apache.org wrote:

 Hi Krishna,
 I'm afraid that this is known limitation of current PostgreSQL direct
 connector. We already have a JIRA to address this - SQOOP-654 [1].

 Currently suggested workaround is to use JDBC based import by dropping the
 --direct argument.

 Links:
 1: https://issues.apache.org/jira/browse/SQOOP-654

 On Tue, Dec 04, 2012 at 05:04:56PM +, Krishna Rao wrote:
  Hi all,
 
  I'm haivng trouble transfering NULLs in a VARCHAR column in a table in
  PostgresQL into Hive. A null value ends up as an empty value in Hive,
  rather than NULL.
 
  I'm running the following command:
 
  sqoop import --username username -P --hive-import --hive-overwrite
  --null-string='\\N' --null-non-string '\\N' --direct --compression-codec
  org.apache.hadoop.io.compress.SnappyCodec
 
  I'm using Sqoop version 1.4.1  Hive 0.9.0
 
  Cheers



Re: Hive compression with external table

2012-11-06 Thread Krishna Rao
Thanks for the reply. Compressed sequence files with compression might work.
However, it's not clear to me if it's possible to read Sequence files using
an external table.

On 5 November 2012 16:04, Edward Capriolo edlinuxg...@gmail.com wrote:

 Compression is a confusing issue. Sequence files that are in block
 format are always split table regardless of what compression for the
 block is chosen.The Programming Hive book has an entire section
 dedicated to the permutations of compression options.

 Edward
 On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao krishnanj...@gmail.com
 wrote:
  Hi all,
 
  I'm looking into finding a suitable format to store data in HDFS, so that
  it's available for processing by Hive. Ideally I would like to satisfy
 the
  following:
 
  1. store the data in a format that is readable by multiple Hadoop
 projects
  (eg. Pig, Mahout, etc.), not just Hive
  2. work with a Hive external table
  3. store data in a compressed format that is splittable
 
  (1) is a requirement because Hive isn't appropriate for all the problems
  that we want to throw at Hadoop.
 
  (2) is really more of a consequence of (1). Ideally we want the data
 stored
  in some open format that is compressed in HDFS.
  This way we can just point Hive, Pig, Mahout, etc at it depending on the
  problem.
 
  (3) is obviously so it plays well with Hadoop.
 
  Gzip is no good because it is not splittable. Snappy looked promising,
 but
  it is splittable only if used with a non-external Hive table.
  LZO also looked promising, but I wonder about whether it is future proof
  given the licencing issues surrounding it.
 
  So far, the only solution I could find that satisfies all the above
 seems to
  be bzip2 compression, but concerns about its performance make me wary
 about
  choosing it.
 
  Is bzip2 the only option I have? Or have I missed some other compression
  option?
 
  Cheers,
 
  Krishna



Hive compression with external table

2012-11-05 Thread Krishna Rao
Hi all,

I'm looking into finding a suitable format to store data in HDFS, so that
it's available for processing by Hive. Ideally I would like to satisfy the
following:

1. store the data in a format that is readable by multiple Hadoop projects
(eg. Pig, Mahout, etc.), not just Hive
2. work with a Hive external table
3. store data in a compressed format that is splittable

(1) is a requirement because Hive isn't appropriate for all the problems
that we want to throw at Hadoop.

(2) is really more of a consequence of (1). Ideally we want the data stored
in some open format that is compressed in HDFS.
This way we can just point Hive, Pig, Mahout, etc at it depending on the
problem.

(3) is obviously so it plays well with Hadoop.

Gzip is no good because it is not splittable. Snappy looked promising, but
it is splittable only if used with a non-external Hive table.
LZO also looked promising, but I wonder about whether it is future proof
given the licencing issues surrounding it.

So far, the only solution I could find that satisfies all the above seems
to be bzip2 compression, but concerns about its performance make me wary
about choosing it.

Is bzip2 the only option I have? Or have I missed some other compression
option?

Cheers,

Krishna