RE: Kerberized Hive | Remote Access using Keytab

2014-04-22 Thread Savant, Keshav
Hi All,

Can someone provide some information on below problem?

Kind Regards,
Keshav C Savant

From: Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
Sent: Friday, April 18, 2014 3:52 PM
To: user@hive.apache.org
Subject: Kerberized Hive | Remote Access using Keytab

Hi All,

I have successfully Kerberized the CDH5 & Hive. Now I can do a kinit & then 
issue hive queries.

Next I wanted to access hive remotely from standalone java client using keytab 
file so that kinit (or credential prompt) can be avoided.

I have written a java code with following lines (based on input from cdh-user 
google 
group<https://urldefense.proofpoint.com/v1/url?u=https://groups.google.com/a/cloudera.org/forum/%23%21topic/cdh-user/S7nPFx0w90U&k=%2FbkpAUdJWZuiTILCq%2FFnQg%3D%3D%0A&r=n8%2FsNJ1paZ2bqAHakATIk84Ym2qkN8Z0Oh2DW2luaMQ%3D%0A&m=5bmaY2O6gxvhGmAlWv5Rm1CE0ohlHdXuWX97e3K5SX4%3D%0A&s=f8d620a00927b0d175986961186dd09268d50bd540d4340e74c68f8ba0a2cc53>)
 to solve the above problem, but after that I am getting GSS initiate failed 
exception.

Configuration conf = new Configuration();
conf.addResource(new 
java.io.FileInputStream("/installer/hive_jdbc/core-site.xml")); //file placed 
at this path
SecurityUtil.login(conf,"/path/to/my/keytab/file/user.keytab", "user@domain");

I have also posted the same problem on 
this<https://urldefense.proofpoint.com/v1/url?u=https://groups.google.com/a/cloudera.org/forum/%23%21topic/cdh-user/S7nPFx0w90U&k=%2FbkpAUdJWZuiTILCq%2FFnQg%3D%3D%0A&r=n8%2FsNJ1paZ2bqAHakATIk84Ym2qkN8Z0Oh2DW2luaMQ%3D%0A&m=5bmaY2O6gxvhGmAlWv5Rm1CE0ohlHdXuWX97e3K5SX4%3D%0A&s=f8d620a00927b0d175986961186dd09268d50bd540d4340e74c68f8ba0a2cc53>
 URL, sample code & logs are posted here.

As per the apache hive wiki on 
this<https://urldefense.proofpoint.com/v1/url?u=https://cwiki.apache.org/confluence/display/Hive/HiveServer2%26%2343%3BClients%23HiveServer2Clients-JDBCClientSetupforaSecureCluster&k=%2FbkpAUdJWZuiTILCq%2FFnQg%3D%3D%0A&r=n8%2FsNJ1paZ2bqAHakATIk84Ym2qkN8Z0Oh2DW2luaMQ%3D%0A&m=5bmaY2O6gxvhGmAlWv5Rm1CE0ohlHdXuWX97e3K5SX4%3D%0A&s=eba097fe03762745b0271351811bc5ce726f5d5cc4dcb5e6137f6eb67cdff4b7>
 page, a valid ticket needs to be there in ticket cache for hitting a 
kerberized hive. Can I bypass this & use a keytab for hitting kerberized hive 
from a standalone java program?

Kindly provide some input/pointers/examples to solve this.

Kind regards,
Keshav C Savant
_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


Kerberized Hive | Remote Access using Keytab

2014-04-18 Thread Savant, Keshav
Hi All,

I have successfully Kerberized the CDH5 & Hive. Now I can do a kinit & then 
issue hive queries.

Next I wanted to access hive remotely from standalone java client using keytab 
file so that kinit (or credential prompt) can be avoided.

I have written a java code with following lines (based on input from cdh-user 
google 
group)
 to solve the above problem, but after that I am getting GSS initiate failed 
exception.

Configuration conf = new Configuration();
conf.addResource(new 
java.io.FileInputStream("/installer/hive_jdbc/core-site.xml")); //file placed 
at this path
SecurityUtil.login(conf,"/path/to/my/keytab/file/user.keytab", "user@domain");

I have also posted the same problem on 
this
 URL, sample code & logs are posted here.

As per the apache hive wiki on 
this
 page, a valid ticket needs to be there in ticket cache for hitting a 
kerberized hive. Can I bypass this & use a keytab for hitting kerberized hive 
from a standalone java program?

Kindly provide some input/pointers/examples to solve this.

Kind regards,
Keshav C Savant

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


RE: Hive 0.11.0 | Issue with ORC Tables

2013-09-20 Thread Savant, Keshav
Hi Nitin,

Thanks for your reply,  we were in an impression that the codec will be 
responsible for ORC format conversion also.
However as per your reply it seems that a conversion from normal CSV to ORC is 
required before hive upload.

We got some leads from following URLs
https://cwiki.apache.org/Hive/languagemanual-orc.html
http://www.math.uic.edu/t3m/SnapPy/installing.html

Please suggest how it can be done using some already available libraries, or we 
need to write our own converter.

Kind Regards,
Keshav

From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
Sent: Thursday, September 19, 2013 5:56 PM
To: user@hive.apache.org
Subject: Re: Hive 0.11.0 | Issue with ORC Tables

How did you create "test.txt" as ORC file?


On Thu, Sep 19, 2013 at 5:34 PM, Savant, Keshav 
mailto:keshav.c.sav...@fisglobal.com>> wrote:
Hi All,

We have setup apache "hive 0.11.0" services on Hadoop cluster (apache version 
0.20.203.0). Hive is showing expected results when tables are stored as 
TextFile.
Whereas, Hive 0.11.0's new feature ORC(Optimized Row Columnar) is throwing an 
exception while running a select query, when we run select queries on tables 
stored as "ORC".
Stacktrace of the exception :

2013-09-19 20:33:38,095 ERROR CliDriver (SessionState.java:printError(386)) - 
Failed with exception 
java.io.IOException:com.google.protobuf.InvalidProtocolBufferException: While 
parsing a protocol message, the input ended unexpectedly in the middle of a 
field.  This could mean either than the input has been truncated or that an 
embedded message misreported its own length.
java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: While 
parsing a protocol message, the input ended unexpectedly in the middle of a 
field.  This could mean either than the input has been truncated or that an 
embedded message misreported its own length.
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:544)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:488)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1412)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:756)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: com.google.protobuf.InvalidProtocolBufferException: While parsing a 
protocol message, the input ended unexpectedly in the middle of a field.  This 
could mean either than the input has been truncated or that an embedded message 
misreported its own length.
at 
com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:49)
at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:754)
at 
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:294)
at 
com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:484)
at 
com.google.protobuf.GeneratedMessage$Builder.parseUnknownField(GeneratedMessage.java:438)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:10129)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:9993)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:300)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFrom(OrcProto.java:9970)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:193)
at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:56)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:168)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:432)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:508)


We did following steps that leads to above exception:

* SET mapred.output.compression.codec= 
org.apache.hadoop.io.compress.SnappyCodec;

* CREATE TABLE person(id INT, name STRING) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ' ' STORED AS ORC tblproperties ("orc.compress"="Snappy");

* LOAD DATA LOCAL INPATH 'test.t

Hive 0.11.0 | Issue with ORC Tables

2013-09-19 Thread Savant, Keshav
Hi All,

We have setup apache "hive 0.11.0" services on Hadoop cluster (apache version 
0.20.203.0). Hive is showing expected results when tables are stored as 
TextFile.
Whereas, Hive 0.11.0's new feature ORC(Optimized Row Columnar) is throwing an 
exception while running a select query, when we run select queries on tables 
stored as "ORC".
Stacktrace of the exception :

2013-09-19 20:33:38,095 ERROR CliDriver (SessionState.java:printError(386)) - 
Failed with exception 
java.io.IOException:com.google.protobuf.InvalidProtocolBufferException: While 
parsing a protocol message, the input ended unexpectedly in the middle of a 
field.  This could mean either than the input has been truncated or that an 
embedded message misreported its own length.
java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: While 
parsing a protocol message, the input ended unexpectedly in the middle of a 
field.  This could mean either than the input has been truncated or that an 
embedded message misreported its own length.
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:544)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:488)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1412)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:756)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: com.google.protobuf.InvalidProtocolBufferException: While parsing a 
protocol message, the input ended unexpectedly in the middle of a field.  This 
could mean either than the input has been truncated or that an embedded message 
misreported its own length.
at 
com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:49)
at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:754)
at 
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:294)
at 
com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:484)
at 
com.google.protobuf.GeneratedMessage$Builder.parseUnknownField(GeneratedMessage.java:438)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:10129)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:9993)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:300)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFrom(OrcProto.java:9970)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:193)
at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:56)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:168)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:432)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:508)


We did following steps that leads to above exception:

* SET mapred.output.compression.codec= 
org.apache.hadoop.io.compress.SnappyCodec;

* CREATE TABLE person(id INT, name STRING) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ' ' STORED AS ORC tblproperties ("orc.compress"="Snappy");

* LOAD DATA LOCAL INPATH 'test.txt' INTO TABLE person;

* Executing  : SELECT * FROM person;
Results :

Failed with exception 
java.io.IOException:com.google.protobuf.InvalidProtocolBufferException: While 
parsing a protocol message, the input ended unexpectedly in the middle of a 
field.  This could mean either than the input has been truncated or that an 
embedded message misreported its own length.



Also, we included codec property in core-site.xml in our hadoop cluster with 
other configuration settings.

 io.compression.codecs
org.apache.hadoop.io.compress.SnappyCodec





Following are the new jars with their placements



1.   Placed a new jar at $HIVE_HOME/lib/config-1.0.0.jar

2.   Placed a new jar for metastore connection 
$HIVE_HOME/lib/mysql-connector-java-5.1.17-bin.jar

3.   Moved jackson-core-asl-1.8.8.jar from $HIVE_HOME/lib to 
$HADOOP_HOME/lib

4.   Moved jackson-mapper-asl-1.8

RE: Query

2012-10-22 Thread Savant, Keshav
Hi Krishnan,

Generally hive clients do not support this option (check your one once), if not 
then for your hive client you can try overriding default behavior and implement 
the below suggestion (if it is open source). We did it for some of our 
projects, we externalized a property that specifies that whether we need to 
print column header or not, and code that handles that property, that worked 
pretty well at our end.


Kind regards,
Keshav C Savant

From: Venugopal Krishnan [mailto:venugopal_krish...@mindtree.com]
Sent: Tuesday, October 23, 2012 11:55 AM
To: Savant, Keshav; user@hive.apache.org
Subject: RE: Query

Hi Savant,

Thanks for your reply . we are trying this option also but we are using the 
below query from the hive client which generates a text file on the cluster. Is 
it not possible to generate the column headers in the text file ?  with the 
default query without using metadata of resultset.

insert overwrite directory '/user/hdev/temp'  select * from customer .

Regards,
Venu

From: Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
Sent: Tuesday, October 23, 2012 11:43 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>; Venugopal Krishnan
Subject: RE: Query

Hi Krishnan,

To do this dynamically for any query

Check if hive JDBC driver supports/provides the metadata for the result set, if 
yes then u can get the metadata of resultset, get its column count, and then 
you can get column names, types etc.

http://javasourcecode.org/html/open-source/hive/hive-0.7.1/org/apache/hadoop/hive/jdbc/HiveResultSetMetaData.html

Kind regards,
Keshav
rom: Venugopal Krishnan 
[mailto:venugopal_krish...@mindtree.com]<mailto:[mailto:venugopal_krish...@mindtree.com]>
Sent: Tuesday, October 23, 2012 11:27 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Query

Hi,

We have a requirement where we need to print the column headers in the 
generated file on executing a query. We are using Jdbc hive client to execute 
the query.

Regards,
Venugopal




http://www.mindtree.com/email/disclaimer.html
_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


RE: Query

2012-10-22 Thread Savant, Keshav
Hi Krishnan,

To do this dynamically for any query

Check if hive JDBC driver supports/provides the metadata for the result set, if 
yes then u can get the metadata of resultset, get its column count, and then 
you can get column names, types etc.

http://javasourcecode.org/html/open-source/hive/hive-0.7.1/org/apache/hadoop/hive/jdbc/HiveResultSetMetaData.html

Kind regards,
Keshav
rom: Venugopal Krishnan [mailto:venugopal_krish...@mindtree.com]
Sent: Tuesday, October 23, 2012 11:27 AM
To: user@hive.apache.org
Subject: Query

Hi,

We have a requirement where we need to print the column headers in the 
generated file on executing a query. We are using Jdbc hive client to execute 
the query.

Regards,
Venugopal




http://www.mindtree.com/email/disclaimer.html

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


RE: Problem loading a CSV file

2012-09-27 Thread Savant, Keshav
Hi Sarath,

Considering your two step approach...

The load command by default searches for file in HDFS, so you are doing the 
same by following command

hive> load data inpath '/user/hduser/dumps/table_dump.csv' overwrite into table 
table1;

instead, you can use 'local' to tell hive that the CSV file is on local file 
system and not on HDFS, as below

hive> load data local inpath '/user/hduser/dumps/table_dump.csv' overwrite into 
table table1;

Hope that helps

Kind regards,
Keshav C Savant

From: Sarath [mailto:sarathchandra.jos...@algofusiontech.com]
Sent: Friday, September 28, 2012 11:28 AM
To: user@hive.apache.org
Subject: Problem loading a CSV file

Hi,

I have created a new table using reference to a file on HDFS -
create external table table1 (field1 STRING, field2 STRING, field3 STRING, 
field3 STRING, field4 STRING, field5 FLOAT, field6 FLOAT, field7 FLOAT, field8 
STRING, field9 STRING) row format delimited fields terminated by ',' location 
'/user/hduser/dumps/table_dump.csv';

The table got created successfully. But when I try retrieving rows from this 
table, it returns me nothing.
hive> select * from table1;
OK
Time taken: 0.156 seconds

I also tried creating the table first and then loading the HDFS file data into 
it -
hive> create table table1 (field1 STRING, field2 STRING, field3 STRING, field3 
STRING, field4 STRING, field5 FLOAT, field6 FLOAT, field7 FLOAT, field8 STRING, 
field9 STRING) row format delimited fields terminated by ',';
OK
Time taken: 0.088 seconds

But when I try to load data into this table I'm getting below error -
hive> load data inpath '/user/hduser/dumps/table_dump.csv' overwrite into table 
table1;
FAILED: Error in semantic analysis: Line 1:17 Invalid path 
''/user/hduser/dumps/table_dump.csv'': No files matching path 
hdfs://master:54310/user/hduser/dumps/table_dump.csv

What is going wrong? Is there a different way to load a CSV file using hive?

Regards,
Sarath.

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


RE: zip file or tar file cosumption

2012-09-27 Thread Savant, Keshav
True Manish.

Keshav C Savant

From: Manish.Bhoge [mailto:manish.bh...@target.com]
Sent: Thursday, September 27, 2012 4:26 PM
To: user@hive.apache.org; manishbh...@rocketmail.com
Subject: RE: zip file or tar file cosumption

Thanks Savant. I believe this will hold good for .zip file also.

Thank You,
Manish.

From: Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
Sent: Thursday, September 27, 2012 10:19 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>; 
manishbh...@rocketmail.com<mailto:manishbh...@rocketmail.com>
Subject: RE: zip file or tar file cosumption

Manish the table that has been created for zipped text files should be defined 
as sequence file, for example

CREATE TABLE my_table_zip(col1 STRING,col2 STRING) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ',' stored as sequencefile;

After this you can use regular load command to load these files, for example

load data local inpath 'path-to-csv-file.gz' into table my_table_zip;

hope this helps

Keshav C Savant

From: Manish Bhoge [mailto:manishbh...@rocketmail.com]
Sent: Wednesday, September 26, 2012 9:43 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: zip file or tar file cosumption

Hi Richin,

Thanks! Yes this is what I wanted to understand how to load zip file to Hive 
table. Now, I'll try this option.

Thank You,
Manish.
Sent from my BlackBerry, pls excuse typo

From: mailto:richin.j...@nokia.com>>
Date: Wed, 26 Sep 2012 14:51:39 +
To: mailto:user@hive.apache.org>>
ReplyTo: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: zip file or tar file cosumption

You are right Chuck. I thought his question was how to use zip files or any 
compressed files in Hive tables.

Yeah, seems like you can't do that see: 
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCAENxBwxkF--3PzCkpz1HX21=gb9yvasr2jl0u3yul2tfgu0...@mail.gmail.com%3E
But you can always compress your files in gzip format and they should be good 
to go.

Richin

From: ext Connell, Chuck [mailto:chuck.conn...@nuance.com]
Sent: Wednesday, September 26, 2012 10:44 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: zip file or tar file cosumption

But TEXTFILE in Hive always has newline as the record delimiter. How could this 
possibly work with a zip/tar file that can contain ASCII 10 characters at 
random locations, and certainly does not have ASCII 10 at the end of each data 
record?

Chuck Connell
Nuance R&D Data Team
Burlington, MA


From: richin.j...@nokia.com<mailto:richin.j...@nokia.com> 
[mailto:richin.j...@nokia.com]
Sent: Wednesday, September 26, 2012 10:14 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>; 
manishbh...@rocketmail.com<mailto:manishbh...@rocketmail.com>
Subject: RE: zip file or tar file cosumption

Hi Manish,

If you have your zip file at location -  /home/manish/zipfile, you can just 
point your external table to that location like
CREATE EXTERNAL TABLE manish_test (field1 string, field2 string) ROW FORMAT 
DELIMITED FIELDS TERMINATED BY  STORED AS TEXTFILE 
LOCATION '/home/manish/zipfile';

OR

If you already have external table pointing to a certain location you can load 
this zip file into your table as
LOAD DATA INPATH '/home/manish/zipfile' INTO TABLE manish_test;

Hope this helps.

Richin

From: ext Manish Bhoge [mailto:manishbh...@rocketmail.com]
Sent: Wednesday, September 26, 2012 9:13 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: zip file or tar file cosumption

Hi Savant,

Got it. But I still need to understand that how to load zip? Can I directly use 
zip file in external table. can u pls help to get the load statement.
Sent from my BlackBerry, pls excuse typo

From: "Savant, Keshav" 
mailto:keshav.c.sav...@fisglobal.com>>
Date: Wed, 26 Sep 2012 12:25:38 +
To: 
user@hive.apache.orgmailto:user@hive.apache.org%3cu...@hive.apache.org>>
ReplyTo: user@hive.apache.org<mailto:user@hive.apache.org>
Cc: 
manish.bh...@target.commailto:manish.bh...@target.com%3cmanish.bh...@target.com>>;
 
chuck.conn...@nuance.commailto:chuck.conn...@nuance.com%3cchuck.conn...@nuance.com>>
Subject: RE: zip file or tar file cosumption

Another solution would be

Using shell script do following

1.   unzip txt files,

2.   one by one merge those 50 (or N number of) text files into one text 
file,

3.   then the zip/tar that bigger text file,

4.   then that big zip/tar file can be uploaded into hive.

Keshav C Savant

From: Connell, Chuck 
[mailto:chuck.conn...@nuance.com]<mailto:[mailto:chuck.conn...@nuance.com]>
Sent: Wednesday, September 26, 2012 4:04 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: zip file or tar file cosumption

This could be a problem. Hive uses newline as the record s

RE: zip file or tar file cosumption

2012-09-26 Thread Savant, Keshav
Manish the table that has been created for zipped text files should be defined 
as sequence file, for example

CREATE TABLE my_table_zip(col1 STRING,col2 STRING) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ',' stored as sequencefile;

After this you can use regular load command to load these files, for example

load data local inpath 'path-to-csv-file.gz' into table my_table_zip;

hope this helps

Keshav C Savant

From: Manish Bhoge [mailto:manishbh...@rocketmail.com]
Sent: Wednesday, September 26, 2012 9:43 PM
To: user@hive.apache.org
Subject: Re: zip file or tar file cosumption

Hi Richin,

Thanks! Yes this is what I wanted to understand how to load zip file to Hive 
table. Now, I'll try this option.

Thank You,
Manish.
Sent from my BlackBerry, pls excuse typo

From: mailto:richin.j...@nokia.com>>
Date: Wed, 26 Sep 2012 14:51:39 +
To: mailto:user@hive.apache.org>>
ReplyTo: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: zip file or tar file cosumption

You are right Chuck. I thought his question was how to use zip files or any 
compressed files in Hive tables.

Yeah, seems like you can't do that see: 
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCAENxBwxkF--3PzCkpz1HX21=gb9yvasr2jl0u3yul2tfgu0...@mail.gmail.com%3E
But you can always compress your files in gzip format and they should be good 
to go.

Richin

From: ext Connell, Chuck [mailto:chuck.conn...@nuance.com]
Sent: Wednesday, September 26, 2012 10:44 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: zip file or tar file cosumption

But TEXTFILE in Hive always has newline as the record delimiter. How could this 
possibly work with a zip/tar file that can contain ASCII 10 characters at 
random locations, and certainly does not have ASCII 10 at the end of each data 
record?

Chuck Connell
Nuance R&D Data Team
Burlington, MA


From: richin.j...@nokia.com<mailto:richin.j...@nokia.com> 
[mailto:richin.j...@nokia.com]
Sent: Wednesday, September 26, 2012 10:14 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>; 
manishbh...@rocketmail.com<mailto:manishbh...@rocketmail.com>
Subject: RE: zip file or tar file cosumption

Hi Manish,

If you have your zip file at location -  /home/manish/zipfile, you can just 
point your external table to that location like
CREATE EXTERNAL TABLE manish_test (field1 string, field2 string) ROW FORMAT 
DELIMITED FIELDS TERMINATED BY  STORED AS TEXTFILE 
LOCATION '/home/manish/zipfile';

OR

If you already have external table pointing to a certain location you can load 
this zip file into your table as
LOAD DATA INPATH '/home/manish/zipfile' INTO TABLE manish_test;

Hope this helps.

Richin

From: ext Manish Bhoge [mailto:manishbh...@rocketmail.com]
Sent: Wednesday, September 26, 2012 9:13 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: zip file or tar file cosumption

Hi Savant,

Got it. But I still need to understand that how to load zip? Can I directly use 
zip file in external table. can u pls help to get the load statement.
Sent from my BlackBerry, pls excuse typo

From: "Savant, Keshav" 
mailto:keshav.c.sav...@fisglobal.com>>
Date: Wed, 26 Sep 2012 12:25:38 +
To: 
user@hive.apache.orgmailto:user@hive.apache.org%3cu...@hive.apache.org>>
ReplyTo: user@hive.apache.org<mailto:user@hive.apache.org>
Cc: 
manish.bh...@target.commailto:manish.bh...@target.com%3cmanish.bh...@target.com>>;
 
chuck.conn...@nuance.commailto:chuck.conn...@nuance.com%3cchuck.conn...@nuance.com>>
Subject: RE: zip file or tar file cosumption

Another solution would be

Using shell script do following

1.   unzip txt files,

2.   one by one merge those 50 (or N number of) text files into one text 
file,

3.   then the zip/tar that bigger text file,

4.   then that big zip/tar file can be uploaded into hive.

Keshav C Savant

From: Connell, Chuck 
[mailto:chuck.conn...@nuance.com]<mailto:[mailto:chuck.conn...@nuance.com]>
Sent: Wednesday, September 26, 2012 4:04 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: zip file or tar file cosumption

This could be a problem. Hive uses newline as the record separator. A ZIP file 
will certainly newline characters. So I doubt this is possible.

BUT, I would like to hear from anyone who has solved the "newline is always a 
record separator" problem, because we ran into it for another type of 
compressed file.

Chuck

From: Manish.Bhoge [manish.bh...@target.com]
Sent: Wednesday, September 26, 2012 3:17 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: zip file or tar file cosumption
Hivers,

I want to understand that would it be possible to utilize zip/tar files 
directly into Hive. All the files has similar schema (s

RE: zip file or tar file cosumption

2012-09-26 Thread Savant, Keshav
Another solution would be

Using shell script do following

1.   unzip txt files,

2.   one by one merge those 50 (or N number of) text files into one text 
file,

3.   then the zip/tar that bigger text file,

4.   then that big zip/tar file can be uploaded into hive.

Keshav C Savant

From: Connell, Chuck [mailto:chuck.conn...@nuance.com]
Sent: Wednesday, September 26, 2012 4:04 PM
To: user@hive.apache.org
Subject: RE: zip file or tar file cosumption

This could be a problem. Hive uses newline as the record separator. A ZIP file 
will certainly newline characters. So I doubt this is possible.

BUT, I would like to hear from anyone who has solved the "newline is always a 
record separator" problem, because we ran into it for another type of 
compressed file.

Chuck


From: Manish.Bhoge [manish.bh...@target.com]
Sent: Wednesday, September 26, 2012 3:17 AM
To: user@hive.apache.org
Subject: zip file or tar file cosumption
Hivers,

I want to understand that would it be possible to utilize zip/tar files 
directly into Hive. All the files has similar schema (structure).  Say 50 *.txt 
files are zipped into a single zip file can we load data directly from this zip 
file OR should we need to unzip first?

Thanks & Regards
Manish Bhoge | Technical Architect  * Target DW/BI| * +919379850010 (M) Ext: 
5691 VOIP: 22165 | * "Excellence is not a skill, It is an attitude." 
MySite

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


RE: Hive on Standalone Machine

2012-04-25 Thread Savant, Keshav
Thanks Ashish, that was helpful.

Keshav
From: Ashish Thusoo [mailto:athu...@qubole.com]
Sent: Wednesday, April 25, 2012 6:37 PM
To: user@hive.apache.org
Subject: Re: Hive on Standalone Machine

Hive needs the hadoop jars to talk to hadoop. The machine that it is installed 
on has to have those jars installed. However, it does not need to be a "part" 
of the hadoop cluster in the sense that it does not need to have a TaskTracker 
or DataNode running. The machine can operate purely as a client to the hadoop 
cluster but it needs the hadoop jars to talk to the hadoop cluster.

Hope that helps...

Ashish
On Wed, Apr 25, 2012 at 12:43 AM, Savant, Keshav 
mailto:keshav.c.sav...@fisglobal.com>> wrote:
Hi All,

Is it possible to install hive on a machine that is not a member of Hadoop 
cluster (and it does not have hadoop installation on it, i.e. no HADOOP_HOME or 
its entry in path)?

Thanks,
Keshav


_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


Hive on Standalone Machine

2012-04-25 Thread Savant, Keshav
Hi All,

Is it possible to install hive on a machine that is not a member of Hadoop 
cluster (and it does not have hadoop installation on it, i.e. no HADOOP_HOME or 
its entry in path)?

Thanks,
Keshav


_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


Hive | HBase Integration

2012-02-28 Thread Savant, Keshav
Hi All,

We did a successful setup of hadoop-0.20.203.0 and hive-0.7.1.

In our next step we are eyeing HBase integration with Hive. As far as we 
understand from articles available on internet and apache site, we can use 
HBase instead of derby as a metastore of Hive, this gives us more flexibility 
while handling very large data.

We are using hbase-0.92.0 to integrate it with Hive, till now HBase has been 
setup and we can create sample table on it and insert sample data in it, but we 
are not able to integrate it with Hive, because when we issue the command to 
create hive specific table on HBase (below in box) the command does not 
executes completely and a new command line is shown with an asterisk (*), and 
table does not gets created.

CREATE TABLE hive_hbasetable_k(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "hivehbasek");



Please provide us some pointers (steps to follow) for doing this integration or 
what we are not doing correctly. Till now we got these below URLs to do this, 
any help is appreciated

http://mevivs.wordpress.com/2010/11/24/hivehbase-integration/
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Kind regards,
Keshav C Savant

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


RE: Data loading from Datanode

2011-12-08 Thread Savant, Keshav
Hi Vikas,

 

I think there is some problem in understanding, I have my cluster setup
where I have installed Hive on namenode, and I can insert data into HDFS
using hive.

 

My question is, can I install hive on any of the datanode (instead of
namenode) and load data from there in datanode directly? And once I do a
data insertion from datanode (and to datanode) the master syncs with
datanode later and keeps track of newly added file on datanode.

 

Kind Regards,

Keshav C Savant

 

From: Vikas Srivastava [mailto:vikas.srivast...@one97.net] 
Sent: Thursday, December 08, 2011 1:03 PM
To: user@hive.apache.org
Subject: Re: Data loading from Datanode

 

hey

yes its possible to load data through hive in hadoop, but you can't
decide that where data file should  store(on which node). that could
only be decide by namenode.

Regards
Vikas Srivastava



On Thu, Dec 8, 2011 at 12:49 PM, Savant, Keshav
 wrote:

Hi All,

 

Is it possible to load data (in HDFS) using Hive Load data query from
any of the Datanode?

 

So that means can we insert files into datanode directly (or from hive
installed on datanode) and then the master node syncs with datanodes
later.

 

Keshav C Savant

_
The information contained in this message is proprietary and/or
confidential. If you are not the intended recipient, please: (i) delete
the message and all copies; (ii) do not disclose, distribute or use the
message in any manner; and (iii) notify the sender immediately. In
addition, please be aware that any message addressed to our domain is
subject to archiving and review by persons other than the intended
recipient. Thank you.




-- 
With Regards
Vikas Srivastava

DWH & Analytics Team

Mob:+91 9560885900
One97 | Let's get talking !

 

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


Data loading from Datanode

2011-12-07 Thread Savant, Keshav
Hi All,

 

Is it possible to load data (in HDFS) using Hive Load data query from
any of the Datanode?

 

So that means can we insert files into datanode directly (or from hive
installed on datanode) and then the master node syncs with datanodes
later.

 

Keshav C Savant

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


RE: Hive query taking too much time

2011-12-07 Thread Savant, Keshav
You are right Wojciech Langiewicz, we did the same thing and posted my
result yesterday. Now we are planning to do this using a shell script
because of dynamicity of our environment where file keep on coming. We
will schedule the shell script using cron job.

A query on this, we are planning to merge files based on either of the
following approach
1. Based on file count: If file count goes to X number of files, then
merge and insert in HDFS.
2. Based on merged file size: If merged file size crosses beyond X
number of bytes, then insert into HDFS.

I think option 2 is better because in that way we can say that all
merged files will be almost of same bytes. What do you suggest?

Kind Regards,
Keshav C Savant


-Original Message-
From: Wojciech Langiewicz [mailto:wlangiew...@gmail.com] 
Sent: Wednesday, December 07, 2011 8:15 PM
To: user@hive.apache.org
Subject: Re: Hive query taking too much time

Hi,
In this case it's much easier and faster to merge all files using this
command:

cat *.csv > output.csv
hive -e "load data local inpath 'output.csv' into table $table"

On 07.12.2011 07:00, Vikas Srivastava wrote:
> hey if u having the same col of  all the files then you can easily 
> merge by shell script
>
> list=`*.csv`
> $table=yourtable
> for file in $list
> do
> cat $file>>new_file.csv
> done
> hive -e "load data local inpath '$file' into table $table"
>
> it will merge all the files in single file then you can upload it in 
> the same query
>
> On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
> wrote:
>
>> Hi Paul,
>> I am having the same problem. Do you know any efficient way of 
>> merging the files?
>>
>> -Mohit
>>
>>
>> On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles
wrote:
>>
>>> How much time is it spending in the map/reduce phases, respectively?

>>> The large number of files could be creating a lot of mappers which 
>>> create a lot of overhead. What happens if you merge the 2624 files 
>>> into a smaller number like 24 or 48. That should speed up the mapper

>>> phase significantly.
>>>
>>> ** **
>>>
>>> *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
>>> *Sent:* Tuesday, December 06, 2011 6:01 AM
>>> *To:* user@hive.apache.org
>>> *Subject:* Hive query taking too much time
>>>
>>> ** **
>>>
>>> Hi All,
>>>
>>> ** **
>>>
>>> My setup is 
>>>
>>> hadoop-0.20.203.0
>>>
>>> hive-0.7.1
>>>
>>> ** **
>>>
>>> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it 
>>> is also acting as secondary name node). On namenode I have setup 
>>> hive with HiveDerbyServerMode to support multiple hive server 
>>> connection.
>>>
>>> ** **
>>>
>>> I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive 
>>> query statements, total number of files is 2624 an their combined 
>>> size is only
>>> 713 MB, which is very less from Hadoop perspective that can handle 
>>> TBs of data very easily.
>>>
>>> ** **
>>>
>>> The problem is, when I run a simple count query (i.e. *select 
>>> count(*) from a_table*), it takes too much time in executing the 
>>> query.
>>>
>>> ** **
>>>
>>> For instance it takes almost 17 minutes to execute the said query if

>>> the table has 950,000 rows, I understand that time is too much for 
>>> executing a query with only such small data. 
>>>
>>> This is only a dev environment and in production environment the 
>>> number of files and their combined size will move into millions and 
>>> GBs
>>> respectively.
>>>
>>> ** **
>>>
>>> On analyzing the logs on all the datanodes and namenode/secondary 
>>> namenode I do not find any error in them.
>>>
>>> ** **
>>>
>>> I have tried setting mapred.reduce.tasks to a fixed number also, but

>>> number of reduce always remains 1 while number of maps is determined

>>> by hive only.
>>>
>>> ** **
>>>
>>> Any suggestion what I am doing wrong, or how can I improve the 
>>> performance of hive queries? Any suggestion or pointer is highly 
>>> appreciated. 
>>>
>>> ** **
>>>
>>> Keshav
>>>
>>> _
>>> The information contained in this message is proprietary and/or 
>&

RE: Hive query taking too much time

2011-12-07 Thread Savant, Keshav
Hi Wojciech Langiewicz/Paul Mackles,

 

I tried your suggestion and it worked, now the performance has increased
many folds, here are the results from my testing after implementing your
suggestion

 

Number of Files on HDFS

File Size

Select count(*) time taken in seconds

Select count(*) result

1 (created from 2624 CSVs )

708.8 MB

66.258

3,567,922

3 (each created from 2624 CSVs )

708.8 MB * 3

119.92

10,703,766

3 (each created from 2624 CSVs ) +
14 (each created from almost 200 CSVs)

708.8 MB *3 +
Combined size of 14 files (ranging 48 Mb to 68 MB) is : 708.8 MB 

153.306

14,271,688

 

Thanks a lot for your help.

 

Kind Regards,

Keshav C Savant

 

From: Paul Mackles [mailto:pmack...@adobe.com] 
Sent: Tuesday, December 06, 2011 8:14 PM
To: user@hive.apache.org
Subject: RE: Hive query taking too much time

 

How much time is it spending in the map/reduce phases, respectively? The
large number of files could be creating a lot of mappers which create a
lot of overhead. What happens if you merge the 2624 files into a smaller
number like 24 or 48. That should speed up the mapper phase
significantly.

 

From: Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] 
Sent: Tuesday, December 06, 2011 6:01 AM
To: user@hive.apache.org
Subject: Hive query taking too much time

 

Hi All,

 

My setup is 

hadoop-0.20.203.0

hive-0.7.1

 

I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.

 

I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.

 

The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.

 

For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data. 

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.

 

On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.

 

I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.

 

Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated. 

 

Keshav

_
The information contained in this message is proprietary and/or
confidential. If you are not the intended recipient, please: (i) delete
the message and all copies; (ii) do not disclose, distribute or use the
message in any manner; and (iii) notify the sender immediately. In
addition, please be aware that any message addressed to our domain is
subject to archiving and review by persons other than the intended
recipient. Thank you.

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


Hive query taking too much time

2011-12-06 Thread Savant, Keshav
Hi All,

 

My setup is 

hadoop-0.20.203.0

hive-0.7.1

 

I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
also acting as secondary name node). On namenode I have setup hive with
HiveDerbyServerMode to support multiple hive server connection.

 

I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined size
is only 713 MB, which is very less from Hadoop perspective that can
handle TBs of data very easily.

 

The problem is, when I run a simple count query (i.e. select count(*)
from a_table), it takes too much time in executing the query.

 

For instance it takes almost 17 minutes to execute the said query if the
table has 950,000 rows, I understand that time is too much for executing
a query with only such small data. 

This is only a dev environment and in production environment the number
of files and their combined size will move into millions and GBs
respectively.

 

On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.

 

I have tried setting mapred.reduce.tasks to a fixed number also, but
number of reduce always remains 1 while number of maps is determined by
hive only.

 

Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated. 

 

Keshav

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


LOAD DATA gives Error for multiple files

2011-11-25 Thread Savant, Keshav
Hi,

 

My setup is

hadoop-0.20.203.0

hive-0.7.1

 

I am using a java program that uses LOAD DATA LOCAL INPATH query to
insert files into a hive table. My program is trying to insert almost
2500+ files using this command one by one. The problem I am facing is
after processing some file (sometimes 180 or 190 files) I get the
following exception

 

java.sql.SQLException: Query returned non-zero code: 9, cause: FAILED:
Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.CopyTask

at
org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.jav
a:192)

 

I am using Statements as I read somewhere that PreparedStatements can
cause this problem, but the problem is same by both the ways. Once this
problem comes it keeps coming for rest of the files, file permissions
are OK for the files I am inserting in hive table, and the java process
owner has rwx permission for all files.

 

If this exception comes and I force close the java program (or it
completes), restart the hive server, and run my java program again, my
java program again processes (180/190) files and then same problem
comes.

 

>From JIRA a found a new version of CopyTask.java, but that version is
not being compiled as it has dependency on some other
class(CopyWork.java or Task.java) that has isErrorOnSrcEmpty() method.
If I add a new method isErrorOnSrcEmpty() and setErrorOnSrcEmpty(boolean
boolValue) in CopyWork.java then I am not sure that from where the
setter will be called to satisfy the hive CopyTask logic.

 

Any help is appreciated. The source code is attached for ready
reference.

 

Kind regards,

Keshav C Savant 

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.


FeedHdfsDAO.java
Description: FeedHdfsDAO.java