How to manage huge partitioned table with 1000+ columns in Hive

2019-09-30 Thread Saurabh Santhosh
Hi,

I am facing the following problem while trying to store/use a huge
partitioned table with 1000+ columns in Hive. I would like to know how to
solve this problem either using hive or any other store.

Requirement:

1).There is a table with around 1000+ columns which is partitioned by date.
2).Every day consist of data about around 500 million entities. There will
be an id column with the id of the entity and around 1000+ columns which
represent attributes of given entity for each day.
3).We have to store data for around 2 years
4). New columns may be added/logic of existing column may be changed any
day and when this happens we have to populate data for the given column for
last 2 years


Our Solution 1:

1). We created a table with 1000+ columns and partitioned by date.
2). Every day we create a new partition and delete partition older than 2
years

Problems Faced in Solution 1:

Whenever we had to add/modify certain columns, the backfill of data took a
long time and it was taking months to backfill the data for 2 years (this
was because there is lot of IO due to the read/write of each partition)


Our Solution 2:

1). We created 10 tables with around 100+ columns each and each of them was
partitioned by date.
2). Every day we create a new partition in each of the small tables and
delete partition older than 2 years
3). Created a view which was a join between all the tables with id, date as
join key


Problems Faced in Solution 2:

Now the backfill time was considerably reduced from months to weeks as we
need to only refresh the small table which contained the columns to be
backfilled thus reducing the IO drastically.
But this lead to very slow queries on top of the view. Especially when we
query for 6 months data, the queries were taking more than 10 hrs due to
the shuffling of data

Our Solution 3:

1). We also tried to bucket each small table based on the id column, but
this did not give us the desired result as the shuffling was still happening


Can anyone suggest what is the best approach to go with in the above
scenario?


Random failure in HIVE tez engine

2019-03-28 Thread Saurabh Mishra
Hi,

While running the following query with 'tez' execution engine am getting 
ArrayOutOfBound exception. Not sure why, but any suggestion would be 
appreciated.

Query : select distinct -1, -1;
Distribution : HDP
HIVE Version : 2.6.3.0-235
Execution Engine : tez
Error Message :

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error 
while processing vector batch (tag=0) [Error getting row data with exception 
java.lang.ArrayIndexOutOfBoundsException: 15

at 
org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch.toString(VectorizedRowBatch.java:179)

at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:449)

at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecordVector(ReduceRecordSource.java:381)

at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:232)

at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:266)

at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)

at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)

at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)

at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)

at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)

at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)

at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)

at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

 ]

at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:454)

at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecordVector(ReduceRecordSource.java:381)

... 17 more

Caused by: java.lang.NullPointerException

at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:136)

at java.io.DataOutputStream.write(DataOutputStream.java:107)

at 
org.apache.hadoop.hive.ql.exec.vector.VectorGroupKeyHelper.copyGroupKey(VectorGroupKeyHelper.java:98)

at 
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator$ProcessingModeGroupBatches.processBatch(VectorGroupByOperator.java:729)

at 
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.process(VectorGroupByOperator.java:881)

at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:440)

... 18 more

]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 
killedTasks:0, Vertex vertex_1553746301792_3720_7_01 [Reducer 2] killed/failed 
due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. 
failedVertices:1 killedVertices:0

Regards
Saurabh Mishra


Unsigned Data Type Support

2015-07-23 Thread saurabh
Hi All,

Is there any data type in hive which only stores positive values and
converts the negative values to ZERO?

This can be handled while data processing e.g. CASE statement or
COLEASCE(IF()), just wanted to be check if this can be handled at the Table
definition level.

Thanks in Advance.

Regards,
Saurabh


Hive With tez

2015-07-05 Thread saurabh
Hi,

We are in process of exploring TEZ for Hive 0.14.
Needed some pointers to start on Hive with Tez.
E.g. in Hive HDFS Block size plays a vital role in getting the number of
Mappers and later independent execution of mappers can accelerate
processing substantially.

I understand this is a very vast topic and cannot be described, however
some quick pointers will be helpful.

I am currently working on:
Query vectorization and COB with ORC tables.

Thanks,
Saurabh


Date Functions in Hive

2015-06-23 Thread saurabh
Hi,

Need some suggestions on Date function in Hive.
The data appearing in the source file is in the format of "May 31, 2015". I
want to convert the same in MM-DD- format.

Please suggest if this can be done with existing Hive 14.0 functionality
or it will require custom UDF.

Please let me know if any more information is required on the same.

Thanks,
Saurabh


Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-02 Thread Saurabh B
This is not open source but we are using Vertica and it works very nicely
for us. There is a 1TB community edition but above that it costs money.
It has really advanced SQL (analytical functions, etc), works like an
RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
Redshift available but Vertica has more features (pattern matching
functions, etc).

Again, not open source so I would be interested to know what you end up
going with and what your experience is.

On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks  wrote:

> Well what I am seeking is a Big Data database that can work with Small
> Data also. I.e.: scaleable from one node to vast clusters; whilst
> maintaining relatively low latency throughout.
>
> Which fit into this category?
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>


Executing Hive Queries in Parallel

2014-04-21 Thread saurabh
Hi,
I need some inputs to execute hive queries in parallel. I tried doing this
using CLI (by opening multiple ssh connection) and executed 4 HQL's; it was
observed that the queries are getting executed sequentially. All the FOUR
queries got submitted however while the first one was in execution mode the
other were in pending state. I was performing this activity on the EMR
running on Batch mode hence didn't able to dig into the logs.

The hive CLI uses native hive connection which by default uses the FIFO
scheduler.  This might be one of the reason for the queries getting
executed in sequence.

I also observed that when multiple queries are executed using multiple HUE
sessions, it provides the parallel execution functionality. Can you please
suggest how the functionality of HUE can be replicated using CLI?

I am aware of beeswax client however i am not sure how this can be used
during EMR- batch mode processing.

Thanks in advance for going through this. Kindly let me know your thoughts
on the same.


Re: Converting from textfile to sequencefile using Hive

2013-09-30 Thread Saurabh B
Thanks Sean, that is exactly what I want.


On Mon, Sep 30, 2013 at 3:09 PM, Sean Busbey  wrote:

> S,
>
> Check out these presentations from Data Science Maryland back in May[1].
>
> 1. working with Tweets in Hive:
>
>
> http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978
>
> 2. then pulling stuff out of Hive to use with Mahout:
>
> http://files.meetup.com/6195792/Working%20With%20Mahout.pdf
>
> The Mahout talk didn't have a directly useful outcome (largely because it
> tried to work with the tweets as individual text documents), but it does
> get through all the mechanics of exactly what you state you want.
>
> The meetup page also has links to video, if the slides don't give enough
> context.
>
> HTH
>
> [1]: http://www.meetup.com/Data-Science-MD/events/111081282/
>
>
> On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B wrote:
>
>> Hi Nitin,
>>
>> No offense taken. Thank you for your response. Part of this is also
>> trying to find the right tool for the job.
>>
>> I am doing queries to determine the cuts of tweets that I want, then
>> doing some modest normalization (through a python script) and then I want
>> to create sequenceFiles from that.
>>
>> So far Hive seems to be the most convenient way to do this. But I can
>> take a look at PIG too. It looked like the "STORED AS SEQUENCEFILE" gets me
>> 99% way there. So I was wondering if there was a way to get those ids in
>> there as well. The last piece is always the stumbler :)
>>
>> Thanks again,
>>
>> S
>>
>>
>>
>>
>> On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar wrote:
>>
>>> are you using hive to just convert your text files to sequence files?
>>> If thats the case then you may want to look at the purpose why hive was
>>> developed.
>>>
>>> If you want to modify data or process data which does not involve any
>>> kind of analytics functions on a routine basis.
>>>
>>> If you want to do a data manipulation or enrichment and do not want to
>>> code a lot of map reduce job, you can take a look at pig scripts.
>>> basically what you want to do is generate an  UUID for each of your
>>> tweet and then feed it to mahout algorithms.
>>>
>>> Sorry if I understood it wrong or it sounds rude.
>>>
>>
>>
>
>
> --
> Sean
>


Re: Converting from textfile to sequencefile using Hive

2013-09-30 Thread Saurabh B
Hi Nitin,

No offense taken. Thank you for your response. Part of this is also trying
to find the right tool for the job.

I am doing queries to determine the cuts of tweets that I want, then doing
some modest normalization (through a python script) and then I want to
create sequenceFiles from that.

So far Hive seems to be the most convenient way to do this. But I can take
a look at PIG too. It looked like the "STORED AS SEQUENCEFILE" gets me 99%
way there. So I was wondering if there was a way to get those ids in there
as well. The last piece is always the stumbler :)

Thanks again,

S




On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar wrote:

> are you using hive to just convert your text files to sequence files?
> If thats the case then you may want to look at the purpose why hive was
> developed.
>
> If you want to modify data or process data which does not involve any kind
> of analytics functions on a routine basis.
>
> If you want to do a data manipulation or enrichment and do not want to
> code a lot of map reduce job, you can take a look at pig scripts.
> basically what you want to do is generate an  UUID for each of your tweet
> and then feed it to mahout algorithms.
>
> Sorry if I understood it wrong or it sounds rude.
>


Converting from textfile to sequencefile using Hive

2013-09-30 Thread Saurabh B
Hi,

I have a lot of tweets saved as text. I created an external table on top of
it to access it as textfile. I need to convert these to sequencefiles with
each tweet as its own record. To do this, I created another table as a
sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
  tweet STRING
  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS SEQUENCEFILE
LOCATION '/user/hdfs/tweetseq'


Now when I insert into this table from my original tweets table, each line
gets its own record as expected. This is great. However, I don't have any
record ids here. How can I get it to write ids?

PS, I need the ids to be there because mahout seq2sparse expects that.

Regards,
S


Converting from textfile to sequencefile using Hive

2013-09-30 Thread Saurabh Bhatnagar (Business Intelligence)
Hi,

I have a lot of tweets saved as text. I created an external table on top of
it to access it as textfile. I need to convert these to sequencefiles with
each tweet as its own record. To do this, I created another table as a
sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
  tweet STRING
  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS SEQUENCEFILE
LOCATION '/user/hdfs/tweetseq'


Now when I insert into this table from my original tweets table, each line
gets its own record as expected. This is great. However, I don't have any
record ids here. Short of writing my own UDF to make that happen, are there
any obvious solutions I am missing here?

PS, I need the ids to be there because mahout seq2sparse expects that.
Without ids, it fails with -

java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
cast to org.apache.hadoop.io.Text
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

Regards,
S


Converting from textfile to sequencefile using Hive

2013-09-29 Thread Saurabh Bhatnagar (Business Intelligence)
Hi,

I have a lot of tweets saved as text. I created an external table on top of
it to access it as textfile. I need to convert these to sequencefiles with
each tweet as its own record. To do this, I created another table as a
sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
  tweet STRING
  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS SEQUENCEFILE
LOCATION '/user/hdfs/tweetseq'


Now when I insert into this table from my original tweets table, each line
gets its own record as expected. This is great. However, I don't have any
record ids here. Short of writing my own UDF to make that happen, are there
any obvious solutions I am missing here?

PS, I need the ids to be there because mahout seq2sparse expects that.
Without ids, it fails with -

java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
cast to org.apache.hadoop.io.Text
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

Regards,
S


Re: Semantics of Rank.

2013-07-26 Thread saurabh
Hi all,

Below are some of observations based on the on-going rank function
discussion.

1. I executed below mentioned queries  and only the query with "rank"
(lowercase) executed successfully, rest were throwing exceptions "FAILED:
SemanticException Failed to breakup Windowing invocations into Groups."

-  select cust_id, ord_dt, RANK() w from cust_ord window w as (partition by
cust_id order by ord_dt);

-  select cust_id, ord_dt, Rank() w from cust_ord window w as (partition by
cust_id order by ord_dt);

-   select cust_id, ord_dt, rank() w from cust_ord window w as (partition
by cust_id order by ord_dt);

It seems "rank" keyword is case-sensitive. Attached is the screenshot for
reference.

2. I created a dummy table with the data provided in the below mail trail
and achieved the expected output, using the below mentioned query.

*select cust_id, ord_dt, rank() over (partition by cust_id order by ord_dt)
from cust_ord;*

 Request all to kindly review these details and suggest if it was of any
help!

Thanks.


On Sat, Jul 27, 2013 at 12:07 AM, j.barrett Strausser <
j.barrett.straus...@gmail.com> wrote:

> Any further help on this, otherwise I'll file a jira.
>
>
> On Wed, Jul 24, 2013 at 11:32 PM, j.barrett Strausser <
> j.barrett.straus...@gmail.com> wrote:
>
>> As an example : If I run my query above removing the arg the following is
>> thrown.
>>
>> FAILED: SemanticException Failed to breakup Windowing invocations into
>> Groups. At least 1 group must only depend on input columns. Also check for
>> circular dependencies.
>> Underlying error:
>> org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more
>> arguments are expected.
>>
>>
>> Similar issue and fix here:
>>
>> http://www.marshut.com/rqvpz/use-rank-over-partition-function-in-hive-11.html
>>
>> Even if it didn't require an arg it still doesn't explain my anomalous
>> output.
>>
>>
>>
>> On Wed, Jul 24, 2013 at 11:28 PM, j.barrett Strausser <
>> j.barrett.straus...@gmail.com> wrote:
>>
>>> That isn't true. If you try to run the above HIVE without an argument,
>>> it will throw an exception. I have seen other users replicate this problem
>>> as well.
>>>
>>> I can file a JIRA if someone can confirm that my query should work.
>>>
>>>
>>> On Wed, Jul 24, 2013 at 11:02 PM, manishbh...@rocketmail.com <
>>> manishbh...@rocketmail.com> wrote:
>>>
 Analytical function doesn't expect any argument. Rank() itself enough
 to sequence based on the window you have defined in partition by. So

 Rank() over (partition by cmscustid  order by orderdate)

 Should work as long as I have wrote right syntax for hive.

 Sent via Rocket from my HTC

 - Reply message -
 From: "j.barrett Strausser" 
 To: 
 Subject: Semantics of Rank.
 Date: Thu, Jul 25, 2013 1:08 AM


 Thanks for the reply. Perhaps my misunderstanding of the relation
 between
 rank and the windowing function is wrong.

 What I want to achieve for the following is : For a given customer id,
 sort his orders. I thought the below would work.

 SELECT eh.cmsorderid, eh.orderdate, RANK(orderdate) w FROM order_data eh
 window w as (partition by cmscustid  order by orderdate);

 The rank function instead returns the rank of the order date over all
 all
 order dates.

 Example snippet from above

 Actual :

 675878327APR201294
 675878323JUN201295
 675878514DEC201296
 675879518DEC201197
 675879606MAY201298
 675879824MAR201399
 675879923NOV2012100


 Expected :

 675878327APR20121
 675878323JUN20122
 675878514DEC20121
 675879518DEC20111
 675879606MAY20121
 675879824MAR20131
 675879923NOV20121


 -b




 On Wed, Jul 24, 2013 at 3:17 PM, Shahar Glixman >>> >wrote:

 > the argument to rank is simply some value, whereas the rank function
 > compare this value
 >  to the previous value received, if value is same, rank returns
 ++index,
 > otherwise, rank return 1.
 > pseudo code:
 >
 > class Rank {
 > int index;
 > Object previousValue = null;
 > int evaluate(Object value) {
 >   if (value == previousValue) {
 > return ++index;
 >   }
 >   previousValue = value;
 >   index = 1;
 >   return 1;
 > }
 >
 >
 > On Wed, Jul 24, 2013 at 9:59 PM, j.barrett Strausser <
 > j.barrett.straus...@gmail.com> wrote:
 >
 >> It seems as though I am required to pass in an argument to RANK().
 >>
 >> What is the effect of passing this argument in ?
 >>
 >> In a RANK function the output for RANK should be the number of rows
 >> preceding
 >> a row in a given window. As  windows are specified by the partition
 and
 >> order by fields I don't u

Re: Question regarding external table and csv in NFS

2013-07-17 Thread Saurabh M
Hi Mainak,

Can you try using this:

 create external table outside_supplier (S_SUPPKEY INT, S_NAME STRING,
S_ADDRESS STRING, S_NATIONKEY INT, S_PHONE STRING, S_ACCTBAL DOUBLE,
S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS
TEXTFILE LOCATION 'file:///mnt/h/tpc-h-impala/data/supplier.tbl';

I assume that "supplier.tbl" is a directory and the csv file is present in
the same.

Let me know if it worked!

Thanks,

Saurabh


On Thu, Jul 18, 2013 at 1:55 AM, Mainak Ghosh  wrote:

> Hello,
>
> I have just started using Hive and I was trying to create an external
> table with the csv file placed in NFS. I tried using file:// and local://.
> Both of these attempts failed with the error:
>
> create external table outside_supplier (S_SUPPKEY INT, S_NAME STRING,
> S_ADDRESS STRING, S_NATIONKEY INT, S_PHONE STRING, S_ACCTBAL DOUBLE,
> S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS
> TEXTFILE LOCATION 'local://mnt/h/tpc-h-impala/data/supplier.tbl';
>
> FAILED: Error in metadata: MetaException(message:Got exception:
> java.io.IOException No FileSystem for scheme: local)
>
> and
>
> create external table outside_supplier (S_SUPPKEY INT, S_NAME STRING,
> S_ADDRESS STRING, S_NATIONKEY INT, S_PHONE STRING, S_ACCTBAL DOUBLE,
> S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS
> TEXTFILE LOCATION 'file://mnt/h/tpc-h-impala/data/supplier.tbl';
>
> FAILED: Error in metadata:
> MetaException(message:file:/h/tpc-h-impala/data/supplier.tbl is not a
> directory or unable to create one)
>
> Am I missing some configuration? Any help would be really appreciated.
>
> Thanks and Regards,
> Mainak.
>


RE: Connecting to Hive from R through JDBC

2013-05-08 Thread Saurabh S
Thanks for the reply, Sanjay.
The Hive server is definitely running. I had once connected Hive from within 
Python, so it's not a server issue.
It looks more like java libraries issue on my local machine, but I can't figure 
out what's wrong.

From: sanjay.subraman...@wizecommerce.com
To: user@hive.apache.org
Subject: Re: Connecting to Hive from R through JDBC
Date: Wed, 8 May 2013 00:27:35 +






Hi Saurabh



The usual suspect looks like hive-server service is not running on server where 
hive is installed….The hive-server service needs to be installed and 
started….It listens on port 1 by default.



Also on a side note is I hope your Hive connecting to MySQL or some non-derby 
RDBMS :-) 



Thanks
sanjay 





From: Saurabh S 

Reply-To: "user@hive.apache.org" 

Date: Tuesday, May 7, 2013 11:55 AM

To: "user@hive.apache.org" 

Subject: Connecting to Hive from R through JDBC







Hi,



I'm trying to use package RJDBC to connect to hive through R. My client 
machine, on which R is installed, is Windows 7. I installed the package, then 
copied the file hive-jdbc-0.9.0-cdh4.1.2.jar to my local R work folder. Then 
used the following commands:



I was able to read the driver through following command:

drv <- JDBC('org.apache.hadoop.hive.jdbc.HiveDriver', 
'C:/Users/Saurabh/Documents/RWork/hive-jdbc-0.9.0-cdh4.1.2.jar')



But when I try to make the connection using the following command:

conn <- dbConnect(drv, 'jdbc:hive://:/default')



I get the following error:
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/metastore/api/MetaException



Any idea why this is happening? Some say that it's best to have R and Hive on 
the same server so that you can just pass localhost in the dbConnect command. 
But that's not an option for me because even if R is installed on the Hadoop 
server,
 RJDBC isn't available and I'm just a local user.



Thanks,
Saurabh







CONFIDENTIALITY NOTICE

==

This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the 
original message along with any attachments, from your computer system. If you 
are the intended recipient, please be advised that the content of this message 
is subject to access, review
 and disclosure by the sender's Email System Administrator.

  

Connecting to Hive from R through JDBC

2013-05-07 Thread Saurabh S
Hi,
I'm trying to use package RJDBC to connect to hive through R. My client 
machine, on which R is installed, is Windows 7. I installed the package, then 
copied the file hive-jdbc-0.9.0-cdh4.1.2.jar to my local R work folder. Then 
used the following commands:
I was able to read the driver through following command:drv <- 
JDBC('org.apache.hadoop.hive.jdbc.HiveDriver',
'C:/Users/Saurabh/Documents/RWork/hive-jdbc-0.9.0-cdh4.1.2.jar')
But when I try to make the connection using the following command:conn <- 
dbConnect(drv,
'jdbc:hive://:/default')
I get the following error:java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/metastore/api/MetaException
Any idea why this is happening? Some say that it's best to have R and Hive on 
the same server so that you can just pass localhost in the dbConnect command. 
But that's not an option for me because even if R is installed on the Hadoop 
server, RJDBC isn't available and I'm just a local user.
Thanks,Saurabh

Give Custom Username to Hive Output Files

2012-12-11 Thread Saurabh Mishra
Hi,
When i try to insert some data into a hive table mapped to a specific location 
in the HDFS, the file which gets created has user information as 'hive' and 
permissions as '755' i.e 'rwxr-xr-x' Is there any way to change this so that i 
can give my own usename or atleast the user from where i have initiated the 
hive client interface.Kindly suggest.
SincerelySaurabh Mishra   

Hive UDAF Limitation on Internally used Collections

2012-10-20 Thread Saurabh Mishra
Hi,

I am trying to write a UDAF for merging a group of rows, such that the 
resulting merged string, has no duplicates. This duplicate check is case 
insensitive (I am not aware if there is any inbuilt function for this or not, 
since i tried looking on the 
https://cwiki.apache.org/Hive/languagemanual-udf.html page and found nothing).
Well the problem i am facing is that, when using this  UDAF, an error is thrown 
:
java.lang.NoSuchMethodException: java.util.Set.()
 and i noticed that i was using a Set collection object inside the UDAF to 
store the results. This made me think that there might be some limitation on 
collections which can be used inside a Hive UDAF, UDF or UDTF function. 
Kindly someone provide some information on this.

Here is the UDAF Class i created : Also attached at the end is the entire error 
stack trace :
/**
 * 
 */
package com.thomsonreuters.ims.util.udaf;

import java.util.Arrays;
import java.util.LinkedHashSet;
import java.util.Set;

import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

/**
 * UDAF to return, merged and concatenated value of group of entries, after 
sorting out duplicates, case insensitively
 * 
 * @author Saurabh
 */
public class UDAFCaseInsensitiveDistinctMerge extends UDAF {

/**
 * Default Separator Defined and used unless overriden.
 */
private static final String DEFAULT_SEPARATOR = ";";

/**
 * Nested Class to Store the Updated Set of Unique Entries and provide 
methods to interact with the stored Set of
 * Entries.
 * 
 * @author Saurabh
 * 
 */
public static class UniqueEntries {
private Set uniqueEntries = new LinkedHashSet();

/**
 * Add to the Set of String, after converting the String to Upper Case
 * 
 * @param entry
 */
public void addToSet(String entry) {
uniqueEntries.add(entry.toUpperCase());
}

/**
 * Return the Unique values stored in the Set after sorting and 
separated by the default Separator.
 * 
 * @return mergedString
 */
public String getMergedString() {
String[] uniqueArray = (String[]) uniqueEntries.toArray();
Arrays.sort(uniqueEntries.toArray());
StringBuilder stringBuilder = new StringBuilder();
for (String entry : uniqueArray) {
stringBuilder.append(entry).append(DEFAULT_SEPARATOR);
}

return stringBuilder.substring(0, stringBuilder.length() - 1);
}

/**
 * Add a Collection of Strings to the Existing String Set
 * 
 * @param entriesCollection
 */
public void addCollectionToSet(Set entriesCollection) {
uniqueEntries.addAll(entriesCollection);
}

/**
 * Retrieve the Stored String Set
 * 
 * @return stringSet
 */
public Set getUniqueEntriesCollection() {
return uniqueEntries;
}

/**
 * Clear the Entries of the String Set.
 */
public void clear() {
uniqueEntries.clear();
}
}

/**
 * Private Constructor to Prevent the Instantiation of the Class.
 */
private UDAFCaseInsensitiveDistinctMerge() {
// Prevent Instantiation
}

/**
 * Provided Evaluator Implementation to Process the Group of Rows Passed to 
the UDAF.
 * 
 * @author Saurabh
 * 
 */
public class UDAFCaseInsensitiveDistinctMergeEvaluator implements 
UDAFEvaluator {

private UniqueEntries uniqueEntries;

/**
 * Constructor to initialize the intermediate Storage Class and 
invoking the init method.
 */
public UDAFCaseInsensitiveDistinctMergeEvaluator() {
super();
uniqueEntries = new UniqueEntries();
init();
}

/**
 * Initializing Method. Clearing the Stored String Set.
 */
public void init() {
uniqueEntries.clear();
}

/**
 * Iterating over the group of rows and passing their values to the 
String set for non duplicate storage.
 * 
 * @param entry
 * @return
 */
public boolean iterate(String entry) {
uniqueEntries.addToSet(entry);
return true;
}

/**
 * Handle for Partially Terminated UDAF Evaluation Call.
 * 
 * @return
 */
public UniqueEntries terminatePartial() {
return uniqueEntries;
}

/**
 * Handler for resuming Partially Terminated Evaluation Call.
 * 
 * @param previousEntries
 * @return
 */
public boolean merge(UniqueEntries previousEntries) {

uniqueEntries.addCollectionToSet(previousEntries.getUniqueEntriesCollection());
r

RE: Hive Query Unable to distribute load evenly in reducers

2012-10-18 Thread Saurabh Mishra
any views on the problem

From: saurabhmishra.i...@outlook.com
To: user@hive.apache.org; navis@nexr.com
Subject: RE: Hive Query Unable to distribute load evenly in reducers
Date: Tue, 16 Oct 2012 11:23:29 +0530




by using mapjoin if you are implying setting 
set hive.auto.convert.join=true;
then this configuration i am already using, but to no avail...:(

Date: Tue, 16 Oct 2012 14:17:47 +0900
Subject: Re: Hive Query Unable to distribute load evenly in reducers
From: navis@nexr.com
To: user@hive.apache.org

How about using MapJoin?

2012/10/16 Saurabh Mishra 




no there is apparently no heavy skewing. also another stats i wanted to point 
was, following is approximate table contents in this 4 table join query : 
tableA : 170 million (actual number, + i am also exploding these records, so 
the number could be much much higher)

tableB:15
tableC:45
tableD:45
tableE : 45
tableF  : 14000

Also i cannot put any filter condition on tableA ,situation does not permit so. 
:( 
Kindly suggest, some alternative solution or some hive configuration to better 
load distribute in the reducers


> Date: Mon, 15 Oct 2012 16:29:56 +0100
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> From: philip.j.trom...@gmail.com

> To: user@hive.apache.org
> 
> Is your data heavily skewed towards certain values of a.x etc?
> 
> On 15 October 2012 15:23, Saurabh Mishra  
> wrote:

> > The queries are simple joins, something on the lines of
> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join group
> > by a, b,c;
> >
> >
> >> From: liy...@gmail.com

> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> >> To: user@hive.apache.org

> >
> >>
> >> And your queries were?
> >>
> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> >>  wrote:

> >> > Hi,
> >> > I am firing some hive queries joining tables containing upto 30millions
> >> > records each. Since the load on the reducers is very significant in
> >> > these

> >> > cases, i specifically set the following parameters before executing the
> >> > queries :
> >> >
> >> > set mapred.reduce.tasks=100;
> >> > set hive.exec.reducers.bytes.per.reducer=5;

> >> > set hive.optimize.cp=true;
> >> >
> >> > The number of reducer the job spouts in now 160, but despite the high
> >> > number
> >> > most of the load remains upon 1 or 2 reducers. Hence in the final

> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> >> > reducers took 2 hrs to run.
> >> > Is there any way to overcome this load distribution disparity.

> >> > Any help in this regards will be highly appreciated.
> >> >
> >> > Sincerely
> >> > Saurabh Mishra
  


  

RE: Hive Query Unable to distribute load evenly in reducers

2012-10-15 Thread Saurabh Mishra
by using mapjoin if you are implying setting 
set hive.auto.convert.join=true;
then this configuration i am already using, but to no avail...:(

Date: Tue, 16 Oct 2012 14:17:47 +0900
Subject: Re: Hive Query Unable to distribute load evenly in reducers
From: navis@nexr.com
To: user@hive.apache.org

How about using MapJoin?

2012/10/16 Saurabh Mishra 




no there is apparently no heavy skewing. also another stats i wanted to point 
was, following is approximate table contents in this 4 table join query : 
tableA : 170 million (actual number, + i am also exploding these records, so 
the number could be much much higher)

tableB:15
tableC:45
tableD:45
tableE : 45
tableF  : 14000

Also i cannot put any filter condition on tableA ,situation does not permit so. 
:( 
Kindly suggest, some alternative solution or some hive configuration to better 
load distribute in the reducers


> Date: Mon, 15 Oct 2012 16:29:56 +0100
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> From: philip.j.trom...@gmail.com

> To: user@hive.apache.org
> 
> Is your data heavily skewed towards certain values of a.x etc?
> 
> On 15 October 2012 15:23, Saurabh Mishra  
> wrote:

> > The queries are simple joins, something on the lines of
> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join group
> > by a, b,c;
> >
> >
> >> From: liy...@gmail.com

> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> >> To: user@hive.apache.org

> >
> >>
> >> And your queries were?
> >>
> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> >>  wrote:

> >> > Hi,
> >> > I am firing some hive queries joining tables containing upto 30millions
> >> > records each. Since the load on the reducers is very significant in
> >> > these

> >> > cases, i specifically set the following parameters before executing the
> >> > queries :
> >> >
> >> > set mapred.reduce.tasks=100;
> >> > set hive.exec.reducers.bytes.per.reducer=5;

> >> > set hive.optimize.cp=true;
> >> >
> >> > The number of reducer the job spouts in now 160, but despite the high
> >> > number
> >> > most of the load remains upon 1 or 2 reducers. Hence in the final

> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> >> > reducers took 2 hrs to run.
> >> > Is there any way to overcome this load distribution disparity.

> >> > Any help in this regards will be highly appreciated.
> >> >
> >> > Sincerely
> >> > Saurabh Mishra
  

  

RE: Hive Query Unable to distribute load evenly in reducers

2012-10-15 Thread Saurabh Mishra
no there is apparently no heavy skewing. also another stats i wanted to point 
was, following is approximate table contents in this 4 table join query : 
tableA : 170 million (actual number, + i am also exploding these records, so 
the number could be much much higher)
tableB:15
tableC:45
tableD:45
tableE : 45
tableF  : 14000

Also i cannot put any filter condition on tableA ,situation does not permit so. 
:( 
Kindly suggest, some alternative solution or some hive configuration to better 
load distribute in the reducers

> Date: Mon, 15 Oct 2012 16:29:56 +0100
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> From: philip.j.trom...@gmail.com
> To: user@hive.apache.org
> 
> Is your data heavily skewed towards certain values of a.x etc?
> 
> On 15 October 2012 15:23, Saurabh Mishra  
> wrote:
> > The queries are simple joins, something on the lines of
> > select a, b, c, count(D) from tableA join tableB on a.x=b.y join group
> > by a, b,c;
> >
> >
> >> From: liy...@gmail.com
> >> Date: Mon, 15 Oct 2012 21:10:39 +0800
> >> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> >> To: user@hive.apache.org
> >
> >>
> >> And your queries were?
> >>
> >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
> >>  wrote:
> >> > Hi,
> >> > I am firing some hive queries joining tables containing upto 30millions
> >> > records each. Since the load on the reducers is very significant in
> >> > these
> >> > cases, i specifically set the following parameters before executing the
> >> > queries :
> >> >
> >> > set mapred.reduce.tasks=100;
> >> > set hive.exec.reducers.bytes.per.reducer=5;
> >> > set hive.optimize.cp=true;
> >> >
> >> > The number of reducer the job spouts in now 160, but despite the high
> >> > number
> >> > most of the load remains upon 1 or 2 reducers. Hence in the final
> >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> >> > reducers took 2 hrs to run.
> >> > Is there any way to overcome this load distribution disparity.
> >> > Any help in this regards will be highly appreciated.
> >> >
> >> > Sincerely
> >> > Saurabh Mishra
  

RE: Hive Query Unable to distribute load evenly in reducers

2012-10-15 Thread Saurabh Mishra
The queries are simple joins, something on the lines of 
select a, b, c, count(D) from tableA join tableB on a.x=b.y join group by 
a, b,c;


> From: liy...@gmail.com
> Date: Mon, 15 Oct 2012 21:10:39 +0800
> Subject: Re: Hive Query Unable to distribute load evenly in reducers
> To: user@hive.apache.org
> 
> And your queries were?
> 
> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra
>  wrote:
> > Hi,
> > I am firing some hive queries joining tables containing upto 30millions
> > records each. Since the load on the reducers is very significant in these
> > cases, i specifically set the following parameters before executing the
> > queries :
> >
> > set mapred.reduce.tasks=100;
> > set hive.exec.reducers.bytes.per.reducer=5;
> > set hive.optimize.cp=true;
> >
> > The number of reducer the job spouts in now 160, but despite the high number
> > most of the load remains upon 1 or 2 reducers. Hence in the final
> > statistics, 158 reducers go completed with 2-3 minutes of start and 2
> > reducers took 2 hrs to run.
> > Is there any way to overcome this load distribution disparity.
> > Any help in this regards will be highly appreciated.
> >
> > Sincerely
> > Saurabh Mishra
  

Hive Query Unable to distribute load evenly in reducers

2012-10-15 Thread Saurabh Mishra
Hi,
I am firing some hive queries joining tables containing upto 30millions records 
each. Since the load on the reducers is very significant in these cases, i 
specifically set the following parameters before executing the queries : 

set mapred.reduce.tasks=100;
set hive.exec.reducers.bytes.per.reducer=5;
set hive.optimize.cp=true;

The number of reducer the job spouts in now 160, but despite the high number 
most of the load remains upon 1 or 2 reducers. Hence in the final statistics, 
158 reducers go completed with 2-3 minutes of start and 2 reducers took 2 hrs 
to run.
Is there any way to overcome this load distribution disparity.
Any help in this regards will be highly appreciated.

Sincerely
Saurabh Mishra
  

Custom UDF in Python?

2012-06-05 Thread Saurabh S

Is it possible to write Hive UDFs in Python? I googled but didn't find 
anything. I would be happy with RTFM replies if you can give link to the manual.
  

RE: 'set cli header' throws null pointer exception

2012-06-01 Thread Saurabh S

The ${HIVE_HOME}/lib contains files like hive-*-0.7.1-cdh3u3.jar, so I guess 
it's version 0.7.1 provided by Cloudera.


> From: jgho...@gmail.com
> Date: Fri, 1 Jun 2012 17:11:18 -0700
> Subject: Re: 'set cli header' throws null pointer exception
> To: user@hive.apache.org
>
> Which version of Hive are you running?
>
> On Fri, Jun 1, 2012 at 3:49 PM, Saurabh S  wrote:
> >
> > Well it seems that simply moving the set header statement after the 'create 
> > temporary function' statement works just fine.
> >
> > 
> >> From: saurab...@live.com
> >> To: user@hive.apache.org
> >> Subject: 'set cli header' throws null pointer exception
> >> Date: Fri, 1 Jun 2012 14:11:09 -0700
> >>
> >>
> >> This is a weird issue. I've been setting hive CLI headers for a long time 
> >> using the following:
> >>
> >> set hive.cli.print.header=true;
> >>
> >> Recently, I started using custom Hive UDF. Here is my test file, 
> >> myfunc.sql:
> >>
> >> --
> >> add jar ;
> >> create temporary function  as ;
> >> ;
> >> ;
> >> --
> >>
> >> This works just fine. However, if I add the line hive cli header to the 
> >> top, Hive crashes at line 2, i.e. the create temporary function with the 
> >> following error:
> >>
> >> --
> >> Added /home/somejar.jar to class path
> >> Added resource: /home/somejar.jar
> >> OK
> >> Exception in thread "main" java.lang.NullPointerException
> >> at 
> >> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:221)
> >> at 
> >> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
> >> at 
> >> org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:310)
> >> at 
> >> org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:317)
> >> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:490)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at 
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> at 
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
> >> --
> >>
> >> Any idea what's going on?
> >>
> >> Regards,
> >> Saurabh
> >>
> >
  

RE: 'set cli header' throws null pointer exception

2012-06-01 Thread Saurabh S

Well it seems that simply moving the set header statement after the 'create 
temporary function' statement works just fine.


> From: saurab...@live.com
> To: user@hive.apache.org
> Subject: 'set cli header' throws null pointer exception
> Date: Fri, 1 Jun 2012 14:11:09 -0700
>
>
> This is a weird issue. I've been setting hive CLI headers for a long time 
> using the following:
>
> set hive.cli.print.header=true;
>
> Recently, I started using custom Hive UDF. Here is my test file, myfunc.sql:
>
> --
> add jar ;
> create temporary function  as ;
> ;
> ;
> --
>
> This works just fine. However, if I add the line hive cli header to the top, 
> Hive crashes at line 2, i.e. the create temporary function with the following 
> error:
>
> --
> Added /home/somejar.jar to class path
> Added resource: /home/somejar.jar
> OK
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:221)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:310)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:317)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:490)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
> --
>
> Any idea what's going on?
>
> Regards,
> Saurabh
>
  

'set cli header' throws null pointer exception

2012-06-01 Thread Saurabh S

This is a weird issue. I've been setting hive CLI headers for a long time using 
the following:

set hive.cli.print.header=true;

Recently, I started using custom Hive UDF. Here is my test file, myfunc.sql:

--
add jar ;
create temporary function  as ;
;
;
--

This works just fine. However, if I add the line hive cli header to the top, 
Hive crashes at line 2, i.e. the create temporary function with the following 
error:

--
Added /home/somejar.jar to class path
Added resource: /home/somejar.jar
OK
Exception in thread "main" java.lang.NullPointerException
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:221)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
    at 
org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:310)
    at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:317)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:490)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
--

Any idea what's going on?

Regards,
Saurabh
  

Hive equivalent of group_concat() ?

2012-05-11 Thread Saurabh S

As far as I understand, there is no equivalent of MySQL group_concat() in Hive. 
This stackoverflow question is from Sept 2010: 
http://stackoverflow.com/questions/3703740/combine-multiple-rows-into-one-space-separated-string

Does anyone know any other method to create a delimited list from from table 
rows? There are a few methods listed in Anthony Molinaro's SQL Cookbook, but 
all of them use over(partition by) method which is also not available in Hive.
  

RE: Passing date as hive configuration variable

2012-05-10 Thread Saurabh S

Whew, thanks everyone! I think wrapping quotes around that did it.

Nicole, I was going to attempt that as a last resort. But the actual query is 
much longer and it would be extremely undesirable to do so.

Regards,
Saurabh


> From: nicole.ge...@memorylane.com
> To: user@hive.apache.org
> Subject: Re: Passing date as hive configuration variable
> Date: Thu, 10 May 2012 18:39:10 +
>
> Another option is to do it all in the shell:
>
> #/bin/sh
> somedate=$(date -d '2 day ago' +"%Y-%m-%d")
> echo "$somedate"
> script="
> select count(*)
> from myschema.mytable
> where local_dt > $somedate
> "
> echo "$script"
> hive -e $script > output.dat
>
>
>
> On 5/10/12 11:34 AM, "Tucker, Matt"  wrote:
>
> >You'll want to wrap ${hiveconf:ref_date} in quotes, so that's it's passed
> >as a string in the query.
> >
> >SELECT "${hiveconf:ref_date}" FROM dummytable LIMIT 1;
> >
> >Matt Tucker
> >Associate eBusiness Analyst
> >Walt Disney Parks and Resorts Online
> >Ph: 407-566-2545
> >Tie: 8-296-2545
> >
> >CONFIDENTIAL
> >
> >-Original Message-
> >From: Saurabh S [mailto:saurab...@live.com]
> >Sent: Thursday, May 10, 2012 2:06 PM
> >To: user@hive.apache.org
> >Subject: Passing date as hive configuration variable
> >
> >
> >I'm having a hard time passing a date as a hive environment variable.
> >
> >The setting is this: The table I'm querying is partitioned on a date
> >column, say, local_dt. I wish to query on last two days' worth of data.
> >Unfortunately there seems to be no way of getting the current date
> >without either scanning the entire table on all local dates
> >(current_timestamp()) or writing a custom UDF. So, I'm trying to pass it
> >as a parameter within shell.
> >
> >Here is my test_query.sql:
> >
> >
> >select count(*)
> >from myschema.mytable
> >where local_dt > ${hiveconf:ref_date}
> >;
> >
> >
> >and here is the driver shell script:
> >
> >
> >#/bin/sh
> >somedate=$(date -d '2 day ago' +"%Y-%m-%d") echo $somedate hive -hiveconf
> >ref_date=$somedate -f test_query.sql > output.dat
> >
> >
> >The problem is that Hive is performing subtractions in the date format.
> >echo $somedate produces "2012-05-08" and "select ${hiveconf:ref_date}
> >from dummytable limit 1" produces "1999".
> >
> >I noticed that there is an option to "set
> >hive.variable.substitute=false;", but in that case, hive throws the
> >following error:
> >FAILED: Parse Error: line 3:7 cannot recognize input near '$' '{'
> >'hiveconf' in select clause
> >
> >Regards,
> >Saurabh
> >
> >P.S. I'm using this as reference:
> >http://hive.apache.org/docs/r0.9.0/language_manual/var_substitution.html
> >
> >
>
  

Passing date as hive configuration variable

2012-05-10 Thread Saurabh S

I'm having a hard time passing a date as a hive environment variable.

The setting is this: The table I'm querying is partitioned on a date column, 
say, local_dt. I wish to query on last two days' worth of data. Unfortunately 
there seems to be no way of getting the current date without either scanning 
the entire table on all local dates (current_timestamp()) or writing a custom 
UDF. So, I'm trying to pass it as a parameter within shell.

Here is my test_query.sql:


select count(*)
from myschema.mytable
where local_dt > ${hiveconf:ref_date}
;


and here is the driver shell script:


#/bin/sh
somedate=$(date -d '2 day ago' +"%Y-%m-%d")
echo $somedate
hive -hiveconf ref_date=$somedate -f test_query.sql > output.dat


The problem is that Hive is performing subtractions in the date format. echo 
$somedate produces "2012-05-08" and "select ${hiveconf:ref_date} from 
dummytable limit 1" produces "1999".

I noticed that there is an option to "set hive.variable.substitute=false;", but 
in that case, hive throws the following error:
FAILED: Parse Error: line 3:7 cannot recognize input near '$' '{' 'hiveconf' in 
select clause

Regards,
Saurabh

P.S. I'm using this as reference: 
http://hive.apache.org/docs/r0.9.0/language_manual/var_substitution.html

  

Get current date in hive

2012-04-25 Thread Saurabh S

Hi,



How do I get the current date in Hive? Specifically, I’m
looking for the equivalent of following SQL where clause:

 

where LOCAL_DT >= current date - 3 day

 

I tried using

 

where local_dt >= date_sub(to_date(unix_timestamp()), 3)

 

but this method seems to be many times slower than hardcoding the date. 
local_dt is the local date and also the partition key for the table.



  

Hive equivalent of row_number()

2012-04-12 Thread Saurabh S

I have a table with three columns, A, B, and Score, where A and B are some 
items, and Score is some kind of affinity between A and B. There are N number 
of items of each A and B, so that the total number of rows in the table are N^2.

Is there a way to fetch "top 5 items in B" for each item in A? So, for each 
distinct item in A, I want to look up 5 items in B which have the highest value 
in Score.

If this were to be done in DB2, I would probably use some kind of windowing 
function using row_number().
  

RE: Help in aggregating comma separated values

2012-03-28 Thread Saurabh S

I downloaded the data to a file and checked. Like you said, there was a space 
after the '3'. Removing that space gives the exact, expected result. In fact, I 
do have these artifacts in the 'real' data and will have to treat them first.

Thanks a lot, Matt! I've been trying to crack this problem for hours.

From: matt.tuc...@disney.com
To: user@hive.apache.org
Date: Wed, 28 Mar 2012 14:46:59 -0400
Subject: RE: Help in aggregating comma separated values



Are you sure that the values in the ID column are the same? If it’s a string 
datatype, do you have any spaces after the ‘3’ but before the tab? Matt Tucker 
From: Saurabh S [mailto:saurab...@live.com] 
Sent: Wednesday, March 28, 2012 2:45 PM
To: user@hive.apache.org
Subject: RE: Help in aggregating comma separated values Thanks for the reply, 
Matt. This is exactly what I'm looking for. I'll look into the explode function 
more deeply.

Another question: For that example table, your query generates a slightly 
different different result. The values for id 3 are not aggregated. So the 
result is as follows. Any idea what's happening?id   value cnt
1101
1201
1301
2202
2301
3201
3301
3401
3201
3301
3401 > From: matt.tuc...@disney.com
> To: user@hive.apache.org
> Date: Wed, 28 Mar 2012 14:25:43 -0400
> Subject: RE: Help in aggregating comma separated values
> 
> Hi,
> 
> The values column needs to be split across multiple rows, so you can then 
> aggregate them. Try this:
> 
> SELECT id, value, COUNT(*) cnt
> FROM table1 LATERAL VIEW explode(split(values, ",")) values_tbl as value
> GROUP BY id, value
> 
> 
> 
> Matt Tucker
> 
> -Original Message-
> From: Saurabh S [mailto:saurab...@live.com] 
> Sent: Wednesday, March 28, 2012 2:21 PM
> To: user@hive.apache.org
> Subject: Help in aggregating comma separated values
> 
> 
> Hello,
> 
> How do I get count from a list of comma separated values? For the lack of 
> better wording, here is an example:
> 
> Suppose there is a table with two columns, id (integers) and values (string) 
> in the following fashion.
> 
> id   values
> 110,20,30
> 220,30
> 220
> 320,30,40
> 340,20,30
> 
> I want an output like this:
> 
> id   value count
> 1101
> 1201
> 1301
> 2202
> 2301
> 3202
> 3302
> 3402
> 
> So, it's a list containing the values those appear against each id, along 
> with their count.
> 
> I understand this is more of a SQL kind of question rather than one specific 
> to Hive, but I'm at a roadblock here.
> 
> Thanks,
> Saurabh
> 

RE: Help in aggregating comma separated values

2012-03-28 Thread Saurabh S

Thanks for the reply, Matt. This is exactly what I'm looking for. I'll look 
into the explode function more deeply.

Another question: For that example table, your query generates a slightly 
different different result. The values for id 3 are not aggregated. So the 
result is as follows. Any idea what's happening?

id   value cnt
1101
1201
1301
2202
2301
3201
3301
3401
3201
3301
3401


> From: matt.tuc...@disney.com
> To: user@hive.apache.org
> Date: Wed, 28 Mar 2012 14:25:43 -0400
> Subject: RE: Help in aggregating comma separated values
> 
> Hi,
> 
> The values column needs to be split across multiple rows, so you can then 
> aggregate them.  Try this:
> 
> SELECT id, value, COUNT(*) cnt
> FROM table1 LATERAL VIEW explode(split(values, ",")) values_tbl as value
> GROUP BY id, value
> 
> 
> 
> Matt Tucker
> 
> -Original Message-
> From: Saurabh S [mailto:saurab...@live.com] 
> Sent: Wednesday, March 28, 2012 2:21 PM
> To: user@hive.apache.org
> Subject: Help in aggregating comma separated values
> 
> 
> Hello,
> 
> How do I get count from a list of comma separated values? For the lack of 
> better wording, here is an example:
> 
> Suppose there is a table with two columns, id (integers) and values (string) 
> in the following fashion.
> 
> id   values
> 110,20,30
> 220,30
> 220
> 320,30,40
> 340,20,30
> 
> I want an output like this:
> 
> id   value count
> 1101
> 1201
> 1301
> 2202
> 2301
> 3202
> 3302
> 3402
> 
> So, it's a list containing the values those appear against each id, along 
> with their count.
> 
> I understand this is more of a SQL kind of question rather than one specific 
> to Hive, but I'm at a roadblock here.
> 
> Thanks,
> Saurabh
> 
  

Help in aggregating comma separated values

2012-03-28 Thread Saurabh S

Hello,

How do I get count from a list of comma separated values? For the lack of 
better wording, here is an example:

Suppose there is a table with two columns, id (integers) and values (string) in 
the following fashion.

id   values
1    10,20,30
2    20,30
2    20
3    20,30,40
3    40,20,30

I want an output like this:

id   value count
1    10    1
1    20    1
1    30    1
2    20    2
2    30    1
3    20    2
3    30    2
3    40    2

So, it's a list containing the values those appear against each id, along with 
their count.

I understand this is more of a SQL kind of question rather than one specific to 
Hive, but I'm at a roadblock here.

Thanks,
Saurabh
  

Length of an array

2012-03-21 Thread Saurabh S

How do I get the length of an array in Hive?

Specifically, I'm looking at the following problem: I'm splitting a column 
using the split() function and a pattern. However, the resulting array can have 
variable number of entries and I want to handle each case separately.
  

RE: Accessing elements from array returned by split() function

2012-03-01 Thread Saurabh S

Of course, it works fine now. I feel like an idiot.

And that problem using parse_url also went away and I can use that as well.

Thanks a bunch, Phil.

> Date: Thu, 1 Mar 2012 21:22:27 +
> Subject: Re: Accessing elements from array returned by split() function
> From: philip.j.trom...@gmail.com
> To: user@hive.apache.org
> 
> I guess that split(...)[1] is giving you what's inbetween the 1st and
> 2nd '/' character, which is nothing. Try split(...)[2].
> 
> Phil.
> 
> On 1 March 2012 21:19, Saurabh S  wrote:
> > Hello,
> >
> > I have a set of URLs which I need to parse. For example, if the url is,
> > http://www.google.com/anything/goes/here,
> >
> > I need to extract www.google.com, i.e. everything between second and third
> > forward slashes.
> >
> > I can't figure out the regex pattern to do so, and am trying to use split()
> > function instead. So, my hive query looks like
> > select url, split(url,'/')
> > ...
> >
> > The second column contains the entire array returned by the split function.
> > Is there any way to access only the second element of the array, which will
> > give me what I need?
> >
> > When I try the following statement select url, split(url,'/')[1], I get an
> > empty second column.
> >
> > Is this the expected behavior? Any other suggestions on how to parse the
> > URL?
> >
> > Oh by the way, I'm aware that the function parse_url(url,'HOST') will give
> > me something similar to what I want, but for some reason, that function on
> > my database is running extremely slow.
> >
> > First time posting to this list. If there is anything wrong, please let me
> > know.
> >
> > Regards,
> > Saurabh
> >
  

Accessing elements from array returned by split() function

2012-03-01 Thread Saurabh S

Hello,

I have a set of URLs which I need to parse. For example, if the url is,
http://www.google.com/anything/goes/here,

I need to extract www.google.com, i.e. everything between second and third 
forward slashes.

I can't figure out the regex pattern to do so, and am trying to use split() 
function instead. So, my hive query looks like
select url, split(url,'/')
...

The second column contains the entire array returned by the split function. Is 
there any way to access only the second element of the array, which will give 
me what I need?

When I try the following statement select url, split(url,'/')[1], I get an 
empty second column.

Is this the expected behavior? Any other suggestions on how to parse the URL?

Oh by the way, I'm aware that the function parse_url(url,'HOST') will give me 
something similar to what I want, but for some reason, that function on my 
database is running extremely slow.

First time posting to this list. If there is anything wrong, please let me know.

Regards,
Saurabh

  

RE: Error in running Hive with Postgresql as metastore DB

2012-01-10 Thread Saurabh Bajaj
Hi Guys,

I have found the issue.
I did not exit Hive using quit; but instead using the Ctrl Z.
As a result, background processed for Hive were running and causing trouble in 
the query execution.

I killed the running Hive processes and was able to run Hive with derby as well 
as MySQL as metastore DB.

Thanks!
Saurabh Bajaj | Senior Business Analyst | +91 9986588089 | 
www.mu-sigma.com<http://www.mu-sigma.com/> |



From: Saurabh Bajaj
Sent: Tuesday, January 10, 2012 2:44 PM
To: 'user@hive.apache.org'
Subject: Error in running Hive with Postgresql as metastore DB

Hi Everyone,

I am getting an error while installing and running hive using postgresql db as 
a metastore DB.
The error is as follows which occurs while running any query in Hive from the 
terminal


FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating 
transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask


I installed CDH3U2 Hadoop and Hive version and the only changes I have made in 
Hive configuration are as follows:

1. export HIVE_AUX_JARS_PATH=/usr/local/hadoop

2. There is no Hive-site.xml so i made changes in the hive-default.xml as 
follows:


  javax.jdo.option.ConnectionURL
  jdbc:postgresql://localhost:5432/hadoop
  JDBC connect string for a JDBC metastore



  javax.jdo.option.ConnectionDriverName
  org.postgresql.Driver
  Driver class name for a JDBC metastore



  javax.jdo.PersistenceManagerFactoryClass
  org.datanucleus.jdo.JDOPersistenceManagerFactory
  class implementing the jdo persistence




  javax.jdo.option.DetachAllOnCommit
  true
  detaches all objects from session so that they can be used after 
transaction is committed



  javax.jdo.option.NonTransactionalRead
  true
  reads outside of transactions



  javax.jdo.option.ConnectionUserName
  hadoop
  username to use against metastore database



  javax.jdo.option.ConnectionPassword
  hadoop
  password to use against metastore database



3. I have placed postgresql-9.1-901.jdbc4.jar in the /usr/local/hadoop path
4. Installed Postgresql and created username/pass hadoop/hadoop
5. postgresql.conf - uncommented the listen_addresses = 'localhost','*' while 
adding the '*' part to it.


Please help me understand why this error would be occuring.

Thanks in advance!


Saurabh Bajaj
+91 9986588089






This email message may contain proprietary, private and confidential 
information. The information transmitted is intended only for the person(s) or 
entities to which it is addressed. Any review, retransmission, dissemination or 
other use of, or taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is prohibited and may be 
illegal. If you received this in error, please contact the sender and delete 
the message from your system.

Mu Sigma takes all reasonable steps to ensure that its electronic 
communications are free from viruses. However, given Internet accessibility, 
the Company cannot accept liability for any virus introduced by this e-mail or 
any attachment and you are advised to use up-to-date virus checking software.


Error in running Hive with Postgresql as metastore DB

2012-01-10 Thread Saurabh Bajaj
Hi Everyone,

I am getting an error while installing and running hive using postgresql db as 
a metastore DB.
The error is as follows which occurs while running any query in Hive from the 
terminal


FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating 
transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask


I installed CDH3U2 Hadoop and Hive version and the only changes I have made in 
Hive configuration are as follows:

1. export HIVE_AUX_JARS_PATH=/usr/local/hadoop

2. There is no Hive-site.xml so i made changes in the hive-default.xml as 
follows:


  javax.jdo.option.ConnectionURL
  jdbc:postgresql://localhost:5432/hadoop
  JDBC connect string for a JDBC metastore



  javax.jdo.option.ConnectionDriverName
  org.postgresql.Driver
  Driver class name for a JDBC metastore



  javax.jdo.PersistenceManagerFactoryClass
  org.datanucleus.jdo.JDOPersistenceManagerFactory
  class implementing the jdo persistence




  javax.jdo.option.DetachAllOnCommit
  true
  detaches all objects from session so that they can be used after 
transaction is committed



  javax.jdo.option.NonTransactionalRead
  true
  reads outside of transactions



  javax.jdo.option.ConnectionUserName
  hadoop
  username to use against metastore database



  javax.jdo.option.ConnectionPassword
  hadoop
  password to use against metastore database



3. I have placed postgresql-9.1-901.jdbc4.jar in the /usr/local/hadoop path
4. Installed Postgresql and created username/pass hadoop/hadoop
5. postgresql.conf - uncommented the listen_addresses = 'localhost','*' while 
adding the '*' part to it.


Please help me understand why this error would be occuring.

Thanks in advance!


Saurabh Bajaj
+91 9986588089






This email message may contain proprietary, private and confidential 
information. The information transmitted is intended only for the person(s) or 
entities to which it is addressed. Any review, retransmission, dissemination or 
other use of, or taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is prohibited and may be 
illegal. If you received this in error, please contact the sender and delete 
the message from your system.

Mu Sigma takes all reasonable steps to ensure that its electronic 
communications are free from viruses. However, given Internet accessibility, 
the Company cannot accept liability for any virus introduced by this e-mail or 
any attachment and you are advised to use up-to-date virus checking software.