Is the USE database command hive server-wide or session specific

2012-06-26 Thread Ladda, Anand
I am connecting to Hive through a client tool via Hive Server. The client tool 
tries to set a database context by running the USE [database] command when it 
makes a connection. However, when I create another session from the client and 
not specify any database context (i.e, implicitly connecting to the DEFAULT 
database) I get an error when trying to query the tables. It appears as though 
the previous client connection's USE database command is still in-effect.



Block Sampling Impact

2012-06-15 Thread Ladda, Anand
Hi
I was trying block sampling on a 6 million (~400MB sized table) and can see if 
I sample about 1 percent of the data I get about 3x faster response on the 
queries (I can also see difference in the data returned). The input format 
though is 'org.apache.hadoop.mapred.TextInputFormat' and not 
CombineHiveInputFormat as mentioned in the Block Sampling documentation. 
Question for the experts on whether block sampling is expected to work with 
other input formats as well
Thanks
Anand


hive> desc formatted orderdetail2;
OK
# col_name  data_type   comment

order_idint None
item_id int None
order_date  string  None
emp_id  int None
promotion_idint None
qty_soldfloat   None
unit_price  float   None
unit_cost   float   None
discountfloat   None
customer_id int None

# Detailed Table Information
Database:   default
Owner:  hdfs
CreateTime: Fri Jun 15 16:51:44 EDT 2012
LastAccessTime: UNKNOWN
Protect Mode:   None
Retention:  0
Location:   --
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1339793622

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.TextInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
Storage Desc Params:
serialization.format1
Time taken: 0.124 seconds
hive>


RE: Block Sampling

2012-06-15 Thread Ladda, Anand
Thanks Carl. Could you give me edit rights to the wiki 
(ala...@microstrategy.com<mailto:ala...@microstrategy.com>) to update the 
sampling page with this info

From: Carl Steinbach [mailto:c...@cloudera.com]
Sent: Friday, June 15, 2012 3:20 PM
To: user@hive.apache.org
Subject: Re: Block Sampling

Hi Anand,

This feature was implemented in HIVE-2121 and appeared in Hive 0.8.0.

Ref: https://issues.apache.org/jira/browse/HIVE-2121

Thanks.

Carl
On Fri, Jun 15, 2012 at 11:59 AM, Ladda, Anand 
mailto:lan...@microstrategy.com>> wrote:
Has the block sampling feature been added to one of the latest (Hive 0.8 or 
Hive 0.9) releases. The wiki has the blurb below on block sampling
Block Sampling
It is a feature that is still on trunk and is not yet in any release version.
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn't 
necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is 
supported and some special compression formats are not handled. If we fail to 
sample it, the input of MapReduce job will be the whole table/partition. We do 
it in HDFS block level so that the sampling granularity is block size. For 
example, if block size is 256MB, even if n% of input size is only 100MB, you 
get 256MB of data.
In the following example the input size 0.1% or more will be used for the query.
SELECT *
FROM source TABLESAMPLE(0.1 PERCENT) s;
Sometimes you want to sample the same data with different blocks, you can 
change this seed number:
set hive.sample.seednumber=;




Block Sampling

2012-06-15 Thread Ladda, Anand
Has the block sampling feature been added to one of the latest (Hive 0.8 or 
Hive 0.9) releases. The wiki has the blurb below on block sampling
Block Sampling
It is a feature that is still on trunk and is not yet in any release version.
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn't 
necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is 
supported and some special compression formats are not handled. If we fail to 
sample it, the input of MapReduce job will be the whole table/partition. We do 
it in HDFS block level so that the sampling granularity is block size. For 
example, if block size is 256MB, even if n% of input size is only 100MB, you 
get 256MB of data.
In the following example the input size 0.1% or more will be used for the query.
SELECT *
FROM source TABLESAMPLE(0.1 PERCENT) s;
Sometimes you want to sample the same data with different blocks, you can 
change this seed number:
set hive.sample.seednumber=;



RE: Filtering on TIMESTAMP data type

2012-06-04 Thread Ladda, Anand
Can anyone helpout with the TIMESTAMP literals piece. So far, I've gotten

Select day_timestamp from lu_day where day_timestamp > 
to_utc_timestamp('2012-06-04 00:00:00', 'GMT') to work ok and give me back 
timestamps greater than the one in the literal. Is this the best function to 
get this to work or is there something else I should be using

From: Ladda, Anand
Sent: Monday, May 28, 2012 11:00 AM
To: user@hive.apache.org
Subject: RE: FW: Filtering on TIMESTAMP data type

Debarshi
Didn't quite follow your first comment. I get the write-your-own UDF part but 
was wondering how others have been transitioning from STRING dates to TIMESTAMP 
dates and getting filtering, partition pruning, etc to work with constants
-Anand

From: Debarshi Basak [mailto:debarshi.ba...@tcs.com]
Sent: Saturday, May 26, 2012 11:54 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: FW: Filtering on TIMESTAMP data type

I guess it exist gotta check.
btw...You can always go and write a udf


Debarshi Basak
Tata Consultancy Services
Mailto: debarshi.ba...@tcs.com<mailto:debarshi.ba...@tcs.com>
Website: http://www.tcs.com

Experience certainty. IT Services
Business Solutions
Outsourcing
________

-"Ladda, Anand" wrote: -
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>, 
"d...@hive.apache.org<mailto:d...@hive.apache.org>" 
mailto:d...@hive.apache.org>>
From: "Ladda, Anand" mailto:lan...@microstrategy.com>>
Date: 05/26/2012 06:58PM
Subject: FW: Filtering on TIMESTAMP data type
How do I set-up a filter constant for TIMESTAMP datatype. In Hive 0.7 since 
timestamps were represented as strings a query like this would return data

select * from LU_day where day_date ='2010-01-01 00:00:00';

But now with day_date as a TIMESTAMP column it doesn't. Is there some type of a 
TO_TIMESTAMP function in hive to convert the string constant into a TIMESTAMP 
one

As a workaround I can do

select * from LU_DAY where TO_DATE(day_date) = '2010-01-01' but that would be a 
problem for partitioning pruning, etc



=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you


RE: Front end visualization tool with Hive (when using as a warehouse)

2012-06-04 Thread Ladda, Anand
I agree with Bejoy's assessment - Hive is good for processing large volumes of 
data in a batch manner. But for real-time or any complex SQL based analysis you 
would typically want to have some type of a RDBMS in the mix along with 
Hadoop/Hive. In terms of what's missing in Hive today - On the query side Hive 
doesn't yet support all flavors of subqueries (correlated subqueries to be 
specific. There are potential workarounds for the non-correlated ones), it also 
doesn't support inserting data from a stream i.e, INSERT INTO TABLE VALUES 
(...) type syntax, Hive's query optimizer is mostly rule-based at this time 
although there's push to move towards a cost-based one. On the administration 
side there's no workload management/job prioritization scheme like a typical 
RDBMS, Hive Server isn't thread-safe and also doesn't yet have any kind of 
security/authentication scheme.



From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Monday, June 04, 2012 7:20 AM
To: user@hive.apache.org
Subject: Re: Front end visualization tool with Hive (when using as a warehouse)

Hi Sreenath

First of all don't take hive like a RDBMS system, while designing your 
solution. It is an awesome tool when it comes to processing of huge volumes of 
data in non real time mode. If any of your use cases comes with 'updates' on 
rows, it is not supported in hive. It is pretty expensive to have a work around 
for updates as well. (you can implement it on overwriting a per partition level 
in the most granular manner, still it is expensive)

By the way I'm not a DWH guy, may be others can add on their experience over 
these.

Regards
Bejoy KS


From: Sreenath Menon mailto:sreenathmen...@gmail.com>>
To: user@hive.apache.org; Bejoy Ks 
mailto:bejoy...@yahoo.com>>
Sent: Monday, June 4, 2012 4:25 PM
Subject: Re: Front end visualization tool with Hive (when using as a warehouse)


Hi Bejoy

I am not looking for just an UI for queries (even though at first, when working 
on twitter data, that was of my interest). But, now I am planning on using Hive 
as a warehouse with a front end in-memory processing engine. Microstrategy or 
tableau would be a good choice.

Now further refining the problem, I would ask what is the warehousing power of 
Hive when compared to a traditional warehouse. Can Hive perform all operations 
performed/required in a warehouse. If not, where are the short comings which I 
need to deal with.

Always thankful for your apt assistance.
On Mon, Jun 4, 2012 at 3:49 PM, Bejoy Ks 
mailto:bejoy...@yahoo.com>> wrote:
Hi Sreenath

 If you are looking at a UI for queries then Cloudera's hue is the best 
choice. Also you do have odbc connectors that integrates BI tools like 
microstrategy, tableau etc with hive.

Regards
Bejoy KS


From: Sreenath Menon mailto:sreenathmen...@gmail.com>>
To: user@hive.apache.org
Sent: Monday, June 4, 2012 2:42 PM
Subject: Front end visualization tool with Hive (when using as a warehouse)

Hi all

I am new to hive and am working on analysis of twitter data with Hive and 
Hadoop in a 27node cluster.
At present am using Microsoft powerpivot as the visualization tool for visual 
representation of analysis done using Hive and have got some really good 
results and I am stunned by the scalability power of the Hadoop system.

As a next step, I would like to evaluate the warehousing capabilities of Hive 
for business data.
Any insights into this is welcome. And am facing problem of delegating job to 
Hive/Powerpivot as Powerpivot itself has capabilities of being a warehouse 
tool. Any other good visualization tools for usage with Hive is also welcome.

For analyzing twitter data, I just ran complex Hive queries for each of 
analysis done. But for a warehouse, this does not sound like a good solution.

Any help is greatly appreciated.

Thanks.





Edit Rights to Hive Wiki

2012-05-29 Thread Ladda, Anand
Can someone grant me edit rights to the Hive Wiki?
Thanks
Anand


RE: FW: Filtering on TIMESTAMP data type

2012-05-28 Thread Ladda, Anand
Debarshi
Didn't quite follow your first comment. I get the write-your-own UDF part but 
was wondering how others have been transitioning from STRING dates to TIMESTAMP 
dates and getting filtering, partition pruning, etc to work with constants
-Anand

From: Debarshi Basak [mailto:debarshi.ba...@tcs.com]
Sent: Saturday, May 26, 2012 11:54 AM
To: user@hive.apache.org
Subject: Re: FW: Filtering on TIMESTAMP data type

I guess it exist gotta check.
btw...You can always go and write a udf


Debarshi Basak
Tata Consultancy Services
Mailto: debarshi.ba...@tcs.com<mailto:debarshi.ba...@tcs.com>
Website: http://www.tcs.com

Experience certainty. IT Services
Business Solutions
Outsourcing
____

-----"Ladda, Anand" wrote: -
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>, 
"d...@hive.apache.org<mailto:d...@hive.apache.org>" 
mailto:d...@hive.apache.org>>
From: "Ladda, Anand" mailto:lan...@microstrategy.com>>
Date: 05/26/2012 06:58PM
Subject: FW: Filtering on TIMESTAMP data type
How do I set-up a filter constant for TIMESTAMP datatype. In Hive 0.7 since 
timestamps were represented as strings a query like this would return data

select * from LU_day where day_date ='2010-01-01 00:00:00';

But now with day_date as a TIMESTAMP column it doesn't. Is there some type of a 
TO_TIMESTAMP function in hive to convert the string constant into a TIMESTAMP 
one

As a workaround I can do

select * from LU_DAY where TO_DATE(day_date) = '2010-01-01' but that would be a 
problem for partitioning pruning, etc



=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you


FW: Filtering on TIMESTAMP data type

2012-05-26 Thread Ladda, Anand
How do I set-up a filter constant for TIMESTAMP datatype. In Hive 0.7 since 
timestamps were represented as strings a query like this would return data

select * from LU_day where day_date ='2010-01-01 00:00:00';

But now with day_date as a TIMESTAMP column it doesn't. Is there some type of a 
TO_TIMESTAMP function in hive to convert the string constant into a TIMESTAMP 
one

As a workaround I can do

select * from LU_DAY where TO_DATE(day_date) = '2010-01-01' but that would be a 
problem for partitioning pruning, etc




RE: [Marketing Mail] Doubts related to Amazon EMR

2012-04-23 Thread Ladda, Anand
Once you have a Hive Job flow running on Amazon EMR, you'll have access to the 
file system on the underlying EC2 machines (you'll get the machine name, etc 
once the cluster is running). You can then move your data files on the EC2 
machine file system and load it into HDFS/Hive. I am not sure about the Sqoop 
part

From: Bhavesh Shah [mailto:bhavesh25s...@gmail.com]
Sent: Monday, April 23, 2012 8:42 AM
To: user@hive.apache.org; d...@hive.apache.org
Subject: [Marketing Mail] Doubts related to Amazon EMR


Hello all,
I want to deploy my task on Amazon EMR. But as I am new to Amazon Web Services 
I am confused in understanding the concepts.

My Use Case:

I want to import the large data from EC2 through SQOOP into the Hive. Imported 
data in Hive will get processed in Hive by applying some algorithm and will 
generate some result (in table form, in Hive only). And generated result will 
be exported back to Ec2 again through SQOOP only.

I am new to Amazon Web Services and want to implement this use case with the 
help of AWS EMR. I have implemented it on local machine.

I have read some links related to AWS EMR for launching the instance and about 
what is EMR, How it works and etc...

I have some doubts about EMR like:

1) EMR uses S3 Buckets, which holds Input and Output data Hadoop Processing (in 
the form of Objects). ---> I didn't get How to store the data in the form of 
Objects on S3 (My data will be files)

2) As already said I have implemented a task for my use case in Java. So If I 
create the JAR of my program and create the Job Flow with Custom JAR. Will it 
be possible to implement like this or do need to do some thing extra for that?

3) As I said in my Use Case that I want to export my result back to Ec2 with 
the help of SQOOP. Does EMR have support of SQOOP?



If you have any kind of idea related to AWS, please reply me with your answer 
as soon as possible. I want to do this as early as possible.

many Thanks.




--
Regards,
Bhavesh Shah



Row Group Size of RCFile

2012-04-18 Thread Ladda, Anand
How do I set the Row Group Size of RCFile in Hive

CREATE TABLE OrderFactPartClustRcFile(
  order_id INT,
  emp_id INT,
  order_amt FLOAT,
  order_cost FLOAT,
  qty_sold FLOAT,
  freight FLOAT,
  gross_dollar_sales FLOAT,
  ship_date STRING,
  rush_order STRING,
  customer_id INT,
  pymt_type INT,
  shipper_id INT
  )
PARTITIONED BY (order_date STRING)
CLUSTERED BY (order_id) SORTED BY (order_id) INTO 256 BUCKETS
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED as RCFILE;


RE: Hive Queries Performance Tuning - Map side joins, Map side aggregations, Partitioning/Clustering

2012-04-03 Thread Ladda, Anand
Thanks Bejoy and Nitin. I've read through the join presentation by Namit Jain 
and Liyin Tang from Facebook and got some ideas on how to improve the join 
performance.

* I understand how Map Joins work but wasn't clear on the workflow of 
bucketed map joins. Is having map join enabled a pre-requisite for bucketed map 
joins i.e, do I need to set both set hive.auto.convert.join=true; and set 
hive.optimize.bucketmapjoin = true; in order for bucketed map joins to work. 
From what I understand bucketed map joins are meant for the scenario when 
neither table in the join is "small enough" to be map join enabled. In that 
case if you have the tables bucketed on the same columns (and the buckets are 
multiples of each other) then you can use the bucketmapjoin technique to help 
improve the performance. Is this accurate?

* Also you mention improving performance of "group by" queries. Are you 
referring to the use of map side aggregation? Any resources you can point me to 
where I can study this further?
Thanks
Anand


From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Sunday, April 01, 2012 5:35 PM
To: user@hive.apache.org
Subject: Re: Hive Queries Performance Tuning - Map side joins, Map side 
aggregations, Partitioning/Clustering

Anand
 You can optimize pretty much all hive queries. Based on your queries you 
need to do the optimizations. For example Group By has some specific way to be 
optimized. Some times Distribute By comes in handy for optimizing some queries. 
Skew joins are good to balace the reducer loads. etc
 Map joins are used if one of the table's involved in the join is small. 
For medium sized bucketed tables you can go in for bucketed map join (with some 
conditions on number of buckets and bucketed columns to join columns).

Regards
Bejoy KS

________
From: "Ladda, Anand" mailto:lan...@microstrategy.com>>
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Sent: Sunday, April 1, 2012 11:59 PM
Subject: Hive Queries Performance Tuning - Map side joins, Map side 
aggregations, Partitioning/Clustering


I am trying to understand what are some of the options/settings available to 
tune the performance of Hive Queries. I have seen the benefits of Map side 
joins and Partitioning/Clustering. However I have yet to realize the impact map 
side aggregation has on query performance. I tried running this query against 
with and without map-side join turned on and did not see much difference in the 
execution times. The raw data in this partition is about 5.5 million. Looking 
for some pointers to see what type of queries benefit from Map-side aggregation


set hive.auto.convert.join=false;


set hive.map.aggr=false;

Non-partitioned, non-clustered single table with where clause on date and no 
map side aggregation

select a11.emp_id, count(1), count (distinct a11.customer_id), 
sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' 
group by a11.emp_id;

400 secs


set hive.map.aggr=true;

Non-partitioned, non-clustered single table with where clause with where clause 
on date and map side aggregation

select a11.emp_id, count(1), count (distinct a11.customer_id), 
sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' 
group by a11.emp_id;

390 secs


Also is there any reason to not turn on map-side joins all the time. In my 
tests I have always seen the performance either be the same or improve with 
map-side joins turned on. Are there any other parameters or Hive features that 
can help improve the performance of Hive queries.
Thanks
Anand




RE: Hive Meta Information

2012-04-03 Thread Ladda, Anand
Thanks Thiruvel and Ed. Might be a dump question but how exactly do I get to to 
audit Hive metastore audit logs. I looked through the JIRA but there isn't a 
mention of the same. Also is there some variable/parameter I need to set/enable 
for the audit logging to be turned on

-Original Message-
From: Thiruvel Thirumoolan [mailto:thiru...@yahoo-inc.com] 
Sent: Monday, April 02, 2012 5:47 AM
To: user@hive.apache.org
Subject: Re: Hive Meta Information

> 1.   recent users of table,
> 2.   top users of table,

Hive metastore has audit support (HIVE-1948). While what Edward suggests will 
be accurate, audit log might give you a superset of that, it will also include 
"desc table" requests.

Thiruvel



On 3/31/12 8:39 PM, "Edward Capriolo"  wrote:

>hive does not capture this information.
>
>I have a tool on github that connects to the JobTracker and pulls stat 
>information.
>
>https://github.com/edwardcapriolo/hadoop_cluster_profiler
>
>It would be pretty easy to add code to record some of information you 
>are looking for.
>
>Hive is close to #4 with a statistics DB used for indexing/query planning.
>
>Edward
>
>On Sat, Mar 31, 2012 at 2:50 AM, Nitin Pawar 
>wrote:
>> Anand,
>>
>> I doubt this information is readily available in hive as this is not 
>>meta  information rather access information.
>>
>> For number of records in a table you can just run a query like select
>> count(1) from table;
>>
>>
>> For the access details on table data, you will need to process hadoop 
>>logs  and based on that you can figure out who accessed the data and 
>>how
>>
>> Thanks,
>> Nitin
>>
>> On Sat, Mar 31, 2012 at 3:36 AM, Ladda, Anand 
>> 
>> wrote:
>>>
>>> How do I get the following meta information about a table
>>>
>>> 1.   recent users of table,
>>>
>>> 2.   top users of table,
>>>
>>> 3.   recent queries/jobs/reports,
>>>
>>> 4.   number of rows in a table
>>>
>>>
>>>
>>> I donĀ¹t see anything either in DESCRIBE FORMATTED or SHOW TABLE 
>>>EXTENDED  LIKE commands.
>>>
>>> Thanks
>>>
>>> Anand
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>



Table Statistics In Hive

2012-04-02 Thread Ladda, Anand
I've tried to collect statistics on an existing table in hive using the 
commands mentioned in this wiki page - 
https://cwiki.apache.org/confluence/display/Hive/StatsDev
ANALYZE TABLE [TABLENAME] PARTITION(parcol1=..., partcol2=) COMPUTE 
STATISTICS
But when I do a DESCRIBE EXTENDED [TABLENAME] after the stats collection has 
been completed, I see the number of rows is still 0. Is there anything I am 
missing here? I am using Hive 0.7.1
Thanks
Anand


Hive Queries Performance Tuning - Map side joins, Map side aggregations, Partitioning/Clustering

2012-04-01 Thread Ladda, Anand
I am trying to understand what are some of the options/settings available to 
tune the performance of Hive Queries. I have seen the benefits of Map side 
joins and Partitioning/Clustering. However I have yet to realize the impact map 
side aggregation has on query performance. I tried running this query against 
with and without map-side join turned on and did not see much difference in the 
execution times. The raw data in this partition is about 5.5 million. Looking 
for some pointers to see what type of queries benefit from Map-side aggregation


set hive.auto.convert.join=false;


set hive.map.aggr=false;

Non-partitioned, non-clustered single table with where clause on date and no 
map side aggregation

select a11.emp_id, count(1), count (distinct a11.customer_id), 
sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' 
group by a11.emp_id;

400 secs


set hive.map.aggr=true;

Non-partitioned, non-clustered single table with where clause with where clause 
on date and map side aggregation

select a11.emp_id, count(1), count (distinct a11.customer_id), 
sum(a11.qty_sold) from orderdetailrcfile a11 where order_date ='01-01-2008' 
group by a11.emp_id;

390 secs


Also is there any reason to not turn on map-side joins all the time. In my 
tests I have always seen the performance either be the same or improve with 
map-side joins turned on. Are there any other parameters or Hive features that 
can help improve the performance of Hive queries.
Thanks
Anand



Hive Meta Information

2012-03-30 Thread Ladda, Anand
How do I get the following meta information about a table

1.   recent users of table,

2.   top users of table,

3.   recent queries/jobs/reports,

4.   number of rows in a table


I don't see anything either in DESCRIBE FORMATTED or SHOW TABLE EXTENDED LIKE 
commands.
Thanks
Anand