Has the block sampling feature been added to one of the latest (Hive 0.8 or
Hive 0.9) releases. The wiki has the blurb below on block sampling
Block Sampling
It is a feature that is still on trunk and is not yet in any release version.
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to
Hi Anand,
This feature was implemented in HIVE-2121 and appeared in Hive 0.8.0.
Ref: https://issues.apache.org/jira/browse/HIVE-2121
Thanks.
Carl
On Fri, Jun 15, 2012 at 11:59 AM, Ladda, Anand
lan...@microstrategy.commailto:lan...@microstrategy.com wrote:
Has the block sampling feature been
Hi
I was trying block sampling on a 6 million (~400MB sized table) and can see if
I sample about 1 percent of the data I get about 3x faster response on the
queries (I can also see difference in the data returned). The input format
though is 'org.apache.hadoop.mapred.TextInputFormat' and not
I agree with Bejoy's assessment - Hive is good for processing large volumes of
data in a batch manner. But for real-time or any complex SQL based analysis you
would typically want to have some type of a RDBMS in the mix along with
Hadoop/Hive. In terms of what's missing in Hive today - On the
this to work or is there something else I should be using
From: Ladda, Anand
Sent: Monday, May 28, 2012 11:00 AM
To: user@hive.apache.org
Subject: RE: FW: Filtering on TIMESTAMP data type
Debarshi
Didn't quite follow your first comment. I get the write-your-own UDF part but
was wondering how others have
Can someone grant me edit rights to the Hive Wiki?
Thanks
Anand
Website: http://www.tcs.com
Experience certainty. IT Services
Business Solutions
Outsourcing
-Ladda, Anand wrote: -
To: user@hive.apache.orgmailto:user@hive.apache.org
user@hive.apache.orgmailto:user
How do I set-up a filter constant for TIMESTAMP datatype. In Hive 0.7 since
timestamps were represented as strings a query like this would return data
select * from LU_day where day_date ='2010-01-01 00:00:00';
But now with day_date as a TIMESTAMP column it doesn't. Is there some type of a
Once you have a Hive Job flow running on Amazon EMR, you'll have access to the
file system on the underlying EC2 machines (you'll get the machine name, etc
once the cluster is running). You can then move your data files on the EC2
machine file system and load it into HDFS/Hive. I am not sure
How do I set the Row Group Size of RCFile in Hive
CREATE TABLE OrderFactPartClustRcFile(
order_id INT,
emp_id INT,
order_amt FLOAT,
order_cost FLOAT,
qty_sold FLOAT,
freight FLOAT,
gross_dollar_sales FLOAT,
ship_date STRING,
rush_order STRING,
customer_id INT,
pymt_type INT,
to process hadoop
logs and based on that you can figure out who accessed the data and
how
Thanks,
Nitin
On Sat, Mar 31, 2012 at 3:36 AM, Ladda, Anand
lan...@microstrategy.com
wrote:
How do I get the following meta information about a table
1. recent users of table,
2. top
Bejoy KS
From: Ladda, Anand lan...@microstrategy.commailto:lan...@microstrategy.com
To: user@hive.apache.orgmailto:user@hive.apache.org
user@hive.apache.orgmailto:user@hive.apache.org
Sent: Sunday, April 1, 2012 11:59 PM
Subject: Hive Queries Performance Tuning - Map
I've tried to collect statistics on an existing table in hive using the
commands mentioned in this wiki page -
https://cwiki.apache.org/confluence/display/Hive/StatsDev
ANALYZE TABLE [TABLENAME] PARTITION(parcol1=..., partcol2=) COMPUTE
STATISTICS
But when I do a DESCRIBE EXTENDED
I am trying to understand what are some of the options/settings available to
tune the performance of Hive Queries. I have seen the benefits of Map side
joins and Partitioning/Clustering. However I have yet to realize the impact map
side aggregation has on query performance. I tried running this
How do I get the following meta information about a table
1. recent users of table,
2. top users of table,
3. recent queries/jobs/reports,
4. number of rows in a table
I don't see anything either in DESCRIBE FORMATTED or SHOW TABLE EXTENDED LIKE
commands.
Thanks
15 matches
Mail list logo