Re: Index not getting used for the queries

2014-02-04 Thread Thilina Gunarathne
Thanks Peter. It helped. That property combined with setting the property
'hive.optimize.index.filter' to 'true' got the index working.

thanks,
Thilina


On Mon, Feb 3, 2014 at 6:12 PM, Peter Marron <
peter.mar...@trilliumsoftware.com> wrote:

>  Hi,
>
>
>
> Not sure if it is relevant to your problem but I'm just checking
>
> that you know about
>
> hive.optimize.index.filter.compact.minsize
>
> it's set to 5Gbytes by default and if the estimated query size is
>
> less than this then the index won't be used.
>
> HTH.
>
>
>
> Regards
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.mar...@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
><https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* Thilina Gunarathne [mailto:cset...@gmail.com]
> *Sent:* 03 February 2014 16:08
> *To:* user
> *Subject:* Index not getting used for the queries
>
>
>
> Dear all,
>
> I created a compact index for a table with several hundred million records
> as follows. The table is partitioned by the month. The index on A and B was
> created successfully, but I can't see it getting used in the queries. It
> would be great if one of you experts can shed some light on what  am I
> missing. I'm using hive 0.9.
>
> set hive.exec.parallel=false;
> CREATE INDEX idx_
> ON TABLE (a,b)
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
> WITH DEFERRED REBUILD
> COMMENT 'Index for  table. Indexing on A and B';
> ALTER INDEX idx_ on  REBUILD;
>
>
> hive> describe ;
> OK
> abigint
> ...
> bbigint
> 
>
> month int
>
>
>
> hive> show index on ;
> OK
> idx_    a, b
> default___p_idx___compact Index for tm top50
> table. Indexing on A and B
>
>
> hive> explain select a,b from tm_top50_p where a=113231 and
> month=201308;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ))) (TOK_INSERT
> (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
> (TOK_TABLE_OR_COL a)) (TOK_SELEXPR (TOK_TABLE_OR_COL b))) (TOK_WHERE (and
> (= (TOK_TABLE_OR_COL a) 113231) (= (TOK_TABLE_OR_COL month) 201308)
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> 
>   TableScan
> alias: 
> Filter Operator
>   predicate:
>   expr: (a = 113231)
>   type: boolean
>   Select Operator
> expressions:
>   expr: a
>   type: bigint
>   expr: b
>   type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
>   compressed: false
>   GlobalTableId: 0
>   table:
>   input format:
> org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>
>   thanks a lot,
> Thilina
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
>
> http://thilina.gunarathne.org
>



-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org
<><><><>

Index not getting used for the queries

2014-02-03 Thread Thilina Gunarathne
Dear all,
I created a compact index for a table with several hundred million records
as follows. The table is partitioned by the month. The index on A and B was
created successfully, but I can't see it getting used in the queries. It
would be great if one of you experts can shed some light on what  am I
missing. I'm using hive 0.9.

set hive.exec.parallel=false;
CREATE INDEX idx_
ON TABLE (a,b)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
COMMENT 'Index for  table. Indexing on A and B';
ALTER INDEX idx_ on  REBUILD;


hive> describe ;
OK
abigint
...
bbigint

month int



hive> show index on ;
OK
idx_    a, b
default___p_idx___compact Index for tm top50
table. Indexing on A and B


hive> explain select a,b from tm_top50_p where a=113231 and
month=201308;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL a)) (TOK_SELEXPR (TOK_TABLE_OR_COL b))) (TOK_WHERE (and
(= (TOK_TABLE_OR_COL a) 113231) (= (TOK_TABLE_OR_COL month) 201308)

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Alias -> Map Operator Tree:

  TableScan
alias: 
Filter Operator
  predicate:
  expr: (a = 113231)
  type: boolean
  Select Operator
expressions:
  expr: a
  type: bigint
  expr: b
  type: bigint
outputColumnNames: _col0, _col1
File Output Operator
  compressed: false
  GlobalTableId: 0
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
Fetch Operator
  limit: -1


thanks a lot,
Thilina

-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org


Re: Using Hive generated SeqeunceFiles and RC files with Java MapReduce and PIG

2014-01-28 Thread Thilina Gunarathne
Thanks for the information Edward.

When you use the default Serde (lazySerde) and sequence files hive writes a
> SequenceFile(create table x  stored as sequence file) , the key is null
> and hive serializes all the columns into a Text Writable that is easy for
> other tools to read.
>
Does this mean, using the default Serde would not give us much advantages
over using a TextFile, other than the split-ability (and the compression
options due to it) of the SequenceFiles?

thanks,
Thilina


> Hive does not dictate the input format or the output format, usually you
> can get hive to produce exactly what you want by mixing and matching serde
> and output format options.
>
>
> On Tue, Jan 28, 2014 at 8:05 PM, Thilina Gunarathne wrote:
>
>> Hi,
>> We have a requirement to store a large data set (more than 5TB) mapped to
>> a Hive table. This Hive table would be populated (and appended
>> periodically) using a Hive query from another Hive table. In addition to
>> the Hive queries, we need to be able to run Java MapReduce and preferably
>> Pig jobs as well on top of this data.
>>
>> I'm wondering what would be the best storage format for this Hive table.
>> How easy it is to use JavaMapReduce on Hive generated sequence files (eg:
>> stored as SequenceFile). How easy it is to use JavaMapReduce on RC files.
>> Any pointers to examples of these would be really great. Does using
>> compressed Text Files (deflate) sound like the best option for this usecase.
>>
>> BTW we are stuck with Hive 0.9 for the foreseeable future and ORC is out
>> of the options.
>>
>> thanks,
>> Thilina
>>
>> --
>> https://www.cs.indiana.edu/~tgunarat/
>> http://www.linkedin.com/in/thilina
>> http://thilina.gunarathne.org
>>
>
>


-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org


Using Hive generated SeqeunceFiles and RC files with Java MapReduce and PIG

2014-01-28 Thread Thilina Gunarathne
Hi,
We have a requirement to store a large data set (more than 5TB) mapped to a
Hive table. This Hive table would be populated (and appended periodically)
using a Hive query from another Hive table. In addition to the Hive
queries, we need to be able to run Java MapReduce and preferably Pig jobs
as well on top of this data.

I'm wondering what would be the best storage format for this Hive table.
How easy it is to use JavaMapReduce on Hive generated sequence files (eg:
stored as SequenceFile). How easy it is to use JavaMapReduce on RC files.
Any pointers to examples of these would be really great. Does using
compressed Text Files (deflate) sound like the best option for this usecase.

BTW we are stuck with Hive 0.9 for the foreseeable future and ORC is out of
the options.

thanks,
Thilina

-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org


Re: RCFile vs SequenceFile vs text files

2014-01-27 Thread Thilina Gunarathne
Thanks Edward. I'm actually populating this table periodically from another
temporary table and OCR sounds like a good fit. But unfortunately we are
stuck with Hive 0.9.

I wonder how easy/hard to use the data stored as RCFile or ORC with Java
MapReduce?

thanks,
Thilina


On Mon, Jan 27, 2014 at 3:09 PM, Edward Capriolo wrote:

> The thing about OCR is that it is great for tables created from other
> tables, (like the other columnar formats) but if you are logging directly
> to HDFS, a columnar format is not easy (possible) to write directly.
> Normally people store data in a very direct row oriented form and then
> there first map reduce job buckets/partitions/columnar-izes it.
>
>
> On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne wrote:
>
>> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
>> not be an option for us as our cluster still runs Hive 0.9 and we won't be
>> migrating any time soon.
>>
>> thanks,
>> Thilina
>>
>>
>> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy 
>> wrote:
>>
>>> Quick insights:
>>>
>>>
>>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>>>
>>>
>>>
>>>
>>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
>>> eric.n.han...@microsoft.com> wrote:
>>>
>>>>  It sounds like ORC would be best.
>>>>
>>>>
>>>>
>>>> -Eric
>>>>
>>>>
>>>>
>>>> *From:* Thilina Gunarathne [mailto:cset...@gmail.com]
>>>> *Sent:* Monday, January 27, 2014 11:05 AM
>>>> *To:* user@hive.apache.org
>>>> *Subject:* RCFile vs SequenceFile vs text files
>>>>
>>>>
>>>>
>>>> Dear all,
>>>>
>>>> We are trying to pick the right data storage format for the Hive table
>>>> with the following requirement and would really appreciate any insights you
>>>> can provide to help our decision.
>>>>
>>>> 1. ~50Billion records per month. ~14 columns per record and each record
>>>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>>>> periodically from another Hive query.
>>>>
>>>> 2. The columns are dense, so I'm not sure whether we'll get any space
>>>> savings by using RCFiles.
>>>>
>>>> 3. Data needs to be compressed.
>>>>
>>>> 4. We will be doing lot of aggregation queries for selected columns.
>>>> There will be ad-hoc queries for whole records as well.
>>>>
>>>> 5. We need the ability to run Java MapReduce programs on the underlying
>>>> data. We have existing programs which use custom inputformats with
>>>> compressed textfiles as input and we are willing to port them to use other
>>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>>>
>>>> 6. Ability to use hive indexing.
>>>>
>>>> thanks a ton in advance,
>>>>
>>>> Thilina
>>>>
>>>>
>>>>
>>>> --
>>>> https://www.cs.indiana.edu/~tgunarat/
>>>> http://www.linkedin.com/in/thilina
>>>>
>>>> http://thilina.gunarathne.org
>>>>
>>>
>>>
>>>
>>> --
>>> Thank you
>>>
>>> Sharath Punreddy
>>> 1201 Golden gate Dr,
>>> Southlake TX 76092.
>>> Phone:626-470-7867
>>>
>>
>>
>>
>> --
>> https://www.cs.indiana.edu/~tgunarat/
>> http://www.linkedin.com/in/thilina
>> http://thilina.gunarathne.org
>>
>
>


-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org


Re: RCFile vs SequenceFile vs text files

2014-01-27 Thread Thilina Gunarathne
Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
not be an option for us as our cluster still runs Hive 0.9 and we won't be
migrating any time soon.

thanks,
Thilina


On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy wrote:

> Quick insights:
>
>
> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>
>
>
>
> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
> eric.n.han...@microsoft.com> wrote:
>
>>  It sounds like ORC would be best.
>>
>>
>>
>> -Eric
>>
>>
>>
>> *From:* Thilina Gunarathne [mailto:cset...@gmail.com]
>> *Sent:* Monday, January 27, 2014 11:05 AM
>> *To:* user@hive.apache.org
>> *Subject:* RCFile vs SequenceFile vs text files
>>
>>
>>
>> Dear all,
>>
>> We are trying to pick the right data storage format for the Hive table
>> with the following requirement and would really appreciate any insights you
>> can provide to help our decision.
>>
>> 1. ~50Billion records per month. ~14 columns per record and each record
>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>> periodically from another Hive query.
>>
>> 2. The columns are dense, so I'm not sure whether we'll get any space
>> savings by using RCFiles.
>>
>> 3. Data needs to be compressed.
>>
>> 4. We will be doing lot of aggregation queries for selected columns.
>> There will be ad-hoc queries for whole records as well.
>>
>> 5. We need the ability to run Java MapReduce programs on the underlying
>> data. We have existing programs which use custom inputformats with
>> compressed textfiles as input and we are willing to port them to use other
>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>
>> 6. Ability to use hive indexing.
>>
>> thanks a ton in advance,
>>
>> Thilina
>>
>>
>>
>> --
>> https://www.cs.indiana.edu/~tgunarat/
>> http://www.linkedin.com/in/thilina
>>
>> http://thilina.gunarathne.org
>>
>
>
>
> --
> Thank you
>
> Sharath Punreddy
> 1201 Golden gate Dr,
> Southlake TX 76092.
> Phone:626-470-7867
>



-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org


RCFile vs SequenceFile vs text files

2014-01-27 Thread Thilina Gunarathne
Dear all,
We are trying to pick the right data storage format for the Hive table with
the following requirement and would really appreciate any insights you can
provide to help our decision.

1. ~50Billion records per month. ~14 columns per record and each record is
~100 bytes.  Table is partitioned by the date. Table gets populated
periodically from another Hive query.
2. The columns are dense, so I'm not sure whether we'll get any space
savings by using RCFiles.
3. Data needs to be compressed.
4. We will be doing lot of aggregation queries for selected columns. There
will be ad-hoc queries for whole records as well.
5. We need the ability to run Java MapReduce programs on the underlying
data. We have existing programs which use custom inputformats with
compressed textfiles as input and we are willing to port them to use other
formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
6. Ability to use hive indexing.

thanks a ton in advance,
Thilina


-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org