Re: Index not getting used for the queries
Thanks Peter. It helped. That property combined with setting the property 'hive.optimize.index.filter' to 'true' got the index working. thanks, Thilina On Mon, Feb 3, 2014 at 6:12 PM, Peter Marron < peter.mar...@trilliumsoftware.com> wrote: > Hi, > > > > Not sure if it is relevant to your problem but I'm just checking > > that you know about > > hive.optimize.index.filter.compact.minsize > > it's set to 5Gbytes by default and if the estimated query size is > > less than this then the index won't be used. > > HTH. > > > > Regards > > > > *Peter Marron* > > Senior Developer, Research & Development > > > > Office: +44 *(0) 118-940-7609* peter.mar...@trilliumsoftware.com > > Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK > ><https://www.facebook.com/pages/Trillium-Software/109184815778307> > > <https://twitter.com/TrilliumSW> > > <http://www.linkedin.com/company/17710> > > > > *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>* > > Be Certain About Your Data. Be Trillium Certain. > > > > *From:* Thilina Gunarathne [mailto:cset...@gmail.com] > *Sent:* 03 February 2014 16:08 > *To:* user > *Subject:* Index not getting used for the queries > > > > Dear all, > > I created a compact index for a table with several hundred million records > as follows. The table is partitioned by the month. The index on A and B was > created successfully, but I can't see it getting used in the queries. It > would be great if one of you experts can shed some light on what am I > missing. I'm using hive 0.9. > > set hive.exec.parallel=false; > CREATE INDEX idx_ > ON TABLE (a,b) > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' > WITH DEFERRED REBUILD > COMMENT 'Index for table. Indexing on A and B'; > ALTER INDEX idx_ on REBUILD; > > > hive> describe ; > OK > abigint > ... > bbigint > > > month int > > > > hive> show index on ; > OK > idx_ a, b > default___p_idx___compact Index for tm top50 > table. Indexing on A and B > > > hive> explain select a,b from tm_top50_p where a=113231 and > month=201308; > OK > ABSTRACT SYNTAX TREE: > (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ))) (TOK_INSERT > (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR > (TOK_TABLE_OR_COL a)) (TOK_SELEXPR (TOK_TABLE_OR_COL b))) (TOK_WHERE (and > (= (TOK_TABLE_OR_COL a) 113231) (= (TOK_TABLE_OR_COL month) 201308) > > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > > TableScan > alias: > Filter Operator > predicate: > expr: (a = 113231) > type: boolean > Select Operator > expressions: > expr: a > type: bigint > expr: b > type: bigint > outputColumnNames: _col0, _col1 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > > Stage: Stage-0 > Fetch Operator > limit: -1 > > thanks a lot, > Thilina > > > -- > https://www.cs.indiana.edu/~tgunarat/ > http://www.linkedin.com/in/thilina > > http://thilina.gunarathne.org > -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org <><><><>
Index not getting used for the queries
Dear all, I created a compact index for a table with several hundred million records as follows. The table is partitioned by the month. The index on A and B was created successfully, but I can't see it getting used in the queries. It would be great if one of you experts can shed some light on what am I missing. I'm using hive 0.9. set hive.exec.parallel=false; CREATE INDEX idx_ ON TABLE (a,b) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD COMMENT 'Index for table. Indexing on A and B'; ALTER INDEX idx_ on REBUILD; hive> describe ; OK abigint ... bbigint month int hive> show index on ; OK idx_ a, b default___p_idx___compact Index for tm top50 table. Indexing on A and B hive> explain select a,b from tm_top50_p where a=113231 and month=201308; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a)) (TOK_SELEXPR (TOK_TABLE_OR_COL b))) (TOK_WHERE (and (= (TOK_TABLE_OR_COL a) 113231) (= (TOK_TABLE_OR_COL month) 201308) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: TableScan alias: Filter Operator predicate: expr: (a = 113231) type: boolean Select Operator expressions: expr: a type: bigint expr: b type: bigint outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 thanks a lot, Thilina -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org
Re: Using Hive generated SeqeunceFiles and RC files with Java MapReduce and PIG
Thanks for the information Edward. When you use the default Serde (lazySerde) and sequence files hive writes a > SequenceFile(create table x stored as sequence file) , the key is null > and hive serializes all the columns into a Text Writable that is easy for > other tools to read. > Does this mean, using the default Serde would not give us much advantages over using a TextFile, other than the split-ability (and the compression options due to it) of the SequenceFiles? thanks, Thilina > Hive does not dictate the input format or the output format, usually you > can get hive to produce exactly what you want by mixing and matching serde > and output format options. > > > On Tue, Jan 28, 2014 at 8:05 PM, Thilina Gunarathne wrote: > >> Hi, >> We have a requirement to store a large data set (more than 5TB) mapped to >> a Hive table. This Hive table would be populated (and appended >> periodically) using a Hive query from another Hive table. In addition to >> the Hive queries, we need to be able to run Java MapReduce and preferably >> Pig jobs as well on top of this data. >> >> I'm wondering what would be the best storage format for this Hive table. >> How easy it is to use JavaMapReduce on Hive generated sequence files (eg: >> stored as SequenceFile). How easy it is to use JavaMapReduce on RC files. >> Any pointers to examples of these would be really great. Does using >> compressed Text Files (deflate) sound like the best option for this usecase. >> >> BTW we are stuck with Hive 0.9 for the foreseeable future and ORC is out >> of the options. >> >> thanks, >> Thilina >> >> -- >> https://www.cs.indiana.edu/~tgunarat/ >> http://www.linkedin.com/in/thilina >> http://thilina.gunarathne.org >> > > -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org
Using Hive generated SeqeunceFiles and RC files with Java MapReduce and PIG
Hi, We have a requirement to store a large data set (more than 5TB) mapped to a Hive table. This Hive table would be populated (and appended periodically) using a Hive query from another Hive table. In addition to the Hive queries, we need to be able to run Java MapReduce and preferably Pig jobs as well on top of this data. I'm wondering what would be the best storage format for this Hive table. How easy it is to use JavaMapReduce on Hive generated sequence files (eg: stored as SequenceFile). How easy it is to use JavaMapReduce on RC files. Any pointers to examples of these would be really great. Does using compressed Text Files (deflate) sound like the best option for this usecase. BTW we are stuck with Hive 0.9 for the foreseeable future and ORC is out of the options. thanks, Thilina -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org
Re: RCFile vs SequenceFile vs text files
Thanks Edward. I'm actually populating this table periodically from another temporary table and OCR sounds like a good fit. But unfortunately we are stuck with Hive 0.9. I wonder how easy/hard to use the data stored as RCFile or ORC with Java MapReduce? thanks, Thilina On Mon, Jan 27, 2014 at 3:09 PM, Edward Capriolo wrote: > The thing about OCR is that it is great for tables created from other > tables, (like the other columnar formats) but if you are logging directly > to HDFS, a columnar format is not easy (possible) to write directly. > Normally people store data in a very direct row oriented form and then > there first map reduce job buckets/partitions/columnar-izes it. > > > On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne wrote: > >> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would >> not be an option for us as our cluster still runs Hive 0.9 and we won't be >> migrating any time soon. >> >> thanks, >> Thilina >> >> >> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy >> wrote: >> >>> Quick insights: >>> >>> >>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/ >>> >>> >>> >>> >>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) < >>> eric.n.han...@microsoft.com> wrote: >>> >>>> It sounds like ORC would be best. >>>> >>>> >>>> >>>> -Eric >>>> >>>> >>>> >>>> *From:* Thilina Gunarathne [mailto:cset...@gmail.com] >>>> *Sent:* Monday, January 27, 2014 11:05 AM >>>> *To:* user@hive.apache.org >>>> *Subject:* RCFile vs SequenceFile vs text files >>>> >>>> >>>> >>>> Dear all, >>>> >>>> We are trying to pick the right data storage format for the Hive table >>>> with the following requirement and would really appreciate any insights you >>>> can provide to help our decision. >>>> >>>> 1. ~50Billion records per month. ~14 columns per record and each record >>>> is ~100 bytes. Table is partitioned by the date. Table gets populated >>>> periodically from another Hive query. >>>> >>>> 2. The columns are dense, so I'm not sure whether we'll get any space >>>> savings by using RCFiles. >>>> >>>> 3. Data needs to be compressed. >>>> >>>> 4. We will be doing lot of aggregation queries for selected columns. >>>> There will be ad-hoc queries for whole records as well. >>>> >>>> 5. We need the ability to run Java MapReduce programs on the underlying >>>> data. We have existing programs which use custom inputformats with >>>> compressed textfiles as input and we are willing to port them to use other >>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?) >>>> >>>> 6. Ability to use hive indexing. >>>> >>>> thanks a ton in advance, >>>> >>>> Thilina >>>> >>>> >>>> >>>> -- >>>> https://www.cs.indiana.edu/~tgunarat/ >>>> http://www.linkedin.com/in/thilina >>>> >>>> http://thilina.gunarathne.org >>>> >>> >>> >>> >>> -- >>> Thank you >>> >>> Sharath Punreddy >>> 1201 Golden gate Dr, >>> Southlake TX 76092. >>> Phone:626-470-7867 >>> >> >> >> >> -- >> https://www.cs.indiana.edu/~tgunarat/ >> http://www.linkedin.com/in/thilina >> http://thilina.gunarathne.org >> > > -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org
Re: RCFile vs SequenceFile vs text files
Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would not be an option for us as our cluster still runs Hive 0.9 and we won't be migrating any time soon. thanks, Thilina On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy wrote: > Quick insights: > > > http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/ > > > > > On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) < > eric.n.han...@microsoft.com> wrote: > >> It sounds like ORC would be best. >> >> >> >> -Eric >> >> >> >> *From:* Thilina Gunarathne [mailto:cset...@gmail.com] >> *Sent:* Monday, January 27, 2014 11:05 AM >> *To:* user@hive.apache.org >> *Subject:* RCFile vs SequenceFile vs text files >> >> >> >> Dear all, >> >> We are trying to pick the right data storage format for the Hive table >> with the following requirement and would really appreciate any insights you >> can provide to help our decision. >> >> 1. ~50Billion records per month. ~14 columns per record and each record >> is ~100 bytes. Table is partitioned by the date. Table gets populated >> periodically from another Hive query. >> >> 2. The columns are dense, so I'm not sure whether we'll get any space >> savings by using RCFiles. >> >> 3. Data needs to be compressed. >> >> 4. We will be doing lot of aggregation queries for selected columns. >> There will be ad-hoc queries for whole records as well. >> >> 5. We need the ability to run Java MapReduce programs on the underlying >> data. We have existing programs which use custom inputformats with >> compressed textfiles as input and we are willing to port them to use other >> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?) >> >> 6. Ability to use hive indexing. >> >> thanks a ton in advance, >> >> Thilina >> >> >> >> -- >> https://www.cs.indiana.edu/~tgunarat/ >> http://www.linkedin.com/in/thilina >> >> http://thilina.gunarathne.org >> > > > > -- > Thank you > > Sharath Punreddy > 1201 Golden gate Dr, > Southlake TX 76092. > Phone:626-470-7867 > -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org
RCFile vs SequenceFile vs text files
Dear all, We are trying to pick the right data storage format for the Hive table with the following requirement and would really appreciate any insights you can provide to help our decision. 1. ~50Billion records per month. ~14 columns per record and each record is ~100 bytes. Table is partitioned by the date. Table gets populated periodically from another Hive query. 2. The columns are dense, so I'm not sure whether we'll get any space savings by using RCFiles. 3. Data needs to be compressed. 4. We will be doing lot of aggregation queries for selected columns. There will be ad-hoc queries for whole records as well. 5. We need the ability to run Java MapReduce programs on the underlying data. We have existing programs which use custom inputformats with compressed textfiles as input and we are willing to port them to use other formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?) 6. Ability to use hive indexing. thanks a ton in advance, Thilina -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org