Re: Optimize hive external tables with serde
The non gzipped version will be faster.One of the usecases for using a GZIP file is when u have input records that are multiline long formats exampleIn the following case, u have to send all contents between begin and end sections to the mapper. If u do NOT gzip this file (containing multiple records) then HDFS will split it in 128MB blocks(whatever u have set the HDFS block size to be) and it may happen that part of contents between the begin and end section get split and u will get errors processing the records. [begin_20100922_12345678_serverid_foofla01]blah bhah blah bhah blah bhah blah bhah blah bhah blah bhah [end_20100922_12345678_serverid_foofla01] So if your record is not multi line then don't use GZIP. I have used LZO which is fast but getting LZO to work correctly needs little bit of work on your cluster. Its not difficult but just a few steps to do. U can refer my documentation on the Hive site. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO I rewrote your query. (The group by is pushed to the outer query. And I created an initial subset jt1 to use later in the query)I did not understand two things in your queryThe join clause --- ON (1=1) ?The where clause --- 1=1 ? a) Code heatmap - how many users used some "code" WITH jt1 AS (SELECT * FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate > '20140922' AND partitionDate <= '20141022' ) SELECT A.code, A.NumberOfDistinctCustomers, B.NumberOfAllDistinctCustomers, A.NumberOfEntries, B.NumberOfAllEntries FROM (SELECT jt1.code as code, COUNT(DISTINCT jt1.customerId) AS NumberOfDistinctCustomers, COUNT(*) AS NumberOfEntries FROM jt1 ) A JOIN (SELECT COUNT(DISTINCT jt1.customerId) AS NumberOfAllDistinctCustomers, COUNT(*) AS NumberOfAllEntries FROM jt1 ) B ON (1=1) GROUP BY A.code ; b) Select users which use some code, but have some special attributes in second table:NOTE : You are querying this with no partition so if json_table has 500-1000+ partitions then this query might be really slow or may not runOne way you can solve this is to write a bash script that loops though a set of partitions and u store the results in an intermediate Hive managed table and then have another query to aggregate that. SELECT jt.customerIdFROM (SELECT * FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) ) jt JOIN (SELECT * FROM attribute_table WHERE attribute_X1='1' AND attribute_X2='1' ) atON jt.customerId = at.customerId From: ptrst To: user@hive.apache.org; Sanjay Subramanian Sent: Wednesday, October 22, 2014 1:02 AM Subject: Re: Optimize hive external tables with serde ad 1) My files are not bigger than Block Size. To be precise all data from one day are up to 2GB gzipped. Unzipped they are ~15GB. The are split in one folder into files less then block size (block size in my case is 128MB, files are ~100MB). I can transform them to other format if you think it will speed up my queries, but as I understand the documentation, such structure should be same as split-able files.If I add one more partition dimension with code, so I have partition by date and code, single files is 3-10MB. Because unique codes I have 3000 ad 2) I read your code and I use same serde only newer one. Should I check something more? Ad 4) Most queries will be about last 3-6 months Ad 3) My example queries are:a) Code heatmap - how many users used some "code" SELECT A.code, A.NumberOfDistinctCustomers, B.NumberOfAllDistinctCustomers, A.NumberOfEntries, B.NumberOfAllEntries FROM ( SELECT code, COUNT(DISTINCT customerId) AS NumberOfDistinctCustomers, COUNT(*) AS NumberOfEntries FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate > '20140922' AND partitionDate <= '20141022' GROUP BY code ) A JOIN ( SELECT COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers, COUNT(*) AS NumberOfAllEntries FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate > '20140922' AND partitionDate <= '20141022' ) B ON(1=1); b) Select users which use some code, but have some special attributes in second table: SELECT json_table.customerId FROM json_table JOIN attribute_table ON (json_table.customerId=attribute_table.customerId) AND attribute_table.attribute_X1='1' AND attribute_table.attribute_X2='1' AND json_table.code in (1,3,54,300,222,111); c) Mix of above. On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian wrote: 1. The gzip files are not splittable, so gzip itse
Re: Optimize hive external tables with serde
ad 1) My files are not bigger than Block Size. To be precise all data from one day are up to 2GB gzipped. Unzipped they are ~15GB. The are split in one folder into files less then block size (block size in my case is 128MB, files are ~100MB). I can transform them to other format if you think it will speed up my queries, but as I understand the documentation, such structure should be same as split-able files. If I add one more partition dimension with code, so I have partition by date and code, single files is 3-10MB. Because unique codes I have 3000 ad 2) I read your code and I use same serde only newer one. Should I check something more? Ad 4) Most queries will be about last 3-6 months Ad 3) My example queries are: a) Code heatmap - how many users used some "code" SELECT A.code, A.NumberOfDistinctCustomers, B.NumberOfAllDistinctCustomers, A.NumberOfEntries, B.NumberOfAllEntries FROM ( SELECT code, COUNT(DISTINCT customerId) AS NumberOfDistinctCustomers, COUNT(*) AS NumberOfEntries FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate > '20140922' AND partitionDate <= '20141022' GROUP BY code ) A JOIN ( SELECT COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers, COUNT(*) AS NumberOfAllEntries FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate > '20140922' AND partitionDate <= '20141022' ) B ON(1=1); b) Select users which use some code, but have some special attributes in second table: SELECT json_table.customerId FROM json_table JOIN attribute_table ON (json_table.customerId=attribute_table.customerId) AND attribute_table.attribute_X1='1' AND attribute_table.attribute_X2='1' AND json_table.code in (1,3,54,300,222,111); c) Mix of above. On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian < sanjaysubraman...@yahoo.com> wrote: > 1. The gzip files are not splittable, so gzip itself will make the queries > slower. > > 2. As a reference for JSON serdes , here is a example from my blog > http://bigdatalatte.wordpress.com/2014/08/21/denormalizing-json-arrays-in-hive/ > > 3. Need to see your query first to try and optimize it > > 4. Even if you have datewise partitions and u have 5 years of data i.e. > about 1825 partitions. >-- Trying to do a select count(*) without where clause might make hive > crawl. > > > ------ > *From:* Ja Sam > *To:* user@hive.apache.org > *Sent:* Tuesday, October 21, 2014 10:37 AM > *Subject:* Optimize hive external tables with serde > > *Part 1: my enviroment* > I have following files uploaded to Hadoop: > >1. The are plain text >2. Each line contains JSON like: > > {code:[int], customerId:[string], data:{[something more here]}} > >1. code are numbers from 1 to 3000, >2. customerId are total up to 4 millions, daily up to 0.5 millon >3. All files are gzip >4. In hive I created external table with custom JSON serde (let's call >it CUSTOMER_DATA) >5. All files from each date is stored in separate directory - and I >use it as partitions in Hive tables > > Most queries which I do are filtering by date, code and customerId. I > have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]: > [customerId] [attribute_1] [attribute_2] ... [attribute_n] which contains > data for all my customers, so rows are up to 4 millions. > I query and filter my data in following way: > >1. Filtering by date - partitions do the job here using WHERE >partitionDate IN (20141020,20141020) >2. Filtering by code using statement like for example `WHERE code IN >(1,4,5,33,6784) >3. Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition >query like SELECT customerId FROM CUSTOMER_DATA JOIN >CUSTOMER_ATTRIBUTES ON >(CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE >CUSTOMER_ATTRIBUTES.attribute_1=[something] > > *Part 2: question* > Is there any efficient way how can I optimize my queries. I read about > indexes and buckets by I don't know if I can use them with external tables > and if they will optimize my queries. > > >
Re: Optimize hive external tables with serde
1. The gzip files are not splittable, so gzip itself will make the queries slower. 2. As a reference for JSON serdes , here is a example from my blog http://bigdatalatte.wordpress.com/2014/08/21/denormalizing-json-arrays-in-hive/ 3. Need to see your query first to try and optimize it 4. Even if you have datewise partitions and u have 5 years of data i.e. about 1825 partitions. -- Trying to do a select count(*) without where clause might make hive crawl. From: Ja Sam To: user@hive.apache.org Sent: Tuesday, October 21, 2014 10:37 AM Subject: Optimize hive external tables with serde Part 1: my enviromentI have following files uploaded to Hadoop: - The are plain text - Each line contains JSON like: {code:[int], customerId:[string], data:{[something more here]}} - code are numbers from 1 to 3000, - customerId are total up to 4 millions, daily up to 0.5 millon - All files are gzip - In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA) - All files from each date is stored in separate directory - and I use it as partitions in Hive tables Most queries which I do are filtering by date, code and customerId. I have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]: [customerId] [attribute_1] [attribute_2] ... [attribute_n]which contains data for all my customers, so rows are up to 4 millions.I query and filter my data in following way: - Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020) - Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784) - Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like SELECT customerId FROM CUSTOMER_DATA JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something] Part 2: questionIs there any efficient way how can I optimize my queries. I read about indexes and buckets by I don't know if I can use them with external tables and if they will optimize my queries.
Optimize hive external tables with serde
*Part 1: my enviroment* I have following files uploaded to Hadoop: 1. The are plain text 2. Each line contains JSON like: {code:[int], customerId:[string], data:{[something more here]}} 1. code are numbers from 1 to 3000, 2. customerId are total up to 4 millions, daily up to 0.5 millon 3. All files are gzip 4. In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA) 5. All files from each date is stored in separate directory - and I use it as partitions in Hive tables Most queries which I do are filtering by date, code and customerId. I have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]: [customerId] [attribute_1] [attribute_2] ... [attribute_n] which contains data for all my customers, so rows are up to 4 millions. I query and filter my data in following way: 1. Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020) 2. Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784) 3. Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like SELECT customerId FROM CUSTOMER_DATA JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something] *Part 2: question* Is there any efficient way how can I optimize my queries. I read about indexes and buckets by I don't know if I can use them with external tables and if they will optimize my queries.