Re: Optimize hive external tables with serde

2014-10-22 Thread Sanjay Subramanian
The non gzipped version will be faster.One of the usecases for using a GZIP 
file is when u have input records that are multiline long formats exampleIn the 
following case, u have to send all contents between begin and end sections to 
the mapper. If u do NOT gzip this file (containing multiple records) then HDFS 
will split it in 128MB blocks(whatever u have set the HDFS block size to be)  
and it may happen that part of contents between the begin and end section get 
split and u will get errors processing the records.
[begin_20100922_12345678_serverid_foofla01]blah bhah blah bhah blah bhah blah 
bhah blah bhah blah bhah [end_20100922_12345678_serverid_foofla01]
So if your record is not multi line then don't use GZIP. I have used LZO which 
is fast but getting LZO to work correctly needs little bit of work on your 
cluster. Its not difficult but just a few steps to do. U can refer my 
documentation on the Hive site. 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO  I rewrote 
your query. (The group by is pushed to the outer query. And I created an 
initial subset jt1 to use later in the query)I did not understand two things in 
your queryThe join clause   ---   ON (1=1) ?The where clause  ---   1=1 ? 
a) Code heatmap - how many users used some "code"    WITH         jt1    AS  
(SELECT  *  FROM json_table        WHERE        1=1   AND        code in 
(1,3,54,300,222,111)   AND        partitionDate > '20140922'    AND        
partitionDate <= '20141022'   )   SELECT    A.code,    
A.NumberOfDistinctCustomers,    B.NumberOfAllDistinctCustomers,    
A.NumberOfEntries,    B.NumberOfAllEntries  FROM    (SELECT        jt1.code as 
code,      COUNT(DISTINCT jt1.customerId) AS NumberOfDistinctCustomers,  
COUNT(*) AS NumberOfEntries     FROM        jt1     ) A  JOIN    (SELECT  
COUNT(DISTINCT jt1.customerId) AS NumberOfAllDistinctCustomers,  COUNT(*) AS 
NumberOfAllEntries      FROM  jt1     ) B  ON     (1=1)  GROUP BY     A.code  ; 
 
b) Select users which use some code, but have some special attributes in second 
table:NOTE : You are querying this with no partition so if json_table has 
500-1000+ partitions then this query might be really slow or may not runOne way 
you can solve this is to write a bash script that loops though a set of 
partitions and u store the results in an intermediate Hive managed table and 
then have another query to aggregate that.
SELECT      jt.customerIdFROM   (SELECT        *   FROM        json_table       
 WHERE         1=1   AND         code in (1,3,54,300,222,111)   ) jt JOIN       
(SELECT            *       FROM            attribute_table       WHERE          
  attribute_X1='1'    AND          attribute_X2='1'   ) atON      jt.customerId 
= at.customerId





  From: ptrst 
 To: user@hive.apache.org; Sanjay Subramanian  
 Sent: Wednesday, October 22, 2014 1:02 AM
 Subject: Re: Optimize hive external tables with serde
   
ad 1) My files are not bigger than Block Size. To be precise all data from one 
day are up to 2GB gzipped. Unzipped they are ~15GB. The are split in one folder 
into files less then block size (block size in my case is 128MB, files are 
~100MB). I can transform them to other format if you think it will speed up my 
queries, but as I understand the documentation, such structure should be same 
as split-able files.If I add one more partition dimension with code, so I have 
partition by date and code, single files is 3-10MB. Because unique codes I have 
3000

ad 2)   I read your code and I use same serde only newer one. Should I check 
something more?
Ad 4) Most  queries will be about last 3-6 months
Ad 3) My example queries are:a) Code heatmap - how many users used some "code" 
SELECT    A.code,    A.NumberOfDistinctCustomers,    
B.NumberOfAllDistinctCustomers,    A.NumberOfEntries,    B.NumberOfAllEntries 
FROM  (   SELECT  code,  COUNT(DISTINCT customerId) AS 
NumberOfDistinctCustomers,  COUNT(*) AS NumberOfEntries    FROM  json_table   
WHERE 1=1    AND code in (1,3,54,300,222,111)    AND partitionDate > '20140922' 
AND partitionDate <= '20141022'    GROUP BY code ) A JOIN (   SELECT  
COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers,  COUNT(*) AS 
NumberOfAllEntries    FROM  json_table   WHERE 1=1    AND code in 
(1,3,54,300,222,111)    AND partitionDate > '20140922' AND partitionDate <= 
'20141022'  ) B ON(1=1);
b) Select users which use some code, but have some special attributes in second 
table: SELECT json_table.customerId  FROM json_table   JOIN attribute_table ON 
(json_table.customerId=attribute_table.customerId)   AND 
attribute_table.attribute_X1='1'  AND attribute_table.attribute_X2='1'  AND 
json_table.code in (1,3,54,300,222,111);
c) Mix of above.


On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian 
 wrote:

1. The gzip files are not splittable, so gzip itse

Re: Optimize hive external tables with serde

2014-10-22 Thread ptrstpppp
ad 1) My files are not bigger than Block Size. To be precise all data from
one day are up to 2GB gzipped. Unzipped they are ~15GB. The are split in
one folder into files less then block size (block size in my case is 128MB,
files are ~100MB). I can transform them to other format if you think it
will speed up my queries, but as I understand the documentation, such
structure should be same as split-able files.
If I add one more partition dimension with code, so I have partition by
date and code, single files is 3-10MB. Because unique codes I have 3000

ad 2)   I read your code and I use same serde only newer one. Should I
check something more?

Ad 4) Most  queries will be about last 3-6 months

Ad 3) My example queries are:
a) Code heatmap - how many users used some "code"
SELECT
  A.code,
  A.NumberOfDistinctCustomers,
  B.NumberOfAllDistinctCustomers,
  A.NumberOfEntries,
  B.NumberOfAllEntries
FROM
(
  SELECT
code,
COUNT(DISTINCT customerId) AS NumberOfDistinctCustomers,
COUNT(*) AS NumberOfEntries
  FROM
json_table
  WHERE 1=1
   AND code in (1,3,54,300,222,111)
   AND partitionDate > '20140922' AND partitionDate <= '20141022'
  GROUP BY code
) A
JOIN
(
  SELECT
COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers,
COUNT(*) AS NumberOfAllEntries
  FROM
json_table
  WHERE 1=1
   AND code in (1,3,54,300,222,111)
   AND partitionDate > '20140922' AND partitionDate <= '20141022'
) B
ON(1=1);

b) Select users which use some code, but have some special attributes in
second table:
SELECT json_table.customerId
 FROM json_table
 JOIN attribute_table ON (json_table.customerId=attribute_table.customerId)
 AND attribute_table.attribute_X1='1'
 AND attribute_table.attribute_X2='1'
 AND json_table.code in (1,3,54,300,222,111);

c) Mix of above.

On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian <
sanjaysubraman...@yahoo.com> wrote:

> 1. The gzip files are not splittable, so gzip itself will make the queries
> slower.
>
> 2. As a reference for JSON serdes , here is a example from my blog
> http://bigdatalatte.wordpress.com/2014/08/21/denormalizing-json-arrays-in-hive/
>
> 3. Need to see your query first to try and optimize it
>
> 4. Even if you have datewise partitions and u have 5 years of data i.e.
> about 1825 partitions.
>-- Trying to do a select count(*) without where clause might make hive
> crawl.
>
>
>   ------
>  *From:* Ja Sam 
> *To:* user@hive.apache.org
> *Sent:* Tuesday, October 21, 2014 10:37 AM
> *Subject:* Optimize hive external tables with serde
>
> *Part 1: my enviroment*
> I have following files uploaded to Hadoop:
>
>1. The are plain text
>2. Each line contains JSON like:
>
> {code:[int], customerId:[string], data:{[something more here]}}
>
>1. code are numbers from 1 to 3000,
>2. customerId are total up to 4 millions, daily up to 0.5 millon
>3. All files are gzip
>4. In hive I created external table with custom JSON serde (let's call
>it CUSTOMER_DATA)
>5. All files from each date is stored in separate directory - and I
>use it as partitions in Hive tables
>
> Most queries which I do are filtering by date, code and customerId. I
> have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
> [customerId] [attribute_1] [attribute_2] ... [attribute_n] which contains
> data for all my customers, so rows are up to 4 millions.
> I query and filter my data in following way:
>
>1. Filtering by date - partitions do the job here using WHERE
>partitionDate IN (20141020,20141020)
>2. Filtering by code using statement like for example `WHERE code IN
>(1,4,5,33,6784)
>3. Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition
>query like  SELECT customerId FROM CUSTOMER_DATA JOIN
>CUSTOMER_ATTRIBUTES ON
>(CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE
>CUSTOMER_ATTRIBUTES.attribute_1=[something]
>
> *Part 2: question*
> Is there any efficient way how can I optimize my queries. I read about
> indexes and buckets by I don't know if I can use them with external tables
> and if they will optimize my queries.
>
>
>


Re: Optimize hive external tables with serde

2014-10-21 Thread Sanjay Subramanian
1. The gzip files are not splittable, so gzip itself will make the queries 
slower.
2. As a reference for JSON serdes , here is a example from my blog 
http://bigdatalatte.wordpress.com/2014/08/21/denormalizing-json-arrays-in-hive/
3. Need to see your query first to try and optimize it
4. Even if you have datewise partitions and u have 5 years of data i.e. about 
1825 partitions.   -- Trying to do a select count(*) without where clause might 
make hive crawl.

  From: Ja Sam 
 To: user@hive.apache.org 
 Sent: Tuesday, October 21, 2014 10:37 AM
 Subject: Optimize hive external tables with serde
   
Part 1: my enviromentI have following files uploaded to Hadoop:   
   - The are plain text
   - Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}   
   - code are numbers from 1 to 3000,
   - customerId are total up to 4 millions, daily up to 0.5 millon
   - All files are gzip
   - In hive I created external table with custom JSON serde (let's call it 
CUSTOMER_DATA)
   - All files from each date is stored in separate directory - and I use it as 
partitions in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also 
a second file with format (let's call it CUSTOMER_ATTRIBUTES]: [customerId] 
[attribute_1] [attribute_2] ... [attribute_n]which contains data for all my 
customers, so rows are up to 4 millions.I query and filter my data in following 
way:   
   - Filtering by date - partitions do the job here using WHERE partitionDate 
IN (20141020,20141020)
   - Filtering by code using statement like for example `WHERE code IN 
(1,4,5,33,6784)
   - Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query 
like SELECT customerId FROM CUSTOMER_DATA JOIN CUSTOMER_ATTRIBUTES ON 
(CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE 
CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: questionIs there any efficient way how can I optimize my queries. I 
read about indexes and buckets by I don't know if I can use them with external 
tables and if they will optimize my queries.

  

Optimize hive external tables with serde

2014-10-21 Thread Ja Sam
*Part 1: my enviroment*

I have following files uploaded to Hadoop:

   1. The are plain text
   2. Each line contains JSON like:

{code:[int], customerId:[string], data:{[something more here]}}

   1. code are numbers from 1 to 3000,
   2. customerId are total up to 4 millions, daily up to 0.5 millon
   3. All files are gzip
   4. In hive I created external table with custom JSON serde (let's call
   it CUSTOMER_DATA)
   5. All files from each date is stored in separate directory - and I use
   it as partitions in Hive tables

Most queries which I do are filtering by date, code and customerId. I have
also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
[customerId] [attribute_1] [attribute_2] ... [attribute_n] which contains
data for all my customers, so rows are up to 4 millions.

I query and filter my data in following way:

   1. Filtering by date - partitions do the job here using WHERE
   partitionDate IN (20141020,20141020)
   2. Filtering by code using statement like for example `WHERE code IN
   (1,4,5,33,6784)
   3. Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition
   query like  SELECT customerId FROM CUSTOMER_DATA JOIN
   CUSTOMER_ATTRIBUTES ON
   (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE
   CUSTOMER_ATTRIBUTES.attribute_1=[something]

*Part 2: question*

Is there any efficient way how can I optimize my queries. I read about
indexes and buckets by I don't know if I can use them with external tables
and if they will optimize my queries.