Re: External partition table question

2014-07-17 Thread Lefty Leverenz
Thanks for this clarification.  I've revised the Add Partitions section

in the wiki accordingly.

-- Lefty


On Fri, Jul 18, 2014 at 12:45 AM, Satish Mittal 
wrote:

> 'ALTER TABLE .. ADD PARTITION..' would just a partition entry for the
> table in hive metastore. It doesn't perform any data loading, instead it
> expects the data to be loaded already in the file pointed to by LOCATION.
>
>
> On Tue, Jul 15, 2014 at 5:39 AM, Raymond Lau  wrote:
>
>> I've created an external table partitioned by a field and am attempting
>> to load in the data via the command 'ALTER TABLE partitioned_table_test ADD
>> PARTITION (pcode = '123') LOCATION '/path/to/parquet/files';' using a
>> custom Parquet SerDe.
>>
>> Does loading in the data this way call the serializer() function in the
>> SerDe at all?
>>
>>  I've tried adding System.out.println statements in my deserializer and
>> serializer to debug and no output seems to come from the Serializer
>> function.
>>
>> --
>> *Raymond Lau*
>> Software Engineer - Intern |
>> r...@ooyala.com | (925) 395-3806
>>
>
>
> _
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify
> us immediately by responding to this email and then delete it from your
> system. The firm is neither liable for the proper and complete transmission
> of the information contained in this communication nor for any delay in its
> receipt.


Re: External partition table question

2014-07-17 Thread Satish Mittal
'ALTER TABLE .. ADD PARTITION..' would just a partition entry for the table
in hive metastore. It doesn't perform any data loading, instead it expects
the data to be loaded already in the file pointed to by LOCATION.


On Tue, Jul 15, 2014 at 5:39 AM, Raymond Lau  wrote:

> I've created an external table partitioned by a field and am attempting to
> load in the data via the command 'ALTER TABLE partitioned_table_test ADD
> PARTITION (pcode = '123') LOCATION '/path/to/parquet/files';' using a
> custom Parquet SerDe.
>
> Does loading in the data this way call the serializer() function in the
> SerDe at all?
>
> I've tried adding System.out.println statements in my deserializer and
> serializer to debug and no output seems to come from the Serializer
> function.
>
> --
> *Raymond Lau*
> Software Engineer - Intern |
> r...@ooyala.com | (925) 395-3806
>

-- 
_
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.


External partition table question

2014-07-14 Thread Raymond Lau
I've created an external table partitioned by a field and am attempting to
load in the data via the command 'ALTER TABLE partitioned_table_test ADD
PARTITION (pcode = '123') LOCATION '/path/to/parquet/files';' using a
custom Parquet SerDe.

Does loading in the data this way call the serializer() function in the
SerDe at all?

I've tried adding System.out.println statements in my deserializer and
serializer to debug and no output seems to come from the Serializer
function.

-- 
*Raymond Lau*
Software Engineer - Intern |
r...@ooyala.com | (925) 395-3806


Compression for a HDFS text file - Hive External Partition Table

2013-11-13 Thread Raj Hadoop
Hi ,
  
1)  My requirement is to load a file ( a tar.gz file which has multiple tab 
separated values files and one file is the main file which has huge data – 
about 10 GB per day) to an externally partitioned hive table.
 
2)  What I am doing is I have automated the process by extracting the 
tar.gz file and get the main data file (10GB text file) and then loading to a 
hdfs file as text file.
 
3)  I want to compress the files. What is the procedure for it?
 
4)  Do I need to use any utility to compress the hit data file before 
loading to HDFS? And also should I define an Input Structure for HDFS File 
format through a Java Program?
 
Regards,
Raj

Re: External Partition Table

2013-10-31 Thread Raj Hadoop


Thanks Tim. I am using a String column for the partition column. 



On Thursday, October 31, 2013 6:49 PM, Timothy Potter  
wrote:
 
Hi Raj,
This seems like a matter of style vs. any performance benefit / cost ... if 
you're going to do a lot of queries just based on month or year, then #2 might 
be easier, e.g.

select * from foo where year = 2013 seems a little cleaner than select * from 
foo where date >= 20130101 and date <= 20131231 (not sure how you're encoding 
dates into a INT but I think you get the idea)

I do something similar but my partition fields are strings, like 
2013-10-31_ (which has the nice property of lexically sorting the same as 
numeric sort).

I'm assuming they will both have the same performance because Hive is still 
selecting the same number of input paths in both scenarios, one just happens to 
be a little deeper.

Cheers,
Tim



On Thu, Oct 31, 2013 at 4:34 PM, Raj Hadoop  wrote:

Hi,
>
>
>I am planning for a Hive External Partition Table based on a date.
>
>
>Which one of the below yields a better performance or both have the same 
>performance?
>
>
>1) Partition based on one folder per day
>LIKE date INT
>2) Partition based on one folder per year / month / day ( So it has three 
>folders) 
>LIKE year INT, month INT, day INT
>
>
>Thanks,
>Raj
>
>

Re: External Partition Table

2013-10-31 Thread Timothy Potter
Hi Raj,

This seems like a matter of style vs. any performance benefit / cost ... if
you're going to do a lot of queries just based on month or year, then #2
might be easier, e.g.

select * from foo where year = 2013 seems a little cleaner than select *
from foo where date >= 20130101 and date <= 20131231 (not sure how you're
encoding dates into a INT but I think you get the idea)

I do something similar but my partition fields are strings, like
2013-10-31_ (which has the nice property of lexically sorting the same
as numeric sort).

I'm assuming they will both have the same performance because Hive is still
selecting the same number of input paths in both scenarios, one just
happens to be a little deeper.

Cheers,
Tim


On Thu, Oct 31, 2013 at 4:34 PM, Raj Hadoop  wrote:

> Hi,
>
> I am planning for a Hive External Partition Table based on a date.
>
> Which one of the below yields a better performance or both have the same
> performance?
>
> 1) Partition based on one folder per day
> LIKE date INT
> 2) Partition based on one folder per year / month / day ( So it has three
> folders)
> LIKE year INT, month INT, day INT
>
> Thanks,
> Raj
>
>


Re: External Partition Table

2013-10-31 Thread Brad Ruderman
Personally from my limited understanding of your requirements, I would
think partitioned by day would be fine. Perhaps use the "MMDD" method
so partition for today would be 20131031 and tomorrow would be 20131101

Thanks,
Brad


On Thu, Oct 31, 2013 at 3:42 PM, Raj Hadoop  wrote:

> Hi Brad,
>
> Thanks for the quick response.
>
> I have about 10 GB file per day (web logs). And I am creating a
> folder(partition) per each day. Is it something uncommon ?
>
> I do not know at this juncture what kind of queries I would be executing
> upon on this table. But just wanted to know whether this is something
> normal or not at all a normal thing.
>
> Thanks,
> Raj
>
>
>   On Thursday, October 31, 2013 6:39 PM, Brad Ruderman <
> bruder...@radiumone.com> wrote:
>  Wow that question won't be answerable. It all depends on the amount of
> data per partition and the queries you are going to be executing on it, as
> well as the structure of the data. In general in hive (depending on your
> cluster size) you need to balance the number of files with the size,
> smaller number of files is typically preferred but partitions will help
> when date restricting.
>
> Thx,
> Brad
>
>
> On Thu, Oct 31, 2013 at 3:34 PM, Raj Hadoop  wrote:
>
> Hi,
>
> I am planning for a Hive External Partition Table based on a date.
>
> Which one of the below yields a better performance or both have the same
> performance?
>
> 1) Partition based on one folder per day
> LIKE date INT
> 2) Partition based on one folder per year / month / day ( So it has three
> folders)
> LIKE year INT, month INT, day INT
>
>  Thanks,
> Raj
>
>
>
>
>


Re: External Partition Table

2013-10-31 Thread Raj Hadoop
Hi Brad,

Thanks for the quick response.

I have about 10 GB file per day (web logs). And I am creating a 
folder(partition) per each day. Is it something uncommon ?

I do not know at this juncture what kind of queries I would be executing upon 
on this table. But just wanted to know whether this is something normal or not 
at all a normal thing.

Thanks,
Raj



On Thursday, October 31, 2013 6:39 PM, Brad Ruderman  
wrote:
 
Wow that question won't be answerable. It all depends on the amount of data per 
partition and the queries you are going to be executing on it, as well as the 
structure of the data. In general in hive (depending on your cluster size) you 
need to balance the number of files with the size, smaller number of files is 
typically preferred but partitions will help when date restricting.

Thx,
Brad



On Thu, Oct 31, 2013 at 3:34 PM, Raj Hadoop  wrote:

Hi,
>
>
>I am planning for a Hive External Partition Table based on a date.
>
>
>Which one of the below yields a better performance or both have the same 
>performance?
>
>
>1) Partition based on one folder per day
>LIKE date INT
>2) Partition based on one folder per year / month / day ( So it has three 
>folders) 
>LIKE year INT, month INT, day INT
>
>
>Thanks,
>Raj
>
>

Re: External Partition Table

2013-10-31 Thread Brad Ruderman
Wow that question won't be answerable. It all depends on the amount of data
per partition and the queries you are going to be executing on it, as well
as the structure of the data. In general in hive (depending on your cluster
size) you need to balance the number of files with the size, smaller number
of files is typically preferred but partitions will help when date
restricting.

Thx,
Brad


On Thu, Oct 31, 2013 at 3:34 PM, Raj Hadoop  wrote:

> Hi,
>
> I am planning for a Hive External Partition Table based on a date.
>
> Which one of the below yields a better performance or both have the same
> performance?
>
> 1) Partition based on one folder per day
> LIKE date INT
> 2) Partition based on one folder per year / month / day ( So it has three
> folders)
> LIKE year INT, month INT, day INT
>
> Thanks,
> Raj
>
>


External Partition Table

2013-10-31 Thread Raj Hadoop
Hi,

I am planning for a Hive External Partition Table based on a date.

Which one of the below yields a better performance or both have the same 
performance?

1) Partition based on one folder per day
LIKE date INT
2) Partition based on one folder per year / month / day ( So it has three 
folders) 
LIKE year INT, month INT, day INT

Thanks,
Raj