Re: ORC vs TEXT file

2013-08-12 Thread pandees waran
Thanks Owen and Hongzhan.
I understand the behavior now.




On Tue, Aug 13, 2013 at 6:28 AM, hongzhan li wrote:

> if you select all the columns ,the orc will not faster than textfile.but
> if you select some column (not all of the colimns),orc will run faster.
> —
>
>
> On Mon, Aug 12, 2013 at 6:40 PM, pandees waran  wrote:
>
>> Hi,
>>
>> Currently, we use TEXTFILE format in hive 0.8 ,while creating the
>> external tables in intermediate processing .
>> I have read about ORC in 0.11. I have created the same table in 0.11
>> with ORC format.
>> Without any compression, the ORC file(totally 3 files) occupied the
>> space twice more than the TEXTFILE(only one file).
>> Even, when i query the data from ORC:
>> Select count(*) from orc_table
>>
>> It took more time than the same query against textfile.
>> But, i see cumulative CPU time is lesser in ORC than the text file.
>>
>> What sort of queries will benefit, if we use ORC?
>> In which cases TEXTFILE will be preferred more than ORC?
>>
>> Thanks.
>>
>
>


-- 
Thanks,
Pandeeswaran


Re: ORC vs TEXT file

2013-08-12 Thread hongzhan li
if you select all the columns ,the orc will not faster than textfile.but if you 
select some column (not all of the colimns),orc will run faster.
—

On Mon, Aug 12, 2013 at 6:40 PM, pandees waran  wrote:

> Hi,
> Currently, we use TEXTFILE format in hive 0.8 ,while creating the
> external tables in intermediate processing .
> I have read about ORC in 0.11. I have created the same table in 0.11
> with ORC format.
> Without any compression, the ORC file(totally 3 files) occupied the
> space twice more than the TEXTFILE(only one file).
> Even, when i query the data from ORC:
> Select count(*) from orc_table
> It took more time than the same query against textfile.
> But, i see cumulative CPU time is lesser in ORC than the text file.
> What sort of queries will benefit, if we use ORC?
> In which cases TEXTFILE will be preferred more than ORC?
> Thanks.

Re: ORC vs TEXT file

2013-08-12 Thread Owen O'Malley
Ok, given the large number of doubles in the schema and bzip2 compression,
I can see why the text would be smaller.

ORC doesn't do compression on floats or doubles, although there is a jira
to do so. (https://issues.apache.org/jira/browse/HIVE-3889)

Bzip is a very aggressive compressor. We should probably add it as an
option with ORC for long term storage files. (
https://issues.apache.org/jira/browse/HIVE-5067)

ORC will give you significant performance gains when you select a subset of
the columns.

select COL70, COL74 from test;

will run much faster on ORC than text files.



On Mon, Aug 12, 2013 at 9:27 AM, pandees waran  wrote:

> Hi Owen,
>
> Thanks for your response.
>
> My structure is like:
>
> a)Textfile:
> CREATE EXTERNAL TABLE test_textfile (
> COL1 BIGINT,
> COL2 STRING,
> COL3 BIGINT,
> COL4 STRING,
> COL5 STRING,
> COL6 BIGINT,
> COL7 BIGINT,
> COL8 BIGINT,
> COL9 BIGINT,
> COl10 BIGINT,
> COl11 BIGINT,
> COL12 STRING,
> COl13 STRING,
> COl14 STRING,
> COl15 BIGINT,
> COl16 STRING,
> COL17 DOUBLE,
> COl18 DOUBLE,
> COl19 DOUBLE,
> COl20 DOUBLE,
> COl21 DOUBLE,
> COL22 DOUBLE,
> COl23 DOUBLE,
> COL24 DOUBLE,
> COl25 DOUBLE,
> COL26 DOUBLE,
> COl27 DOUBLE,
> COL28 DOUBLE,
> COL29 DOUBLE,
> COl30 DOUBLE,
> COl31 DOUBLE,
> COL32 DOUBLE,
> COL33 STRING,
> COl34 STRING,
> COl35 DOUBLE,
> COL36 DOUBLE,
> COl37 DOUBLE,
> COL38 DOUBLE,
> COl39 DOUBLE,
> COL40 DOUBLE,
> COl41 DOUBLE,
> COL42 DOUBLE,
> COL43 DOUBLE,
> COl44 DOUBLE,
> COl45 DOUBLE,
> COL46 DOUBLE,
> COL47 DOUBLE,
> COl48 DOUBLE,
> COl49 DOUBLE,
> COL50 DOUBLE,
> COL51 DOUBLE,
> COl52 DOUBLE,
> COl53 DOUBLE,
> COl54 DOUBLE,
> COL55 DOUBLE,
> COL56 STRING,
> COL57 DOUBLE,
> COL58 DOUBLE,
> COL59 DOUBLE,
> COl60 DOUBLE,
> COl61 STRING,
> COL62 STRING,
> COL63 STRING,
> COL64 STRING,
> COl65 STRING,
> COl66 STRING,
> COl67 STRING,
> COL68 STRING,
> Col69 STRING,
> COL70 STRING,
> COL71 STRING,
> COl72 STRING,
> COl73 STRING,
> COL74  STRING
> ) PARTITIONED BY (
> COL75 STRING,
> COL76 STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
> STORED AS TEXTFILE LOCATION 's3://test/textfile/';
> Using block level compression and bzip2codec  for output.
>
> b) With the above set of columns, just i have changed as STORED AS ORC for
> creating ORC. Not using any compression option
>
> c)Inserted 7256852 records in  both the tables
>
> d)Space occupied in S3:
>
> Storing as ORC(3 files):153.4MB *3=460.2MB
> TEXT(single file in bz2 format)=306MB
>
> I need to check ORC with compression enabled.
>
> Please let me know, if i miss anything.
>
> Thanks,
>
>
>
>
> On Mon, Aug 12, 2013 at 8:50 PM, Owen O'Malley  wrote:
>
>> Pandees,
>>   I've never seen a table that was larger with ORC than with text. Can
>> you share your text's file schema with us? Is the table very small? How
>> many rows and GB are the tables? The overhead for ORC is typically small,
>> but as Ed says it is possible for rare cases for the overhead to dominate
>> the data size itself.
>>
>> -- Owen
>>
>>
>> On Mon, Aug 12, 2013 at 6:52 AM, pandees waran wrote:
>>
>>> Thanks Edward.  I shall try compression besides orc and let you know.
>>> And also,  it looks like the cpu  usage is lesser while querying orc rather
>>> than text file.
>>> But the total time taken by the query time is slightly more in orc than
>>> text file.  Could you please explain the difference between cumulative cpu
>>> time and the total time taken (usually in last line in terms or secs)?
>>> Which one should we give preference?
>>> On Aug 12, 2013 7:01 PM, "Edward Capriolo" 
>>> wrote:
>>>
 Colmnar formats do not always beat row wise storage. Many times gzip
 plus block storage will compress something better then columnar storage
 especially when you have repeated data in different columns.

 Based on what you are saying it could be possible that you missed a
 setting and the ocr are not compressed.


 On Monday, August 12, 2013, pandees waran  wrote:
 > Hi,
 >
 > Currently, we use TEXTFILE format in hive 0.8 ,while creating the
 > external tables in intermediate processing .
 > I have read about ORC in 0.11. I have created the same table in 0.11
 > with ORC format.
 > Without any compression, the ORC file(totally 3 files) occupied the
 > space twice more than the TEXTFILE(only one file).
 > Even, when i query the data from ORC:
 > Select count(*) from orc_table
 >
 > It took more time than the same query against textfile.
 > But, i see cumulative CPU time is lesser in ORC than the text file.
 >
 > What sort of queries will benefit, if we use ORC?
 > In which cases TEXTFILE will be pref

Re: ORC vs TEXT file

2013-08-12 Thread pandees waran
Hi Owen,

Thanks for your response.

My structure is like:

a)Textfile:
CREATE EXTERNAL TABLE test_textfile (
COL1 BIGINT,
COL2 STRING,
COL3 BIGINT,
COL4 STRING,
COL5 STRING,
COL6 BIGINT,
COL7 BIGINT,
COL8 BIGINT,
COL9 BIGINT,
COl10 BIGINT,
COl11 BIGINT,
COL12 STRING,
COl13 STRING,
COl14 STRING,
COl15 BIGINT,
COl16 STRING,
COL17 DOUBLE,
COl18 DOUBLE,
COl19 DOUBLE,
COl20 DOUBLE,
COl21 DOUBLE,
COL22 DOUBLE,
COl23 DOUBLE,
COL24 DOUBLE,
COl25 DOUBLE,
COL26 DOUBLE,
COl27 DOUBLE,
COL28 DOUBLE,
COL29 DOUBLE,
COl30 DOUBLE,
COl31 DOUBLE,
COL32 DOUBLE,
COL33 STRING,
COl34 STRING,
COl35 DOUBLE,
COL36 DOUBLE,
COl37 DOUBLE,
COL38 DOUBLE,
COl39 DOUBLE,
COL40 DOUBLE,
COl41 DOUBLE,
COL42 DOUBLE,
COL43 DOUBLE,
COl44 DOUBLE,
COl45 DOUBLE,
COL46 DOUBLE,
COL47 DOUBLE,
COl48 DOUBLE,
COl49 DOUBLE,
COL50 DOUBLE,
COL51 DOUBLE,
COl52 DOUBLE,
COl53 DOUBLE,
COl54 DOUBLE,
COL55 DOUBLE,
COL56 STRING,
COL57 DOUBLE,
COL58 DOUBLE,
COL59 DOUBLE,
COl60 DOUBLE,
COl61 STRING,
COL62 STRING,
COL63 STRING,
COL64 STRING,
COl65 STRING,
COl66 STRING,
COl67 STRING,
COL68 STRING,
Col69 STRING,
COL70 STRING,
COL71 STRING,
COl72 STRING,
COl73 STRING,
COL74  STRING
) PARTITIONED BY (
COL75 STRING,
COL76 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 's3://test/textfile/';
Using block level compression and bzip2codec  for output.

b) With the above set of columns, just i have changed as STORED AS ORC for
creating ORC. Not using any compression option

c)Inserted 7256852 records in  both the tables

d)Space occupied in S3:

Storing as ORC(3 files):153.4MB *3=460.2MB
TEXT(single file in bz2 format)=306MB

I need to check ORC with compression enabled.

Please let me know, if i miss anything.

Thanks,




On Mon, Aug 12, 2013 at 8:50 PM, Owen O'Malley  wrote:

> Pandees,
>   I've never seen a table that was larger with ORC than with text. Can you
> share your text's file schema with us? Is the table very small? How many
> rows and GB are the tables? The overhead for ORC is typically small, but as
> Ed says it is possible for rare cases for the overhead to dominate the data
> size itself.
>
> -- Owen
>
>
> On Mon, Aug 12, 2013 at 6:52 AM, pandees waran  wrote:
>
>> Thanks Edward.  I shall try compression besides orc and let you know. And
>> also,  it looks like the cpu  usage is lesser while querying orc rather
>> than text file.
>> But the total time taken by the query time is slightly more in orc than
>> text file.  Could you please explain the difference between cumulative cpu
>> time and the total time taken (usually in last line in terms or secs)?
>> Which one should we give preference?
>> On Aug 12, 2013 7:01 PM, "Edward Capriolo"  wrote:
>>
>>> Colmnar formats do not always beat row wise storage. Many times gzip
>>> plus block storage will compress something better then columnar storage
>>> especially when you have repeated data in different columns.
>>>
>>> Based on what you are saying it could be possible that you missed a
>>> setting and the ocr are not compressed.
>>>
>>>
>>> On Monday, August 12, 2013, pandees waran  wrote:
>>> > Hi,
>>> >
>>> > Currently, we use TEXTFILE format in hive 0.8 ,while creating the
>>> > external tables in intermediate processing .
>>> > I have read about ORC in 0.11. I have created the same table in 0.11
>>> > with ORC format.
>>> > Without any compression, the ORC file(totally 3 files) occupied the
>>> > space twice more than the TEXTFILE(only one file).
>>> > Even, when i query the data from ORC:
>>> > Select count(*) from orc_table
>>> >
>>> > It took more time than the same query against textfile.
>>> > But, i see cumulative CPU time is lesser in ORC than the text file.
>>> >
>>> > What sort of queries will benefit, if we use ORC?
>>> > In which cases TEXTFILE will be preferred more than ORC?
>>> >
>>> > Thanks.
>>> >
>>
>>
>


-- 
Thanks,
Pandeeswaran


Re: ORC vs TEXT file

2013-08-12 Thread Owen O'Malley
Pandees,
  I've never seen a table that was larger with ORC than with text. Can you
share your text's file schema with us? Is the table very small? How many
rows and GB are the tables? The overhead for ORC is typically small, but as
Ed says it is possible for rare cases for the overhead to dominate the data
size itself.

-- Owen


On Mon, Aug 12, 2013 at 6:52 AM, pandees waran  wrote:

> Thanks Edward.  I shall try compression besides orc and let you know. And
> also,  it looks like the cpu  usage is lesser while querying orc rather
> than text file.
> But the total time taken by the query time is slightly more in orc than
> text file.  Could you please explain the difference between cumulative cpu
> time and the total time taken (usually in last line in terms or secs)?
> Which one should we give preference?
> On Aug 12, 2013 7:01 PM, "Edward Capriolo"  wrote:
>
>> Colmnar formats do not always beat row wise storage. Many times gzip plus
>> block storage will compress something better then columnar storage
>> especially when you have repeated data in different columns.
>>
>> Based on what you are saying it could be possible that you missed a
>> setting and the ocr are not compressed.
>>
>>
>> On Monday, August 12, 2013, pandees waran  wrote:
>> > Hi,
>> >
>> > Currently, we use TEXTFILE format in hive 0.8 ,while creating the
>> > external tables in intermediate processing .
>> > I have read about ORC in 0.11. I have created the same table in 0.11
>> > with ORC format.
>> > Without any compression, the ORC file(totally 3 files) occupied the
>> > space twice more than the TEXTFILE(only one file).
>> > Even, when i query the data from ORC:
>> > Select count(*) from orc_table
>> >
>> > It took more time than the same query against textfile.
>> > But, i see cumulative CPU time is lesser in ORC than the text file.
>> >
>> > What sort of queries will benefit, if we use ORC?
>> > In which cases TEXTFILE will be preferred more than ORC?
>> >
>> > Thanks.
>> >
>
>


Re: ORC vs TEXT file

2013-08-12 Thread pandees waran
Thanks Edward.  I shall try compression besides orc and let you know. And
also,  it looks like the cpu  usage is lesser while querying orc rather
than text file.
But the total time taken by the query time is slightly more in orc than
text file.  Could you please explain the difference between cumulative cpu
time and the total time taken (usually in last line in terms or secs)?
Which one should we give preference?
On Aug 12, 2013 7:01 PM, "Edward Capriolo"  wrote:

> Colmnar formats do not always beat row wise storage. Many times gzip plus
> block storage will compress something better then columnar storage
> especially when you have repeated data in different columns.
>
> Based on what you are saying it could be possible that you missed a
> setting and the ocr are not compressed.
>
>
> On Monday, August 12, 2013, pandees waran  wrote:
> > Hi,
> >
> > Currently, we use TEXTFILE format in hive 0.8 ,while creating the
> > external tables in intermediate processing .
> > I have read about ORC in 0.11. I have created the same table in 0.11
> > with ORC format.
> > Without any compression, the ORC file(totally 3 files) occupied the
> > space twice more than the TEXTFILE(only one file).
> > Even, when i query the data from ORC:
> > Select count(*) from orc_table
> >
> > It took more time than the same query against textfile.
> > But, i see cumulative CPU time is lesser in ORC than the text file.
> >
> > What sort of queries will benefit, if we use ORC?
> > In which cases TEXTFILE will be preferred more than ORC?
> >
> > Thanks.
> >


Re: ORC vs TEXT file

2013-08-12 Thread Edward Capriolo
Colmnar formats do not always beat row wise storage. Many times gzip plus
block storage will compress something better then columnar storage
especially when you have repeated data in different columns.

Based on what you are saying it could be possible that you missed a setting
and the ocr are not compressed.


On Monday, August 12, 2013, pandees waran  wrote:
> Hi,
>
> Currently, we use TEXTFILE format in hive 0.8 ,while creating the
> external tables in intermediate processing .
> I have read about ORC in 0.11. I have created the same table in 0.11
> with ORC format.
> Without any compression, the ORC file(totally 3 files) occupied the
> space twice more than the TEXTFILE(only one file).
> Even, when i query the data from ORC:
> Select count(*) from orc_table
>
> It took more time than the same query against textfile.
> But, i see cumulative CPU time is lesser in ORC than the text file.
>
> What sort of queries will benefit, if we use ORC?
> In which cases TEXTFILE will be preferred more than ORC?
>
> Thanks.
>


ORC vs TEXT file

2013-08-12 Thread pandees waran
Hi,

Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Even, when i query the data from ORC:
Select count(*) from orc_table

It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.

What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?

Thanks.