Re: orc vs parquet aggregation, orc is really slow

Mich Talebzadeh Sat, 16 Apr 2016 04:14:44 -0700

Apologies that should read

*desc formatted <table_name>*


Example for table dummy

hive> desc formatted  dummy;
OK
# col_name              data_type               comment
id                      int
clustered               int
scattered               int
randomised              int
random_string           varchar(50)
small_vc                varchar(10)
padding                 varchar(10)
# Detailed Table Information
Database:               test
Owner:                  hduser
CreateTime:             Sun Jan 31 06:09:56 GMT 2016
LastAccessTime:         UNKNOWN
Retention:              0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                35
        numRows                 100000000
        orc.bloom.filter.columns        ID
        orc.bloom.filter.fpp    0.05
        orc.compress            SNAPPY
        orc.create.index        true
        orc.row.index.stride    10000
        orc.stripe.size         16777216
        rawDataSize             0
        totalSize               5660813776
        transient_lastDdlTime   1458636871
# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            256
Bucket Columns:         [id]
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.065 seconds, Fetched: 42 row(s)

Store dates as dates.

To convert a string to date for a column like paymentdate (note in below UK
date formatted converted to Hive yyyy-MM-dd)use the following:

        
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd'))
AS paymentdate
HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 April 2016 at 09:09, Maurin Lenglart <[email protected]> wrote:

> Hi,
> I have : 17970737 rows
> I tried to do a “desc formatted statistics myTable” but I get “Error
> while compiling statement: FAILED: SemanticException [Error 10001]: Table
> not found statistics”
> Even after doing something like : “ANALYZE TABLE myTable COMPUTE
> STATISTICS FOR COLUMNS"
>
> Thank you for your answer.
>
> From: Mich Talebzadeh <[email protected]>
> Date: Saturday, April 16, 2016 at 12:32 AM
> To: maurin lenglart <[email protected]>
> Cc: "user @spark" <[email protected]>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> Have you analysed statistics on the ORC table? How many rows are there?
>
> Also send the outp of
>
> desc formatted statistics <TABLE_NAME>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 April 2016 at 08:20, Maurin Lenglart <[email protected]> wrote:
>
>> Hi,
>> I am executing one query :
>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>> LIMIT 20000”
>>
>> My table was created something like :
>>   CREATE TABLE myTable (
>>   bookings            DOUBLE
>>   , deal views          INT
>>   )
>>    STORED AS ORC or PARQUET
>>      PARTITION BY (event_date STRING)
>>
>> PARQUET take 9second of cumulative CPU
>> ORC take 50second of cumulative CPU.
>>
>> For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown
>> ”,“true”)
>> But it didn’t change anything
>>
>> I am missing something, or parquet is better for this type of query?
>>
>> I am using spark 1.6.0 with hive 1.1.0
>>
>> thanks
>>
>>
>>
>

Re: orc vs parquet aggregation, orc is really slow

Reply via email to