Apologies that should read
*desc formatted <table_name>*
Example for table dummy
hive> desc formatted dummy;
OK
# col_name data_type comment
id int
clustered int
scattered int
randomised int
random_string varchar(50)
small_vc varchar(10)
padding varchar(10)
# Detailed Table Information
Database: test
Owner: hduser
CreateTime: Sun Jan 31 06:09:56 GMT 2016
LastAccessTime: UNKNOWN
Retention: 0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 35
numRows 100000000
orc.bloom.filter.columns ID
orc.bloom.filter.fpp 0.05
orc.compress SNAPPY
orc.create.index true
orc.row.index.stride 10000
orc.stripe.size 16777216
rawDataSize 0
totalSize 5660813776
transient_lastDdlTime 1458636871
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: 256
Bucket Columns: [id]
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.065 seconds, Fetched: 42 row(s)
Store dates as dates.
To convert a string to date for a column like paymentdate (note in below UK
date formatted converted to Hive yyyy-MM-dd)use the following:
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd'))
AS paymentdate
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 16 April 2016 at 09:09, Maurin Lenglart <[email protected]> wrote:
> Hi,
> I have : 17970737 rows
> I tried to do a “desc formatted statistics myTable” but I get “Error
> while compiling statement: FAILED: SemanticException [Error 10001]: Table
> not found statistics”
> Even after doing something like : “ANALYZE TABLE myTable COMPUTE
> STATISTICS FOR COLUMNS"
>
> Thank you for your answer.
>
> From: Mich Talebzadeh <[email protected]>
> Date: Saturday, April 16, 2016 at 12:32 AM
> To: maurin lenglart <[email protected]>
> Cc: "user @spark" <[email protected]>
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> Have you analysed statistics on the ORC table? How many rows are there?
>
> Also send the outp of
>
> desc formatted statistics <TABLE_NAME>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 April 2016 at 08:20, Maurin Lenglart <[email protected]> wrote:
>
>> Hi,
>> I am executing one query :
>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date`
>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>> LIMIT 20000”
>>
>> My table was created something like :
>> CREATE TABLE myTable (
>> bookings DOUBLE
>> , deal views INT
>> )
>> STORED AS ORC or PARQUET
>> PARTITION BY (event_date STRING)
>>
>> PARQUET take 9second of cumulative CPU
>> ORC take 50second of cumulative CPU.
>>
>> For ORC I have tried to hiveContext.setConf(“Spark.Sql.Orc.FilterPushdown
>> ”,“true”)
>> But it didn’t change anything
>>
>> I am missing something, or parquet is better for this type of query?
>>
>> I am using spark 1.6.0 with hive 1.1.0
>>
>> thanks
>>
>>
>>
>