Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mich Talebzadeh
I gather you have an issue where predicate pushdown is not taking place and
it takes time to get the data?


   1. How many rows do you have
   2. The version of Hive
   3. have you analyzed statistics for this Hive table

In summary you expect that partition in the query to be read as opposed to
every partiotion?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 August 2016 at 17:08, Mehdi Meziane  wrote:

> Hi Mich,
>
> The data is stored as parquet.
> The table definition looks like :
>
> CREATE EXTERNAL TABLE nadata (
> extract_date TIMESTAMP,
> date_formatted STRING,
> day_of_week INT,
> hour_of_day INT,
> entity_label STRING,
> entity_currency_id INT,
> entity_currency_label STRING,
> entity_margin_percentage FLOAT,
> entity2_id INT,
> entity2_label STRING,
> entity2_categories ARRAY,
> entity3_id INT,
> entity3_label STRING,
> entity3_categories ARRAY,
> entity4_id INT,
> entity4_hid INT,
> entity4_label STRING,
> entity4_total_budget DOUBLE
> )
> PARTITIONED BY (day STRING,mba_id BIGINT,partition_id INT)
> STORED AS PARQUET
> LOCATION 's3a://bucketname/'
>
> Do you think the definition can be the source of the problem ?
> Thanks
>
> - Mail Original -
> De: "Mich Talebzadeh" 
> À: "Mehdi Meziane" 
> Cc: "user @spark" 
> Envoyé: Mercredi 3 Août 2016 16h47:46 GMT +01:00 Amsterdam / Berlin /
> Berne / Rome / Stockholm / Vienne
> Objet: Re: [SQL] Reading from hive table is listing all files in S3
>
>
> Hi,
>
> Do you have a schema definition for this Hive table?
>
> What format is this table stored
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 August 2016 at 15:03, Mehdi Meziane 
> wrote:
>
>> Hi all,
>>
>> We have a hive table stored in S3 and registered in a hive metastore.
>> This table is partitionned with a key "day".
>>
>> So we access this table through the spark dataframe API as :
>>
>> sqlContext.read()
>> .table("tablename)
>> .where(col("day").between("2016-08-01","2016-08-02"))
>>
>> When the job is launched, we can see that spark have tasks "table" which
>> have a small duration (seconds) but takes minutes.
>> In the logs we see that every paths for every partitions are listed,
>> regardless the partition key values, during minutes.
>>
>> 16/08/03 13:17:16 INFO HadoopFsRelation: Listing
>> s3a://buckets3/day=2016-07-24
>> 16/08/03 13:17:16 INFO HadoopFsRelation: Listing
>> s3a://buckets3/day=2016-07-25
>> 
>>
>> Is it a normal behaviour? Do we could specify something in the
>> read().table, maybe some options?
>> I tried to find such options but i cannot find anything.
>>
>> Thanks,
>> Mehdi
>>
>
>


Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Gourav Sengupta
from what I am observing your path is s3a://buckets3/day=2016-07-25 and
partition is day, mba_id and partition_id. Are the sub folders in the form
s3://buckets3/day=2016-07-25/mba_id=1122/partition_id=111?

can you please include the add partition statement as well for one single
partition?

The other thing that is a bit confusing is that you have declared day as
STRING and treating them as DATE in your select statement. Does that work?


Regards,
Gourav Sengupta

On Wed, Aug 3, 2016 at 5:08 PM, Mehdi Meziane 
wrote:

> Hi Mich,
>
> The data is stored as parquet.
> The table definition looks like :
>
> CREATE EXTERNAL TABLE nadata (
> extract_date TIMESTAMP,
> date_formatted STRING,
> day_of_week INT,
> hour_of_day INT,
> entity_label STRING,
> entity_currency_id INT,
> entity_currency_label STRING,
> entity_margin_percentage FLOAT,
> entity2_id INT,
> entity2_label STRING,
> entity2_categories ARRAY,
> entity3_id INT,
> entity3_label STRING,
> entity3_categories ARRAY,
> entity4_id INT,
> entity4_hid INT,
> entity4_label STRING,
> entity4_total_budget DOUBLE
> )
> PARTITIONED BY (day STRING,mba_id BIGINT,partition_id INT)
> STORED AS PARQUET
> LOCATION 's3a://bucketname/'
>
> Do you think the definition can be the source of the problem ?
> Thanks
>
> - Mail Original -
> De: "Mich Talebzadeh" 
> À: "Mehdi Meziane" 
> Cc: "user @spark" 
> Envoyé: Mercredi 3 Août 2016 16h47:46 GMT +01:00 Amsterdam / Berlin /
> Berne / Rome / Stockholm / Vienne
> Objet: Re: [SQL] Reading from hive table is listing all files in S3
>
>
> Hi,
>
> Do you have a schema definition for this Hive table?
>
> What format is this table stored
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 August 2016 at 15:03, Mehdi Meziane 
> wrote:
>
>> Hi all,
>>
>> We have a hive table stored in S3 and registered in a hive metastore.
>> This table is partitionned with a key "day".
>>
>> So we access this table through the spark dataframe API as :
>>
>> sqlContext.read()
>> .table("tablename)
>> .where(col("day").between("2016-08-01","2016-08-02"))
>>
>> When the job is launched, we can see that spark have tasks "table" which
>> have a small duration (seconds) but takes minutes.
>> In the logs we see that every paths for every partitions are listed,
>> regardless the partition key values, during minutes.
>>
>> 16/08/03 13:17:16 INFO HadoopFsRelation: Listing
>> s3a://buckets3/day=2016-07-24
>> 16/08/03 13:17:16 INFO HadoopFsRelation: Listing
>> s3a://buckets3/day=2016-07-25
>> 
>>
>> Is it a normal behaviour? Do we could specify something in the
>> read().table, maybe some options?
>> I tried to find such options but i cannot find anything.
>>
>> Thanks,
>> Mehdi
>>
>
>


Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mehdi Meziane
Hi Mich, 


The data is stored as parquet. 
The table definition looks like : 



CREATE EXTERNAL TABLE nadata ( 
extract_date TIMESTAMP, 
date_formatted STRING, 
day_of_week INT, 
hour_of_day INT, 
entity_label STRING, 
entity_currency_id INT, 
entity_currency_label STRING, 
entity_margin_percentage FLOAT, 
entity2_id INT, 
entity2_label STRING, 
entity2_categories ARRAY, 
entity3_id INT, 
entity3_label STRING, 
entity3_categories ARRAY, 
entity4_id INT, 
entity4_hid INT, 
entity4_label STRING, 
entity4_total_budget DOUBLE 
) 

PARTITIONED BY (day STRING,mba_id BIGINT,partition_id INT) 
STORED AS PARQUET 
LOCATION 's3a://bucketname/' 


Do you think the definition can be the source of the problem ? 
Thanks 

- Mail Original - 
De: "Mich Talebzadeh"  
À: "Mehdi Meziane"  
Cc: "user @spark"  
Envoyé: Mercredi 3 Août 2016 16h47:46 GMT +01:00 Amsterdam / Berlin / Berne / 
Rome / Stockholm / Vienne 
Objet: Re: [SQL] Reading from hive table is listing all files in S3 



Hi, 


Do you have a schema definition for this Hive table? 


What format is this table stored 


HTH 
















Dr Mich Talebzadeh 



LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 



http://talebzadehmich.wordpress.com 




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 



On 3 August 2016 at 15:03, Mehdi Meziane < mehdi.mezi...@ldmobile.net > wrote: 





Hi all, 


We have a hive table stored in S3 and registered in a hive metastore. 
This table is partitionned with a key "day". 


So we access this table through the spark dataframe API as : 


sqlContext.read() 
.table("tablename) 
.where(col("day").between("2016-08-01","2016-08-02")) 


When the job is launched, we can see that spark have tasks "table" which have a 
small duration (seconds) but takes minutes. 
In the logs we see that every paths for every partitions are listed, regardless 
the partition key values, during minutes. 


16/08/03 13:17:16 INFO HadoopFsRelation: Listing s3a://buckets3/day=2016-07-24 
16/08/03 13:17:16 INFO HadoopFsRelation: Listing s3a://buckets3/day=2016-07-25 
 


Is it a normal behaviour? Do we could specify something in the read().table, 
maybe some options? 
I tried to find such options but i cannot find anything. 


Thanks, 
Mehdi 


Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mich Talebzadeh
Hi,

Do you have a schema definition for this Hive table?

What format is this table stored

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 August 2016 at 15:03, Mehdi Meziane  wrote:

> Hi all,
>
> We have a hive table stored in S3 and registered in a hive metastore.
> This table is partitionned with a key "day".
>
> So we access this table through the spark dataframe API as :
>
> sqlContext.read()
> .table("tablename)
> .where(col("day").between("2016-08-01","2016-08-02"))
>
> When the job is launched, we can see that spark have tasks "table" which
> have a small duration (seconds) but takes minutes.
> In the logs we see that every paths for every partitions are listed,
> regardless the partition key values, during minutes.
>
> 16/08/03 13:17:16 INFO HadoopFsRelation: Listing
> s3a://buckets3/day=2016-07-24
> 16/08/03 13:17:16 INFO HadoopFsRelation: Listing
> s3a://buckets3/day=2016-07-25
> 
>
> Is it a normal behaviour? Do we could specify something in the
> read().table, maybe some options?
> I tried to find such options but i cannot find anything.
>
> Thanks,
> Mehdi
>


[SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mehdi Meziane

Hi all, 


We have a hive table stored in S3 and registered in a hive metastore. 
This table is partitionned with a key "day". 


So we access this table through the spark dataframe API as : 


sqlContext.read() 
.table("tablename) 
.where(col("day").between("2016-08-01","2016-08-02")) 


When the job is launched, we can see that spark have tasks "table" which have a 
small duration (seconds) but takes minutes. 
In the logs we see that every paths for every partitions are listed, regardless 
the partition key values, during minutes. 


16/08/03 13:17:16 INFO HadoopFsRelation: Listing s3a://buckets3/day=2016-07-24 
16/08/03 13:17:16 INFO HadoopFsRelation: Listing s3a://buckets3/day=2016-07-25 
 


Is it a normal behaviour? Do we could specify something in the read().table, 
maybe some options? 
I tried to find such options but i cannot find anything. 


Thanks, 
Mehdi