Querying Hive tables from Spark

2016-06-27 Thread Mich Talebzadeh
Hi, I have done some extensive tests with Spark querying Hive tables. It appears to me that Spark does not rely on statistics that are collected by Hive on say ORC tables. It seems that Spark uses its own optimization to query the Hive tables irrespective of Hive has collected by way of statistic

Re: Optimize Hive Query

2016-06-27 Thread Mich Talebzadeh
Hi, Curious to see if this issue been resolved (performance) after compaction? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://t

Enable hive logs in Hive -2.0.1

2016-06-27 Thread prabhu Mahendran
Hi All, How to enable the hive logs in Hive-2.0.1?. I have tried ordinary way like hive-1.2.1. - just go into log4j properties and type log.directory then it will create directory automatically But in Hive-2.0.1 it is totally different. - i just give directory in log4j2 properties bu

RE: Querying Hive tables from Spark

2016-06-27 Thread Markovitz, Dudu
Hi Mich I could not figure out what is the point you are trying to make. Could you please clarify? Thanks Dudu From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Monday, June 27, 2016 12:20 PM To: user @spark ; user Subject: Querying Hive tables from Spark Hi, I have done some e

unsubscribe

2016-06-27 Thread Roshani Kale
unsubscribe

Re: Hash table in map join - Hive

2016-06-27 Thread Gopal Vijayaraghavan
> 1. Is there a way to check the size of the hash table created during map >side join in Hive/Tez? Only from the log files. However, you enable hive.tez.exec.print.summary=true; then the hive CLI will print out the total # of items shuffle from the broadcast edges feeding the hashtable. Not sure

Re: Querying Hive tables from Spark

2016-06-27 Thread Gopal Vijayaraghavan
> It appears to me that Spark does not rely on statistics that are >collected by Hive on say ORC tables. > It seems that Spark uses its own optimization to query the Hive tables >irrespective of Hive has collected by way of statistics etc? Spark does not have a cost based optimizer yet - please fo

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-27 Thread Gopal Vijayaraghavan
>Correct me if I¹m wrong but at this point isn¹t the number of splits >calculated? Yes you are correct, but the grouping kicks in after that. The real reason for grouping is because Shuffle operations are internally MxN and explode out of control if grouping hasn't been done. Running through 50

Re: Optimize Hive Query

2016-06-27 Thread Eugene Koifman
if you have many acid tables you almost certainly want more than 2 workers. If you have 2 workers (and a single metastore instance) you can run at most 2 compaction jobs at a time. Unless the tables are very small, compaction may fall behind if it's configured to run too serially. In order fo

Implementing a custom StorageHandler

2016-06-27 Thread Long, Andrew
Hello everyone, I’m in the process of implementing a custom StorageHandler and I had some questions. 1) What is the difference between org.apache.Hadoop.mapred.InputFormat and org.apache.hadoop.mapreduce.InputFormat? 2) How is numSpits calculated in org.apache.Hadoop.mapred.Input

Re: Querying Hive tables from Spark

2016-06-27 Thread Mich Talebzadeh
Thanks Gopal. I added a compact index to this table as below on 5 columns hive> show formatted indexes on sales2; OK idx_nametab_namecol_names idx_tab_nameidx_typecomment sales2_idx sales2 prod_id, cust_id

Re: Hash table in map join - Hive

2016-06-27 Thread Ross Guth
Hi Gopal, Thanks a lot for the answers. They were helpful. I have a few more questions regarding this: 1. OOM condition -- I get the following error when I force a map join in hive/tez with low container size and heap size:" java.lang.OutOfMemoryError: Java heap space". I was wondering what is th

Re: Querying Hive tables from Spark

2016-06-27 Thread Gopal Vijayaraghavan
> I added a compact index to this table as below on 5 columns No, those are not what I recommend in this scenario. You made a statement that the table was sorted and it wasn't. >>Table is sorted in the order of prod_id, cust_id,time_id, channel_id and >> promo_id. It has 22 million rows. >> No

Re: Hash table in map join - Hive

2016-06-27 Thread Gopal Vijayaraghavan
> 1. OOM condition -- I get the following error when I force a map join in >hive/tez with low container size and heap size:" >java.lang.OutOfMemoryError: Java heap space". I was wondering what is the >condition which leads to this error. You are not modifying the noconditionaltasksize to match th

External_Tables_Disadvantages

2016-06-27 Thread Ajay Chander
Hi Everyone, I would like to know the disadvantages of using External tables in Hive. I was told that "Managing security with sentry will be very limited for external tables" is it true? Can someone explain it please? Thank you. Regards, Aj