Hey Finally I improved a lot the spark-hive sql performances.
I had some problem with some topology_script.py that made huge log error trace and reduced spark performances in python mode. I just corrected the python2 scripts to be python3 ready. I had some problem with broadcast variable while joining tables. I just deactivated this fucntionality. As a result our users are now able to use spark-hive with very limited resources (2 executors with 4core) and get decent performances for analytics. Compared to JDBC presto, this has several advantages: - integrated solution - single security layer (hive/kerberos) - direct partitionned lazy datasets versus complicated jdbc dataset management - more robust for analytics with less memory (apparently) However presto still makes sence for sub second analytics, and oltp like queries and data discovery. Le 05 nov. 2017 à 13:57, Nicolas Paris écrivait : > Hi > > After some testing, I have been quite disapointed with hiveContext way of > accessing hive tables. > > The main problem is resource allocation: I have tons of users and they > get a limited subset of workers. Then this does not allow to query huge > datasetsn because to few memory allocated (or maybe I am missing > something). > > If using Hive jdbc, Hive resources are shared by all my users and then > queries are able to finish. > > Then I have been testing other jdbc based approach and for now, "presto" > looks like the most appropriate solution to access hive tables. > > In order to load huge datasets into spark, the proposed approach is to > use presto distributed CTAS to build an ORC dataset, and access to that > dataset from spark dataframe loader ability (instead of direct jdbc > access tha would break the driver memory). > > > > Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait : > > Hi Nicolas, > > > > without the hive thrift server, if you try to run a select * on a table > > which > > has around 10,000 partitions, SPARK will give you some surprises. PRESTO > > works > > fine in these scenarios, and I am sure SPARK community will soon learn from > > their algorithms. > > > > > > Regards, > > Gourav > > > > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <nipari...@gmail.com> wrote: > > > > > I do not think that SPARK will automatically determine the partitions. > > Actually > > > it does not automatically determine the partitions. In case a table > > has a > > few > > > million records, it all goes through the driver. > > > > Hi Gourav > > > > Actualy spark jdbc driver is able to deal direclty with partitions. > > Sparks creates a jdbc connection for each partition. > > > > All details explained in this post : > > http://www.gatorsmile.io/numpartitionsinjdbc/ > > > > Also an example with greenplum database: > > http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/ > > > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org