Hey

Finally I improved a lot the spark-hive sql performances.

I had some problem with some topology_script.py that made huge log error
trace and reduced spark performances in python mode. I just corrected
the python2 scripts to be python3 ready.
I had some problem with broadcast variable while joining tables. I just
deactivated this fucntionality.

As a result our users are now able to use spark-hive with very limited
resources (2 executors with 4core) and get decent performances for
analytics.

Compared to JDBC presto, this has several advantages:
- integrated solution
- single security layer (hive/kerberos)
- direct partitionned lazy datasets versus complicated jdbc dataset management
- more robust for analytics with less memory (apparently)

However presto still makes sence for sub second analytics, and oltp like
queries and data discovery.

Le 05 nov. 2017 à 13:57, Nicolas Paris écrivait :
> Hi
> 
> After some testing, I have been quite disapointed with hiveContext way of
> accessing hive tables.
> 
> The main problem is resource allocation: I have tons of users and they
> get a limited subset of workers. Then this does not allow to query huge
> datasetsn because to few memory allocated (or maybe I am missing
> something).
> 
> If using Hive jdbc, Hive resources are shared by all my users and then
> queries are able to finish.
> 
> Then I have been testing other jdbc based approach and for now, "presto"
> looks like the most appropriate solution to access hive tables.
> 
> In order to load huge datasets into spark, the proposed approach is to
> use presto distributed CTAS to build an ORC dataset, and access to that
> dataset from spark dataframe loader ability (instead of direct jdbc
> access tha would break the driver memory).
> 
> 
> 
> Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> > Hi Nicolas,
> > 
> > without the hive thrift server, if you try to run a select * on a table 
> > which
> > has around 10,000 partitions, SPARK will give you some surprises. PRESTO 
> > works
> > fine in these scenarios, and I am sure SPARK community will soon learn from
> > their algorithms.
> > 
> > 
> > Regards,
> > Gourav
> > 
> > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <nipari...@gmail.com> wrote:
> > 
> >     > I do not think that SPARK will automatically determine the partitions.
> >     Actually
> >     > it does not automatically determine the partitions. In case a table 
> > has a
> >     few
> >     > million records, it all goes through the driver.
> > 
> >     Hi Gourav
> > 
> >     Actualy spark jdbc driver is able to deal direclty with partitions.
> >     Sparks creates a jdbc connection for each partition.
> > 
> >     All details explained in this post :
> >     http://www.gatorsmile.io/numpartitionsinjdbc/
> > 
> >     Also an example with greenplum database:
> >     http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
> > 
> > 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to