Le 05 nov. 2017 à 14:11, Gourav Sengupta écrivait : > thanks a ton for your kind response. Have you used SPARK Session ? I think > that > hiveContext is a very old way of solving things in SPARK, and since then new > algorithms have been introduced in SPARK.
I will give a try out sparkSession. > It will be a lot of help, given how kind you have been by sharing your > experience, if you could kindly share your code as well and provide details > like SPARK , HADOOP, HIVE, and other environment version and details. I am testing a HDP 2.6 distrib and also: SPARK: 2.1.1 HADOOP: 2.7.3 HIVE: 1.2.1000 PRESTO: 1.87 > After all, no one wants to use SPARK 1.x version to solve problems anymore, > though I have seen couple of companies who are stuck with these versions as > they are using in house deployments which they cannot upgrade because of > incompatibility issues. Didn't know hiveContext was legacy spark way. I will give a try to sparkSession and conclude. After all, I would prefer to provide our users, a unique and uniform framework such spark, instead of multiple complicated layers such spark + whatever jdbc access > > > Regards, > Gourav Sengupta > > > On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris <nipari...@gmail.com> wrote: > > Hi > > After some testing, I have been quite disapointed with hiveContext way of > accessing hive tables. > > The main problem is resource allocation: I have tons of users and they > get a limited subset of workers. Then this does not allow to query huge > datasetsn because to few memory allocated (or maybe I am missing > something). > > If using Hive jdbc, Hive resources are shared by all my users and then > queries are able to finish. > > Then I have been testing other jdbc based approach and for now, "presto" > looks like the most appropriate solution to access hive tables. > > In order to load huge datasets into spark, the proposed approach is to > use presto distributed CTAS to build an ORC dataset, and access to that > dataset from spark dataframe loader ability (instead of direct jdbc > access tha would break the driver memory). > > > > Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait : > > Hi Nicolas, > > > > without the hive thrift server, if you try to run a select * on a table > which > > has around 10,000 partitions, SPARK will give you some surprises. PRESTO > works > > fine in these scenarios, and I am sure SPARK community will soon learn > from > > their algorithms. > > > > > > Regards, > > Gourav > > > > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <nipari...@gmail.com> > wrote: > > > > > I do not think that SPARK will automatically determine the > partitions. > > Actually > > > it does not automatically determine the partitions. In case a > table > has a > > few > > > million records, it all goes through the driver. > > > > Hi Gourav > > > > Actualy spark jdbc driver is able to deal direclty with partitions. > > Sparks creates a jdbc connection for each partition. > > > > All details explained in this post : > > http://www.gatorsmile.io/numpartitionsinjdbc/ > > > > Also an example with greenplum database: > > http://engineering.pivotal.io/post/getting-started-with- > greenplum-spark/ > > > > > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org