Hi

After some testing, I have been quite disapointed with hiveContext way of
accessing hive tables.

The main problem is resource allocation: I have tons of users and they
get a limited subset of workers. Then this does not allow to query huge
datasetsn because to few memory allocated (or maybe I am missing
something).

If using Hive jdbc, Hive resources are shared by all my users and then
queries are able to finish.

Then I have been testing other jdbc based approach and for now, "presto"
looks like the most appropriate solution to access hive tables.

In order to load huge datasets into spark, the proposed approach is to
use presto distributed CTAS to build an ORC dataset, and access to that
dataset from spark dataframe loader ability (instead of direct jdbc
access tha would break the driver memory).



Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> Hi Nicolas,
> 
> without the hive thrift server, if you try to run a select * on a table which
> has around 10,000 partitions, SPARK will give you some surprises. PRESTO works
> fine in these scenarios, and I am sure SPARK community will soon learn from
> their algorithms.
> 
> 
> Regards,
> Gourav
> 
> On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris <nipari...@gmail.com> wrote:
> 
>     > I do not think that SPARK will automatically determine the partitions.
>     Actually
>     > it does not automatically determine the partitions. In case a table has 
> a
>     few
>     > million records, it all goes through the driver.
> 
>     Hi Gourav
> 
>     Actualy spark jdbc driver is able to deal direclty with partitions.
>     Sparks creates a jdbc connection for each partition.
> 
>     All details explained in this post :
>     http://www.gatorsmile.io/numpartitionsinjdbc/
> 
>     Also an example with greenplum database:
>     http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to