Hey
Finally I improved a lot the spark-hive sql performances.
I had some problem with some topology_script.py that made huge log error
trace and reduced spark performances in python mode. I just corrected
the python2 scripts to be python3 ready.
I had some problem with broadcast variable while
Yes, my thought exactly. Kindly let me know if you need any help to port in
pyspark.
On Mon, Nov 6, 2017 at 8:54 AM, Nicolas Paris wrote:
> Le 05 nov. 2017 à 22:46, ayan guha écrivait :
> > Thank you for the clarification. That was my understanding too. However
> how to
> >
Le 05 nov. 2017 à 22:46, ayan guha écrivait :
> Thank you for the clarification. That was my understanding too. However how to
> provide the upper bound as it changes for every call in real life. For example
> it is not required for sqoop.
True. AFAIK sqoop begins with doing a
"select
Thank you for the clarification. That was my understanding too. However how
to provide the upper bound as it changes for every call in real life. For
example it is not required for sqoop.
On Mon, 6 Nov 2017 at 8:20 am, Nicolas Paris wrote:
> Le 05 nov. 2017 à 22:02, ayan
Le 05 nov. 2017 à 22:02, ayan guha écrivait :
> Can you confirm if JDBC DF Reader actually loads all data from source to
> driver
> memory and then distributes to the executors?
apparently yes when not using partition column
> And this is true even when a
> partition column is provided?
No,
Hi
Can you confirm if JDBC DF Reader actually loads all data from source to
driver memory and then distributes to the executors? And this is true even
when a partition column is provided?
Best
Ayan
On Mon, Nov 6, 2017 at 3:00 AM, David Hodeffi <
david.hode...@niceactimize.com> wrote:
> Testing
Testing Spark group e-mail
Confidentiality: This communication and any attachments are intended for the
above-named persons only and may be confidential and/or legally privileged. Any
opinions expressed in this communication are not necessarily those of NICE
Actimize. If this communication has
Le 05 nov. 2017 à 14:11, Gourav Sengupta écrivait :
> thanks a ton for your kind response. Have you used SPARK Session ? I think
> that
> hiveContext is a very old way of solving things in SPARK, and since then new
> algorithms have been introduced in SPARK.
I will give a try out sparkSession.
Hi Nicolas,
thanks a ton for your kind response. Have you used SPARK Session ? I think
that hiveContext is a very old way of solving things in SPARK, and since
then new algorithms have been introduced in SPARK.
It will be a lot of help, given how kind you have been by sharing your
experience,
Hi
After some testing, I have been quite disapointed with hiveContext way of
accessing hive tables.
The main problem is resource allocation: I have tons of users and they
get a limited subset of workers. Then this does not allow to query huge
datasetsn because to few memory allocated (or maybe I
Hi Nicolas,
without the hive thrift server, if you try to run a select * on a table
which has around 10,000 partitions, SPARK will give you some surprises.
PRESTO works fine in these scenarios, and I am sure SPARK community will
soon learn from their algorithms.
Regards,
Gourav
On Sun, Oct 15,
> I do not think that SPARK will automatically determine the partitions.
> Actually
> it does not automatically determine the partitions. In case a table has a few
> million records, it all goes through the driver.
Hi Gourav
Actualy spark jdbc driver is able to deal direclty with partitions.
Hi Gourav
> what if the table has partitions and sub-partitions?
well this also work with multiple orc files having same schema:
val people = sqlContext.read.format("orc").load("hdfs://cluster/people*")
Am I missing something?
> And you do not want to access the entire data?
This works for
Hi Nicolas,
what if the table has partitions and sub-partitions? And you do not want to
access the entire data?
Regards,
Gourav
On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris wrote:
> Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> > I wonder the differences
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext
Well there is also a third way to access the hive data from spark:
- with direct file access (here ORC format)
For example:
val
My take on this might sound a bit different. Here are few points to consider
below:
1. Going through Hive JDBC means that the application is restricted by the #
of queries that can be compiled. HS2 can only compile one SQL at a time and if
users have bad SQL, it can take a long time just to
> In case a table has a few
> million records, it all goes through the driver.
This sounds clear in JDBC mode, the driver get all the rows and then it
spreads the RDD over the executors.
I d'say that most use cases deal with SQL to aggregate huge datasets,
and retrieve small amount of rows to be
Hi,
I do not think that SPARK will automatically determine the partitions.
Actually it does not automatically determine the partitions. In case a
table has a few million records, it all goes through the driver.
Ofcourse, I have only tried JDBC connections in AURORA, Oracle and Postgres.
To: user@spark.apache.org
Subject: Re: Hive From Spark: Jdbc VS sparkContext
[ External Email ]
Is Hive from Spark via JDBC working for you? In case it does, I would be
interested in your setup :-)
We can't get this working. See bug here, especially my last comment:
https://issues.apache.org
Is Hive from Spark via JDBC working for you? In case it does, I would be
interested in your setup :-)
We can't get this working. See bug here, especially my last comment:
https://issues.apache.org/jira/browse/SPARK-21063
Regards
Andreas
--
Sent from:
That is not correct, IMHO. If I am not wrong, Spark will still load data in
executor, by running some stats on the data itself to identify
partitions
On Tue, Oct 10, 2017 at 9:23 PM, 郭鹏飞 wrote:
>
> > 在 2017年10月4日,上午2:08,Nicolas Paris 写道:
> >
>
> 在 2017年10月4日,上午2:08,Nicolas Paris 写道:
>
> Hi
>
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext
>
> I would say that jdbc is better since it uses HIVE that is based on
> map-reduce / TEZ and then works on
Well the obvious point is security. Ranger and Sentry can secure jdbc
endpoints only. For performance aspect, I am equally curious 邏
On Wed, 4 Oct 2017 at 10:30 pm, Gourav Sengupta
wrote:
> Hi,
>
> I am genuinely curious to see whether any one responds to this
Hi,
I am genuinely curious to see whether any one responds to this question.
Its very hard to shake off JAVA, OOPs and JDBC's :)
Regards,
Gourav Sengupta
On Tue, Oct 3, 2017 at 7:08 PM, Nicolas Paris wrote:
> Hi
>
> I wonder the differences accessing HIVE tables in two
Hi
I wonder the differences accessing HIVE tables in two different ways:
- with jdbc access
- with sparkContext
I would say that jdbc is better since it uses HIVE that is based on
map-reduce / TEZ and then works on disk.
Using spark rdd can lead to memory errors on very huge datasets.
Anybody
25 matches
Mail list logo