Re: Hive using Spark engine vs native spark with hive integration.

2020-10-07 Thread Patrick McCarthy
I think a lot will depend on what the scripts do. I've seen some legacy
hive scripts which were written in an awkward way (e.g. lots of subqueries,
nested explodes) because pre-spark it was the only way to express certain
logic. For fairly straightforward operations I expect Catalyst would reduce
both code to similar plans.

On Tue, Oct 6, 2020 at 12:07 PM Manu Jacob 
wrote:

> Hi All,
>
>
>
> Not sure if I need to ask this question on spark community or hive
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advice on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>
>
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Re: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Ricardo Martinelli de Oliveira
My 2 cents is that this is a complicated question since I'm not confident
that Spark is 100% compatible with Hive in terms of query language. I have
an unanswered question in this list about this:

http://apache-spark-user-list.1001560.n3.nabble.com/Should-SHOW-TABLES-statement-return-a-hive-compatible-output-td38577.html

One thing that is important to check is if you are using the supported
objects in both Hive and Spark. One example is the lack of support for
materialized views in Spark:
https://issues.apache.org/jira/browse/SPARK-29038

With that being said, I'd recommend going to 2. as this will force your
code to use that Spark offers.

Hope that helps.

On Tue, Oct 6, 2020 at 1:14 PM Manu Jacob 
wrote:

> Hi All,
>
>
>
> Not sure if I need to ask this question on spark community or hive
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advice on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>
>
>


-- 

Ricardo Martinelli De Oliveira

Data Engineer, AI CoE

Red Hat Brazil 

Av. Brigadeiro Faria Lima, 3900

8th floor

rmart...@redhat.comT: +551135426125
M: +5511970696531
@redhatjobs    redhatjobs
 @redhatjobs




Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Manu Jacob
Hi All,

Not sure if I need to ask this question on spark community or hive community.

We have a set of hive scripts that runs on EMR (Tez engine). We would like to 
experiment by moving some of it onto Spark. We are planning to experiment with 
two options.


  1.  Use the current code based on HQL, with engine set as spark.
  2.  Write pure spark code in scala/python using SparkQL and hive integration.

The first approach helps us to transition to Spark quickly but not sure if this 
is the best approach in terms of performance.  Could not find any reasonable 
comparisons of this two approaches.  It looks like writing pure Spark code, 
gives us more control to add logic and also control some of the performance 
features, for example things like caching/evicting etc.


Any advice on this is much appreciated.


Thanks,
-Manu