Re: Building Spark to use for Hive on Spark

2015-11-22 Thread Lefty Leverenz
Thanks Xuefu!

-- Lefty

On Mon, Nov 23, 2015 at 1:09 AM, Xuefu Zhang  wrote:

> Hive is supposed to work with any version of Hive (1.1+) and a version of
> Spark w/o Hive. Thus, to make HoS work reliably and also simply the
> matters, I think it still makes to require that spark-assembly jar
> shouldn't contain Hive Jars. Otherwise, you have to make sure that your
> Hive version matches the same as the "other" Hive version that's included
> in Spark.
>
> In CDH 5.x, Spark version is 1.5, and we still build Spark jar w/o Hive.
>
> Therefore, I don't see a need to update the doc.
>
> --Xuefu
>
> On Sun, Nov 22, 2015 at 9:23 PM, Lefty Leverenz 
> wrote:
>
>> Gopal, can you confirm the doc change that Jone Zhang suggests?  The
>> second sentence confuses me:  "You can choose Spark1.5.0+ which  build
>> include the Hive jars."
>>
>> Thanks.
>>
>> -- Lefty
>>
>>
>> On Thu, Nov 19, 2015 at 8:33 PM, Jone Zhang 
>> wrote:
>>
>>> I should add that Spark1.5.0+ is used hive1.2.1 default when you use
>>> -Phive
>>>
>>> So this page
>>> 
>>>  shoule
>>> write like below
>>> “Note that you must have a version of Spark which does *not* include
>>> the Hive jars if you use Spark1.4.1 and before, You can choose
>>> Spark1.5.0+ which  build include the Hive jars ”
>>>
>>>
>>> 2015-11-19 5:12 GMT+08:00 Gopal Vijayaraghavan :
>>>


 > I wanted to know  why is it necessary to remove the Hive jars from the
 >Spark build as mentioned on this

 Because SparkSQL was originally based on Hive & still uses Hive AST to
 parse SQL.

 The org.apache.spark.sql.hive package contains the parser which has
 hard-references to the hive's internal AST, which is unfortunately
 auto-generated code (HiveParser.TOK_TABNAME etc).

 Everytime Hive makes a release, those constants change in value and that
 is private API because of the lack of backwards-compat, which is
 violated
 by SparkSQL.

 So Hive-on-Spark forces mismatched versions of Hive classes, because
 it's
 a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
 laws of causality.

 Spark cannot depend on a version of Hive that is unreleased and
 Hive-on-Spark release cannot depend on a version of Spark that is
 unreleased.

 Cheers,
 Gopal



>>>
>>
>


Re: Building Spark to use for Hive on Spark

2015-11-22 Thread Xuefu Zhang
Hive is supposed to work with any version of Hive (1.1+) and a version of
Spark w/o Hive. Thus, to make HoS work reliably and also simply the
matters, I think it still makes to require that spark-assembly jar
shouldn't contain Hive Jars. Otherwise, you have to make sure that your
Hive version matches the same as the "other" Hive version that's included
in Spark.

In CDH 5.x, Spark version is 1.5, and we still build Spark jar w/o Hive.

Therefore, I don't see a need to update the doc.

--Xuefu

On Sun, Nov 22, 2015 at 9:23 PM, Lefty Leverenz 
wrote:

> Gopal, can you confirm the doc change that Jone Zhang suggests?  The
> second sentence confuses me:  "You can choose Spark1.5.0+ which  build
> include the Hive jars."
>
> Thanks.
>
> -- Lefty
>
>
> On Thu, Nov 19, 2015 at 8:33 PM, Jone Zhang 
> wrote:
>
>> I should add that Spark1.5.0+ is used hive1.2.1 default when you use
>> -Phive
>>
>> So this page
>> 
>>  shoule
>> write like below
>> “Note that you must have a version of Spark which does *not* include the
>> Hive jars if you use Spark1.4.1 and before, You can choose Spark1.5.0+
>> which  build include the Hive jars ”
>>
>>
>> 2015-11-19 5:12 GMT+08:00 Gopal Vijayaraghavan :
>>
>>>
>>>
>>> > I wanted to know  why is it necessary to remove the Hive jars from the
>>> >Spark build as mentioned on this
>>>
>>> Because SparkSQL was originally based on Hive & still uses Hive AST to
>>> parse SQL.
>>>
>>> The org.apache.spark.sql.hive package contains the parser which has
>>> hard-references to the hive's internal AST, which is unfortunately
>>> auto-generated code (HiveParser.TOK_TABNAME etc).
>>>
>>> Everytime Hive makes a release, those constants change in value and that
>>> is private API because of the lack of backwards-compat, which is violated
>>> by SparkSQL.
>>>
>>> So Hive-on-Spark forces mismatched versions of Hive classes, because it's
>>> a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
>>> laws of causality.
>>>
>>> Spark cannot depend on a version of Hive that is unreleased and
>>> Hive-on-Spark release cannot depend on a version of Spark that is
>>> unreleased.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>


Re: Building Spark to use for Hive on Spark

2015-11-22 Thread Lefty Leverenz
Gopal, can you confirm the doc change that Jone Zhang suggests?  The second
sentence confuses me:  "You can choose Spark1.5.0+ which  build include the
Hive jars."

Thanks.

-- Lefty


On Thu, Nov 19, 2015 at 8:33 PM, Jone Zhang  wrote:

> I should add that Spark1.5.0+ is used hive1.2.1 default when you use -Phive
>
> So this page
> 
>  shoule
> write like below
> “Note that you must have a version of Spark which does *not* include the
> Hive jars if you use Spark1.4.1 and before, You can choose Spark1.5.0+
> which  build include the Hive jars ”
>
>
> 2015-11-19 5:12 GMT+08:00 Gopal Vijayaraghavan :
>
>>
>>
>> > I wanted to know  why is it necessary to remove the Hive jars from the
>> >Spark build as mentioned on this
>>
>> Because SparkSQL was originally based on Hive & still uses Hive AST to
>> parse SQL.
>>
>> The org.apache.spark.sql.hive package contains the parser which has
>> hard-references to the hive's internal AST, which is unfortunately
>> auto-generated code (HiveParser.TOK_TABNAME etc).
>>
>> Everytime Hive makes a release, those constants change in value and that
>> is private API because of the lack of backwards-compat, which is violated
>> by SparkSQL.
>>
>> So Hive-on-Spark forces mismatched versions of Hive classes, because it's
>> a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
>> laws of causality.
>>
>> Spark cannot depend on a version of Hive that is unreleased and
>> Hive-on-Spark release cannot depend on a version of Spark that is
>> unreleased.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>


Re: Building Spark to use for Hive on Spark

2015-11-19 Thread Jone Zhang
I should add that Spark1.5.0+ is used hive1.2.1 default when you use -Phive

So this page

shoule
write like below
“Note that you must have a version of Spark which does *not* include the
Hive jars if you use Spark1.4.1 and before, You can choose Spark1.5.0+
which  build include the Hive jars ”


2015-11-19 5:12 GMT+08:00 Gopal Vijayaraghavan :

>
>
> > I wanted to know  why is it necessary to remove the Hive jars from the
> >Spark build as mentioned on this
>
> Because SparkSQL was originally based on Hive & still uses Hive AST to
> parse SQL.
>
> The org.apache.spark.sql.hive package contains the parser which has
> hard-references to the hive's internal AST, which is unfortunately
> auto-generated code (HiveParser.TOK_TABNAME etc).
>
> Everytime Hive makes a release, those constants change in value and that
> is private API because of the lack of backwards-compat, which is violated
> by SparkSQL.
>
> So Hive-on-Spark forces mismatched versions of Hive classes, because it's
> a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
> laws of causality.
>
> Spark cannot depend on a version of Hive that is unreleased and
> Hive-on-Spark release cannot depend on a version of Spark that is
> unreleased.
>
> Cheers,
> Gopal
>
>
>


Re: Building Spark to use for Hive on Spark

2015-11-18 Thread Gopal Vijayaraghavan


> I wanted to know  why is it necessary to remove the Hive jars from the
>Spark build as mentioned on this

Because SparkSQL was originally based on Hive & still uses Hive AST to
parse SQL.

The org.apache.spark.sql.hive package contains the parser which has
hard-references to the hive's internal AST, which is unfortunately
auto-generated code (HiveParser.TOK_TABNAME etc).

Everytime Hive makes a release, those constants change in value and that
is private API because of the lack of backwards-compat, which is violated
by SparkSQL.

So Hive-on-Spark forces mismatched versions of Hive classes, because it's
a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
laws of causality.

Spark cannot depend on a version of Hive that is unreleased and
Hive-on-Spark release cannot depend on a version of Spark that is
unreleased.

Cheers,
Gopal




Building Spark to use for Hive on Spark

2015-11-18 Thread Udit Mehta
Hi,

I am planning to test out the Hive on Spark functionality provided by the
newer versions of Hive. I wanted to know  why is it necessary to remove the
Hive jars from the Spark build as mentioned on this this page.

This would require me to have 2 spark builds, one with the Hive jars and
one without.

Any help is appreciated,
Udit