Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-22 Thread Xiao Li
Based on my experience in development of Spark SQL, the maintenance cost is
very small for supporting different versions of Hive metastore. Feel free
to ping me if we hit any issue about it.

Cheers,

Xiao

Reynold Xin  于2019年1月22日周二 下午11:18写道:

> Actually a non trivial fraction of users / customers I interact with still
> use very old Hive metastores. Because it’s very difficult to upgrade Hive
> metastore wholesale (it’d require all the production jobs that access the
> same metastore be upgraded at once). This is even harder than JVM upgrade
> which can be done on a per job basis, or OS upgrade that can be done on a
> per machine basis.
>
> Is there high maintenance cost with keeping these? My understanding is
> that Michael did a good job initially with classloader isolation and
> modular design that they are very easy to maintain.
>
> On Jan 22, 2019, at 11:13 PM, Hyukjin Kwon  wrote:
>
> Yea, I was thinking about that too. They are too old to keep. +1 for
> removing them out.
>
> 2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성:
>
>> Hi, All.
>>
>> Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3.
>> Among them, HMS 0.x releases look very old since we are in 2019.
>> If these are not used in the production any more, can we drop HMS 0.x
>> supports in 3.0.0?
>>
>> hive-0.12.0 2013-10-10
>> hive-0.13.0 2014-04-15
>> hive-0.13.1 2014-11-16
>> hive-0.14.0 2014-11-16
>> ( https://archive.apache.org/dist/hive/ )
>>
>> In addition, if there is someone who is still using these HMS versions
>> and has a plan to install and use Spark 3.0.0 with these HMS versions,
>> could you reply this email thread? If there is a reason, that would be very
>> helpful for me.
>>
>> Thanks,
>> Dongjoon.
>>
>


Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-22 Thread Reynold Xin
Actually a non trivial fraction of users / customers I interact with still use 
very old Hive metastores. Because it’s very difficult to upgrade Hive metastore 
wholesale (it’d require all the production jobs that access the same metastore 
be upgraded at once). This is even harder than JVM upgrade which can be done on 
a per job basis, or OS upgrade that can be done on a per machine basis.

Is there high maintenance cost with keeping these? My understanding is that 
Michael did a good job initially with classloader isolation and modular design 
that they are very easy to maintain.

> On Jan 22, 2019, at 11:13 PM, Hyukjin Kwon  wrote:
> 
> Yea, I was thinking about that too. They are too old to keep. +1 for removing 
> them out.
> 
> 2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성:
>> Hi, All.
>> 
>> Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3.
>> Among them, HMS 0.x releases look very old since we are in 2019.
>> If these are not used in the production any more, can we drop HMS 0.x 
>> supports in 3.0.0?
>> 
>> hive-0.12.0 2013-10-10
>> hive-0.13.0 2014-04-15
>> hive-0.13.1 2014-11-16
>> hive-0.14.0 2014-11-16
>> ( https://archive.apache.org/dist/hive/ )
>> 
>> In addition, if there is someone who is still using these HMS versions and 
>> has a plan to install and use Spark 3.0.0 with these HMS versions, could you 
>> reply this email thread? If there is a reason, that would be very helpful 
>> for me.
>> 
>> Thanks,
>> Dongjoon.


Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-22 Thread Hyukjin Kwon
Yea, I was thinking about that too. They are too old to keep. +1 for
removing them out.

2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성:

> Hi, All.
>
> Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3.
> Among them, HMS 0.x releases look very old since we are in 2019.
> If these are not used in the production any more, can we drop HMS 0.x
> supports in 3.0.0?
>
> hive-0.12.0 2013-10-10
> hive-0.13.0 2014-04-15
> hive-0.13.1 2014-11-16
> hive-0.14.0 2014-11-16
> ( https://archive.apache.org/dist/hive/ )
>
> In addition, if there is someone who is still using these HMS versions and
> has a plan to install and use Spark 3.0.0 with these HMS versions, could
> you reply this email thread? If there is a reason, that would be very
> helpful for me.
>
> Thanks,
> Dongjoon.
>


Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-22 Thread Dongjoon Hyun
Hi, All.

Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3.
Among them, HMS 0.x releases look very old since we are in 2019.
If these are not used in the production any more, can we drop HMS 0.x
supports in 3.0.0?

hive-0.12.0 2013-10-10
hive-0.13.0 2014-04-15
hive-0.13.1 2014-11-16
hive-0.14.0 2014-11-16
( https://archive.apache.org/dist/hive/ )

In addition, if there is someone who is still using these HMS versions and
has a plan to install and use Spark 3.0.0 with these HMS versions, could
you reply this email thread? If there is a reason, that would be very
helpful for me.

Thanks,
Dongjoon.


Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-22 Thread Ryan Blue
Thanks for reviewing this! I'll create an SPIP doc and issue for it and
call a vote.

On Tue, Jan 22, 2019 at 11:41 AM Matt Cheah  wrote:

> +1 for n-part namespace as proposed. Agree that a short SPIP would be
> appropriate for this. Perhaps also a JIRA ticket?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Felix Cheung 
> *Date: *Sunday, January 20, 2019 at 4:48 PM
> *To: *"rb...@netflix.com" , Spark Dev List <
> dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] Identifiers with multi-catalog support
>
>
>
> +1 I like Ryan last mail. Thank you for putting it clearly (should be a
> spec/SPIP!)
>
>
>
> I agree and understand the need for 3 part id. However I don’t think we
> should make assumption that it must be or can only be as long as 3 parts.
> Once the catalog is identified (ie. The first part), the catalog should be
> responsible for resolving the namespace or schema etc. Agree also path is
> good idea to add to support file-based variant. Should separator be
> optional (perhaps in *space) to keep this extensible (it might not always
> be ‘.’)
>
>
>
> Also this whole scheme will need to play nice with column identifier as
> well.
>
>
>
>
> --
>
> *From:* Ryan Blue 
> *Sent:* Thursday, January 17, 2019 11:38 AM
> *To:* Spark Dev List
> *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support
>
>
>
> Any discussion on how Spark should manage identifiers when multiple
> catalogs are supported?
>
>
>
> I know this is an area where a lot of people are interested in making
> progress, and it is a blocker for both multi-catalog support and CTAS in
> DSv2.
>
>
>
> On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue  wrote:
>
> I think that the solution to this problem is to mix the two approaches by
> supporting 3 identifier parts: catalog, namespace, and name, where
> namespace can be an n-part identifier:
>
> type Namespace = Seq[String]
>
> case class CatalogIdentifier(space: Namespace, name: String)
>
> This allows catalogs to work with the hierarchy of the external store, but
> the catalog API only requires a few discovery methods to list namespaces
> and to list each type of object in a namespace.
>
> def listNamespaces(): Seq[Namespace]
>
> def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
>
> def listTables(space: Namespace): Seq[CatalogIdentifier]
>
> def listViews(space: Namespace): Seq[CatalogIdentifier]
>
> def listFunctions(space: Namespace): Seq[CatalogIdentifier]
>
> The methods to list tables, views, or functions, would only return
> identifiers for the type queried, not namespaces or the other objects.
>
> The SQL parser would be updated so that identifiers are parsed to 
> UnresovledIdentifier(parts:
> Seq[String]), and resolution would work like this pseudo-code:
>
> def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
> CatalogIdentifier) = {
>
>   val maybeCatalog = sparkSession.catalog(ident.parts.head)
>
>   ident.parts match {
>
> case Seq(catalogName, *space, name) if catalog.isDefined =>
>
>   (maybeCatalog.get, CatalogIdentifier(space, name))
>
> case Seq(*space, name) =>
>
>   (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
>
>   }
>
> }
>
> I think this is a good approach because it allows Spark users to reference
> or discovery any name in the hierarchy of an external store, it uses a few
> well-defined methods for discovery, and makes name hierarchy a user concern.
>
> · SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of
> listNamespaces()
>
> · SHOW NAMESPACES LIKE a.b% would return the result of 
> listNamespaces(Seq("a"),
> "b")
>
> · USE a.b would set the current namespace to Seq("a", "b")
>
> · SHOW TABLES would return the result of
> listTables(currentNamespace)
>
> Also, I think that we could generalize this a little more to support
> path-based tables by adding a path to CatalogIdentifier, either as a
> namespace or as a separate optional string. Then, the identifier passed to
> a catalog would work for either a path-based table or a catalog table,
> without needing a path-based catalog API.
>
> Thoughts?
>
>
>
> On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue  wrote:
>
> In the DSv2 sync up, we tried to discuss the Table metadata proposal but
> were side-tracked on its use of TableIdentifier. There were good points
> about how Spark should identify tables, views, functions, etc, and I want
> to start a discussion here.
>
> Identifiers are orthogonal to the TableCatalog proposal that can be
> updated to use whatever identifier class we choose. That proposal is
> concerned with what information should be passed to define a table, and how
> to pass that information.
>
> The main question for *this* discussion is: *how should Spark identify
> tables, views, and functions when it supports multiple catalogs?*
>
> There are two main approaches:
>
> 1.   Use a 3-part identifier, catalog.database.table
>
> 2.   Use an identifier with an arbitrary number o

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-22 Thread Matt Cheah
+1 for n-part namespace as proposed. Agree that a short SPIP would be 
appropriate for this. Perhaps also a JIRA ticket?

 

-Matt Cheah

 

From: Felix Cheung 
Date: Sunday, January 20, 2019 at 4:48 PM
To: "rb...@netflix.com" , Spark Dev List 

Subject: Re: [DISCUSS] Identifiers with multi-catalog support

 

+1 I like Ryan last mail. Thank you for putting it clearly (should be a 
spec/SPIP!)

 

I agree and understand the need for 3 part id. However I don’t think we should 
make assumption that it must be or can only be as long as 3 parts. Once the 
catalog is identified (ie. The first part), the catalog should be responsible 
for resolving the namespace or schema etc. Agree also path is good idea to add 
to support file-based variant. Should separator be optional (perhaps in *space) 
to keep this extensible (it might not always be ‘.’)

 

Also this whole scheme will need to play nice with column identifier as well.

 

 

From: Ryan Blue 
Sent: Thursday, January 17, 2019 11:38 AM
To: Spark Dev List
Subject: Re: [DISCUSS] Identifiers with multi-catalog support 

 

Any discussion on how Spark should manage identifiers when multiple catalogs 
are supported? 

 

I know this is an area where a lot of people are interested in making progress, 
and it is a blocker for both multi-catalog support and CTAS in DSv2.

 

On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue  wrote:

I think that the solution to this problem is to mix the two approaches by 
supporting 3 identifier parts: catalog, namespace, and name, where namespace 
can be an n-part identifier:
type Namespace = Seq[String]
case class CatalogIdentifier(space: Namespace, name: String)
This allows catalogs to work with the hierarchy of the external store, but the 
catalog API only requires a few discovery methods to list namespaces and to 
list each type of object in a namespace.
def listNamespaces(): Seq[Namespace]
def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
def listTables(space: Namespace): Seq[CatalogIdentifier]
def listViews(space: Namespace): Seq[CatalogIdentifier]
def listFunctions(space: Namespace): Seq[CatalogIdentifier]
The methods to list tables, views, or functions, would only return identifiers 
for the type queried, not namespaces or the other objects.

The SQL parser would be updated so that identifiers are parsed to 
UnresovledIdentifier(parts: Seq[String]), and resolution would work like this 
pseudo-code:
def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
CatalogIdentifier) = {
  val maybeCatalog = sparkSession.catalog(ident.parts.head)
  ident.parts match {
    case Seq(catalogName, *space, name) if catalog.isDefined =>
  (maybeCatalog.get, CatalogIdentifier(space, name))
    case Seq(*space, name) =>
  (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
  }
}
I think this is a good approach because it allows Spark users to reference or 
discovery any name in the hierarchy of an external store, it uses a few 
well-defined methods for discovery, and makes name hierarchy a user concern.

· SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of 
listNamespaces() 

· SHOW NAMESPACES LIKE a.b% would return the result of 
listNamespaces(Seq("a"), "b") 

· USE a.b would set the current namespace to Seq("a", "b") 

· SHOW TABLES would return the result of listTables(currentNamespace) 

Also, I think that we could generalize this a little more to support path-based 
tables by adding a path to CatalogIdentifier, either as a namespace or as a 
separate optional string. Then, the identifier passed to a catalog would work 
for either a path-based table or a catalog table, without needing a path-based 
catalog API.

Thoughts?

 

On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue  wrote:

In the DSv2 sync up, we tried to discuss the Table metadata proposal but were 
side-tracked on its use of TableIdentifier. There were good points about how 
Spark should identify tables, views, functions, etc, and I want to start a 
discussion here.

Identifiers are orthogonal to the TableCatalog proposal that can be updated to 
use whatever identifier class we choose. That proposal is concerned with what 
information should be passed to define a table, and how to pass that 
information.

The main question for this discussion is: how should Spark identify tables, 
views, and functions when it supports multiple catalogs?

There are two main approaches:

1.   Use a 3-part identifier, catalog.database.table 

2.   Use an identifier with an arbitrary number of parts 

Option 1: use 3-part identifiers

The argument for option #1 is that it is simple. If an external data store has 
additional logical hierarchy layers, then that hierarchy would be mapped to 
multiple catalogs in Spark. Spark can support show tables and show databases 
without much trouble. This is the approach used by Presto, so there is some 
precedent for it.

The drawback is that mapping a more complex hierarch

Re: Make proactive check for closure serializability optional?

2019-01-22 Thread Sean Owen
Agree, I'm not pushing for it unless there's other evidence. The closure
check does entail serialization, not just checking serializability, note.
I don't like flags either but this one sounded like it could actually be
something a user wanted to vary, globally, for runs of the same code.

On Tue, Jan 22, 2019 at 11:25 AM Reynold Xin  wrote:

> Typically very large closures include some array, and the serialization
> itself should be much more expensive than the closure check. Does anybody
> have actual data on this could be a problem? We don't want to add a config
> flag if for virtually any case it doesn't make sense to change.
>
>


Re: Make proactive check for closure serializability optional?

2019-01-22 Thread Reynold Xin
Typically very large closures include some array, and the serialization itself 
should be much more expensive than the closure check. Does anybody have actual 
data on this could be a problem? We don't want to add a config flag if for 
virtually any case it doesn't make sense to change.

On Mon, Jan 21, 2019 at 12:37 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> Agreed on the pros / cons, esp driver could be the data science notebook.
> Is it worthwhile making it configurable?
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Monday, January 21, 2019 10:42 AM
> *To:* Reynold Xin
> *Cc:* dev
> *Subject:* Re: Make proactive check for closure serializability optional?
>  
> None except the bug / PR I linked to, which is really just a bug in
> the RowMatrix implementation; a 2GB closure isn't reasonable.
> I doubt it's much overhead in the common case, because closures are
> small and this extra check happens once per execution of the closure.
> 
> I can also imagine middle-ground cases where people are dragging along
> largeish 10MB closures (like, a model or some data) and this could add
> non-trivial memory pressure on the driver. They should be broadcasting
> those things, sure.
> 
> Given just that I'd leave it alone, but was wondering if anyone had
> ever had the same thought or more arguments that it should be
> disable-able. In 'production' one would imagine all the closures do
> serialize correctly and so this is just a bit overhead that could be
> skipped.
> 
> On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> >
> > Did you actually observe a perf issue?
> >
> > On Mon, Jan 21, 2019 at 10:04 AM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> >>
> >> The ClosureCleaner proactively checks that closures passed to
> >> transformations like RDD.map() are serializable, before they're
> >> executed. It does this by just serializing it with the JavaSerializer.
> >>
> >> That's a nice feature, although there's overhead in always trying to
> >> serialize the closure ahead of time, especially if the closure is
> >> large. It shouldn't be large, usually. But I noticed it when coming up
> >> with this fix: https:/ / github. com/ apache/ spark/ pull/ 23600 (
> https://github.com/apache/spark/pull/23600 )
> >>
> >> It made me wonder, should this be optional, or even not the default?
> >> Closures that don't serialize still fail, just later when an action is
> >> invoked. I don't feel strongly about it, just checking if anyone had
> >> pondered this before.
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> >>
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
>

Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

2019-01-22 Thread Felix Cheung
I’ve tried a couple of times. The latest test run took 12 hr+

1 aborted suite:
00:53:25.769 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: 
Failed to download Spark 2.3.2 from 
http://mirrors.koehn.com/apache//spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz:
 Error writing to server
00:53:25.812 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: 
Failed to download Spark 2.3.2 from 
http://mirror.cc.columbia.edu/pub/software/apache//spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz:
 Error writing to server
00:53:25.838 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: 
Failed to download Spark 2.3.2 from 
https://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz:
 Socket closed

org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
Exception encountered when invoking run on a nested suite - Unable to download 
Spark 2.3.2 (HiveExternalCatalogVersionsSuite.scala:97)

And then it stopped. I checked this morning the archive link should be valid. 
Try to see if I can try again/resume from it.



From: Takeshi Yamamuro 
Sent: Sunday, January 20, 2019 6:45 PM
To: Sean Owen
Cc: Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC1)

Oh, sorry for that and I misunderstood the Apache release policy.
Yea, its ok to keep the RC1 voting.

Best,
Takeshi

On Mon, Jan 21, 2019 at 11:07 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
OK, if it passes tests, I'm +1 on the release.
Can anyone else verify the tests pass?

What is the reason for a new RC? I didn't see any other issues reported.

On Sun, Jan 20, 2019 at 8:03 PM Takeshi Yamamuro 
mailto:linguin@gmail.com>> wrote:
>
> Hi, all
>
> Thanks for the checks, Sean and Felix.
> I'll start the next vote as RC2 this Tuesday noon (PST).
>
> > Sean
> I re-run JavaTfIdfSuite on my env and it passed.
> I used `-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Psparkr` and
> run the tests on a EC2 instance below (I launched the new instance for the 
> tests);
> 
> $ cat /etc/os-release
> NAME="Amazon Linux"
> VERSION="2"
> ID="amzn"
> ID_LIKE="centos rhel fedora"
> VERSION_ID="2"
> PRETTY_NAME="Amazon Linux 2"
> ANSI_COLOR="0;33"
> CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
> HOME_URL="https://amazonlinux.com/";
> $ java -version
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
>
>
>
>
> On Mon, Jan 21, 2019 at 9:53 AM Felix Cheung 
> mailto:felixcheun...@hotmail.com>> wrote:
>>
>> +1
>>
>> My focus is on R (sorry couldn’t cross validate what’s Sean is seeing)
>>
>> tested:
>> reviewed doc
>> R package test
>> win-builder, r-hub
>> Tarball/package signature
>>
>>
>>
>> 
>> From: Takeshi Yamamuro mailto:linguin@gmail.com>>
>> Sent: Thursday, January 17, 2019 6:49 PM
>> To: Spark dev list
>> Subject: [VOTE] Release Apache Spark 2.3.3 (RC1)
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.3.3.
>>
>> The vote is open until January 20 8:00PM (PST) and passes if a majority +1 
>> PMC votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.3.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.3.3-rc1 (commit 
>> b5ea9330e3072e99841270b10dc1d2248127064b):
>> https://github.com/apache/spark/tree/v2.3.3-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1297
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc1-docs/
>>
>> The list of bug fixes going into 2.3.3 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12343759
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.3?
>> ==