Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?
Based on my experience in development of Spark SQL, the maintenance cost is very small for supporting different versions of Hive metastore. Feel free to ping me if we hit any issue about it. Cheers, Xiao Reynold Xin 于2019年1月22日周二 下午11:18写道: > Actually a non trivial fraction of users / customers I interact with still > use very old Hive metastores. Because it’s very difficult to upgrade Hive > metastore wholesale (it’d require all the production jobs that access the > same metastore be upgraded at once). This is even harder than JVM upgrade > which can be done on a per job basis, or OS upgrade that can be done on a > per machine basis. > > Is there high maintenance cost with keeping these? My understanding is > that Michael did a good job initially with classloader isolation and > modular design that they are very easy to maintain. > > On Jan 22, 2019, at 11:13 PM, Hyukjin Kwon wrote: > > Yea, I was thinking about that too. They are too old to keep. +1 for > removing them out. > > 2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성: > >> Hi, All. >> >> Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3. >> Among them, HMS 0.x releases look very old since we are in 2019. >> If these are not used in the production any more, can we drop HMS 0.x >> supports in 3.0.0? >> >> hive-0.12.0 2013-10-10 >> hive-0.13.0 2014-04-15 >> hive-0.13.1 2014-11-16 >> hive-0.14.0 2014-11-16 >> ( https://archive.apache.org/dist/hive/ ) >> >> In addition, if there is someone who is still using these HMS versions >> and has a plan to install and use Spark 3.0.0 with these HMS versions, >> could you reply this email thread? If there is a reason, that would be very >> helpful for me. >> >> Thanks, >> Dongjoon. >> >
Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?
Actually a non trivial fraction of users / customers I interact with still use very old Hive metastores. Because it’s very difficult to upgrade Hive metastore wholesale (it’d require all the production jobs that access the same metastore be upgraded at once). This is even harder than JVM upgrade which can be done on a per job basis, or OS upgrade that can be done on a per machine basis. Is there high maintenance cost with keeping these? My understanding is that Michael did a good job initially with classloader isolation and modular design that they are very easy to maintain. > On Jan 22, 2019, at 11:13 PM, Hyukjin Kwon wrote: > > Yea, I was thinking about that too. They are too old to keep. +1 for removing > them out. > > 2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성: >> Hi, All. >> >> Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3. >> Among them, HMS 0.x releases look very old since we are in 2019. >> If these are not used in the production any more, can we drop HMS 0.x >> supports in 3.0.0? >> >> hive-0.12.0 2013-10-10 >> hive-0.13.0 2014-04-15 >> hive-0.13.1 2014-11-16 >> hive-0.14.0 2014-11-16 >> ( https://archive.apache.org/dist/hive/ ) >> >> In addition, if there is someone who is still using these HMS versions and >> has a plan to install and use Spark 3.0.0 with these HMS versions, could you >> reply this email thread? If there is a reason, that would be very helpful >> for me. >> >> Thanks, >> Dongjoon.
Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?
Yea, I was thinking about that too. They are too old to keep. +1 for removing them out. 2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성: > Hi, All. > > Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3. > Among them, HMS 0.x releases look very old since we are in 2019. > If these are not used in the production any more, can we drop HMS 0.x > supports in 3.0.0? > > hive-0.12.0 2013-10-10 > hive-0.13.0 2014-04-15 > hive-0.13.1 2014-11-16 > hive-0.14.0 2014-11-16 > ( https://archive.apache.org/dist/hive/ ) > > In addition, if there is someone who is still using these HMS versions and > has a plan to install and use Spark 3.0.0 with these HMS versions, could > you reply this email thread? If there is a reason, that would be very > helpful for me. > > Thanks, > Dongjoon. >
Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?
Hi, All. Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3. Among them, HMS 0.x releases look very old since we are in 2019. If these are not used in the production any more, can we drop HMS 0.x supports in 3.0.0? hive-0.12.0 2013-10-10 hive-0.13.0 2014-04-15 hive-0.13.1 2014-11-16 hive-0.14.0 2014-11-16 ( https://archive.apache.org/dist/hive/ ) In addition, if there is someone who is still using these HMS versions and has a plan to install and use Spark 3.0.0 with these HMS versions, could you reply this email thread? If there is a reason, that would be very helpful for me. Thanks, Dongjoon.
Re: [DISCUSS] Identifiers with multi-catalog support
Thanks for reviewing this! I'll create an SPIP doc and issue for it and call a vote. On Tue, Jan 22, 2019 at 11:41 AM Matt Cheah wrote: > +1 for n-part namespace as proposed. Agree that a short SPIP would be > appropriate for this. Perhaps also a JIRA ticket? > > > > -Matt Cheah > > > > *From: *Felix Cheung > *Date: *Sunday, January 20, 2019 at 4:48 PM > *To: *"rb...@netflix.com" , Spark Dev List < > dev@spark.apache.org> > *Subject: *Re: [DISCUSS] Identifiers with multi-catalog support > > > > +1 I like Ryan last mail. Thank you for putting it clearly (should be a > spec/SPIP!) > > > > I agree and understand the need for 3 part id. However I don’t think we > should make assumption that it must be or can only be as long as 3 parts. > Once the catalog is identified (ie. The first part), the catalog should be > responsible for resolving the namespace or schema etc. Agree also path is > good idea to add to support file-based variant. Should separator be > optional (perhaps in *space) to keep this extensible (it might not always > be ‘.’) > > > > Also this whole scheme will need to play nice with column identifier as > well. > > > > > -- > > *From:* Ryan Blue > *Sent:* Thursday, January 17, 2019 11:38 AM > *To:* Spark Dev List > *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support > > > > Any discussion on how Spark should manage identifiers when multiple > catalogs are supported? > > > > I know this is an area where a lot of people are interested in making > progress, and it is a blocker for both multi-catalog support and CTAS in > DSv2. > > > > On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote: > > I think that the solution to this problem is to mix the two approaches by > supporting 3 identifier parts: catalog, namespace, and name, where > namespace can be an n-part identifier: > > type Namespace = Seq[String] > > case class CatalogIdentifier(space: Namespace, name: String) > > This allows catalogs to work with the hierarchy of the external store, but > the catalog API only requires a few discovery methods to list namespaces > and to list each type of object in a namespace. > > def listNamespaces(): Seq[Namespace] > > def listNamespaces(space: Namespace, prefix: String): Seq[Namespace] > > def listTables(space: Namespace): Seq[CatalogIdentifier] > > def listViews(space: Namespace): Seq[CatalogIdentifier] > > def listFunctions(space: Namespace): Seq[CatalogIdentifier] > > The methods to list tables, views, or functions, would only return > identifiers for the type queried, not namespaces or the other objects. > > The SQL parser would be updated so that identifiers are parsed to > UnresovledIdentifier(parts: > Seq[String]), and resolution would work like this pseudo-code: > > def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, > CatalogIdentifier) = { > > val maybeCatalog = sparkSession.catalog(ident.parts.head) > > ident.parts match { > > case Seq(catalogName, *space, name) if catalog.isDefined => > > (maybeCatalog.get, CatalogIdentifier(space, name)) > > case Seq(*space, name) => > > (sparkSession.defaultCatalog, CatalogIdentifier(space, name)) > > } > > } > > I think this is a good approach because it allows Spark users to reference > or discovery any name in the hierarchy of an external store, it uses a few > well-defined methods for discovery, and makes name hierarchy a user concern. > > · SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of > listNamespaces() > > · SHOW NAMESPACES LIKE a.b% would return the result of > listNamespaces(Seq("a"), > "b") > > · USE a.b would set the current namespace to Seq("a", "b") > > · SHOW TABLES would return the result of > listTables(currentNamespace) > > Also, I think that we could generalize this a little more to support > path-based tables by adding a path to CatalogIdentifier, either as a > namespace or as a separate optional string. Then, the identifier passed to > a catalog would work for either a path-based table or a catalog table, > without needing a path-based catalog API. > > Thoughts? > > > > On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue wrote: > > In the DSv2 sync up, we tried to discuss the Table metadata proposal but > were side-tracked on its use of TableIdentifier. There were good points > about how Spark should identify tables, views, functions, etc, and I want > to start a discussion here. > > Identifiers are orthogonal to the TableCatalog proposal that can be > updated to use whatever identifier class we choose. That proposal is > concerned with what information should be passed to define a table, and how > to pass that information. > > The main question for *this* discussion is: *how should Spark identify > tables, views, and functions when it supports multiple catalogs?* > > There are two main approaches: > > 1. Use a 3-part identifier, catalog.database.table > > 2. Use an identifier with an arbitrary number o
Re: [DISCUSS] Identifiers with multi-catalog support
+1 for n-part namespace as proposed. Agree that a short SPIP would be appropriate for this. Perhaps also a JIRA ticket? -Matt Cheah From: Felix Cheung Date: Sunday, January 20, 2019 at 4:48 PM To: "rb...@netflix.com" , Spark Dev List Subject: Re: [DISCUSS] Identifiers with multi-catalog support +1 I like Ryan last mail. Thank you for putting it clearly (should be a spec/SPIP!) I agree and understand the need for 3 part id. However I don’t think we should make assumption that it must be or can only be as long as 3 parts. Once the catalog is identified (ie. The first part), the catalog should be responsible for resolving the namespace or schema etc. Agree also path is good idea to add to support file-based variant. Should separator be optional (perhaps in *space) to keep this extensible (it might not always be ‘.’) Also this whole scheme will need to play nice with column identifier as well. From: Ryan Blue Sent: Thursday, January 17, 2019 11:38 AM To: Spark Dev List Subject: Re: [DISCUSS] Identifiers with multi-catalog support Any discussion on how Spark should manage identifiers when multiple catalogs are supported? I know this is an area where a lot of people are interested in making progress, and it is a blocker for both multi-catalog support and CTAS in DSv2. On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote: I think that the solution to this problem is to mix the two approaches by supporting 3 identifier parts: catalog, namespace, and name, where namespace can be an n-part identifier: type Namespace = Seq[String] case class CatalogIdentifier(space: Namespace, name: String) This allows catalogs to work with the hierarchy of the external store, but the catalog API only requires a few discovery methods to list namespaces and to list each type of object in a namespace. def listNamespaces(): Seq[Namespace] def listNamespaces(space: Namespace, prefix: String): Seq[Namespace] def listTables(space: Namespace): Seq[CatalogIdentifier] def listViews(space: Namespace): Seq[CatalogIdentifier] def listFunctions(space: Namespace): Seq[CatalogIdentifier] The methods to list tables, views, or functions, would only return identifiers for the type queried, not namespaces or the other objects. The SQL parser would be updated so that identifiers are parsed to UnresovledIdentifier(parts: Seq[String]), and resolution would work like this pseudo-code: def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, CatalogIdentifier) = { val maybeCatalog = sparkSession.catalog(ident.parts.head) ident.parts match { case Seq(catalogName, *space, name) if catalog.isDefined => (maybeCatalog.get, CatalogIdentifier(space, name)) case Seq(*space, name) => (sparkSession.defaultCatalog, CatalogIdentifier(space, name)) } } I think this is a good approach because it allows Spark users to reference or discovery any name in the hierarchy of an external store, it uses a few well-defined methods for discovery, and makes name hierarchy a user concern. · SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of listNamespaces() · SHOW NAMESPACES LIKE a.b% would return the result of listNamespaces(Seq("a"), "b") · USE a.b would set the current namespace to Seq("a", "b") · SHOW TABLES would return the result of listTables(currentNamespace) Also, I think that we could generalize this a little more to support path-based tables by adding a path to CatalogIdentifier, either as a namespace or as a separate optional string. Then, the identifier passed to a catalog would work for either a path-based table or a catalog table, without needing a path-based catalog API. Thoughts? On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue wrote: In the DSv2 sync up, we tried to discuss the Table metadata proposal but were side-tracked on its use of TableIdentifier. There were good points about how Spark should identify tables, views, functions, etc, and I want to start a discussion here. Identifiers are orthogonal to the TableCatalog proposal that can be updated to use whatever identifier class we choose. That proposal is concerned with what information should be passed to define a table, and how to pass that information. The main question for this discussion is: how should Spark identify tables, views, and functions when it supports multiple catalogs? There are two main approaches: 1. Use a 3-part identifier, catalog.database.table 2. Use an identifier with an arbitrary number of parts Option 1: use 3-part identifiers The argument for option #1 is that it is simple. If an external data store has additional logical hierarchy layers, then that hierarchy would be mapped to multiple catalogs in Spark. Spark can support show tables and show databases without much trouble. This is the approach used by Presto, so there is some precedent for it. The drawback is that mapping a more complex hierarch
Re: Make proactive check for closure serializability optional?
Agree, I'm not pushing for it unless there's other evidence. The closure check does entail serialization, not just checking serializability, note. I don't like flags either but this one sounded like it could actually be something a user wanted to vary, globally, for runs of the same code. On Tue, Jan 22, 2019 at 11:25 AM Reynold Xin wrote: > Typically very large closures include some array, and the serialization > itself should be much more expensive than the closure check. Does anybody > have actual data on this could be a problem? We don't want to add a config > flag if for virtually any case it doesn't make sense to change. > >
Re: Make proactive check for closure serializability optional?
Typically very large closures include some array, and the serialization itself should be much more expensive than the closure check. Does anybody have actual data on this could be a problem? We don't want to add a config flag if for virtually any case it doesn't make sense to change. On Mon, Jan 21, 2019 at 12:37 PM, Felix Cheung < felixcheun...@hotmail.com > wrote: > > Agreed on the pros / cons, esp driver could be the data science notebook. > Is it worthwhile making it configurable? > > > > > *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) > > *Sent:* Monday, January 21, 2019 10:42 AM > *To:* Reynold Xin > *Cc:* dev > *Subject:* Re: Make proactive check for closure serializability optional? > > None except the bug / PR I linked to, which is really just a bug in > the RowMatrix implementation; a 2GB closure isn't reasonable. > I doubt it's much overhead in the common case, because closures are > small and this extra check happens once per execution of the closure. > > I can also imagine middle-ground cases where people are dragging along > largeish 10MB closures (like, a model or some data) and this could add > non-trivial memory pressure on the driver. They should be broadcasting > those things, sure. > > Given just that I'd leave it alone, but was wondering if anyone had > ever had the same thought or more arguments that it should be > disable-able. In 'production' one would imagine all the closures do > serialize correctly and so this is just a bit overhead that could be > skipped. > > On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > > > Did you actually observe a perf issue? > > > > On Mon, Jan 21, 2019 at 10:04 AM Sean Owen < srowen@ gmail. com ( > sro...@gmail.com ) > wrote: > >> > >> The ClosureCleaner proactively checks that closures passed to > >> transformations like RDD.map() are serializable, before they're > >> executed. It does this by just serializing it with the JavaSerializer. > >> > >> That's a nice feature, although there's overhead in always trying to > >> serialize the closure ahead of time, especially if the closure is > >> large. It shouldn't be large, usually. But I noticed it when coming up > >> with this fix: https:/ / github. com/ apache/ spark/ pull/ 23600 ( > https://github.com/apache/spark/pull/23600 ) > >> > >> It made me wonder, should this be optional, or even not the default? > >> Closures that don't serialize still fail, just later when an action is > >> invoked. I don't feel strongly about it, just checking if anyone had > >> pondered this before. > >> > >> - > >> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( > dev-unsubscr...@spark.apache.org ) > >> > > - > To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org ( > dev-unsubscr...@spark.apache.org ) >
Re: [VOTE] Release Apache Spark 2.3.3 (RC1)
I’ve tried a couple of times. The latest test run took 12 hr+ 1 aborted suite: 00:53:25.769 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 2.3.2 from http://mirrors.koehn.com/apache//spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz: Error writing to server 00:53:25.812 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 2.3.2 from http://mirror.cc.columbia.edu/pub/software/apache//spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz: Error writing to server 00:53:25.838 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 2.3.2 from https://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz: Socket closed org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** Exception encountered when invoking run on a nested suite - Unable to download Spark 2.3.2 (HiveExternalCatalogVersionsSuite.scala:97) And then it stopped. I checked this morning the archive link should be valid. Try to see if I can try again/resume from it. From: Takeshi Yamamuro Sent: Sunday, January 20, 2019 6:45 PM To: Sean Owen Cc: Spark dev list Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC1) Oh, sorry for that and I misunderstood the Apache release policy. Yea, its ok to keep the RC1 voting. Best, Takeshi On Mon, Jan 21, 2019 at 11:07 AM Sean Owen mailto:sro...@gmail.com>> wrote: OK, if it passes tests, I'm +1 on the release. Can anyone else verify the tests pass? What is the reason for a new RC? I didn't see any other issues reported. On Sun, Jan 20, 2019 at 8:03 PM Takeshi Yamamuro mailto:linguin@gmail.com>> wrote: > > Hi, all > > Thanks for the checks, Sean and Felix. > I'll start the next vote as RC2 this Tuesday noon (PST). > > > Sean > I re-run JavaTfIdfSuite on my env and it passed. > I used `-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Psparkr` and > run the tests on a EC2 instance below (I launched the new instance for the > tests); > > $ cat /etc/os-release > NAME="Amazon Linux" > VERSION="2" > ID="amzn" > ID_LIKE="centos rhel fedora" > VERSION_ID="2" > PRETTY_NAME="Amazon Linux 2" > ANSI_COLOR="0;33" > CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" > HOME_URL="https://amazonlinux.com/"; > $ java -version > openjdk version "1.8.0_191" > OpenJDK Runtime Environment (build 1.8.0_191-b12) > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode) > > > > > On Mon, Jan 21, 2019 at 9:53 AM Felix Cheung > mailto:felixcheun...@hotmail.com>> wrote: >> >> +1 >> >> My focus is on R (sorry couldn’t cross validate what’s Sean is seeing) >> >> tested: >> reviewed doc >> R package test >> win-builder, r-hub >> Tarball/package signature >> >> >> >> >> From: Takeshi Yamamuro mailto:linguin@gmail.com>> >> Sent: Thursday, January 17, 2019 6:49 PM >> To: Spark dev list >> Subject: [VOTE] Release Apache Spark 2.3.3 (RC1) >> >> Please vote on releasing the following candidate as Apache Spark version >> 2.3.3. >> >> The vote is open until January 20 8:00PM (PST) and passes if a majority +1 >> PMC votes are cast, with >> a minimum of 3 +1 votes. >> >> [ ] +1 Release this package as Apache Spark 2.3.3 >> [ ] -1 Do not release this package because ... >> >> To learn more about Apache Spark, please see http://spark.apache.org/ >> >> The tag to be voted on is v2.3.3-rc1 (commit >> b5ea9330e3072e99841270b10dc1d2248127064b): >> https://github.com/apache/spark/tree/v2.3.3-rc1 >> >> The release files, including signatures, digests, etc. can be found at: >> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc1-bin/ >> >> Signatures used for Spark RCs can be found in this file: >> https://dist.apache.org/repos/dist/dev/spark/KEYS >> >> The staging repository for this release can be found at: >> https://repository.apache.org/content/repositories/orgapachespark-1297 >> >> The documentation corresponding to this release can be found at: >> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc1-docs/ >> >> The list of bug fixes going into 2.3.3 can be found at the following URL: >> https://issues.apache.org/jira/projects/SPARK/versions/12343759 >> >> FAQ >> >> = >> How can I help test this release? >> = >> >> If you are a Spark user, you can help us test this release by taking >> an existing Spark workload and running on this release candidate, then >> reporting any regressions. >> >> If you're working in PySpark you can set up a virtual env and install >> the current RC and see if anything important breaks, in the Java/Scala >> you can add the staging repository to your projects resolvers and test >> with the RC (make sure to clean up the artifact cache before/after so >> you don't end up building with a out of date RC going forward). >> >> === >> What should happen to JIRA tickets still targeting 2.3.3? >> ==