Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Holden Karau
Deprecating Py 2 in the 2.4 release probably doesn't belong in the RC vote
thread. Personally I think we might be a little too late in the game to
deprecate it in 2.4, but I think calling it out as "soon to be deprecated"
in the release docs would be sensible to give folks extra time to prepare.

On Mon, Sep 17, 2018 at 2:04 PM Erik Erlandson  wrote:

>
> I have no binding vote but I second Stavros’ recommendation for spark-23200
>
> Per parallel threads on Py2 support I would also like to propose
> deprecating Py2 starting with this 2.4 release
>
> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>  wrote:
>
>> You can log in to https://repository.apache.org and see what's wrong.
>> Just find that staging repo and look at the messages. In your case it
>> seems related to your signature.
>>
>> failureMessageNo public key: Key with id: () was not able to be
>> located on http://gpg-keyserver.de/. Upload your public key and try
>> the operation again.
>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan  wrote:
>> >
>> > I confirmed that
>> https://repository.apache.org/content/repositories/orgapachespark-1285
>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any
>> error message during it.
>> >
>> > Any insights are appreciated! So that I can fix it in the next RC.
>> Thanks!
>> >
>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>> >>
>> >> I think one build is enough, but haven't thought it through. The
>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>> >> Really, whatever's the easy thing to do.
>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>> wrote:
>> >> >
>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>> Scala 2.12?
>> >> >
>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>> wrote:
>> >> >>
>> >> >> A few preliminary notes:
>> >> >>
>> >> >> Wenchen for some weird reason when I hit your key in gpg --import,
>> it
>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>> verify
>> >> >> the signature. No issue there really.
>> >> >>
>> >> >> The staging repo gives a 404:
>> >> >>
>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>> >> >>
>> >> >> The (revamped) licenses are OK, though there are some minor glitches
>> >> >> in the final release tarballs (my fault) : there's an extra
>> directory,
>> >> >> and the source release has both binary and source licenses. I'll fix
>> >> >> that. Not strictly necessary to reject the release over those.
>> >> >>
>> >> >> Last, when I check the staging repo I'll get my answer, but, were
>> you
>> >> >> able to build 2.12 artifacts as well?
>> >> >>
>> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>> wrote:
>> >> >> >
>> >> >> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.0.
>> >> >> >
>> >> >> > The vote is open until September 20 PST and passes if a majority
>> +1 PMC votes are cast, with
>> >> >> > a minimum of 3 +1 votes.
>> >> >> >
>> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>> >> >> > [ ] -1 Do not release this package because ...
>> >> >> >
>> >> >> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >> >> >
>> >> >> > The tag to be voted on is v2.4.0-rc1 (commit
>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>> >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>> >> >> >
>> >> >> > The release files, including signatures, digests, etc. can be
>> found at:
>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>> >> >> >
>> >> >> > Signatures used for Spark RCs can be found in this file:
>> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >> >
>> >> >> > The staging repository for this release can be found at:
>> >> >> >
>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> >> >
>> >> >> > The documentation corresponding to this release can be found at:
>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>> >> >> >
>> >> >> > The list of bug fixes going into 2.4.0 can be found at the
>> following URL:
>> >> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>> >> >> >
>> >> >> > FAQ
>> >> >> >
>> >> >> > =
>> >> >> > How can I help test this release?
>> >> >> > =
>> >> >> >
>> >> >> > If you are a Spark user, you can help us test this release by
>> taking
>> >> >> > an existing Spark workload and running on this release candidate,
>> then
>> >> >> > reporting any 

Re: Metastore problem on Spark2.3 with Hive3.0

2018-09-17 Thread Dongjoon Hyun
Hi, Jerry.

There is a JIRA issue for that,
https://issues.apache.org/jira/browse/SPARK-24360 .

So far, it's in progress for Hive 3.1.0 Metastore for Apache Spark 2.5.0.
You can track that issue there.

Bests,
Dongjoon.


On Mon, Sep 17, 2018 at 7:01 PM 白也诗无敌 <445484...@qq.com> wrote:

> Hi, guys
>   I am using Spark2.3 and I meet the metastore problem.
>   It looks like something about the compatibility cause Spark2.3 still use
> the hive-metastore-1.2.1-spark2.
>   Is there any solution?
>   The Hive metastore version is 3.0 and the stacktrace is below:
>
> org.apache.thrift.TApplicationException: Required field 'filesAdded' is
> unset! Struct:InsertEventRequestData(filesAdded:null)
> at
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4182)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4169)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:1954)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy5.fireListenerEvent(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.fireInsertEvent(Hive.java:1947)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1673)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:847)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:756)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:829)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:416)
> at
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:403)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
> at org.apache.spark.sql.Dataset.(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:355)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at
> 

Metastore problem on Spark2.3 with Hive3.0

2018-09-17 Thread ??????????
Hi, guys
  I am using Spark2.3 and I meet the metastore problem.
  It looks like something about the compatibility cause Spark2.3 still use the 
hive-metastore-1.2.1-spark2.
  Is there any solution?
  The Hive metastore version is 3.0 and the stacktrace is below:

  
org.apache.thrift.TApplicationException: Required field 'filesAdded' is unset! 
Struct:InsertEventRequestData(filesAdded:null)
at 
org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4182)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4169)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:1954)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy5.fireListenerEvent(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.fireInsertEvent(Hive.java:1947)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1673)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:847)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:757)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:756)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:829)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:827)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:416)
at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:403)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.(Dataset.scala:190)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:355)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:263)
at 

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
I think that makes sense. The main benefit of deprecating *prior* to 3.0
would be informational - making the community aware of the upcoming
transition earlier. But there are other ways to start informing the
community between now and 3.0, besides formal deprecation.

I have some residual curiosity about what it might mean for a release like
2.4 to still be in its support lifetime after Py2 goes EOL. I asked Apache
Legal  to comment. It is
possible there are no issues with this at all.


On Mon, Sep 17, 2018 at 4:26 PM, Reynold Xin  wrote:

> i'd like to second that.
>
> if we want to communicate timeline, we can add to the release notes saying
> py2 will be deprecated in 3.0, and removed in a 3.x release.
>
> --
> excuse the brevity and lower case due to wrist injury
>
>
> On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia 
> wrote:
>
>> That’s a good point — I’d say there’s just a risk of creating a
>> perception issue. First, some users might feel that this means they have to
>> migrate now, which is before Python itself drops support; they might also
>> be surprised that we did this in a minor release (e.g. might we drop Python
>> 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors
>> might feel that this means new features no longer have to work with Python
>> 2, which would be confusing. Maybe it’s OK on both fronts, but it just
>> seems scarier for users to do this now if we do plan to have Spark 3.0 in
>> the next 6 months anyway.
>>
>> Matei
>>
>> > On Sep 17, 2018, at 1:04 PM, Mark Hamstra 
>> wrote:
>> >
>> > What is the disadvantage to deprecating now in 2.4.0? I mean, it
>> doesn't change the code at all; it's just a notification that we will
>> eventually cease supporting Py2. Wouldn't users prefer to get that
>> notification sooner rather than later?
>> >
>> > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
>> wrote:
>> > I’d like to understand the maintenance burden of Python 2 before
>> deprecating it. Since it is not EOL yet, it might make sense to only
>> deprecate it once it’s EOL (which is still over a year from now).
>> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
>> Scala versions in the same codebase, so what are we losing out?
>> >
>> > The other thing is that even though Python core devs might not support
>> 2.x later, it’s quite possible that various Linux distros will if moving
>> from 2 to 3 remains painful. In that case, we may want Apache Spark to
>> continue releasing for it despite the Python core devs not supporting it.
>> >
>> > Basically, I’d suggest to deprecate this in Spark 3.0 and then remove
>> it later in 3.x instead of deprecating it in 2.4. I’d also consider looking
>> at what other data science tools are doing before fully removing it: for
>> example, if Pandas and TensorFlow no longer support Python 2 past some
>> point, that might be a good point to remove it.
>> >
>> > Matei
>> >
>> > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
>> wrote:
>> > >
>> > > If we're going to do that, then we need to do it right now, since
>> 2.4.0 is already in release candidates.
>> > >
>> > > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
>> wrote:
>> > > I like Mark’s concept for deprecating Py2 starting with 2.4: It may
>> seem like a ways off but even now there may be some spark versions
>> supporting Py2 past the point where Py2 is no longer receiving security
>> patches
>> > >
>> > >
>> > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>> > > We could also deprecate Py2 already in the 2.4.0 release.
>> > >
>> > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>> > > In case this didn't make it onto this thread:
>> > >
>> > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>> remove it entirely on a later 3.x release.
>> > >
>> > > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>> wrote:
>> > > On a separate dev@spark thread, I raised a question of whether or
>> not to support python 2 in Apache Spark, going forward into Spark 3.0.
>> > >
>> > > Python-2 is going EOL at the end of 2019. The upcoming release of
>> Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and
>> so it is a good time to consider support for Python-2 on PySpark.
>> > >
>> > > Key advantages to dropping Python 2 are:
>> > >   • Support for PySpark becomes significantly easier.
>> > >   • Avoid having to support Python 2 until Spark 4.0, which is
>> likely to imply supporting Python 2 for some time after it goes EOL.
>> > > (Note that supporting python 2 after EOL means, among other things,
>> that PySpark would be supporting a version of python that was no longer
>> receiving security patches)
>> > >
>> > > The main disadvantage is that PySpark users who have legacy python-2
>> code would have to migrate their code to python 3 to take advantage of
>> Spark 3.0
>> > >
>> > > This decision obviously has 

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-17 Thread Saisai Shao
+1 from my own side.

Thanks
Saisai

Wenchen Fan  于2018年9月18日周二 上午9:34写道:

> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>
> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>
>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>> build from source with most profiles passed for me.
>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.3.2.
>> >
>> > The vote is open until September 21 PST and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.3.2
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.3.2-rc6 (commit
>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1286/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>> >
>> > The list of bug fixes going into 2.3.2 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.3.2?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.3.2 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.3.2
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Saisai Shao
Hi Wenchen,

I think you need to set SPHINXPYTHON to python3 before building the docs,
to workaround the doc issue (
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
).

Here is the notes for release page:


>- Ensure you have Python 3 having Sphinx installed, and SPHINXPYTHON 
> environment
>variable is set to indicate your Python 3 executable (see SPARK-24530).
>
>
Wenchen Fan  于2018年9月17日周一 上午10:48写道:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until September 20 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc1 (commit
> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
> https://github.com/apache/spark/tree/v2.4.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1285/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-17 Thread Wenchen Fan
+1. All the blocker issues are all resolved in 2.3.2 AFAIK.

On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:

> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
> build from source with most profiles passed for me.
> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.3.2.
> >
> > The vote is open until September 21 PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.3.2
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.3.2-rc6 (commit
> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
> > https://github.com/apache/spark/tree/v2.3.2-rc6
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1286/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
> >
> > The list of bug fixes going into 2.3.2 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
> >
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.3.2?
> > ===
> >
> > The current list of open tickets targeted at 2.3.2 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.3.2
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-17 Thread tigerquoll
Hi Jayesh,
I get where you are coming from - partitions are just an implementation
optimisation that we really shouldn’t be bothering the end user with. 
Unfortunately that view is like saying RPC is like a procedure call, and
details of the network transport should be hidden from the end user. CORBA
tried this approach for RPC and failed for the same reason that no major
vendor of DBMS systems that support partitions try to hide them from the end
user.  They have a substantial real world effect that is impossible to hide
from the user (in particular when writing/modifying the data source).  Any
attempt to “take care” of partitions automatically invariably guesses wrong
and ends up frustrating the end user (as “substantial real world effect”
turns to “show stopping performance penalty” if the user attempts to fight
against a partitioning scheme she has no idea exists)

So if we are not hiding them from the user, we need to allow users to
manipulate them. Either by representing them generically in the API,
allowing pass-through commands to manipulate them, or by some other means.

Regards,
Dale.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-17 Thread Sean Owen
+1 . Licenses and sigs check out as in previous 2.3.x releases. A
build from source with most profiles passed for me.
On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.2.
>
> The vote is open until September 21 PST and passes if a majority +1 PMC votes 
> are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.2-rc6 (commit 
> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
> https://github.com/apache/spark/tree/v2.3.2-rc6
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1286/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>
> The list of bug fixes going into 2.3.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.2?
> ===
>
> The current list of open tickets targeted at 2.3.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.3.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Reynold Xin
i'd like to second that.

if we want to communicate timeline, we can add to the release notes saying
py2 will be deprecated in 3.0, and removed in a 3.x release.

--
excuse the brevity and lower case due to wrist injury


On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia 
wrote:

> That’s a good point — I’d say there’s just a risk of creating a perception
> issue. First, some users might feel that this means they have to migrate
> now, which is before Python itself drops support; they might also be
> surprised that we did this in a minor release (e.g. might we drop Python 2
> altogether in a Spark 2.5 if that later comes out?). Second, contributors
> might feel that this means new features no longer have to work with Python
> 2, which would be confusing. Maybe it’s OK on both fronts, but it just
> seems scarier for users to do this now if we do plan to have Spark 3.0 in
> the next 6 months anyway.
>
> Matei
>
> > On Sep 17, 2018, at 1:04 PM, Mark Hamstra 
> wrote:
> >
> > What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't
> change the code at all; it's just a notification that we will eventually
> cease supporting Py2. Wouldn't users prefer to get that notification sooner
> rather than later?
> >
> > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
> wrote:
> > I’d like to understand the maintenance burden of Python 2 before
> deprecating it. Since it is not EOL yet, it might make sense to only
> deprecate it once it’s EOL (which is still over a year from now).
> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
> Scala versions in the same codebase, so what are we losing out?
> >
> > The other thing is that even though Python core devs might not support
> 2.x later, it’s quite possible that various Linux distros will if moving
> from 2 to 3 remains painful. In that case, we may want Apache Spark to
> continue releasing for it despite the Python core devs not supporting it.
> >
> > Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at
> what other data science tools are doing before fully removing it: for
> example, if Pandas and TensorFlow no longer support Python 2 past some
> point, that might be a good point to remove it.
> >
> > Matei
> >
> > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
> wrote:
> > >
> > > If we're going to do that, then we need to do it right now, since
> 2.4.0 is already in release candidates.
> > >
> > > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
> wrote:
> > > I like Mark’s concept for deprecating Py2 starting with 2.4: It may
> seem like a ways off but even now there may be some spark versions
> supporting Py2 past the point where Py2 is no longer receiving security
> patches
> > >
> > >
> > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
> > > We could also deprecate Py2 already in the 2.4.0 release.
> > >
> > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
> > > In case this didn't make it onto this thread:
> > >
> > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
> remove it entirely on a later 3.x release.
> > >
> > > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
> > > On a separate dev@spark thread, I raised a question of whether or not
> to support python 2 in Apache Spark, going forward into Spark 3.0.
> > >
> > > Python-2 is going EOL at the end of 2019. The upcoming release of
> Spark 3.0 is an opportunity to make breaking changes to Spark's APIs, and
> so it is a good time to consider support for Python-2 on PySpark.
> > >
> > > Key advantages to dropping Python 2 are:
> > >   • Support for PySpark becomes significantly easier.
> > >   • Avoid having to support Python 2 until Spark 4.0, which is
> likely to imply supporting Python 2 for some time after it goes EOL.
> > > (Note that supporting python 2 after EOL means, among other things,
> that PySpark would be supporting a version of python that was no longer
> receiving security patches)
> > >
> > > The main disadvantage is that PySpark users who have legacy python-2
> code would have to migrate their code to python 3 to take advantage of
> Spark 3.0
> > >
> > > This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
> > >
> > >
> >
>
>


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
That’s a good point — I’d say there’s just a risk of creating a perception 
issue. First, some users might feel that this means they have to migrate now, 
which is before Python itself drops support; they might also be surprised that 
we did this in a minor release (e.g. might we drop Python 2 altogether in a 
Spark 2.5 if that later comes out?). Second, contributors might feel that this 
means new features no longer have to work with Python 2, which would be 
confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to 
do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra  wrote:
> 
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't 
> change the code at all; it's just a notification that we will eventually 
> cease supporting Py2. Wouldn't users prefer to get that notification sooner 
> rather than later?
> 
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia  
> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating 
> it. Since it is not EOL yet, it might make sense to only deprecate it once 
> it’s EOL (which is still over a year from now). Supporting Python 2+3 seems 
> less burdensome than supporting, say, multiple Scala versions in the same 
> codebase, so what are we losing out?
> 
> The other thing is that even though Python core devs might not support 2.x 
> later, it’s quite possible that various Linux distros will if moving from 2 
> to 3 remains painful. In that case, we may want Apache Spark to continue 
> releasing for it despite the Python core devs not supporting it.
> 
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it 
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at 
> what other data science tools are doing before fully removing it: for 
> example, if Pandas and TensorFlow no longer support Python 2 past some point, 
> that might be a good point to remove it.
> 
> Matei
> 
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> > 
> > If we're going to do that, then we need to do it right now, since 2.4.0 is 
> > already in release candidates.
> > 
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem 
> > like a ways off but even now there may be some spark versions supporting 
> > Py2 past the point where Py2 is no longer receiving security patches 
> > 
> > 
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  
> > wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> > 
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> > In case this didn't make it onto this thread:
> > 
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove 
> > it entirely on a later 3.x release.
> > 
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  
> > wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to 
> > support python 2 in Apache Spark, going forward into Spark 3.0.
> > 
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> > is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> > good time to consider support for Python-2 on PySpark.
> > 
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is likely 
> > to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that 
> > PySpark would be supporting a version of python that was no longer 
> > receiving security patches)
> > 
> > The main disadvantage is that PySpark users who have legacy python-2 code 
> > would have to migrate their code to python 3 to take advantage of Spark 3.0
> > 
> > This decision obviously has large implications for the Apache Spark 
> > community and we want to solicit community feedback.
> > 
> > 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Stavros Kontopoulos
Hi Xiao,

I just tested it, it seems ok. There are some questions about which
properties we should keep when restoring the config. Otherwise it looks ok
to me.
The reason this should go in 2.4 is that streaming on k8s is something
people want to try day one (or at least it is cool to try) and since 2.4
comes with k8s support being refactored a lot,
it would be disappointing not to have it in...IMHO.

Best,
Stavros

On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li  wrote:

> We can merge the PR and get SPARK-23200 resolved if the whole point is to
> make streaming on k8s work first. But given that this is not a blocker for
> 2.4, I think we can take a bit more time here and get it right. With that
> being said, I would expect it to be resolved soon.
>
> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li  wrote:
>
>> Hi, Erik and Stavros,
>>
>> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It sounds
>> important for the Streaming on K8S. Could the K8S oriented committers speed
>> up the reviews?
>>
>> Thanks,
>>
>> Xiao
>>
>> Erik Erlandson  于2018年9月17日周一 上午11:04写道:
>>
>>>
>>> I have no binding vote but I second Stavros’ recommendation for
>>> spark-23200
>>>
>>> Per parallel threads on Py2 support I would also like to propose
>>> deprecating Py2 starting with this 2.4 release
>>>
>>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>>  wrote:
>>>
 You can log in to https://repository.apache.org and see what's wrong.
 Just find that staging repo and look at the messages. In your case it
 seems related to your signature.

 failureMessageNo public key: Key with id: () was not able to be
 located on http://gpg-keyserver.de/. Upload your public key and try
 the operation again.
 On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
 wrote:
 >
 > I confirmed that https://repository.apache.org/content/repositories/
 orgapachespark-1285 is not accessible. I did it via
 ./dev/create-release/do-release-docker.sh -d /my/work/dir -s publish ,
 not sure what's going wrong. I didn't see any error message during it.
 >
 > Any insights are appreciated! So that I can fix it in the next RC.
 Thanks!
 >
 > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
 >>
 >> I think one build is enough, but haven't thought it through. The
 >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
 >> best advertised as a 'beta'. So maybe publish a no-hadoop build of
 it?
 >> Really, whatever's the easy thing to do.
 >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
 wrote:
 >> >
 >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
 Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
 hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
 Scala 2.12?
 >> >
 >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
 wrote:
 >> >>
 >> >> A few preliminary notes:
 >> >>
 >> >> Wenchen for some weird reason when I hit your key in gpg
 --import, it
 >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
 verify
 >> >> the signature. No issue there really.
 >> >>
 >> >> The staging repo gives a 404:
 >> >> https://repository.apache.org/content/repositories/
 orgapachespark-1285/
 >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
 >> >> [id=orgapachespark-1285] exists but is not exposed.
 >> >>
 >> >> The (revamped) licenses are OK, though there are some minor
 glitches
 >> >> in the final release tarballs (my fault) : there's an extra
 directory,
 >> >> and the source release has both binary and source licenses. I'll
 fix
 >> >> that. Not strictly necessary to reject the release over those.
 >> >>
 >> >> Last, when I check the staging repo I'll get my answer, but, were
 you
 >> >> able to build 2.12 artifacts as well?
 >> >>
 >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
 wrote:
 >> >> >
 >> >> > Please vote on releasing the following candidate as Apache
 Spark version 2.4.0.
 >> >> >
 >> >> > The vote is open until September 20 PST and passes if a
 majority +1 PMC votes are cast, with
 >> >> > a minimum of 3 +1 votes.
 >> >> >
 >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
 >> >> > [ ] -1 Do not release this package because ...
 >> >> >
 >> >> > To learn more about Apache Spark, please see
 http://spark.apache.org/
 >> >> >
 >> >> > The tag to be voted on is v2.4.0-rc1 (commit
 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
 >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
 >> >> >
 >> >> > The release files, including signatures, digests, etc. can be
 found at:
 >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
 >> >> >
 >> >> > Signatures used for Spark RCs can be found in 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Yinan Li
We can merge the PR and get SPARK-23200 resolved if the whole point is to
make streaming on k8s work first. But given that this is not a blocker for
2.4, I think we can take a bit more time here and get it right. With that
being said, I would expect it to be resolved soon.

On Mon, Sep 17, 2018 at 11:47 AM Xiao Li  wrote:

> Hi, Erik and Stavros,
>
> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It sounds
> important for the Streaming on K8S. Could the K8S oriented committers speed
> up the reviews?
>
> Thanks,
>
> Xiao
>
> Erik Erlandson  于2018年9月17日周一 上午11:04写道:
>
>>
>> I have no binding vote but I second Stavros’ recommendation for
>> spark-23200
>>
>> Per parallel threads on Py2 support I would also like to propose
>> deprecating Py2 starting with this 2.4 release
>>
>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>  wrote:
>>
>>> You can log in to https://repository.apache.org and see what's wrong.
>>> Just find that staging repo and look at the messages. In your case it
>>> seems related to your signature.
>>>
>>> failureMessageNo public key: Key with id: () was not able to be
>>> located on http://gpg-keyserver.de/. Upload your public key and try
>>> the operation again.
>>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
>>> wrote:
>>> >
>>> > I confirmed that
>>> https://repository.apache.org/content/repositories/orgapachespark-1285
>>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any
>>> error message during it.
>>> >
>>> > Any insights are appreciated! So that I can fix it in the next RC.
>>> Thanks!
>>> >
>>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>> >>
>>> >> I think one build is enough, but haven't thought it through. The
>>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>>> >> Really, whatever's the easy thing to do.
>>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>>> wrote:
>>> >> >
>>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>>> Scala 2.12?
>>> >> >
>>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>>> wrote:
>>> >> >>
>>> >> >> A few preliminary notes:
>>> >> >>
>>> >> >> Wenchen for some weird reason when I hit your key in gpg --import,
>>> it
>>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>>> verify
>>> >> >> the signature. No issue there really.
>>> >> >>
>>> >> >> The staging repo gives a 404:
>>> >> >>
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>>> >> >>
>>> >> >> The (revamped) licenses are OK, though there are some minor
>>> glitches
>>> >> >> in the final release tarballs (my fault) : there's an extra
>>> directory,
>>> >> >> and the source release has both binary and source licenses. I'll
>>> fix
>>> >> >> that. Not strictly necessary to reject the release over those.
>>> >> >>
>>> >> >> Last, when I check the staging repo I'll get my answer, but, were
>>> you
>>> >> >> able to build 2.12 artifacts as well?
>>> >> >>
>>> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>>> wrote:
>>> >> >> >
>>> >> >> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.0.
>>> >> >> >
>>> >> >> > The vote is open until September 20 PST and passes if a majority
>>> +1 PMC votes are cast, with
>>> >> >> > a minimum of 3 +1 votes.
>>> >> >> >
>>> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>>> >> >> > [ ] -1 Do not release this package because ...
>>> >> >> >
>>> >> >> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >> >> >
>>> >> >> > The tag to be voted on is v2.4.0-rc1 (commit
>>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>>> >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>>> >> >> >
>>> >> >> > The release files, including signatures, digests, etc. can be
>>> found at:
>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>>> >> >> >
>>> >> >> > Signatures used for Spark RCs can be found in this file:
>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >> >> >
>>> >> >> > The staging repository for this release can be found at:
>>> >> >> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >> >
>>> >> >> > The documentation corresponding to this release can be found at:
>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>>> >> >> >
>>> >> >> > The list of bug fixes going into 2.4.0 can be found at the
>>> following URL:
>>> >> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>>> >> >> >

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
FWIW, Pandas is dropping

Py2 support at the end of this year.  Tensorflow is less clear. They only
support py3 on windows, but there is no reference to any policy about py2
on their roadmap or the TF 2.0 announcement.


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't
change the code at all; it's just a notification that we will eventually
cease supporting Py2. Wouldn't users prefer to get that notification sooner
rather than later?

On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia 
wrote:

> I’d like to understand the maintenance burden of Python 2 before
> deprecating it. Since it is not EOL yet, it might make sense to only
> deprecate it once it’s EOL (which is still over a year from now).
> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
> Scala versions in the same codebase, so what are we losing out?
>
> The other thing is that even though Python core devs might not support 2.x
> later, it’s quite possible that various Linux distros will if moving from 2
> to 3 remains painful. In that case, we may want Apache Spark to continue
> releasing for it despite the Python core devs not supporting it.
>
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at
> what other data science tools are doing before fully removing it: for
> example, if Pandas and TensorFlow no longer support Python 2 past some
> point, that might be a good point to remove it.
>
> Matei
>
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
> wrote:
> >
> > If we're going to do that, then we need to do it right now, since 2.4.0
> is already in release candidates.
> >
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
> wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
> like a ways off but even now there may be some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
> >
> >
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> >
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
> > In case this didn't make it onto this thread:
> >
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
> remove it entirely on a later 3.x release.
> >
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
> > On a separate dev@spark thread, I raised a question of whether or not
> to support python 2 in Apache Spark, going forward into Spark 3.0.
> >
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark
> 3.0 is an opportunity to make breaking changes to Spark's APIs, and so it
> is a good time to consider support for Python-2 on PySpark.
> >
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is
> likely to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
> >
> > The main disadvantage is that PySpark users who have legacy python-2
> code would have to migrate their code to python 3 to take advantage of
> Spark 3.0
> >
> > This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
> >
> >
>
>


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
I’d like to understand the maintenance burden of Python 2 before deprecating 
it. Since it is not EOL yet, it might make sense to only deprecate it once it’s 
EOL (which is still over a year from now). Supporting Python 2+3 seems less 
burdensome than supporting, say, multiple Scala versions in the same codebase, 
so what are we losing out?

The other thing is that even though Python core devs might not support 2.x 
later, it’s quite possible that various Linux distros will if moving from 2 to 
3 remains painful. In that case, we may want Apache Spark to continue releasing 
for it despite the Python core devs not supporting it.

Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later 
in 3.x instead of deprecating it in 2.4. I’d also consider looking at what 
other data science tools are doing before fully removing it: for example, if 
Pandas and TensorFlow no longer support Python 2 past some point, that might be 
a good point to remove it.

Matei

> On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> 
> If we're going to do that, then we need to do it right now, since 2.4.0 is 
> already in release candidates.
> 
> On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like 
> a ways off but even now there may be some spark versions supporting Py2 past 
> the point where Py2 is no longer receiving security patches 
> 
> 
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  wrote:
> We could also deprecate Py2 already in the 2.4.0 release.
> 
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> In case this didn't make it onto this thread:
> 
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it 
> entirely on a later 3.x release.
> 
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  wrote:
> On a separate dev@spark thread, I raised a question of whether or not to 
> support python 2 in Apache Spark, going forward into Spark 3.0.
> 
> Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> good time to consider support for Python-2 on PySpark.
> 
> Key advantages to dropping Python 2 are:
>   • Support for PySpark becomes significantly easier.
>   • Avoid having to support Python 2 until Spark 4.0, which is likely to 
> imply supporting Python 2 for some time after it goes EOL.
> (Note that supporting python 2 after EOL means, among other things, that 
> PySpark would be supporting a version of python that was no longer receiving 
> security patches)
> 
> The main disadvantage is that PySpark users who have legacy python-2 code 
> would have to migrate their code to python 3 to take advantage of Spark 3.0
> 
> This decision obviously has large implications for the Apache Spark community 
> and we want to solicit community feedback.
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Xiao Li
Hi, Erik and Stavros,

This bug fix SPARK-23200 is not a blocker of the 2.4 release. It sounds
important for the Streaming on K8S. Could the K8S oriented committers speed
up the reviews?

Thanks,

Xiao

Erik Erlandson  于2018年9月17日周一 上午11:04写道:

>
> I have no binding vote but I second Stavros’ recommendation for spark-23200
>
> Per parallel threads on Py2 support I would also like to propose
> deprecating Py2 starting with this 2.4 release
>
> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>  wrote:
>
>> You can log in to https://repository.apache.org and see what's wrong.
>> Just find that staging repo and look at the messages. In your case it
>> seems related to your signature.
>>
>> failureMessageNo public key: Key with id: () was not able to be
>> located on http://gpg-keyserver.de/. Upload your public key and try
>> the operation again.
>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan  wrote:
>> >
>> > I confirmed that
>> https://repository.apache.org/content/repositories/orgapachespark-1285
>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any
>> error message during it.
>> >
>> > Any insights are appreciated! So that I can fix it in the next RC.
>> Thanks!
>> >
>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>> >>
>> >> I think one build is enough, but haven't thought it through. The
>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>> >> Really, whatever's the easy thing to do.
>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>> wrote:
>> >> >
>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>> Scala 2.12?
>> >> >
>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>> wrote:
>> >> >>
>> >> >> A few preliminary notes:
>> >> >>
>> >> >> Wenchen for some weird reason when I hit your key in gpg --import,
>> it
>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>> verify
>> >> >> the signature. No issue there really.
>> >> >>
>> >> >> The staging repo gives a 404:
>> >> >>
>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>> >> >>
>> >> >> The (revamped) licenses are OK, though there are some minor glitches
>> >> >> in the final release tarballs (my fault) : there's an extra
>> directory,
>> >> >> and the source release has both binary and source licenses. I'll fix
>> >> >> that. Not strictly necessary to reject the release over those.
>> >> >>
>> >> >> Last, when I check the staging repo I'll get my answer, but, were
>> you
>> >> >> able to build 2.12 artifacts as well?
>> >> >>
>> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>> wrote:
>> >> >> >
>> >> >> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.0.
>> >> >> >
>> >> >> > The vote is open until September 20 PST and passes if a majority
>> +1 PMC votes are cast, with
>> >> >> > a minimum of 3 +1 votes.
>> >> >> >
>> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>> >> >> > [ ] -1 Do not release this package because ...
>> >> >> >
>> >> >> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >> >> >
>> >> >> > The tag to be voted on is v2.4.0-rc1 (commit
>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>> >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>> >> >> >
>> >> >> > The release files, including signatures, digests, etc. can be
>> found at:
>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>> >> >> >
>> >> >> > Signatures used for Spark RCs can be found in this file:
>> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >> >
>> >> >> > The staging repository for this release can be found at:
>> >> >> >
>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> >> >
>> >> >> > The documentation corresponding to this release can be found at:
>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>> >> >> >
>> >> >> > The list of bug fixes going into 2.4.0 can be found at the
>> following URL:
>> >> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>> >> >> >
>> >> >> > FAQ
>> >> >> >
>> >> >> > =
>> >> >> > How can I help test this release?
>> >> >> > =
>> >> >> >
>> >> >> > If you are a Spark user, you can help us test this release by
>> taking
>> >> >> > an existing Spark workload and running on this release candidate,
>> then
>> >> >> > reporting any regressions.
>> >> >> >
>> >> >> > If you're working in PySpark you can set up a virtual env and
>> install
>> >> >> 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Erik Erlandson
I have no binding vote but I second Stavros’ recommendation for spark-23200

Per parallel threads on Py2 support I would also like to propose
deprecating Py2 starting with this 2.4 release

On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin 
wrote:

> You can log in to https://repository.apache.org and see what's wrong.
> Just find that staging repo and look at the messages. In your case it
> seems related to your signature.
>
> failureMessageNo public key: Key with id: () was not able to be
> located on http://gpg-keyserver.de/. Upload your public key and try
> the operation again.
> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan  wrote:
> >
> > I confirmed that
> https://repository.apache.org/content/repositories/orgapachespark-1285 is
> not accessible. I did it via ./dev/create-release/do-release-docker.sh -d
> /my/work/dir -s publish , not sure what's going wrong. I didn't see any
> error message during it.
> >
> > Any insights are appreciated! So that I can fix it in the next RC.
> Thanks!
> >
> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
> >>
> >> I think one build is enough, but haven't thought it through. The
> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
> >> Really, whatever's the easy thing to do.
> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
> wrote:
> >> >
> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
> Scala 2.12?
> >> >
> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
> >> >>
> >> >> A few preliminary notes:
> >> >>
> >> >> Wenchen for some weird reason when I hit your key in gpg --import, it
> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
> verify
> >> >> the signature. No issue there really.
> >> >>
> >> >> The staging repo gives a 404:
> >> >>
> https://repository.apache.org/content/repositories/orgapachespark-1285/
> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
> >> >> [id=orgapachespark-1285] exists but is not exposed.
> >> >>
> >> >> The (revamped) licenses are OK, though there are some minor glitches
> >> >> in the final release tarballs (my fault) : there's an extra
> directory,
> >> >> and the source release has both binary and source licenses. I'll fix
> >> >> that. Not strictly necessary to reject the release over those.
> >> >>
> >> >> Last, when I check the staging repo I'll get my answer, but, were you
> >> >> able to build 2.12 artifacts as well?
> >> >>
> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
> wrote:
> >> >> >
> >> >> > Please vote on releasing the following candidate as Apache Spark
> version 2.4.0.
> >> >> >
> >> >> > The vote is open until September 20 PST and passes if a majority
> +1 PMC votes are cast, with
> >> >> > a minimum of 3 +1 votes.
> >> >> >
> >> >> > [ ] +1 Release this package as Apache Spark 2.4.0
> >> >> > [ ] -1 Do not release this package because ...
> >> >> >
> >> >> > To learn more about Apache Spark, please see
> http://spark.apache.org/
> >> >> >
> >> >> > The tag to be voted on is v2.4.0-rc1 (commit
> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
> >> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
> >> >> >
> >> >> > The release files, including signatures, digests, etc. can be
> found at:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
> >> >> >
> >> >> > Signatures used for Spark RCs can be found in this file:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >> >
> >> >> > The staging repository for this release can be found at:
> >> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1285/
> >> >> >
> >> >> > The documentation corresponding to this release can be found at:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
> >> >> >
> >> >> > The list of bug fixes going into 2.4.0 can be found at the
> following URL:
> >> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
> >> >> >
> >> >> > FAQ
> >> >> >
> >> >> > =
> >> >> > How can I help test this release?
> >> >> > =
> >> >> >
> >> >> > If you are a Spark user, you can help us test this release by
> taking
> >> >> > an existing Spark workload and running on this release candidate,
> then
> >> >> > reporting any regressions.
> >> >> >
> >> >> > If you're working in PySpark you can set up a virtual env and
> install
> >> >> > the current RC and see if anything important breaks, in the
> Java/Scala
> >> >> > you can add the staging repository to your projects resolvers and
> test
> >> >> > with the RC (make sure to clean up the artifact cache before/after
> so
> >> >> > you don't end up building with a out of date RC going forward).
> >> >> >
> >> >> > ===

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
If we're going to do that, then we need to do it right now, since 2.4.0 is
already in release candidates.

On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:

> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
> like a ways off but even now there may be some spark versions supporting
> Py2 past the point where Py2 is no longer receiving security patches
>
>
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
> wrote:
>
>> We could also deprecate Py2 already in the 2.4.0 release.
>>
>> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>>
>>> In case this didn't make it onto this thread:
>>>
>>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>>> remove it entirely on a later 3.x release.
>>>
>>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>>> wrote:
>>>
 On a separate dev@spark thread, I raised a question of whether or not
 to support python 2 in Apache Spark, going forward into Spark 3.0.

 Python-2 is going EOL  at
 the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
 make breaking changes to Spark's APIs, and so it is a good time to consider
 support for Python-2 on PySpark.

 Key advantages to dropping Python 2 are:

- Support for PySpark becomes significantly easier.
- Avoid having to support Python 2 until Spark 4.0, which is likely
to imply supporting Python 2 for some time after it goes EOL.

 (Note that supporting python 2 after EOL means, among other things,
 that PySpark would be supporting a version of python that was no longer
 receiving security patches)

 The main disadvantage is that PySpark users who have legacy python-2
 code would have to migrate their code to python 3 to take advantage of
 Spark 3.0

 This decision obviously has large implications for the Apache Spark
 community and we want to solicit community feedback.


>>>


Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem
like a ways off but even now there may be some spark versions supporting
Py2 past the point where Py2 is no longer receiving security patches


On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra 
wrote:

> We could also deprecate Py2 already in the 2.4.0 release.
>
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
>
>> In case this didn't make it onto this thread:
>>
>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>> remove it entirely on a later 3.x release.
>>
>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>> wrote:
>>
>>> On a separate dev@spark thread, I raised a question of whether or not
>>> to support python 2 in Apache Spark, going forward into Spark 3.0.
>>>
>>> Python-2 is going EOL  at
>>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>>> make breaking changes to Spark's APIs, and so it is a good time to consider
>>> support for Python-2 on PySpark.
>>>
>>> Key advantages to dropping Python 2 are:
>>>
>>>- Support for PySpark becomes significantly easier.
>>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>>to imply supporting Python 2 for some time after it goes EOL.
>>>
>>> (Note that supporting python 2 after EOL means, among other things, that
>>> PySpark would be supporting a version of python that was no longer
>>> receiving security patches)
>>>
>>> The main disadvantage is that PySpark users who have legacy python-2
>>> code would have to migrate their code to python 3 to take advantage of
>>> Spark 3.0
>>>
>>> This decision obviously has large implications for the Apache Spark
>>> community and we want to solicit community feedback.
>>>
>>>
>>


Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Marcelo Vanzin
You can log in to https://repository.apache.org and see what's wrong.
Just find that staging repo and look at the messages. In your case it
seems related to your signature.

failureMessageNo public key: Key with id: () was not able to be
located on http://gpg-keyserver.de/. Upload your public key and try
the operation again.
On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan  wrote:
>
> I confirmed that 
> https://repository.apache.org/content/repositories/orgapachespark-1285 is not 
> accessible. I did it via ./dev/create-release/do-release-docker.sh -d 
> /my/work/dir -s publish , not sure what's going wrong. I didn't see any error 
> message during it.
>
> Any insights are appreciated! So that I can fix it in the next RC. Thanks!
>
> On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>
>> I think one build is enough, but haven't thought it through. The
>> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>> Really, whatever's the easy thing to do.
>> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan  wrote:
>> >
>> > Ah I missed the Scala 2.12 build. Do you mean we should publish a Scala 
>> > 2.12 build this time? Current for Scala 2.11 we have 3 builds: with hadoop 
>> > 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for Scala 
>> > 2.12?
>> >
>> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
>> >>
>> >> A few preliminary notes:
>> >>
>> >> Wenchen for some weird reason when I hit your key in gpg --import, it
>> >> asks for a passphrase. When I skip it, it's fine, gpg can still verify
>> >> the signature. No issue there really.
>> >>
>> >> The staging repo gives a 404:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>> >> [id=orgapachespark-1285] exists but is not exposed.
>> >>
>> >> The (revamped) licenses are OK, though there are some minor glitches
>> >> in the final release tarballs (my fault) : there's an extra directory,
>> >> and the source release has both binary and source licenses. I'll fix
>> >> that. Not strictly necessary to reject the release over those.
>> >>
>> >> Last, when I check the staging repo I'll get my answer, but, were you
>> >> able to build 2.12 artifacts as well?
>> >>
>> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan  wrote:
>> >> >
>> >> > Please vote on releasing the following candidate as Apache Spark 
>> >> > version 2.4.0.
>> >> >
>> >> > The vote is open until September 20 PST and passes if a majority +1 PMC 
>> >> > votes are cast, with
>> >> > a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v2.4.0-rc1 (commit 
>> >> > 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> > https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>> >> >
>> >> > The list of bug fixes going into 2.4.0 can be found at the following 
>> >> > URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>> >> >
>> >> > FAQ
>> >> >
>> >> > =
>> >> > How can I help test this release?
>> >> > =
>> >> >
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate, then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the Java/Scala
>> >> > you can add the staging repository to your projects resolvers and test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===
>> >> > What should happen to JIRA tickets still targeting 2.4.0?
>> >> > ===
>> >> >
>> >> > The current list of open tickets targeted at 2.4.0 can be found at:
>> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> >> > Version/s" = 2.4.0
>> >> >
>> >> > Committers should look at those and triage. Extremely important bug
>> >> > fixes, 

Re: Python friendly API for Spark 3.0

2018-09-17 Thread Leif Walsh
I agree with Reynold, at some point you’re going to run into the parts of
the pandas API that aren’t distributable. More feature parity will be good,
but users are still eventually going to hit a feature cliff. Moreover, it’s
not just the pandas API that people want to use, but also the set of
libraries built around the pandas DataFrame structure.

I think rather than similarity to pandas, we should target smoother
interoperability with pandas, to ease the pain of hitting this cliff.

We’ve been working on part of this problem with the pandas UDF stuff, but
there’s a lot more to do.

On Sun, Sep 16, 2018 at 17:13 Reynold Xin  wrote:

> Most of those are pretty difficult to add though, because they are
> fundamentally difficult to do in a distributed setting and with lazy
> execution.
>
> We should add some but at some point there are fundamental differences
> between the underlying execution engine that are pretty difficult to
> reconcile.
>
> On Sun, Sep 16, 2018 at 2:09 PM Matei Zaharia 
> wrote:
>
>> My 2 cents on this is that the biggest room for improvement in Python is
>> similarity to Pandas. We already made the Python DataFrame API different
>> from Scala/Java in some respects, but if there’s anything we can do to make
>> it more obvious to Pandas users, that will help the most. The other issue
>> though is that a bunch of Pandas functions are just missing in Spark — it
>> would be awesome to set up an umbrella JIRA to just track those and let
>> people fill them in.
>>
>> Matei
>>
>> > On Sep 16, 2018, at 1:02 PM, Mark Hamstra 
>> wrote:
>> >
>> > It's not splitting hairs, Erik. It's actually very close to something
>> that I think deserves some discussion (perhaps on a separate thread.) What
>> I've been thinking about also concerns API "friendliness" or style. The
>> original RDD API was very intentionally modeled on the Scala parallel
>> collections API. That made it quite friendly for some Scala programmers,
>> but not as much so for users of the other language APIs when they
>> eventually came about. Similarly, the Dataframe API drew a lot from pandas
>> and R, so it is relatively friendly for those used to those abstractions.
>> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
>> The new barrier scheduling draws inspiration from MPI. With all of these
>> models and sources of inspiration, as well as multiple language targets,
>> there isn't really a strong sense of coherence across Spark -- I mean, even
>> though one of the key advantages of Spark is the ability to do within a
>> single framework things that would otherwise require multiple frameworks,
>> actually doing that is requiring more than one programming style or
>> multiple design abstractions more than what is strictly necessary even when
>> writing Spark code in just a single language.
>> >
>> > For me, that raises questions over whether we want to start designing,
>> implementing and supporting APIs that are designed to be more consistent,
>> friendly and idiomatic to particular languages and abstractions -- e.g. an
>> API covering all of Spark that is designed to look and feel as much like
>> "normal" code for a Python programmer, another that looks and feels more
>> like "normal" Java code, another for Scala, etc. That's a lot more work and
>> support burden than the current approach where sometimes it feels like you
>> are writing "normal" code for your prefered programming environment, and
>> sometimes it feels like you are trying to interface with something foreign,
>> but underneath it hopefully isn't too hard for those writing the
>> implementation code below the APIs, and it is not too hard to maintain
>> multiple language bindings that are each fairly lightweight.
>> >
>> > It's a cost-benefit judgement, of course, whether APIs that are heavier
>> (in terms of implementing and maintaining) and friendlier (for end users)
>> are worth doing, and maybe some of these "friendlier" APIs can be done
>> outside of Spark itself (imo, Frameless is doing a very nice job for the
>> parts of Spark that it is currently covering --
>> https://github.com/typelevel/frameless); but what we have currently is a
>> bit too ad hoc and fragmentary for my taste.
>> >
>> > On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
>> wrote:
>> > I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>> >
>> > They're not mutually exclusive of course, and both worth working on.
>> But it's *possible* to improve either without the other.
>> >
>> > Stub files look like a great solution for type annotations, maybe even
>> if only python 3 is supported.
>> >
>> > I definitely agree that any decision to drop python 2 should not be
>> taken lightly. Anecdotally, I'm seeing an increase in 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Stavros Kontopoulos
I just follow the comment Wehnchen Fan (of course it is not merged yet, but
I wanted to bring this to the attention of the dev list)

"We should definitely merge it to branch 2.4, but I won't block the release
since it's not that critical and it's still in progress. After it's merged,
feel free to vote -1 on the RC voting email to include this change, if
necessary."


So if the vote is not valid, we can ignore it. But this should have been
in, before 2.4 was cut IMHO, anyway.


Stavros


On Mon, Sep 17, 2018 at 4:53 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I believe -1 votes are merited only for correctness bugs and regressions
> since the previous release.
>
> Does SPARK-23200 count as either?
>
> 2018년 9월 17일 (월) 오전 9:40, Stavros Kontopoulos  lightbend.com>님이 작성:
>
>> -1
>>
>> I would like to see: https://github.com/apache/spark/pull/22392 in, as
>> discussed here: https://issues.apache.org/jira/browse/SPARK-23200. It is
>> important IMHO for streaming on K8s.
>> I just started testing it btw.
>>
>> Also 2.12.7(https://contributors.scala-lang.org/t/2-12-7-release/2301,
>> https://github.com/scala/scala/milestone/73 is coming out (will be
>> staged this week), do we want to build the beta 2.12 build against it?
>>
>> Stavros
>>
>> On Mon, Sep 17, 2018 at 8:00 AM, Wenchen Fan  wrote:
>>
>>> I confirmed that https://repository.apache.org/content/
>>> repositories/orgapachespark-1285 is not accessible. I did it via
>>> ./dev/create-release/do-release-docker.sh -d /my/work/dir -s publish ,
>>> not sure what's going wrong. I didn't see any error message during it.
>>>
>>> Any insights are appreciated! So that I can fix it in the next RC.
>>> Thanks!
>>>
>>> On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>>
 I think one build is enough, but haven't thought it through. The
 Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
 best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
 Really, whatever's the easy thing to do.
 On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
 wrote:
 >
 > Ah I missed the Scala 2.12 build. Do you mean we should publish a
 Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
 hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
 Scala 2.12?
 >
 > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
 >>
 >> A few preliminary notes:
 >>
 >> Wenchen for some weird reason when I hit your key in gpg --import, it
 >> asks for a passphrase. When I skip it, it's fine, gpg can still
 verify
 >> the signature. No issue there really.
 >>
 >> The staging repo gives a 404:
 >> https://repository.apache.org/content/repositories/
 orgapachespark-1285/
 >> 404 - Repository "orgapachespark-1285 (staging: open)"
 >> [id=orgapachespark-1285] exists but is not exposed.
 >>
 >> The (revamped) licenses are OK, though there are some minor glitches
 >> in the final release tarballs (my fault) : there's an extra
 directory,
 >> and the source release has both binary and source licenses. I'll fix
 >> that. Not strictly necessary to reject the release over those.
 >>
 >> Last, when I check the staging repo I'll get my answer, but, were you
 >> able to build 2.12 artifacts as well?
 >>
 >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
 wrote:
 >> >
 >> > Please vote on releasing the following candidate as Apache Spark
 version 2.4.0.
 >> >
 >> > The vote is open until September 20 PST and passes if a majority
 +1 PMC votes are cast, with
 >> > a minimum of 3 +1 votes.
 >> >
 >> > [ ] +1 Release this package as Apache Spark 2.4.0
 >> > [ ] -1 Do not release this package because ...
 >> >
 >> > To learn more about Apache Spark, please see
 http://spark.apache.org/
 >> >
 >> > The tag to be voted on is v2.4.0-rc1 (commit
 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
 >> > https://github.com/apache/spark/tree/v2.4.0-rc1
 >> >
 >> > The release files, including signatures, digests, etc. can be
 found at:
 >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
 >> >
 >> > Signatures used for Spark RCs can be found in this file:
 >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
 >> >
 >> > The staging repository for this release can be found at:
 >> > https://repository.apache.org/content/repositories/
 orgapachespark-1285/
 >> >
 >> > The documentation corresponding to this release can be found at:
 >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
 >> >
 >> > The list of bug fixes going into 2.4.0 can be found at the
 following URL:
 >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
 >> >
 >> > FAQ
 >> >
 >> > =
 >> > How can I help test this 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Nicholas Chammas
I believe -1 votes are merited only for correctness bugs and regressions
since the previous release.

Does SPARK-23200 count as either?

2018년 9월 17일 (월) 오전 9:40, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com>님이 작성:

> -1
>
> I would like to see: https://github.com/apache/spark/pull/22392 in, as
> discussed here: https://issues.apache.org/jira/browse/SPARK-23200. It is
> important IMHO for streaming on K8s.
> I just started testing it btw.
>
> Also 2.12.7(https://contributors.scala-lang.org/t/2-12-7-release/2301,
> https://github.com/scala/scala/milestone/73 is coming out (will be staged
> this week), do we want to build the beta 2.12 build against it?
>
> Stavros
>
> On Mon, Sep 17, 2018 at 8:00 AM, Wenchen Fan  wrote:
>
>> I confirmed that
>> https://repository.apache.org/content/repositories/orgapachespark-1285
>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see
>> any error message during it.
>>
>> Any insights are appreciated! So that I can fix it in the next RC. Thanks!
>>
>> On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>>
>>> I think one build is enough, but haven't thought it through. The
>>> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>>> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>>> Really, whatever's the easy thing to do.
>>> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>>> wrote:
>>> >
>>> > Ah I missed the Scala 2.12 build. Do you mean we should publish a
>>> Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with
>>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for
>>> Scala 2.12?
>>> >
>>> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
>>> >>
>>> >> A few preliminary notes:
>>> >>
>>> >> Wenchen for some weird reason when I hit your key in gpg --import, it
>>> >> asks for a passphrase. When I skip it, it's fine, gpg can still verify
>>> >> the signature. No issue there really.
>>> >>
>>> >> The staging repo gives a 404:
>>> >>
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>>> >> [id=orgapachespark-1285] exists but is not exposed.
>>> >>
>>> >> The (revamped) licenses are OK, though there are some minor glitches
>>> >> in the final release tarballs (my fault) : there's an extra directory,
>>> >> and the source release has both binary and source licenses. I'll fix
>>> >> that. Not strictly necessary to reject the release over those.
>>> >>
>>> >> Last, when I check the staging repo I'll get my answer, but, were you
>>> >> able to build 2.12 artifacts as well?
>>> >>
>>> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>>> wrote:
>>> >> >
>>> >> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.0.
>>> >> >
>>> >> > The vote is open until September 20 PST and passes if a majority +1
>>> PMC votes are cast, with
>>> >> > a minimum of 3 +1 votes.
>>> >> >
>>> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>>> >> > [ ] -1 Do not release this package because ...
>>> >> >
>>> >> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >> >
>>> >> > The tag to be voted on is v2.4.0-rc1 (commit
>>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>>> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>>> >> >
>>> >> > The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>>> >> >
>>> >> > Signatures used for Spark RCs can be found in this file:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >> >
>>> >> > The staging repository for this release can be found at:
>>> >> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>>> >> >
>>> >> > The documentation corresponding to this release can be found at:
>>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>>> >> >
>>> >> > The list of bug fixes going into 2.4.0 can be found at the
>>> following URL:
>>> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>>> >> >
>>> >> > FAQ
>>> >> >
>>> >> > =
>>> >> > How can I help test this release?
>>> >> > =
>>> >> >
>>> >> > If you are a Spark user, you can help us test this release by taking
>>> >> > an existing Spark workload and running on this release candidate,
>>> then
>>> >> > reporting any regressions.
>>> >> >
>>> >> > If you're working in PySpark you can set up a virtual env and
>>> install
>>> >> > the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> >> > you can add the staging repository to your projects resolvers and
>>> test
>>> >> > with the RC (make sure to clean up the artifact cache before/after
>>> so
>>> >> > you don't end up building with a out of date RC going forward).
>>> >> >
>>> >> > 

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Stavros Kontopoulos
-1

I would like to see: https://github.com/apache/spark/pull/22392 in, as
discussed here: https://issues.apache.org/jira/browse/SPARK-23200. It is
important IMHO for streaming on K8s.
I just started testing it btw.

Also 2.12.7(https://contributors.scala-lang.org/t/2-12-7-release/2301,
https://github.com/scala/scala/milestone/73 is coming out (will be staged
this week), do we want to build the beta 2.12 build against it?

Stavros

On Mon, Sep 17, 2018 at 8:00 AM, Wenchen Fan  wrote:

> I confirmed that https://repository.apache.org/content/
> repositories/orgapachespark-1285 is not accessible. I did it via
> ./dev/create-release/do-release-docker.sh -d /my/work/dir -s publish ,
> not sure what's going wrong. I didn't see any error message during it.
>
> Any insights are appreciated! So that I can fix it in the next RC. Thanks!
>
> On Mon, Sep 17, 2018 at 11:31 AM Sean Owen  wrote:
>
>> I think one build is enough, but haven't thought it through. The
>> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably
>> best advertised as a 'beta'. So maybe publish a no-hadoop build of it?
>> Really, whatever's the easy thing to do.
>> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan  wrote:
>> >
>> > Ah I missed the Scala 2.12 build. Do you mean we should publish a Scala
>> 2.12 build this time? Current for Scala 2.11 we have 3 builds: with hadoop
>> 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for Scala
>> 2.12?
>> >
>> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen  wrote:
>> >>
>> >> A few preliminary notes:
>> >>
>> >> Wenchen for some weird reason when I hit your key in gpg --import, it
>> >> asks for a passphrase. When I skip it, it's fine, gpg can still verify
>> >> the signature. No issue there really.
>> >>
>> >> The staging repo gives a 404:
>> >> https://repository.apache.org/content/repositories/
>> orgapachespark-1285/
>> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>> >> [id=orgapachespark-1285] exists but is not exposed.
>> >>
>> >> The (revamped) licenses are OK, though there are some minor glitches
>> >> in the final release tarballs (my fault) : there's an extra directory,
>> >> and the source release has both binary and source licenses. I'll fix
>> >> that. Not strictly necessary to reject the release over those.
>> >>
>> >> Last, when I check the staging repo I'll get my answer, but, were you
>> >> able to build 2.12 artifacts as well?
>> >>
>> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan 
>> wrote:
>> >> >
>> >> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.0.
>> >> >
>> >> > The vote is open until September 20 PST and passes if a majority +1
>> PMC votes are cast, with
>> >> > a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Release this package as Apache Spark 2.4.0
>> >> > [ ] -1 Do not release this package because ...
>> >> >
>> >> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >> >
>> >> > The tag to be voted on is v2.4.0-rc1 (commit
>> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
>> >> > https://github.com/apache/spark/tree/v2.4.0-rc1
>> >> >
>> >> > The release files, including signatures, digests, etc. can be found
>> at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>> >> >
>> >> > Signatures used for Spark RCs can be found in this file:
>> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >> >
>> >> > The staging repository for this release can be found at:
>> >> > https://repository.apache.org/content/repositories/
>> orgapachespark-1285/
>> >> >
>> >> > The documentation corresponding to this release can be found at:
>> >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>> >> >
>> >> > The list of bug fixes going into 2.4.0 can be found at the following
>> URL:
>> >> > https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>> >> >
>> >> > FAQ
>> >> >
>> >> > =
>> >> > How can I help test this release?
>> >> > =
>> >> >
>> >> > If you are a Spark user, you can help us test this release by taking
>> >> > an existing Spark workload and running on this release candidate,
>> then
>> >> > reporting any regressions.
>> >> >
>> >> > If you're working in PySpark you can set up a virtual env and install
>> >> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> >> > you can add the staging repository to your projects resolvers and
>> test
>> >> > with the RC (make sure to clean up the artifact cache before/after so
>> >> > you don't end up building with a out of date RC going forward).
>> >> >
>> >> > ===
>> >> > What should happen to JIRA tickets still targeting 2.4.0?
>> >> > ===
>> >> >
>> >> > The current list of open tickets targeted at 2.4.0 can be found at:
>> >> > https://issues.apache.org/jira/projects/SPARK and search for
>> "Target Version/s" = 2.4.0
>> >> >
>> >> > 

[VOTE] SPARK 2.3.2 (RC6)

2018-09-17 Thread Saisai Shao
Please vote on releasing the following candidate as Apache Spark version
2.3.2.

The vote is open until September 21 PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.2-rc6 (commit
02b510728c31b70e6035ad541bfcdc2b59dcd79a):
https://github.com/apache/spark/tree/v2.3.2-rc6

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1286/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/

The list of bug fixes going into 2.3.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343289


FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.2?
===

The current list of open tickets targeted at 2.3.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.3.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: how can solve this error

2018-09-17 Thread Wenchen Fan
have you read
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
?

On Mon, Sep 17, 2018 at 4:46 AM hagersaleh 
wrote:

> I write code to connect kafka with spark using python and I run code on
> jupyer
> my code
> import os
> #os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars
>
> /home/hadoop/Desktop/spark-program/kafka/spark-streaming-kafka-0-8-assembly_2.10-2.0.0-preview.jar
> pyspark-shell'
> os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell"
>
> os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages
> org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 pyspark-shell"
>
> import pyspark
> from pyspark.streaming.kafka import KafkaUtils
> from pyspark.streaming import StreamingContext
>
> #sc = SparkContext()
> ssc = StreamingContext(sc,1)
>
> broker = "iotmsgs"
> directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
> {"metadata.broker.list": broker})
> directKafkaStream.pprint()
> ssc.start()
>
> error display
> Spark Streaming's Kafka libraries not found in class path. Try one of the
> following.
>
>   1. Include the Kafka library and its dependencies with in the
>  spark-submit command as
>
>  $ bin/spark-submit --packages
> org.apache.spark:spark-streaming-kafka-0-8:2.3.0 ...
>
>   2. Download the JAR of the artifact from Maven Central
> http://search.maven.org/,
>  Group Id = org.apache.spark, Artifact Id =
> spark-streaming-kafka-0-8-assembly, Version = 2.3.0.
>  Then, include the jar in the spark-submit command as
>
>  $ bin/spark-submit --jars  ...
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>