Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
I confirmed that https://repository.apache.org/content/repositories/orgapachespark-1285 is not accessible. I did it via ./dev/create-release/do-release-docker.sh -d /my/work/dir -s publish , not sure what's going wrong. I didn't see any error message during it. Any insights are appreciated! So

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Sean Owen
I think one build is enough, but haven't thought it through. The Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is probably best advertised as a 'beta'. So maybe publish a no-hadoop build of it? Really, whatever's the easy thing to do. On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan wrote:

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
Ah I missed the Scala 2.12 build. Do you mean we should publish a Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: with hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing for Scala 2.12? On Mon, Sep 17, 2018 at 11:14 AM Sean Owen wrote: > A few

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Sean Owen
A few preliminary notes: Wenchen for some weird reason when I hit your key in gpg --import, it asks for a passphrase. When I skip it, it's fine, gpg can still verify the signature. No issue there really. The staging repo gives a 404:

[VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until September 20 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ...

Re: Some PRs not automatically linked to JIRAs

2018-09-16 Thread Hyukjin Kwon
Seems same thing is happening again. For instance, - https://issues.apache.org/jira/browse/SPARK-25440 / https://github.com/apache/spark/pull/22429 - https://issues.apache.org/jira/browse/SPARK-25429 / https://github.com/apache/spark/pull/22420 2017년 8월 3일 (목) 오전 9:06, Hyukjin Kwon 님이 작성: > I

Re: from_csv

2018-09-16 Thread Hyukjin Kwon
+1 for this idea since text parsing in CSV/JSON is quite common. One thing is about schema inference likewise with JSON functionality. In case of JSON, we added schema_of_json for it and same thing should be able to apply to CSV too. If we see some more needs for it, we can consider a function

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Hyukjin Kwon
I think we can deprecate it in 3.x.0 and remove it in Spark 4.0.0. Many people still use Python 2. Also, techincally 2.7 support is not officially dropped yet - https://pythonclock.org/ 2018년 9월 17일 (월) 오전 9:31, Aakash Basu 님이 작성: > Removing support for an API in a major release makes poor

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-16 Thread Wenchen Fan
I'm +1 for this proposal: "Extend SessionConfigSupport to support passing specific white-listed configuration values" One goal of data source v2 API is to not depend on any high-level APIs like SparkSession, SQLConf, etc. If users do want to access these high-level APIs, there is a workaround:

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-16 Thread Thakrar, Jayesh
I am not involved with the design or development of the V2 API - so these could be naïve comments/thoughts. Just as dataset is to abstract away from RDD, which otherwise required a little more intimate knowledge about Spark internals, I am guessing the absence of partition operations is either

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Felix Cheung
I don’t think we should remove any API even in a major release without deprecating it first... From: Mark Hamstra Sent: Sunday, September 16, 2018 12:26 PM To: Erik Erlandson Cc: u...@spark.apache.org; dev Subject: Re: Should python-2 be supported in Spark 3.0?

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
> > difficult to reconcile > That's a big chunk of what I'm getting at: How much is it even possible to do this kind of reconciliation from the underlying implementation to a more normal/expected/friendly API for a given programming environment? How much more work is it for us to maintain

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Reynold Xin
Most of those are pretty difficult to add though, because they are fundamentally difficult to do in a distributed setting and with lazy execution. We should add some but at some point there are fundamental differences between the underlying execution engine that are pretty difficult to reconcile.

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Matei Zaharia
My 2 cents on this is that the biggest room for improvement in Python is similarity to Pandas. We already made the Python DataFrame API different from Scala/Java in some respects, but if there’s anything we can do to make it more obvious to Pandas users, that will help the most. The other issue

how can solve this error

2018-09-16 Thread hagersaleh
I write code to connect kafka with spark using python and I run code on jupyer my code import os #os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /home/hadoop/Desktop/spark-program/kafka/spark-streaming-kafka-0-8-assembly_2.10-2.0.0-preview.jar pyspark-shell' os.environ['PYSPARK_SUBMIT_ARGS'] =

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
It's not splitting hairs, Erik. It's actually very close to something that I think deserves some discussion (perhaps on a separate thread.) What I've been thinking about also concerns API "friendliness" or style. The original RDD API was very intentionally modeled on the Scala parallel collections

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Mark Hamstra
We could also deprecate Py2 already in the 2.4.0 release. On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson wrote: > In case this didn't make it onto this thread: > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove > it entirely on a later 3.x release. > > On Sat, Sep

Re: from_csv

2018-09-16 Thread Maxim Gekk
Hi Reynold, > i'd make this as consistent as to_json / from_json as possible Sure, new function from_csv() has the same signature as from_json(). > how would this work in sql? i.e. how would passing options in work? The options are passed to the function via map, for example: select

[Discuss] Datasource v2 support for Kerberos

2018-09-16 Thread tigerquoll
The current V2 Datasource API provides support for querying a portion of the SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport API. This was designed with the assumption that all configuration information for v2 data sources should be separate from each other.

[Discuss] Datasource v2 support for manipulating partitions

2018-09-16 Thread tigerquoll
I've been following the development of the new data source abstraction with keen interest. One of the issues that has occurred to me as I sat down and planned how I would implement a data source is how I would support manipulating partitions. My reading of the current prototype is that Data