Re: Branch 2.4 is cut

2018-09-10 Thread Wenchen Fan
he next release, like > updating the write path. I think it would be better not to change this only > to include another major change in the next release. > > On Sun, Sep 9, 2018 at 10:41 PM Wenchen Fan wrote: > >> Strictly speaking, data source v2 is always half-finished until w

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Wenchen Fan
Regardless the API, to use Spark to write data atomically, it requires 1. Write data distributedly, with a central coordinator at Spark driver. 2. The distributed writers are not guaranteed to run together at the same time. (This can be relaxed if we can extend the barrier scheduling feature) 3.

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-29 Thread Wenchen Fan
t matter. >> >> Tom >> >> On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid >> wrote: >> >> >> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan wrote: >> >> SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243&

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-10 Thread Wenchen Fan
+1 On Wed, Jul 11, 2018 at 1:31 AM John Zhuge wrote: > +1 > > On Sun, Jul 8, 2018 at 1:30 AM Saisai Shao wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.3.2. >> >> The vote is open until July 11th PST and passes if a majority +1 PMC >> votes are cast,

Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-11 Thread Wenchen Fan
Hi Ryan, Great job on this! Shall we call a vote for the plan standardization SPIP? I think this is a good idea and we should do it. Notes: We definitely need new user-facing APIs to produce these new logical plans like DeleteData. But we need a design doc for these new APIs after the SPIP

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-22 Thread Wenchen Fan
+1 All the blocking issues are resolved(AFAIK), and important data source v2 features have been merged. On Tue, Jan 23, 2018 at 9:09 AM, Marcelo Vanzin wrote: > +0 > > Signatures check out. Code compiles, although I see the errors in [1] > when untarring the source

Re: SparkContext - parameter for RDD, but not serializable, why?

2018-02-28 Thread Wenchen Fan
My understanding: RDD is also a driver side stuff like SparkContext, works like a handler to your distributed data on the cluster. However, `RDD.compute` (defines how to produce data for each partition) needs to be executed on the remote nodes. It's more convenient to make RDD serializable, and

Re: SparkContext - parameter for RDD, but not serializable, why?

2018-02-28 Thread Wenchen Fan
rtition = split.asInstanceOf[ > MyDataSourcePartition] > *val *partitionId = myDataSourcePartition.index > *val *rows = myDataSourcePartition.rowCount > *val *partitionData = 1 to rows map(r => *Row*(*s"Partition: $* > {partitionId}*, row $*{r}* of $*{rows}*"*))

Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-17 Thread Wenchen Fan
’ in the interface. Then > specifically how are we going to express capability of the given reader of > its supported format(s), or specific support for each of “real-time data in > row format, and history data in columnar format”? > > > ---------- > *

Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Wenchen Fan
gt;interfaces entirely and simplify what implementers must provide and would >>reduce confusion over what to do. >> >> Using InternalRow doesn’t cover the case where we want to produce >> ColumnarBatch instead, so what you’re proposing might still be a good >> id

[discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-15 Thread Wenchen Fan
Hi all, I'd like to propose an API change to the data source v2. One design goal of data source v2 is API type safety. The FileFormat API is a bad example, it asks the implementation to return InternalRow even it's actually ColumnarBatch. In data source v2 we add a type parameter to

Re: DataSourceV2 write input requirements

2018-03-26 Thread Wenchen Fan
yuzhih...@gmail.com> wrote: > >> Hmm. Ryan seems to be right. >> >> Looking at sql/core/src/main/java/org/apache/spark/sql/sources/v2/re >> ader/SupportsReportPartitioning.java : >> >> import org.apache.spark.sql.sources.v2.reader.partitioning.Partitioning; &

Re: DataSourceV2 write input requirements

2018-03-26 Thread Wenchen Fan
Actually clustering is already supported, please take a look at SupportsReportPartitioning Ordering is not proposed yet, might be similar to what Ryan proposed. On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu wrote: > Interesting. > > Should requiredClustering return a Set of

Re: DataSourceV2 write input requirements

2018-03-27 Thread Wenchen Fan
10:05 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Hmm. Ryan seems to be right. >>> >>> Looking at sql/core/src/main/java/org/apache/spark/sql/sources/v2/re >>> ader/SupportsReportPartitioning.java : >>> >>> import org.apache.spark.sql.so

Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Wenchen Fan
Hi all, The Spark PMC recently added Zhenhua Wang as a committer on the project. Zhenhua is the major contributor of the CBO project, and has been contributing across several areas of Spark for a while, focusing especially on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Wenchen Fan
org.apache.spark.sql.execution.streaming.Source is for internal use only. The official stream data source API is the data source v2 API. You can take a look at the Spark built-in streaming data sources as examples. Note: data source v2 is still experimental, you may need to update your code in a

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Wenchen Fan
ing" to > true. > > > > But I should confess that I don't know the source code very well, so will > appreciate if you can point me to any other pointers/examples please. > > > > *From: *Wenchen Fan <cloud0...@gmail.com> > *Date: *Thursday, March 22, 2018 at 2:5

Re: Welcoming some new committers

2018-03-02 Thread Wenchen Fan
Congratulations to everyone and welcome! On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger wrote: > Congrats to the new committers, and I appreciate the vote of confidence. > > On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia > wrote: > > Hi everyone, > >

Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Wenchen Fan
ed on >> join keys, we can output the ordered join keys as output ordering. >> >> >> Chrysan Wu >> 吴晓菊 >> Phone:+86 17717640807 >> >> >> 2018-06-28 22:53 GMT+08:00 Wenchen Fan : >> >>> SortMergeJoin only reports ordering of the join

Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Wenchen Fan
SortMergeJoin only reports ordering of the join keys, not the output ordering of any child. It seems reasonable to me that broadcast join should respect the output ordering of the children. Feel free to submit a PR to fix it, thanks! On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊 wrote: > Why we cannot

Re: Time for 2.3.2?

2018-06-28 Thread Wenchen Fan
四 上午11:40写道: > >> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes >> for those out. >> >> (Those are what delayed 2.2.2 and 2.1.3 for those watching...) >> >> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan wrote: >> > Hi all

Re: [VOTE] Spark 2.2.2 (RC2)

2018-06-27 Thread Wenchen Fan
+1 On Thu, Jun 28, 2018 at 10:19 AM zhenya Sun wrote: > +1 > > 在 2018年6月28日,上午10:15,Hyukjin Kwon 写道: > > +1 > > 2018년 6월 28일 (목) 오전 8:42, Sean Owen 님이 작성: > >> +1 from me too. >> >> On Wed, Jun 27, 2018 at 3:31 PM Tom Graves >> wrote: >> >>> Please vote on releasing the following candidate as

Time for 2.3.2?

2018-06-27 Thread Wenchen Fan
Hi all, Spark 2.3.1 was released just a while ago, but unfortunately we discovered and fixed some critical issues afterward. *SPARK-24495: SortMergeJoin may produce wrong result.* This is a serious correctness bug, and is easy to hit: have duplicated join key from the left table, e.g. `WHERE

Re: Remove Flume support in 3.0.0?

2018-10-11 Thread Wenchen Fan
Note that, it was deprecated in 2.3.0 already: https://spark.apache.org/docs/2.3.0/streaming-flume-integration.html On Fri, Oct 12, 2018 at 12:46 AM Reynold Xin wrote: > Sounds like a good idea... > > > On Oct 11, 2018, at 6:40 PM, Sean Owen wrote: > > > > Yep, that already exists as Bahir. >

Re: Coalesce behaviour

2018-10-12 Thread Wenchen Fan
8 partitions and 2048 tasks and then a map > phase with 10 partitions and 10 tasks that writes to hdfs? > > every time i try to do this using coalesce the shuffle ends up having 10 > tasks which is unacceptable due to OOM. this makes coalesce somewhat > useless. > > On Wed, Oc

Re: Coalesce behaviour

2018-10-14 Thread Wenchen Fan
olution. > > how can we insert a coalesce as a new map-phase (new job on application > manager with narrow dependency) instead of modifying the existing reduce > phase? i am saying map-phase because it should not introduce a new shuffle: > this is wasteful and unnecessary. > > &

Re: data source api v2 refactoring

2018-10-18 Thread Wenchen Fan
, 2018 at 4:51 AM Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Thanks for the info Ryan – very helpful! > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Wednesday, September 19, 2018 at 3:17 PM > *To: *"Thakrar, Jayesh&q

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
t; >> at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > >> at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > >> at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org > $scalatest$BeforeAndAfter$$super$ru

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
Since GitHub and Jenkins are in a chaotic state, I didn't wait for a green Jenkins QA job for the RC4 commit. We should fail this RC if the Jenkins is broken (very unlikely). I'm adding my own +1, all known blockers are resolved. On Tue, Oct 23, 2018 at 1:42 AM Wenchen Fan wrote: > Please v

[VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

Re: About introduce function sum0 to Spark

2018-10-23 Thread Wenchen Fan
This is logically `sum( if(isnull(col), 0, col) )` right? On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛 wrote: > The name is from Apache Calcite, And it doesn’t matter, we can introduce > our own. > > > > > > --- > > Regards! > > Aron Tao > > > > *发件人**: *Mark Hamstra > *日期**: *2018年10月23日 星期二 12:28 >

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Wenchen Fan
. > > > On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.4.0. >> >> The vote is open until October 1 PST and passes if a majority +1 PMC >> votes are cast, with >>

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Wenchen Fan
nd call it the same release. >> >> I'd still like some confirmation that someone can build and pass tests >> with -Pkubernetes, maybe? It actually all passed with the 2.11 build. >> I don't think it's a 2.12 incompatibility, but rather than the K8S >> tests maybe don't quite w

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
e to lower the priority. Always, I respect PMC's > decision. > > I'm sending this email to draw more attention to this bug and to give some > warning on the new feature's limitation to the community. > > Bests, > Dongjoon. > > > On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
rk-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 > {1:3} > > > hive> select map(1,2,1,3); // Hive 1.2.2 > OK > {1:3} > > > presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); > // Presto 0.212 > _col0 > --- > {1=3} > > > Bes

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
her words, `map_filter` works like `push-downed filter` to the map in > terms of the output result > while users assumed that `map_filter` works on top of the result of `m`. > > This is a function semantic issue. > > > On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan wrote: > >&g

[VOTE] SPARK 2.4.0 (RC5)

2018-10-29 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until November 1 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

Re: python lint is broken on master branch

2018-10-31 Thread Wenchen Fan
be a pycodestyle or flake8 version issue? >>> On Wed, Oct 31, 2018 at 7:43 AM Wenchen Fan wrote: >>> > >>> > The Jenkins job spark-master-lint keeps failing. The error message is >>> > flake8.exceptions.FailedToLoadPlugin: Flake8 failed

python lint is broken on master branch

2018-10-31 Thread Wenchen Fan
The Jenkins job spark-master-lint keeps failing. The error message is flake8.exceptions.FailedToLoadPlugin: Flake8 failed to load plugin "pycodestyle.break_after_binary_operator" due to 'module' object has no attribute 'break_after_binary_operator'. flake8 checks failed. As an example please see

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-11-01 Thread Wenchen Fan
This vote passes! I'll follow up with a formal release announcement soon. +1: Xiao Li (binding) Sean Owen (binding) Gengliang Wang Hyukjin Kwon Wenchen Fan (binding) Ryan Blue Bryan Cutler Marcelo Vanzin (binding) Reynold Xin (binding) Chitral Verma Dilip Biswal Denny Lee Felix Cheung (binding

Re: DataSourceV2 hangouts sync

2018-10-30 Thread Wenchen Fan
Hi all, I spent some time thinking about the roadmap, and came up with an initial list: SPARK-25390: data source V2 API refactoring SPARK-24252: add catalog support SPARK-25531: new write APIs for data source v2 SPARK-25190: better operator pushdown API Streaming rate control API Custom metrics

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-30 Thread Wenchen Fan
gt;> >> > >> >> > These items are still targeted to 2.4.0; Xiangrui I assume these >> >> > should just be untargeted now, or resolved? >> >> > SPARK-25584 Document libsvm data source in doc site >> >> > SPARK-25346 Document Spark b

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-25 Thread Wenchen Fan
a wrong result due to our undefined behavior. I agree we >> should revert the newly added map-oriented functions. In 3.0 release, we >> need to define the behavior of duplicate keys in the data type MAP and fix >> all the related issues that are confusing to our end users. >

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Wenchen Fan
Any updates on this topic? https://github.com/apache/spark/pull/22827 is merged and 2.4 is unblocked. I'll cut RC5 shortly after the weekend, and it will be great to include the change proposed here. Thanks, Wenchen On Fri, Oct 26, 2018 at 12:55 AM Stavros Kontopoulos <

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Wenchen Fan
Big +1 on this! I live in UTC+8 and I'm available from 8 am, which is 5 pm in the bay area. Hopefully we can coordinate a time that fits everyone. Thanks Wenchen On Fri, Oct 26, 2018 at 7:21 AM Dongjoon Hyun wrote: > +1. Thank you for volunteering, Ryan! > > Bests, > Dongjoon. > > > On Thu,

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Wenchen Fan
people here in the bay > area. > > If that makes it so some people can't attend, we can do the next one > earlier for people in Europe. > > If we go with 5PM PDT, then what day works best for everyone? > > On Thu, Oct 25, 2018 at 5:01 PM Wenchen Fan wrote: > >> Big +1

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-06 Thread Wenchen Fan
Do you mean we should have a 2.4.0 release without CRAN and then do a 2.4.1 immediately? On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung wrote: > Shivaram and I were discussing. > Actually we worked with them before. Another possible approach is to > remove the vignettes eval and all test from the

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread Wenchen Fan
We make Scala 2.11 the default one in Spark 2.0, then drop Scala 2.10 in Spark 2.3. Shall we follow it and drop Scala 2.11 at some point of Spark 3.x? On Wed, Nov 7, 2018 at 8:55 AM Reynold Xin wrote: > Have we deprecated Scala 2.11 already in an existing release? > > On Tue, Nov 6, 2018 at

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
resend On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: > > > -- Forwarded message - > From: Wenchen Fan > Date: Thu, Nov 8, 2018 at 10:55 PM > Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 > To: Spark dev list > > > Hi all, > > Apa

Re: Did the 2.4 release email go out?

2018-11-08 Thread Wenchen Fan
Actually I did it 3 hours ago, however the mail server seems to have some problems and my email was lost. Let me resend it. On Fri, Nov 9, 2018 at 1:56 AM Reynold Xin wrote: > The website is already up but I didn’t see any email announcement yet. >

Re: Did the 2.4 release email go out?

2018-11-08 Thread Wenchen Fan
ping On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: > Actually I did it 3 hours ago, however the mail server seems to have some > problems and my email was lost. Let me resend it. > > On Fri, Nov 9, 2018 at 1:56 AM Reynold Xin wrote: > >> The website is already up but I

[ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
Hi all, Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds Barrier Execution Mode for better integration with deep learning frameworks, introduces 30+ built-in and higher-order functions to deal with complex data type easier, improves the K8s integration, along with

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
+ user list On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: > resend > > On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: > >> >> >> -- Forwarded message - >> From: Wenchen Fan >> Date: Thu, Nov 8, 2018 at 10:55 PM >> S

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-08 Thread Wenchen Fan
_ > > From: Sean Owen > > Sent: Tuesday, November 6, 2018 10:51 AM > > To: Shivaram Venkataraman > > Cc: Felix Cheung; Wenchen Fan; Matei Zaharia; dev > > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0 > > > > I think the second optio

Re: DataSourceV2 capability API

2018-11-12 Thread Wenchen Fan
I think this works, but there are also other solutions, e.g. mixin traits and runtime exceptions Assuming the general abstraction is: table -> scan builder -> scan -> batch/batches (see alternative #2 in the doc

which classes/methods are considered as private in Spark?

2018-11-13 Thread Wenchen Fan
Hi all, Recently I updated the MiMa exclusion rules, and found MiMa tracks some private classes/methods unexpectedly. Note that, "private" here means that, we have no guarantee about compatibility. We don't provide documents and users need to take the risk when using them. In the API document,

Re: need assistance debugging a strange build failure...

2018-11-05 Thread Wenchen Fan
Have you tried Maven instead of SBT? This looks like a Java dependency problem, e.g. a wrong version of Avro is picked. On Tue, Nov 6, 2018 at 8:30 AM shane knapp wrote: > i'm really close (for real: really close!) on the ubuntu port... but one > build has been a thorn in my side and i was

Re: which classes/methods are considered as private in Spark?

2018-11-13 Thread Wenchen Fan
es are meant to be private. We should make it consistent between MiMa and doc generator though. On Wed, Nov 14, 2018 at 10:41 AM Marcelo Vanzin wrote: > On Tue, Nov 13, 2018 at 6:26 PM Wenchen Fan wrote: > > Recently I updated the MiMa exclusion rules, and found MiMa tracks some >

Re: welcome a new batch of committers

2018-10-03 Thread Wenchen Fan
Congratulations! On Wed, Oct 3, 2018 at 9:24 PM Madhusudanan Kandasamy < madhusuda...@in.ibm.com> wrote: > Congratulations Ishizaki-san.. > > Thanks, > Madhu. > _ > -Denny Lee wrote: - > To: Dongjin Lee > From: Denny Lee > Date: 10/03/2018 06:31PM > Cc:

[VOTE] SPARK 2.4.0 (RC2)

2018-09-27 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until October 1 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-09-27 Thread Wenchen Fan
I'm adding my own +1, since all the problems mentioned in the RC1 voting email are all resolved. And there is no blocker issue for 2.4.0 AFAIK. On Fri, Sep 28, 2018 at 10:59 AM Wenchen Fan wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.0. > &

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-09-27 Thread Wenchen Fan
Yes, that's proposed by Sean. This time we should publish a Scala 2.12 build, both in maven and the download page. On Fri, Sep 28, 2018 at 11:34 AM Saisai Shao wrote: > Only "without-hadoop" profile has 2.12 binary, is it expected? > > Thanks > Saisai > > Wenchen

Re: Coalesce behaviour

2018-10-10 Thread Wenchen Fan
Note that, RDD partitions and Spark tasks are not always 1-1 mapping. Assuming `rdd1` has 100 partitions, and `rdd2 = rdd1.coalesce(10)`. Then `rdd2` has 10 partitions, and there is no shuffle between `rdd1` and `rdd2`. During scheduling, `rdd1` and `rdd2` are in the same stage, and this stage

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Wenchen Fan
I'm adding my own +1, since there are no known blocker issues. The correctness issue has been fixed, the streaming Java API problem has been resolved, and we have upgraded to Scala 2.12.7. On Thu, Oct 11, 2018 at 12:46 AM Wenchen Fan wrote: > Please vote on releasing the following candid

[VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until October 1 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

Fwd: [VOTE] SPARK 2.4.0 (RC3)

2018-10-11 Thread Wenchen Fan
Forgot to cc dev-list -- Forwarded message - From: Wenchen Fan Date: Thu, Oct 11, 2018 at 10:14 AM Subject: Re: [VOTE] SPARK 2.4.0 (RC3) To: Cc: Sean Owen Ah sorry guys, I just copy-paste the voting email from the last RC and forgot to update the date :P The voting should

Re: Possible bug in DatasourceV2

2018-10-11 Thread Wenchen Fan
source, schema.toAttributes, options, ident, userSpecifiedSchema) >> >> } >> >> >> >> Correct this? >> >> >> >> Or even creating a new create which simply gets the schema as non >> optional? >> >> >> >

Re: BroadcastJoin failed on partitioned parquet table

2018-10-01 Thread Wenchen Fan
I'm not sure if Spark 1.6 is still maintained, can you try a 2.x spark version and see if the problem still exists? On Sun, Sep 30, 2018 at 4:14 PM 白也诗无敌 <445484...@qq.com> wrote: > Besides I have tried ANALYZE statement. It has no use cause I need the > single partition but get the total table

Re: Data source V2 in spark 2.4.0

2018-10-01 Thread Wenchen Fan
Ryan thanks for putting up a list! Generally there are a few tunning to the data source v2 API in 2.4, and it shouldn't be too hard if you already have a data source v2 implementation and you want to upgrade to Spark 2.4. However, we do want to do some big API changes for data source v2 in the

Re: On Scala 2.12.7

2018-10-01 Thread Wenchen Fan
My major concern is how it will affect end-users if Spark 2.4 is built with Scala versions prior to 2.12.7. Generally I'm hesitating to upgrade Scala version when we are very close to a release, and Scala 2.12 build of Spark 2.4 is beta anyway. On Sat, Sep 29, 2018 at 6:46 AM Sean Owen wrote: >

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Wenchen Fan
This RC fails because of the correctness bug: SPARK-25538 I'll start a new RC once the fix(https://github.com/apache/spark/pull/22602) is merged. Thanks, Wenchen On Tue, Oct 2, 2018 at 1:21 AM Sean Owen wrote: > Given that this release is probably still 2 weeks from landing, I don't > think

Re: On Scala 2.12.7

2018-10-01 Thread Wenchen Fan
ar two > informed opinions (Darcy and the scala release notes) that it was > relevant. As we have no prior 2.12 support, I guess my feeling was > indeed to get this update in to the first 2.12-supporting release. > > On Mon, Oct 1, 2018 at 9:43 PM Wenchen Fan wrote: > >

Re: [DISCUSS] Syntax for table DDL

2018-10-03 Thread Wenchen Fan
Thank you Ryan for proposing the DDL syntax! I think it's good to follow mainstream databases, and the proposed syntax looks very reasonable. About Hive compatibility, I think it's not that important now, but it's still good if we keep it. Shall we support the Hive syntax as an alternative? It

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-21 Thread Wenchen Fan
". Removing 3.0.0 would work in > this case? > > 2018년 9월 21일 (금) 오후 2:29, Wenchen Fan 님이 작성: > >> There is an issue in the merge script, when resolving a ticket, the >> default fixed version is 3.0.0. I guess someone forgot to type the fixed >> version and lead to

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Wenchen Fan
t;>>>> FYI: SPARK-23200 has been resolved. >>>>> >>>>> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung < >>>>> felixcheun...@hotmail.com> wrote: >>>>> >>>>>> If we could work on this quickly - it might get on to future RCs. >>>>>> &g

Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Wenchen Fan
Sean thanks for checking them! I made one pass and re-targeted/closed some of them. Most of them are documentation and auditing, do we need to block the release for them? On Fri, Sep 21, 2018 at 6:01 AM Sean Owen wrote: > Because we're into 2.4 release candidates, I thought I'd look at >

Re: SPIP: support decimals with negative scale in decimal operation

2018-09-21 Thread Wenchen Fan
Hi Marco, Thanks for sending it! The problem is clearly explained in this email, but I would not treat it as a SPIP. It proposes a fix for a very tricky bug, and SPIP is usually for new features. Others please correct me if I was wrong. Thanks, Wenchen On Fri, Sep 21, 2018 at 5:47 PM Marco

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-24 Thread Wenchen Fan
> All of the Kerberos options already exist in their own legacy locations though - changing their location could break a lot of systems. We can define the prefix for shared options, and we can strip the prefix when passing these options to the data source. Will this work for your case? On Tue,

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-09-28 Thread Wenchen Fan
t tests at least. > > > On Fri, 28 Sep 2018 10:59:41 +0800 *Wenchen Fan >* wrote > > Please vote on releasing the following candidate as Apache Spark version > 2.4.0. > > The vote is open until October 1 PST and passes if a majority +1 PMC votes > are cast

Re: Spark Packaging Jenkins

2019-01-05 Thread Wenchen Fan
IIRC there was a change to the release process: we stop using the shared gpg key on Jenkins, but use the personal key of the release manager. I'm not sure Jenkins can help testing package anymore. BTW release manager needs to run the packaging script by himself. If there is a problem, the release

Re: proposal for expanded & consistent timestamp types

2018-12-15 Thread Wenchen Fan
I like this proposal. > We'll get agreement across Spark, Hive, and Impala. Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it. On Wed, Dec 12, 2018 at 3:36 AM Li Jin wrote: > Of course. I added some comments in the doc. >

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Wenchen Fan
end-user to >> than having a consistent behavior between native clients and Spark clients. >> This is why I asked if the goal was to just have a common "Spark" behavior >> because I don't think it makes sense if you consider multiple interaction >> points f

Re: Support SqlStreaming in spark

2018-12-21 Thread Wenchen Fan
It will be great to add pure-SQL support to structured streaming. I think it goes without saying that how important SQL support is, but we should make a completed design first. Looking at the Kafka streaming syntax , it has

Re: removing most of the config functions in SQLConf?

2018-12-14 Thread Wenchen Fan
IIRC, the reason we did it is: `SQLConf` was in SQL core module. So we need to create methods in `CatalystConf`, and `SQLConf` implements `CatalystConf`. Now the problem has gone: we moved `SQLConf` to catalyst module. I think we can remove these methods. On Fri, Dec 14, 2018 at 3:45 PM Reynold

Re: Support SqlStreaming in spark

2018-12-24 Thread Wenchen Fan
Hi JackyLee, Can you put the answers to these questions in the design doc? e.g. if we don't want to support manipulating a streaming query, then is `SELECT STREAM ...` a blocking action? And how users can create a Spark application with multiple streaming jobs? How users can run Structured

Re: DataSourceV2 community sync #3

2018-11-28 Thread Wenchen Fan
Hi Ryan, Thanks for hosting the discussion! I think the table catalog is super useful, but since this is the first time we allow users to extend catalog, it's better to write down some details from end-user APIs to internal management. 1. How would end-users register/unregister catalog with SQL

Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-12-05 Thread Wenchen Fan
The bucket feature is designed to only work with data sources with table support, and currently the table support is not public yet, which means no external data sources can access bucketing information right now. The bucket feature only works with Spark native file source tables. We are working

Re: Run a specific PySpark test or group of tests

2018-12-05 Thread Wenchen Fan
great job! thanks a lot! On Thu, Dec 6, 2018 at 9:39 AM Hyukjin Kwon wrote: > It's merged now and in developer tools page - > http://spark.apache.org/developer-tools.html#individual-tests > Have some func with PySpark testing! > > 2018년 12월 5일 (수) 오후 4:30, Hyukjin Kwon 님이 작성: > >> Hey all, I

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-08 Thread Wenchen Fan
+1 On Wed, Jan 9, 2019 at 3:37 AM DB Tsai wrote: > +1 > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 0x5CED8B896A6BDFA0 > > On Tue, Jan 8, 2019 at 11:14 AM Dongjoon Hyun > wrote: > > > > Please vote on

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-08 Thread Wenchen Fan
PM Wenchen Fan wrote: > I'm OK with it, i.e. fail the write if there are negative-scale decimals > (we need to document it though). We can improve it later in data source v2. > > On Mon, Jan 7, 2019 at 10:09 PM Marco Gaido > wrote: > >> In general we can say that som

Re: DataSourceV2 community sync tonight

2019-01-09 Thread Wenchen Fan
There are 2 remaining problems in the write side API refactor : 1. how to put the `queryId` parameter in the write API 2. how to put the streaming OutputMode parameter in the write API I'd like to

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-07 Thread Wenchen Fan
; Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan > ha scritto: > >> AFAIK parquet spec says decimal scale can't be negative. If we want to >> officially support negative-scale decimal, we should clearly define the >> behavior when writing negative-scale decimals to parq

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-07 Thread Wenchen Fan
, > Marco > > Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan > ha scritto: > >> I think we need to do this for backward compatibility, and according to >> the discussion in the doc, SQL standard allows negative scale. >> >> To do this, I think the P

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-06 Thread Wenchen Fan
I think we need to do this for backward compatibility, and according to the discussion in the doc, SQL standard allows negative scale. To do this, I think the PR should also include a doc for the decimal type, like the definition of precision and scale(this one

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-16 Thread Wenchen Fan
I'm +1 for this proposal: "Extend SessionConfigSupport to support passing specific white-listed configuration values" One goal of data source v2 API is to not depend on any high-level APIs like SparkSession, SQLConf, etc. If users do want to access these high-level APIs, there is a workaround:

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
oop build of it? > Really, whatever's the easy thing to do. > On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan wrote: > > > > Ah I missed the Scala 2.12 build. Do you mean we should publish a Scala > 2.12 build this time? Current for Scala 2.11 we have 3 builds: with hadoop > 2.7

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-17 Thread Wenchen Fan
+1. All the blocker issues are all resolved in 2.3.2 AFAIK. On Tue, Sep 18, 2018 at 9:23 AM Sean Owen wrote: > +1 . Licenses and sigs check out as in previous 2.3.x releases. A > build from source with most profiles passed for me. > On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao > wrote: > > > >

Re: how can solve this error

2018-09-17 Thread Wenchen Fan
have you read https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html ? On Mon, Sep 17, 2018 at 4:46 AM hagersaleh wrote: > I write code to connect kafka with spark using python and I run code on > jupyer > my code > import os > #os.environ['PYSPARK_SUBMIT_ARGS'] =

[VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until September 20 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ...

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
xtra directory, > and the source release has both binary and source licenses. I'll fix > that. Not strictly necessary to reject the release over those. > > Last, when I check the staging repo I'll get my answer, but, were you > able to build 2.12 artifacts as well? > > On Sun, Sep 16

<    1   2   3   4   5   6   >