Re: Coalesce behaviour

2018-10-10 Thread Wenchen Fan
Note that, RDD partitions and Spark tasks are not always 1-1 mapping. Assuming `rdd1` has 100 partitions, and `rdd2 = rdd1.coalesce(10)`. Then `rdd2` has 10 partitions, and there is no shuffle between `rdd1` and `rdd2`. During scheduling, `rdd1` and `rdd2` are in the same stage, and this stage

Re: [DISCUSS] Syntax for table DDL

2018-10-03 Thread Wenchen Fan
Thank you Ryan for proposing the DDL syntax! I think it's good to follow mainstream databases, and the proposed syntax looks very reasonable. About Hive compatibility, I think it's not that important now, but it's still good if we keep it. Shall we support the Hive syntax as an alternative? It

Re: welcome a new batch of committers

2018-10-03 Thread Wenchen Fan
Congratulations! On Wed, Oct 3, 2018 at 9:24 PM Madhusudanan Kandasamy < madhusuda...@in.ibm.com> wrote: > Congratulations Ishizaki-san.. > > Thanks, > Madhu. > _ > -Denny Lee wrote: - > To: Dongjin Lee > From: Denny Lee > Date: 10/03/2018 06:31PM > Cc:

Re: On Scala 2.12.7

2018-10-01 Thread Wenchen Fan
ar two > informed opinions (Darcy and the scala release notes) that it was > relevant. As we have no prior 2.12 support, I guess my feeling was > indeed to get this update in to the first 2.12-supporting release. > > On Mon, Oct 1, 2018 at 9:43 PM Wenchen Fan wrote: > >

Re: On Scala 2.12.7

2018-10-01 Thread Wenchen Fan
My major concern is how it will affect end-users if Spark 2.4 is built with Scala versions prior to 2.12.7. Generally I'm hesitating to upgrade Scala version when we are very close to a release, and Scala 2.12 build of Spark 2.4 is beta anyway. On Sat, Sep 29, 2018 at 6:46 AM Sean Owen wrote: >

Re: BroadcastJoin failed on partitioned parquet table

2018-10-01 Thread Wenchen Fan
I'm not sure if Spark 1.6 is still maintained, can you try a 2.x spark version and see if the problem still exists? On Sun, Sep 30, 2018 at 4:14 PM 白也诗无敌 <445484...@qq.com> wrote: > Besides I have tried ANALYZE statement. It has no use cause I need the > single partition but get the total table

Re: Data source V2 in spark 2.4.0

2018-10-01 Thread Wenchen Fan
Ryan thanks for putting up a list! Generally there are a few tunning to the data source v2 API in 2.4, and it shouldn't be too hard if you already have a data source v2 implementation and you want to upgrade to Spark 2.4. However, we do want to do some big API changes for data source v2 in the

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Wenchen Fan
This RC fails because of the correctness bug: SPARK-25538 I'll start a new RC once the fix(https://github.com/apache/spark/pull/22602) is merged. Thanks, Wenchen On Tue, Oct 2, 2018 at 1:21 AM Sean Owen wrote: > Given that this release is probably still 2 weeks from landing, I don't > think

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-09-28 Thread Wenchen Fan
t tests at least. > > > On Fri, 28 Sep 2018 10:59:41 +0800 *Wenchen Fan >* wrote > > Please vote on releasing the following candidate as Apache Spark version > 2.4.0. > > The vote is open until October 1 PST and passes if a majority +1 PMC votes > are cast

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-09-27 Thread Wenchen Fan
Yes, that's proposed by Sean. This time we should publish a Scala 2.12 build, both in maven and the download page. On Fri, Sep 28, 2018 at 11:34 AM Saisai Shao wrote: > Only "without-hadoop" profile has 2.12 binary, is it expected? > > Thanks > Saisai > > Wenchen

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-09-27 Thread Wenchen Fan
I'm adding my own +1, since all the problems mentioned in the RC1 voting email are all resolved. And there is no blocker issue for 2.4.0 AFAIK. On Fri, Sep 28, 2018 at 10:59 AM Wenchen Fan wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.0. > &

[VOTE] SPARK 2.4.0 (RC2)

2018-09-27 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until October 1 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-24 Thread Wenchen Fan
> All of the Kerberos options already exist in their own legacy locations though - changing their location could break a lot of systems. We can define the prefix for shared options, and we can strip the prefix when passing these options to the data source. Will this work for your case? On Tue,

Re: SPIP: support decimals with negative scale in decimal operation

2018-09-21 Thread Wenchen Fan
Hi Marco, Thanks for sending it! The problem is clearly explained in this email, but I would not treat it as a SPIP. It proposes a fix for a very tricky bug, and SPIP is usually for new features. Others please correct me if I was wrong. Thanks, Wenchen On Fri, Sep 21, 2018 at 5:47 PM Marco

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-21 Thread Wenchen Fan
". Removing 3.0.0 would work in > this case? > > 2018년 9월 21일 (금) 오후 2:29, Wenchen Fan 님이 작성: > >> There is an issue in the merge script, when resolving a ticket, the >> default fixed version is 3.0.0. I guess someone forgot to type the fixed >> version and lead to

Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Wenchen Fan
Sean thanks for checking them! I made one pass and re-targeted/closed some of them. Most of them are documentation and auditing, do we need to block the release for them? On Fri, Sep 21, 2018 at 6:01 AM Sean Owen wrote: > Because we're into 2.4 release candidates, I thought I'd look at >

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Wenchen Fan
t;>>>> FYI: SPARK-23200 has been resolved. >>>>> >>>>> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung < >>>>> felixcheun...@hotmail.com> wrote: >>>>> >>>>>> If we could work on this quickly - it might get on to future RCs. >>>>>> &g

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-18 Thread Wenchen Fan
r case it >>>> seems related to your signature. >>>> >>>> failureMessageNo public key: Key with id: () was not able to be >>>> located on http://gpg-keyserver.de/. Upload your public key and try >>>> the operation again. >>>&g

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-17 Thread Wenchen Fan
+1. All the blocker issues are all resolved in 2.3.2 AFAIK. On Tue, Sep 18, 2018 at 9:23 AM Sean Owen wrote: > +1 . Licenses and sigs check out as in previous 2.3.x releases. A > build from source with most profiles passed for me. > On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao > wrote: > > > >

Re: how can solve this error

2018-09-17 Thread Wenchen Fan
have you read https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html ? On Mon, Sep 17, 2018 at 4:46 AM hagersaleh wrote: > I write code to connect kafka with spark using python and I run code on > jupyer > my code > import os > #os.environ['PYSPARK_SUBMIT_ARGS'] =

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
oop build of it? > Really, whatever's the easy thing to do. > On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan wrote: > > > > Ah I missed the Scala 2.12 build. Do you mean we should publish a Scala > 2.12 build this time? Current for Scala 2.11 we have 3 builds: with hadoop > 2.7

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
xtra directory, > and the source release has both binary and source licenses. I'll fix > that. Not strictly necessary to reject the release over those. > > Last, when I check the staging repo I'll get my answer, but, were you > able to build 2.12 artifacts as well? > > On Sun, Sep 16

[VOTE] SPARK 2.4.0 (RC1)

2018-09-16 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until September 20 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ...

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-16 Thread Wenchen Fan
I'm +1 for this proposal: "Extend SessionConfigSupport to support passing specific white-listed configuration values" One goal of data source v2 API is to not depend on any high-level APIs like SparkSession, SQLConf, etc. If users do want to access these high-level APIs, there is a workaround:

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Wenchen Fan
Regardless the API, to use Spark to write data atomically, it requires 1. Write data distributedly, with a central coordinator at Spark driver. 2. The distributed writers are not guaranteed to run together at the same time. (This can be relaxed if we can extend the barrier scheduling feature) 3.

Re: Branch 2.4 is cut

2018-09-10 Thread Wenchen Fan
he next release, like > updating the write path. I think it would be better not to change this only > to include another major change in the next release. > > On Sun, Sep 9, 2018 at 10:41 PM Wenchen Fan wrote: > >> Strictly speaking, data source v2 is always half-finished until w

Re: Branch 2.4 is cut

2018-09-10 Thread Wenchen Fan
xpected to change further due redesigns before 3.0 so don't see much > value releasing it in 2.4. > > On Sun, 9 Sep 2018 at 22:42, Wenchen Fan wrote: > >> Strictly speaking, data source v2 is always half-finished until we mark >> it as stable. We need some small milestones to move

Re: Branch 2.4 is cut

2018-09-09 Thread Wenchen Fan
Strictly speaking, data source v2 is always half-finished until we mark it as stable. We need some small milestones to move forward step by step. The redesign also happens in an incremental way. SPARK-24882 mostly focus on the "RDD" part of the API: the separation of reader factory and input

Re: Branch 2.4 is cut

2018-09-07 Thread Wenchen Fan
prevent any >> regression. >> >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ >> >> Bests, >> Dongjoon. >> >> >> On Thu, Sep 6, 2018 at 6:56 AM Wenchen Fan wrote: >> >>> Good news! I'll try and upd

Re: data source api v2 refactoring

2018-09-07 Thread Wenchen Fan
th what you propose, >> assuming that I understand it correctly. >> >> rb >> >> On Tue, Sep 4, 2018 at 8:42 PM Wenchen Fan wrote: >> >>> I'm switching to my another Gmail account, let's see if it still gets >>> dropped this time. >>>

Re: Branch 2.4 is cut

2018-09-06 Thread Wenchen Fan
> Let's try also producing a 2.12 build with this release. The machinery > should be there in the release scripts, but let me know if something fails > while running the release for 2.12. > > On Thu, Sep 6, 2018 at 12:32 AM Wenchen Fan wrote: > >> Hi all, >> >

Re: Datasource v2 Select Into support

2018-09-06 Thread Wenchen Fan
Data source v2 catalog support(table/view) is still in progress. There are several threads in the dev list discussing it, please join the discussion if you are interested. Thanks for trying! On Thu, Sep 6, 2018 at 7:23 PM Ross Lawley wrote: > Hi, > > I hope this is the correct mailinglist. I've

Branch 2.4 is cut

2018-09-06 Thread Wenchen Fan
Hi all, I've cut the branch-2.4 since all the major blockers are resolved. If no objections I'll shortly followup with an RC to get the QA started in parallel. Committers, please only merge PRs to branch-2.4 that are bug fixes, performance regression fixes, document changes, or test suites

Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Wenchen Fan
The repartition correctness bug fix is merged. The Scala 2.12 PRs mentioned in this thread are all merged. The Kryo upgrade is done. I'm going to cut the branch 2.4 since all the major blockers are now resolved. Thanks, Wenchen On Sun, Sep 2, 2018 at 12:07 AM sadhen wrote: >

Re: Select top (100) percent equivalent in spark

2018-09-04 Thread Wenchen Fan
+ Liang-Chi and Herman, I think this is a common requirement to get top N records. For now we guarantee it by the `TakeOrderedAndProject` operator. However, this operator may not be used if the spark.sql.execution.topKSortFallbackThreshold config has a small value. Shall we reconsider

Re: data source api v2 refactoring

2018-09-04 Thread Wenchen Fan
ase it was dropped. > > -- Forwarded message - > From: Wenchen Fan > Date: Mon, Sep 3, 2018 at 6:16 AM > Subject: Re: data source api v2 refactoring > To: > Cc: Ryan Blue , Reynold Xin , < > dev@spark.apache.org> > > > Hi Mridul, > >

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-29 Thread Wenchen Fan
t matter. >> >> Tom >> >> On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid >> wrote: >> >> >> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan wrote: >> >> SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243&

Re: [DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-15 Thread Wenchen Fan
I'm also happy to see we have R support on k8s for Spark 2.4. I'll do the manual testing for it if we don't want to upgrade the OS now. If the Python support is also merged in this way, I think we can merge the R support PR too? On Thu, Aug 16, 2018 at 7:23 AM shane knapp wrote: > >> What is

[SPARK-24771] Upgrade AVRO version from 1.7.7 to 1.8

2018-08-14 Thread Wenchen Fan
Hi all, We've upgraded Avro from 1.7 to 1.8, to support date/timestamp/decimal types in the newly added Avro data source in the coming Spark 2.4, and also to make Avro work with Parquet. Since Avro 1.8 is not binary compatible with Avro 1.7 (see https://issues.apache.org/jira/browse/AVRO-1502),

Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Wenchen Fan
SPARK-25051 is resolved, can we start a new RC? SPARK-16406 is an improvement, generally we should not backport. On Wed, Aug 15, 2018 at 5:16 AM Sean Owen wrote: > (We wouldn't consider lack of an improvement to block a maintenance > release. It's reasonable to raise this elsewhere as a big

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Wenchen Fan
ocumentation of the issue so that users are less likely to stumble into >> this unaware; but really we need to fix at least the most common cases of >> this bug. Backports to maintenance branches are also probably in order. >> >> On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid >> wr

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Wenchen Fan
Some updates for the JIRA tickets that we want to resolve before Spark 2.4. green: merged orange: in progress red: likely to miss SPARK-24374 : Support Barrier Execution Mode in Apache Spark The core functionality is finished, but we still need

Re: Set up Scala 2.12 test build in Jenkins

2018-08-05 Thread Wenchen Fan
It seems to me that the closure cleaner fails to clean up something. The failed test case defines a serializable class inside the test case, and the class doesn't refer to anything in the outer class. Ideally it can be serialized after cleaning up the closure. This is somehow a very weird way to

Re: [DISCUSS] Multiple catalog support

2018-08-01 Thread Wenchen Fan
he “feature” that you can write a different > schema to a path-based JSON table without needing to run an “alter table” > on it to update the schema. If this is behavior we want to preserve (and I > think it is) then we need to clearly state what that behavior is. > > Second, I think that we

DISCUSS: SPARK-24882 data source v2 API improvement

2018-07-31 Thread Wenchen Fan
Hi all, Data source v2 is out for a while. During this release, we migrated most of the streaming sources to the v2 API (SPARK-22911 ) started to migrate file sources (SPARK-23817 ) started to

Re: Writing file

2018-07-31 Thread Wenchen Fan
It depends on how you deploy Spark. The writer just writes data to your specified path(HDFS or local path), but the writer is run on executors. If you deploy Spark with the local mode, i.e. executor and driver are together, then you will see the output file on the driver node. If you deploy Spark

Re: [DISCUSS] Multiple catalog support

2018-07-31 Thread Wenchen Fan
Here is my interpretation of your proposal, please correct me if something is wrong. End users can read/write a data source with its name and some options. e.g. `df.read.format("xyz").option(...).load`. This is currently the only end-user API for data source v2, and is widely used by Spark users

Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Wenchen Fan
Hi Carson and Yuanjian, Thanks for contributing to this project and sharing the production use cases! I believe the adaptive execution will be a very important feature of Spark SQL and will definitely benefit a lot of users. I went through the design docs and the high-level design totally makes

Re: Data source V2

2018-07-31 Thread Wenchen Fan
Hi assaf, Thanks for trying data source v2! Data source v2 is still evolving(we marked all the data source v2 interface as @Evolving), and we've already made a lot of API changes in this release(some renaming, switching to InternalRow, etc.). So I'd not encourage people to use data source v2 in

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Wenchen Fan
I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4: *High Priority*: SPARK-24374 : Support Barrier Execution Mode in Apache Spark This one is critical to the Spark ecosystem for deep learning. It only

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-30 Thread Wenchen Fan
Another two correctness bug fixes were merged to 2.3 today: https://issues.apache.org/jira/browse/SPARK-24934 https://issues.apache.org/jira/browse/SPARK-24957 On Mon, Jul 30, 2018 at 1:19 PM Xiao Li wrote: > Sounds good to me. Thanks! Today, we merged another correctness fix >

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Wenchen Fan
, I am close but need some more time. > We could get it into 2.4. > > Stavros > > On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan wrote: > >> This seems fine to me. >> >> BTW Ryan Blue and I are working on some data source v2 stuff and >> hopefully we can

Re: [DISCUSS] Multiple catalog support

2018-07-27 Thread Wenchen Fan
I think the major issue is, now users have 2 ways to create a specific data source table: 1) use the USING syntax. 2) create the table in the specific catalog. It can be super confusing if users create a cassandra table in hbase data source. Also we can't drop the USING syntax as data source v1

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-27 Thread Wenchen Fan
This seems fine to me. BTW Ryan Blue and I are working on some data source v2 stuff and hopefully we can get more things done with one more week. Thanks, Wenchen On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang wrote: > Xiangrui and I are leading an effort to implement a highly desirable >

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-17 Thread Wenchen Fan
+1 (binding). I think this is more clear to both users and developers, compared to the existing one which only supports append/overwrite and doesn't work with tables in data source(like JDBC table) well. On Wed, Jul 18, 2018 at 2:06 AM Ryan Blue wrote: > +1 (not binding) > > On Tue, Jul 17,

Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-17 Thread Wenchen Fan
ved before we can start getting the API in? If so, what do you think > needs to be decided to get it ready? > > Thanks! > > rb > > On Wed, Jul 11, 2018 at 8:24 PM Wenchen Fan wrote: > >> Hi Ryan, >> >> Great job on this! Shall we call a vote for the plan stan

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-15 Thread Wenchen Fan
+1. The Spark 2.3 regressions I'm aware of are all fixed. On Sun, Jul 15, 2018 at 4:09 PM Saisai Shao wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.2. > > The vote is open until July 20 PST and passes if a majority +1 PMC votes > are cast, with a

Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-11 Thread Wenchen Fan
Hi Ryan, Great job on this! Shall we call a vote for the plan standardization SPIP? I think this is a good idea and we should do it. Notes: We definitely need new user-facing APIs to produce these new logical plans like DeleteData. But we need a design doc for these new APIs after the SPIP

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-10 Thread Wenchen Fan
+1 On Wed, Jul 11, 2018 at 1:31 AM John Zhuge wrote: > +1 > > On Sun, Jul 8, 2018 at 1:30 AM Saisai Shao wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.3.2. >> >> The vote is open until July 11th PST and passes if a majority +1 PMC >> votes are cast,

Re: Spark data source resiliency

2018-07-03 Thread Wenchen Fan
I believe you are using something like `local[8]` as your Spark mater, which can't retry tasks. Please try `local[8, 3]` which can re-try failed tasks 3 times. On Tue, Jul 3, 2018 at 2:42 PM assaf.mendelson wrote: > That is what I expected, however, I did a very simple test (using println >

Re: Spark data source resiliency

2018-07-03 Thread Wenchen Fan
a failure in the data reader results to a task failure, and Spark will re-try the task for you (IIRC re-try 3 times before fail the job). Can you check your Spark log and see if the task fails consistently? On Tue, Jul 3, 2018 at 2:17 PM assaf.mendelson wrote: > Hi All, > > I am implemented a

Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Wenchen Fan
ed on >> join keys, we can output the ordered join keys as output ordering. >> >> >> Chrysan Wu >> 吴晓菊 >> Phone:+86 17717640807 >> >> >> 2018-06-28 22:53 GMT+08:00 Wenchen Fan : >> >>> SortMergeJoin only reports ordering of the join

Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Wenchen Fan
SortMergeJoin only reports ordering of the join keys, not the output ordering of any child. It seems reasonable to me that broadcast join should respect the output ordering of the children. Feel free to submit a PR to fix it, thanks! On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊 wrote: > Why we cannot

Re: Time for 2.3.2?

2018-06-28 Thread Wenchen Fan
四 上午11:40写道: > >> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes >> for those out. >> >> (Those are what delayed 2.2.2 and 2.1.3 for those watching...) >> >> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan wrote: >> > Hi all

Time for 2.3.2?

2018-06-27 Thread Wenchen Fan
Hi all, Spark 2.3.1 was released just a while ago, but unfortunately we discovered and fixed some critical issues afterward. *SPARK-24495: SortMergeJoin may produce wrong result.* This is a serious correctness bug, and is easy to hit: have duplicated join key from the left table, e.g. `WHERE

Re: [VOTE] Spark 2.2.2 (RC2)

2018-06-27 Thread Wenchen Fan
+1 On Thu, Jun 28, 2018 at 10:19 AM zhenya Sun wrote: > +1 > > 在 2018年6月28日,上午10:15,Hyukjin Kwon 写道: > > +1 > > 2018년 6월 28일 (목) 오전 8:42, Sean Owen 님이 작성: > >> +1 from me too. >> >> On Wed, Jun 27, 2018 at 3:31 PM Tom Graves >> wrote: >> >>> Please vote on releasing the following candidate as

Re: Time for 2.1.3

2018-06-15 Thread Wenchen Fan
+1 On Fri, Jun 15, 2018 at 7:10 AM, Tom Graves wrote: > +1 for doing a 2.1.3 release. > > Tom > > On Wednesday, June 13, 2018, 7:28:26 AM CDT, Marco Gaido < > marcogaid...@gmail.com> wrote: > > > Yes, you're right Herman. Sorry, my bad. > > Thanks. > Marco > > 2018-06-13 14:01 GMT+02:00 Herman

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Wenchen Fan
+1 On Tue, Jun 5, 2018 at 1:20 AM, Henry Robinson wrote: > +1 > > (I hope there will be a fuller design document to review, since the SPIP > is really light on details). > > On 4 June 2018 at 10:17, Joseph Bradley wrote: > >> +1 >> >> On Sun, Jun 3, 2018 at 9:59 AM, Weichen Xu >> wrote: >>

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-02 Thread Wenchen Fan
+1 On Sun, Jun 3, 2018 at 6:54 AM, Marcelo Vanzin wrote: > If you're building your own Spark, definitely try the hadoop-cloud > profile. Then you don't even need to pull anything at runtime, > everything is already packaged with Spark. > > On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas >

Re: [VOTE] Spark 2.3.1 (RC2)

2018-05-23 Thread Wenchen Fan
We found a critical bug in tungsten that can lead to silent data corruption: https://github.com/apache/spark/pull/21311 This is a long-standing bug that starts with Spark 2.0(not a regression), but since we are going to release 2.3.1, I think it's a good chance to include this fix. We will also

Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-17 Thread Wenchen Fan
SPARK-22371 turns an error to a warning, so it won't break any existing workloads. Let me backport it to 2.3 so users won't hit this problem in the new release. On Fri, May 18, 2018 at 5:59 AM, Imran Rashid wrote: > I just found

Re: Preventing predicate pushdown

2018-05-15 Thread Wenchen Fan
applying predict pushdown is an optimization, and it makes sense to provide configs to turn off certain optimizations. Feel free to create a JIRA. Thanks, Wenchen On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda wrote: > Hi, > > while working with JDBC datasource I saw

Re: Custom datasource as a wrapper for existing ones?

2018-05-03 Thread Wenchen Fan
Hi Jakub, Yea I think data source would be the most elegant way to solve your problem. Unfortunately in Spark 2.3 the only stable data source API is data source v1, which can't be used to implement high-performance data source. Data source v2 is still a preview version in Spark 2.3 and may change

Re: AccumulatorV2 vs AccumulableParam (V1)

2018-05-03 Thread Wenchen Fan
Hi Sergey, Thanks for your valuable feedback! For 1: yea this is definitely a bug and I have sent a PR to fix it. For 2: I have left my comments on the JIRA ticket. For 3: I don't quite understand it, can you give some concrete examples? For 4: yea this is a problem, but I think it's not a big

Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Wenchen Fan
gt;interfaces entirely and simplify what implementers must provide and would >>reduce confusion over what to do. >> >> Using InternalRow doesn’t cover the case where we want to produce >> ColumnarBatch instead, so what you’re proposing might still be a good >> id

Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-17 Thread Wenchen Fan
’ in the interface. Then > specifically how are we going to express capability of the given reader of > its supported format(s), or specific support for each of “real-time data in > row format, and history data in columnar format”? > > > ---------- > *

[discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-15 Thread Wenchen Fan
Hi all, I'd like to propose an API change to the data source v2. One design goal of data source v2 is API type safety. The FileFormat API is a bad example, it asks the implementation to return InternalRow even it's actually ColumnarBatch. In data source v2 we add a type parameter to

Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Wenchen Fan
Hi all, The Spark PMC recently added Zhenhua Wang as a committer on the project. Zhenhua is the major contributor of the CBO project, and has been contributing across several areas of Spark for a while, focusing especially on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!

Re: DataSourceV2 write input requirements

2018-03-27 Thread Wenchen Fan
10:05 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Hmm. Ryan seems to be right. >>> >>> Looking at sql/core/src/main/java/org/apache/spark/sql/sources/v2/re >>> ader/SupportsReportPartitioning.java : >>> >>> import org.apache.spark.sql.so

Re: DataSourceV2 write input requirements

2018-03-26 Thread Wenchen Fan
yuzhih...@gmail.com> wrote: > >> Hmm. Ryan seems to be right. >> >> Looking at sql/core/src/main/java/org/apache/spark/sql/sources/v2/re >> ader/SupportsReportPartitioning.java : >> >> import org.apache.spark.sql.sources.v2.reader.partitioning.Partitioning; &

Re: DataSourceV2 write input requirements

2018-03-26 Thread Wenchen Fan
Actually clustering is already supported, please take a look at SupportsReportPartitioning Ordering is not proposed yet, might be similar to what Ryan proposed. On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu wrote: > Interesting. > > Should requiredClustering return a Set of

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Wenchen Fan
ing" to > true. > > > > But I should confess that I don't know the source code very well, so will > appreciate if you can point me to any other pointers/examples please. > > > > *From: *Wenchen Fan <cloud0...@gmail.com> > *Date: *Thursday, March 22, 2018 at 2:5

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Wenchen Fan
org.apache.spark.sql.execution.streaming.Source is for internal use only. The official stream data source API is the data source v2 API. You can take a look at the Spark built-in streaming data sources as examples. Note: data source v2 is still experimental, you may need to update your code in a

Re: Welcoming some new committers

2018-03-02 Thread Wenchen Fan
Congratulations to everyone and welcome! On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger wrote: > Congrats to the new committers, and I appreciate the vote of confidence. > > On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia > wrote: > > Hi everyone, > >

Re: SparkContext - parameter for RDD, but not serializable, why?

2018-02-28 Thread Wenchen Fan
rtition = split.asInstanceOf[ > MyDataSourcePartition] > *val *partitionId = myDataSourcePartition.index > *val *rows = myDataSourcePartition.rowCount > *val *partitionData = 1 to rows map(r => *Row*(*s"Partition: $* > {partitionId}*, row $*{r}* of $*{rows}*"*))

Re: SparkContext - parameter for RDD, but not serializable, why?

2018-02-28 Thread Wenchen Fan
My understanding: RDD is also a driver side stuff like SparkContext, works like a handler to your distributed data on the cluster. However, `RDD.compute` (defines how to produce data for each partition) needs to be executed on the remote nodes. It's more convenient to make RDD serializable, and

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-22 Thread Wenchen Fan
+1 On Fri, Feb 23, 2018 at 6:23 AM, Sameer Agarwal wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.0. The vote is open until Tuesday February 27, 2018 at 8:00:00 am UTC > and passes if a majority of at least 3 PMC +1 votes are cast. >

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Wenchen Fan
ARK-23470 : the All Jobs >>> page >>> >> >>> may be >>> >> >>> too slow and cause "read timeout" when there are lots of jobs and >>> >> >>> stages. >>> >> >>> This is one of the most i

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Wenchen Fan
ng(Ryan) Zhu >>>> >> >> <shixi...@databricks.com> wrote: >>>> >> >>> >>>> >> >>> I'm -1 because of the UI regression >>>> >> >>> https://issues.apache.org/jira/browse/SPARK-23470 :

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Wenchen Fan
+1 On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin wrote: > +1 > > On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal , > wrote: > > this file shouldn't be included? https://dist.apache.org/repos/ >> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml >> >

Re: There is no space for new record

2018-02-09 Thread Wenchen Fan
It should be fixed by https://github.com/apache/spark/pull/20561 soon. On Fri, Feb 9, 2018 at 6:16 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > This has been reported before: http://apache-spark- > developers-list.1001551.n3.nabble.com/java-lang- > IllegalStateException-Th

Re: There is no space for new record

2018-02-09 Thread Wenchen Fan
This has been reported before: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-tc20108.html I think we may have a real bug here, but we need a reproduce. Can you provide one? thanks! On Fri, Feb 9, 2018 at 5:59 PM,

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-05 Thread Wenchen Fan
I think many advanced Spark users already have customer catalyst rules, to deal with the query plan directly, so it makes a lot of sense to standardize the logical plan. However, instead of exploring possible operations ourselves, I think we should follow the SQL standard. ReplaceTable, RTAS:

Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Wenchen Fan
You can use `Dataset.limit`, which return a new `Dataset` instead of an Array. Then you can transform it and still get the top k optimization from Spark. On Wed, Jan 31, 2018 at 3:39 PM, Yacine Mazari wrote: > Thanks for the quick reply and explanation @rxin. > > So if one

Re: ClassNotFoundException while running unit test with local cluster mode in Intellij IDEA

2018-01-30 Thread Wenchen Fan
You can run test in SBT and attach your IDEA to it for debugging, which works for me. On Tue, Jan 30, 2018 at 7:44 PM, wuyi wrote: > Dear devs, > I'v got stuck on this issue for several days, and I need help now. > At the first, I run into an old issue, which is the

Re: Why Dataset.hint uses logicalPlan (= analyzed not planWithBarrier)?

2018-01-26 Thread Wenchen Fan
Looks like we missed this one, feel free to submit a patch, thanks for your finding! On Fri, Jan 26, 2018 at 3:39 PM, Jacek Laskowski wrote: > Hi, > > I've just noticed that every time Dataset.hint is used it triggers > execution of logical commands, their unions and hint

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-22 Thread Wenchen Fan
+1 All the blocking issues are resolved(AFAIK), and important data source v2 features have been merged. On Tue, Jan 23, 2018 at 9:09 AM, Marcelo Vanzin wrote: > +0 > > Signatures check out. Code compiles, although I see the errors in [1] > when untarring the source

Re: Broken SQL Visualization?

2018-01-15 Thread Wenchen Fan
Hi, thanks for reporting, can you include the steps to reproduce this bug? On Tue, Jan 16, 2018 at 7:07 AM, Ted Yu wrote: > Did you include any picture ? > > Looks like the picture didn't go thru. > > Please use third party site. > > Thanks > > Original message

Re: Distinct on Map data type -- SPARK-19893

2018-01-12 Thread Wenchen Fan
Actually Spark 2.1.0 doesn't work for your case, it may give you wrong result... We are still working on adding this feature, but before that, we should fail earlier instead of returning wrong result. On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u wrote: > I see SPARK-19893 is

Re: Why some queries use logical.stats while others analyzed.stats?

2018-01-06 Thread Wenchen Fan
l/ > basicLogicalOperators.scala#L895 > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Stream

<    1   2   3   4   5   6   >