Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Wenchen Fan
I agree with the temp table approach. One idea is: maybe we only need one temp table, and each task writes to this temp table. At the end we read the data from the temp table and write it to the target table. AFAIK JDBC can handle concurrent table writing very well, and it's better than creating

Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-05 Thread Wenchen Fan
defaults, given that there are a lot > of varying opinions on this thread. > > On Mon, Aug 5, 2019 at 12:49 AM Wenchen Fan wrote: > >> I think we need to clarify one thing before further discussion: the >> proposal is for the next release but not a long term solutio

Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-05 Thread Wenchen Fan
, because > different users have different levels of tolerance for different kinds of > errors. I’d expect these sorts of configurations to be set up at an > infrastructure level, e.g. to maintain consistent standards throughout a > whole organization. > > > > *From: *Gengli

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-30 Thread Wenchen Fan
nal > effect of the null decision would be different depending on the insertion > target if the target has different behaviors for null. > > On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan wrote: > >> > I'm a big -1 on null values for invalid casts. >> >> This is why

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-29 Thread Wenchen Fan
behavior >>> anyway. >>> Eventually, most sources are supposed to be migrated to DataSourceV2 V2. >>> I think we can discuss and make a decision now. >>> >>> > Fixing the silent corruption by adding a runtime exception is not a >>> good opti

Re: Re: How to force sorted merge join to broadcast join

2019-07-29 Thread Wenchen Fan
You can try EXPLAIN COST query and see if it works for you. On Mon, Jul 29, 2019 at 5:34 PM Rubén Berenguel wrote: > I think there is no way of doing that (at least don't remember one right > now). The closer I remember now, is you can run the SQL "ANALYZE TABLE > table_name COMPUTE STATISTIC"

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-26 Thread Wenchen Fan
mals and fail simple insert > statements. We already came up with two alternatives to fix that problem in > the DSv2 sync and I think it is a better idea to go with one of those > instead of “fixing” Spark in a way that will corrupt data or cause runtime > failures. > > On T

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-25 Thread Wenchen Fan
I have heard about many complaints about the old table insertion behavior. Blindly casting everything will leak the user mistake to a late stage of the data pipeline, and make it very hard to debug. When a user writes string values to an int column, it's probably a mistake and the columns are

Re: DataSourceV2 sync notes - 10 July 2019

2019-07-23 Thread Wenchen Fan
Hi Ryan, Thanks for summarizing and sending out the meeting notes! Unfortunately, I missed the last sync, but the topics are really interesting, especially the stats integration. The ideal solution I can think of is to refactor the optimizer/planner and move all the stats-based optimization to

Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-02 Thread Wenchen Fan
+1 as well On Tue, Jul 2, 2019 at 12:13 PM Dongjoon Hyun wrote: > Thank you so much for the replies, Reynold, Sean, Takeshi, Hyukjin! > > Bests, > Dongjoon. > > On Mon, Jul 1, 2019 at 6:00 PM Hyukjin Kwon wrote: > >> +1 >> >> 2019년 7월 2일 (화) 오전 9:39, Takeshi Yamamuro 님이 작성: >> >>> I'm also

Re: API for SparkContext ?

2019-06-30 Thread Wenchen Fan
You can call `SparkContext#addSparkListener` with a listener that implements `onApplicationEnd`. On Tue, May 14, 2019 at 1:51 AM Nasrulla Khan Haris wrote: > HI All, > > > > Is there a API for sparkContext where we can add our custom code before > stopping sparkcontext ? > > Appreciate your

Re: Access to live data of cached dataFrame

2019-05-21 Thread Wenchen Fan
When you cache a dataframe, you actually cache a logical plan. That's why re-creating the dataframe doesn't work: Spark finds out the logical plan is cached and picks the cached data. You need to uncache the dataframe, or go back to the SQL way: df.createTempView("abc") spark.table("abc").cache()

Re: RDD object Out of scope.

2019-05-21 Thread Wenchen Fan
RDD is kind of a pointer to the actual data. Unless it's cached, we don't need to clean up the RDD. On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris wrote: > HI Spark developers, > > > > Can someone point out the code where RDD objects go out of scope ?. I > found the contextcleaner >

Re: [VOTE] Release Apache Spark 2.4.3

2019-05-06 Thread Wenchen Fan
+1. The Scala version problem has been resolved, which is the main motivation of 2.4.3. On Mon, May 6, 2019 at 12:38 AM Felix Cheung wrote: > I ran basic tests on R, r-hub etc. LGTM. > > +1 (limited - I didn’t get to run other usual tests) > > -- > *From:* Sean Owen

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-29 Thread Wenchen Fan
-port and re-roll the > RC. That said I think we did / had to already drop the ability to build <= > 2.3 from the master release script already. > > On Sun, Apr 28, 2019 at 9:25 PM Wenchen Fan wrote: > >> > ... by using the release script of Spark 2.4 branch >> &

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
ml) > > On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan wrote: > >> How did you read/write the timestamp value from/to ORC file? >> >> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia < >> shubh.chaura...@gmail.com> wrote: >> >>> Hi Al

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
How did you read/write the timestamp value from/to ORC file? On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia wrote: > Hi All, > > Consider the following(spark v2.4.0): > > Basically I change values of `spark.sql.session.timeZone` and perform an > orc write. Here are 3 samples:- > > 1) >

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-21 Thread Wenchen Fan
this one for sure as it's easy to > overlook with all the pages being updated per release. > > On Thu, Apr 18, 2019 at 9:51 PM Wenchen Fan wrote: > > > > Please vote on releasing the following candidate as Apache Spark version > 2.4.2. > > > > The vote is open until

Re: Spark 2.4.2

2019-04-18 Thread Wenchen Fan
endencies but > bringing them into the process breaks users code easily) > > > -- > *From:* Michael Heuer > *Sent:* Thursday, April 18, 2019 6:41 AM > *To:* Reynold Xin > *Cc:* Sean Owen; Michael Armbrust; Ryan Blue; Spark Dev List; Wenchen > Fan;

[VOTE] Release Apache Spark 2.4.2

2019-04-18 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.2. The vote is open until April 23 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.2 [ ] -1 Do not release this package because ... To

Re: Spark 2.4.2

2019-04-17 Thread Wenchen Fan
I volunteer to be the release manager for 2.4.2, as I was also going to propose 2.4.2 because of the reverting of SPARK-25250. Is there any other ongoing bug fixes we want to include in 2.4.2? If no I'd like to start the release process today (CST). Thanks, Wenchen On Thu, Apr 18, 2019 at 3:44

Re: DataSourceV2 exceptions

2019-04-08 Thread Wenchen Fan
Like `RDD.map`, you can throw whatever exceptions and they will be propagated to the driver side and fail the Spark job. On Mon, Apr 8, 2019 at 3:10 PM Andrew Melo wrote: > Hello, > > I'm developing a (java) DataSourceV2 to read a columnar fileformat > popular in a number of physical sciences

Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-27 Thread Wenchen Fan
+1, all the known blockers are resolved. Thanks for driving this! On Wed, Mar 27, 2019 at 11:31 AM DB Tsai wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.1. > > The vote is open until March 30 PST and passes if a majority +1 PMC votes > are cast, with >

Re: [DISCUSS] Spark Columnar Processing

2019-03-25 Thread Wenchen Fan
Do you have some initial perf numbers? It seems fine to me to remain row-based inside Spark with whole-stage-codegen, and convert rows to columnar batches when communicating with external systems. On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans wrote: > This thread is to discuss adding in support

Re: Cross Join

2019-03-22 Thread Wenchen Fan
Spark 2.0 is EOL. Can you try 2.3 or 2.4? On Thu, Mar 21, 2019 at 10:23 AM asma zgolli wrote: > Hello , > > I need to cross my data and i'm executing a cross join on two dataframes . > > C = A.crossJoin(B) > A has 50 records > B has 5 records > > the result im getting with spark 2.0 is a

Re: Manually reading parquet files.

2019-03-22 Thread Wenchen Fan
Try `val enconder = RowEncoder(df.schema).resolveAndBind()` ? On Thu, Mar 21, 2019 at 5:39 PM Long, Andrew wrote: > Thanks a ton for the help! > > > > Is there a standardized way of converting the internal row to rows? > > > > I’ve tried this but im getting an exception > > > > *val *enconder =

Re: spark sql occer error

2019-03-22 Thread Wenchen Fan
Did you include the whole error message? On Fri, Mar 22, 2019 at 12:45 AM 563280193 <563280...@qq.com> wrote: > Hi , > I ran a spark sql like this: > > *select imei,tag, product_id,* > * sum(case when succ1>=1 then 1 else 0 end) as succ,* > * sum(case when fail1>=1 and succ1=0 then 1

Re: [build system] jenkins wedged again, rebooting master node

2019-03-15 Thread Wenchen Fan
cool, thanks! On Sat, Mar 16, 2019 at 1:08 AM shane knapp wrote: > well, that box rebooted in record time! we're back up and building. > > and as always, i'll keep a close eye on things today... jenkins usually > works great, until it doesn't. :\ > > On Fri, Mar 15, 2019 at 9:52 AM shane

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Wenchen Fan
Which version of Parquet has this bug? Maybe we can downgrade it. On Mon, Mar 11, 2019 at 10:34 AM Mark Hamstra wrote: > It worked in 2.3. We broke it with 2.4.0 and were informed of that > regression late in the 2.4.0 release process. Since we didn't fix it before > the 2.4.0 release, it

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Wenchen Fan
+1 On Sat, Mar 2, 2019 at 6:11 AM Yinan Li wrote: > +1 > > On Fri, Mar 1, 2019 at 12:37 PM Tom Graves > wrote: > >> +1 for the SPIP. >> >> Tom >> >> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang < >> jiangxb1...@gmail.com> wrote: >> >> >> Hi all, >> >> I want to call for a vote of

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Wenchen Fan
+1, thanks for making it clear that this SPIP focuses on high-level direction! On Sat, Mar 2, 2019 at 9:35 AM Reynold Xin wrote: > Thanks Ryan. +1. > > > > > On Fri, Mar 01, 2019 at 5:33 PM, Ryan Blue wrote: > >> Actually, I went ahead and removed the confusing section. There is no >> public

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-27 Thread Wenchen Fan
it in 3.0. >> > >> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: >> > Will that then require an API break down the line? Do we save that for >> Spark 4? >> > >> > >> > >> > >> > -Matt Cheah? >> > >> >

Re: Moving forward with the timestamp proposal

2019-02-20 Thread Wenchen Fan
I think this is the right direction to go, but I'm wondering how can Spark support these new types if the underlying data sources(like parquet files) do not support them yet. I took a quick look at the new doc for file formats, but not sure what's the proposal. Are we going to implement these new

Re: [build system] Jenkins stopped working

2019-02-19 Thread Wenchen Fan
Thanks Shane! On Wed, Feb 20, 2019 at 6:48 AM shane knapp wrote: > alright, i increased the httpd and proxy timeouts and kicked apache. i'll > keep an eye on things, but as of right now we're happily building. > > On Tue, Feb 19, 2019 at 2:25 PM shane knapp wrote: > >> aand i had to issue

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Wenchen Fan
+1 On Tue, Feb 19, 2019 at 10:50 AM Ryan Blue wrote: > Hi everyone, > > It looks like there is consensus on the proposal, so I'd like to start a > vote thread on the SPIP for identifiers in multi-catalog Spark. > > The doc is available here: >

Re: [DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Wenchen Fan
I think this is the right direction to go. Shall we move forward with a vote and detailed designs? On Mon, Feb 4, 2019 at 9:57 AM Ryan Blue wrote: > Hi everyone, > > This is a follow-up to the "Identifiers with multi-catalog support" > discussion thread. I've taken the proposal I posted to that

Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Wenchen Fan
great job! On Mon, Feb 18, 2019 at 4:24 PM Hyukjin Kwon wrote: > Yay! Good job Takeshi! > > On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro >> We are happy to announce the availability of Spark 2.3.3! >> >> Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 >> maintenance branch

Re: Compatibility on build-in DateTime functions with Hive/Presto

2019-02-17 Thread Wenchen Fan
in the Spark SQL example, `year("1912")` means, first cast "1912" to date type, and then call the "year" function. in the Postgres example, `date_part('year',TIMESTAMP '2017')` means, get a timestamp literal, and call the "date_part" function. Can you try date literal in Postgres? On Mon, Feb

Re: Time to cut an Apache 2.4.1 release?

2019-02-14 Thread Wenchen Fan
Do you know which bug ORC 1.5.2 introduced? Or is it because Hive uses a legacy version of ORC which has a bug? On Thu, Feb 14, 2019 at 2:35 PM Darcy Shen wrote: > > We found that ORC table created by Spark 2.4 failed to be read by Hive > 2.1.1. > > > spark-sql -e 'CREATE TABLE tmp.orcTable2

Re: Time to cut an Apache 2.4.1 release?

2019-02-12 Thread Wenchen Fan
+1 for 2.4.1 On Tue, Feb 12, 2019 at 7:55 PM Hyukjin Kwon wrote: > +1 for 2.4.1 > > 2019년 2월 12일 (화) 오후 4:56, Dongjin Lee 님이 작성: > >> > SPARK-23539 is a non-trivial improvement, so probably would not be >> back-ported to 2.4.x. >> >> Got it. It seems reasonable. >> >> Committers: >> >> Please

Re: Tungsten Memory Consumer

2019-02-11 Thread Wenchen Fan
what do you mean by ''Tungsten Consumer"? On Fri, Feb 8, 2019 at 6:11 PM Jack Kolokasis wrote: > Hello all, > I am studying about Tungsten Project and I am wondering when Spark > creates a Tungsten consumer. While I am running some applications, I see > that Spark creates Tungsten Consumer

Re: DataSourceV2 community sync tonight

2019-01-09 Thread Wenchen Fan
There are 2 remaining problems in the write side API refactor : 1. how to put the `queryId` parameter in the write API 2. how to put the streaming OutputMode parameter in the write API I'd like to

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-08 Thread Wenchen Fan
PM Wenchen Fan wrote: > I'm OK with it, i.e. fail the write if there are negative-scale decimals > (we need to document it though). We can improve it later in data source v2. > > On Mon, Jan 7, 2019 at 10:09 PM Marco Gaido > wrote: > >> In general we can say that som

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-08 Thread Wenchen Fan
+1 On Wed, Jan 9, 2019 at 3:37 AM DB Tsai wrote: > +1 > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 0x5CED8B896A6BDFA0 > > On Tue, Jan 8, 2019 at 11:14 AM Dongjoon Hyun > wrote: > > > > Please vote on

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-07 Thread Wenchen Fan
; Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan > ha scritto: > >> AFAIK parquet spec says decimal scale can't be negative. If we want to >> officially support negative-scale decimal, we should clearly define the >> behavior when writing negative-scale decimals to parq

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-07 Thread Wenchen Fan
, > Marco > > Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan > ha scritto: > >> I think we need to do this for backward compatibility, and according to >> the discussion in the doc, SQL standard allows negative scale. >> >> To do this, I think the P

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-06 Thread Wenchen Fan
I think we need to do this for backward compatibility, and according to the discussion in the doc, SQL standard allows negative scale. To do this, I think the PR should also include a doc for the decimal type, like the definition of precision and scale(this one

Re: Spark Packaging Jenkins

2019-01-05 Thread Wenchen Fan
IIRC there was a change to the release process: we stop using the shared gpg key on Jenkins, but use the personal key of the release manager. I'm not sure Jenkins can help testing package anymore. BTW release manager needs to run the packaging script by himself. If there is a problem, the release

Re: Support SqlStreaming in spark

2018-12-24 Thread Wenchen Fan
Hi JackyLee, Can you put the answers to these questions in the design doc? e.g. if we don't want to support manipulating a streaming query, then is `SELECT STREAM ...` a blocking action? And how users can create a Spark application with multiple streaming jobs? How users can run Structured

Re: Support SqlStreaming in spark

2018-12-21 Thread Wenchen Fan
It will be great to add pure-SQL support to structured streaming. I think it goes without saying that how important SQL support is, but we should make a completed design first. Looking at the Kafka streaming syntax , it has

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Wenchen Fan
end-user to >> than having a consistent behavior between native clients and Spark clients. >> This is why I asked if the goal was to just have a common "Spark" behavior >> because I don't think it makes sense if you consider multiple interaction >> points f

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
syntax. Isn't the right thing (for > option 1) to pass the default through to the underlying data source? > Sources that don't support defaults would throw an exception. > > On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan wrote: > >> The standard ADD COLUMN SQL syntax is: ALTER

Re: Noisy spark-website notifications

2018-12-19 Thread Wenchen Fan
+1, at least it should only send one email when a PR is merged. On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Can we somehow disable these new email alerts coming through for the Spark > website repo? > > On Wed, Dec 19, 2018 at 8:25 PM GitBox wrote: >

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
or not missing columns are > OK and let the Datasource deal with the missing data based on it's > underlying storage. > > On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan wrote: > >> I agree that we should not rewrite existing parquet files when a new >> column is added, but we should a

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Wenchen Fan
I agree that we should not rewrite existing parquet files when a new column is added, but we should also try out best to make the behavior same as RDBMS/SQL standard. 1. it should be the user who decides the default value of a column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE

Re: proposal for expanded & consistent timestamp types

2018-12-15 Thread Wenchen Fan
I like this proposal. > We'll get agreement across Spark, Hive, and Impala. Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it. On Wed, Dec 12, 2018 at 3:36 AM Li Jin wrote: > Of course. I added some comments in the doc. >

Re: removing most of the config functions in SQLConf?

2018-12-14 Thread Wenchen Fan
IIRC, the reason we did it is: `SQLConf` was in SQL core module. So we need to create methods in `CatalystConf`, and `SQLConf` implements `CatalystConf`. Now the problem has gone: we moved `SQLConf` to catalyst module. I think we can remove these methods. On Fri, Dec 14, 2018 at 3:45 PM Reynold

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Wenchen Fan
expressions/functions can be expensive and I do think Spark should trust data source and not re-apply pushed filters. If data source lies, many things can go wrong... On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote: > Well even if it has to apply it again, if pushdown is activated then it >

Re: Run a specific PySpark test or group of tests

2018-12-05 Thread Wenchen Fan
great job! thanks a lot! On Thu, Dec 6, 2018 at 9:39 AM Hyukjin Kwon wrote: > It's merged now and in developer tools page - > http://spark.apache.org/developer-tools.html#individual-tests > Have some func with PySpark testing! > > 2018년 12월 5일 (수) 오후 4:30, Hyukjin Kwon 님이 작성: > >> Hey all, I

Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-12-05 Thread Wenchen Fan
The bucket feature is designed to only work with data sources with table support, and currently the table support is not public yet, which means no external data sources can access bucketing information right now. The bucket feature only works with Spark native file source tables. We are working

Re: DataSourceV2 community sync #3

2018-11-28 Thread Wenchen Fan
Hi Ryan, Thanks for hosting the discussion! I think the table catalog is super useful, but since this is the first time we allow users to extend catalog, it's better to write down some details from end-user APIs to internal management. 1. How would end-users register/unregister catalog with SQL

Re: which classes/methods are considered as private in Spark?

2018-11-13 Thread Wenchen Fan
es are meant to be private. We should make it consistent between MiMa and doc generator though. On Wed, Nov 14, 2018 at 10:41 AM Marcelo Vanzin wrote: > On Tue, Nov 13, 2018 at 6:26 PM Wenchen Fan wrote: > > Recently I updated the MiMa exclusion rules, and found MiMa tracks some >

which classes/methods are considered as private in Spark?

2018-11-13 Thread Wenchen Fan
Hi all, Recently I updated the MiMa exclusion rules, and found MiMa tracks some private classes/methods unexpectedly. Note that, "private" here means that, we have no guarantee about compatibility. We don't provide documents and users need to take the risk when using them. In the API document,

Re: DataSourceV2 capability API

2018-11-12 Thread Wenchen Fan
I think this works, but there are also other solutions, e.g. mixin traits and runtime exceptions Assuming the general abstraction is: table -> scan builder -> scan -> batch/batches (see alternative #2 in the doc

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-08 Thread Wenchen Fan
_ > > From: Sean Owen > > Sent: Tuesday, November 6, 2018 10:51 AM > > To: Shivaram Venkataraman > > Cc: Felix Cheung; Wenchen Fan; Matei Zaharia; dev > > Subject: Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0 > > > > I think the second optio

[ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
Hi all, Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds Barrier Execution Mode for better integration with deep learning frameworks, introduces 30+ built-in and higher-order functions to deal with complex data type easier, improves the K8s integration, along with

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
+ user list On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: > resend > > On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: > >> >> >> -- Forwarded message - >> From: Wenchen Fan >> Date: Thu, Nov 8, 2018 at 10:55 PM >> S

Re: Did the 2.4 release email go out?

2018-11-08 Thread Wenchen Fan
Actually I did it 3 hours ago, however the mail server seems to have some problems and my email was lost. Let me resend it. On Fri, Nov 9, 2018 at 1:56 AM Reynold Xin wrote: > The website is already up but I didn’t see any email announcement yet. >

Re: Did the 2.4 release email go out?

2018-11-08 Thread Wenchen Fan
ping On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: > Actually I did it 3 hours ago, however the mail server seems to have some > problems and my email was lost. Let me resend it. > > On Fri, Nov 9, 2018 at 1:56 AM Reynold Xin wrote: > >> The website is already up but I

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
resend On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: > > > -- Forwarded message - > From: Wenchen Fan > Date: Thu, Nov 8, 2018 at 10:55 PM > Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 > To: Spark dev list > > > Hi all, > > Apa

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread Wenchen Fan
We make Scala 2.11 the default one in Spark 2.0, then drop Scala 2.10 in Spark 2.3. Shall we follow it and drop Scala 2.11 at some point of Spark 3.x? On Wed, Nov 7, 2018 at 8:55 AM Reynold Xin wrote: > Have we deprecated Scala 2.11 already in an existing release? > > On Tue, Nov 6, 2018 at

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.4.0

2018-11-06 Thread Wenchen Fan
Do you mean we should have a 2.4.0 release without CRAN and then do a 2.4.1 immediately? On Wed, Nov 7, 2018 at 12:34 AM Felix Cheung wrote: > Shivaram and I were discussing. > Actually we worked with them before. Another possible approach is to > remove the vignettes eval and all test from the

Re: need assistance debugging a strange build failure...

2018-11-05 Thread Wenchen Fan
Have you tried Maven instead of SBT? This looks like a Java dependency problem, e.g. a wrong version of Avro is picked. On Tue, Nov 6, 2018 at 8:30 AM shane knapp wrote: > i'm really close (for real: really close!) on the ubuntu port... but one > build has been a thorn in my side and i was

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-11-01 Thread Wenchen Fan
This vote passes! I'll follow up with a formal release announcement soon. +1: Xiao Li (binding) Sean Owen (binding) Gengliang Wang Hyukjin Kwon Wenchen Fan (binding) Ryan Blue Bryan Cutler Marcelo Vanzin (binding) Reynold Xin (binding) Chitral Verma Dilip Biswal Denny Lee Felix Cheung (binding

Re: python lint is broken on master branch

2018-10-31 Thread Wenchen Fan
be a pycodestyle or flake8 version issue? >>> On Wed, Oct 31, 2018 at 7:43 AM Wenchen Fan wrote: >>> > >>> > The Jenkins job spark-master-lint keeps failing. The error message is >>> > flake8.exceptions.FailedToLoadPlugin: Flake8 failed

python lint is broken on master branch

2018-10-31 Thread Wenchen Fan
The Jenkins job spark-master-lint keeps failing. The error message is flake8.exceptions.FailedToLoadPlugin: Flake8 failed to load plugin "pycodestyle.break_after_binary_operator" due to 'module' object has no attribute 'break_after_binary_operator'. flake8 checks failed. As an example please see

Re: DataSourceV2 hangouts sync

2018-10-30 Thread Wenchen Fan
Hi all, I spent some time thinking about the roadmap, and came up with an initial list: SPARK-25390: data source V2 API refactoring SPARK-24252: add catalog support SPARK-25531: new write APIs for data source v2 SPARK-25190: better operator pushdown API Streaming rate control API Custom metrics

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-30 Thread Wenchen Fan
gt;> >> > >> >> > These items are still targeted to 2.4.0; Xiangrui I assume these >> >> > should just be untargeted now, or resolved? >> >> > SPARK-25584 Document libsvm data source in doc site >> >> > SPARK-25346 Document Spark b

[VOTE] SPARK 2.4.0 (RC5)

2018-10-29 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until November 1 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Wenchen Fan
people here in the bay > area. > > If that makes it so some people can't attend, we can do the next one > earlier for people in Europe. > > If we go with 5PM PDT, then what day works best for everyone? > > On Thu, Oct 25, 2018 at 5:01 PM Wenchen Fan wrote: > >> Big +1

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Wenchen Fan
Big +1 on this! I live in UTC+8 and I'm available from 8 am, which is 5 pm in the bay area. Hopefully we can coordinate a time that fits everyone. Thanks Wenchen On Fri, Oct 26, 2018 at 7:21 AM Dongjoon Hyun wrote: > +1. Thank you for volunteering, Ryan! > > Bests, > Dongjoon. > > > On Thu,

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Wenchen Fan
Any updates on this topic? https://github.com/apache/spark/pull/22827 is merged and 2.4 is unblocked. I'll cut RC5 shortly after the weekend, and it will be great to include the change proposed here. Thanks, Wenchen On Fri, Oct 26, 2018 at 12:55 AM Stavros Kontopoulos <

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-25 Thread Wenchen Fan
a wrong result due to our undefined behavior. I agree we >> should revert the newly added map-oriented functions. In 3.0 release, we >> need to define the behavior of duplicate keys in the data type MAP and fix >> all the related issues that are confusing to our end users. >

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
her words, `map_filter` works like `push-downed filter` to the map in > terms of the output result > while users assumed that `map_filter` works on top of the result of `m`. > > This is a function semantic issue. > > > On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan wrote: > >&g

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
rk-sql> select map(1,2,1,3); // Spark 2.4.0 RC4 > {1:3} > > > hive> select map(1,2,1,3); // Hive 1.2.2 > OK > {1:3} > > > presto> SELECT map_concat(map(array[1],array[2]), map(array[1],array[3])); > // Presto 0.212 > _col0 > --- > {1=3} > > > Bes

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-24 Thread Wenchen Fan
e to lower the priority. Always, I respect PMC's > decision. > > I'm sending this email to draw more attention to this bug and to give some > warning on the new feature's limitation to the community. > > Bests, > Dongjoon. > > > On Mon, Oct 22, 2018 at 10:42 AM Wenchen Fan

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Wenchen Fan
nd call it the same release. >> >> I'd still like some confirmation that someone can build and pass tests >> with -Pkubernetes, maybe? It actually all passed with the 2.11 build. >> I don't think it's a 2.12 incompatibility, but rather than the K8S >> tests maybe don't quite w

Re: About introduce function sum0 to Spark

2018-10-23 Thread Wenchen Fan
This is logically `sum( if(isnull(col), 0, col) )` right? On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛 wrote: > The name is from Apache Calcite, And it doesn’t matter, we can introduce > our own. > > > > > > --- > > Regards! > > Aron Tao > > > > *发件人**: *Mark Hamstra > *日期**: *2018年10月23日 星期二 12:28 >

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
t; >> at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > >> at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > >> at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org > $scalatest$BeforeAndAfter$$super$ru

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
Since GitHub and Jenkins are in a chaotic state, I didn't wait for a green Jenkins QA job for the RC4 commit. We should fail this RC if the Jenkins is broken (very unlikely). I'm adding my own +1, all known blockers are resolved. On Tue, Oct 23, 2018 at 1:42 AM Wenchen Fan wrote: > Please v

[VOTE] SPARK 2.4.0 (RC4)

2018-10-22 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until October 26 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Wenchen Fan
. > > > On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.4.0. >> >> The vote is open until October 1 PST and passes if a majority +1 PMC >> votes are cast, with >>

Re: data source api v2 refactoring

2018-10-18 Thread Wenchen Fan
, 2018 at 4:51 AM Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Thanks for the info Ryan – very helpful! > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Wednesday, September 19, 2018 at 3:17 PM > *To: *"Thakrar, Jayesh&q

Re: Coalesce behaviour

2018-10-14 Thread Wenchen Fan
olution. > > how can we insert a coalesce as a new map-phase (new job on application > manager with narrow dependency) instead of modifying the existing reduce > phase? i am saying map-phase because it should not introduce a new shuffle: > this is wasteful and unnecessary. > > &

Re: Coalesce behaviour

2018-10-12 Thread Wenchen Fan
8 partitions and 2048 tasks and then a map > phase with 10 partitions and 10 tasks that writes to hdfs? > > every time i try to do this using coalesce the shuffle ends up having 10 > tasks which is unacceptable due to OOM. this makes coalesce somewhat > useless. > > On Wed, Oc

Re: Remove Flume support in 3.0.0?

2018-10-11 Thread Wenchen Fan
Note that, it was deprecated in 2.3.0 already: https://spark.apache.org/docs/2.3.0/streaming-flume-integration.html On Fri, Oct 12, 2018 at 12:46 AM Reynold Xin wrote: > Sounds like a good idea... > > > On Oct 11, 2018, at 6:40 PM, Sean Owen wrote: > > > > Yep, that already exists as Bahir. >

Re: Possible bug in DatasourceV2

2018-10-11 Thread Wenchen Fan
source, schema.toAttributes, options, ident, userSpecifiedSchema) >> >> } >> >> >> >> Correct this? >> >> >> >> Or even creating a new create which simply gets the schema as non >> optional? >> >> >> >

Fwd: [VOTE] SPARK 2.4.0 (RC3)

2018-10-11 Thread Wenchen Fan
Forgot to cc dev-list -- Forwarded message - From: Wenchen Fan Date: Thu, Oct 11, 2018 at 10:14 AM Subject: Re: [VOTE] SPARK 2.4.0 (RC3) To: Cc: Sean Owen Ah sorry guys, I just copy-paste the voting email from the last RC and forgot to update the date :P The voting should

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Wenchen Fan
I'm adding my own +1, since there are no known blocker issues. The correctness issue has been fixed, the streaming Java API problem has been resolved, and we have upgraded to Scala 2.12.7. On Thu, Oct 11, 2018 at 12:46 AM Wenchen Fan wrote: > Please vote on releasing the following candid

[VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 2.4.0. The vote is open until October 1 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.0 [ ] -1 Do not release this package because ... To

<    1   2   3   4   5   6   >