Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-10 Thread Wenchen Fan
99|146327314953| > |18995|243603134985| > |18991|476309451025| > |18993|287916490001| > |18998|324427845137| > |18992|412640801297| > |18994|302012976401| > +-++ > ... > > This can happen with such inconsistent schemas because State in Structured > Streaming doesn't check the schema (both

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-08 Thread Wenchen Fan
y seems to be >>>> different), and once we notice the issue it would be really odd if we >>>> publish it as it is, and try to fix it later (the fix may not be even >>>> included in 3.0.x as it might bring behavioral change). >>>> >>>>

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-08 Thread Wenchen Fan
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how. On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim wrote: > (bump to expose the discussion to more readers) > > On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim > wrote: > >> Hi

Re: is there any tool to visualize the spark physical plan or spark plan

2020-04-30 Thread Wenchen Fan
Does the Spark SQL web UI work for you? https://spark.apache.org/docs/3.0.0-preview/web-ui.html#sql-tab On Thu, Apr 30, 2020 at 5:30 PM Manu Zhang wrote: > Hi Kelly, > > If you can parse event log, then try listening on > `SparkListenerSQLExecutionStart` event and build a `SparkPlanGraph` like

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Wenchen Fan
IIUC We are moving away from having 2 classes for Java and Scala, like JavaRDD and RDD. It's much simpler to maintain and use with a single class. I don't have a strong preference over option 3 or 4. We may need to collect more data points from actual users. On Mon, Apr 27, 2020 at 9:50 PM

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Wenchen Fan
ards,Dhrubajyoti Hati.* > > > On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote: > >> This looks like a bug that path filter doesn't work for hive table >> reading. Can you open a JIRA ticket? >> >> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati >

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Wenchen Fan
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket? On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati wrote: > Just wondering if any one could help me out on this. > > Thank you! > > > > > *Regards,Dhrubajyoti Hati.* > > > On Wed, Apr 22, 2020

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Wenchen Fan
The ongoing critical issues I'm aware of are: SPARK-31257 : Fix ambiguous two different CREATE TABLE syntaxes SPARK-31404 : backward compatibility issues after switching to Proleptic Gregorian

Re: DSv2 & DataSourceRegister

2020-04-08 Thread Wenchen Fan
llo > > On Tue, Apr 7, 2020 at 23:16 Wenchen Fan wrote: > >> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not >> sure this is possible as the DS V2 API is very different in 3.0, e.g. there >> is no `DataSourceV2` anymore, and you should implement `T

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Wenchen Fan
Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table). On Wed, Apr 8, 2020 at 6:58 AM Andrew

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Wenchen Fan
Yea, release candidates are different from the preview version, as release candidates are not official releases, so they won't appear in Maven Central, can't be downloaded in the Spark official website, etc. On Wed, Apr 1, 2020 at 12:32 PM Sean Owen wrote: > These are release candidates, not

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-29 Thread Wenchen Fan
I agree that we can cut the RC anyway even if there are blockers, to move us to a more official "code freeze" status. About the CREATE TABLE unification, it's still WIP and not close-to-merge yet. Can we fix some specific problems like CREATE EXTERNAL TABLE surgically and leave the unification to

Re: Programmatic: parquet file corruption error

2020-03-27 Thread Wenchen Fan
Running Spark application with an IDE is not officially supported. It may work under some cases but there is no guarantee at all. The official way is to run interactive queries with spark-shell or package your application to a jar and use spark-submit. On Thu, Mar 26, 2020 at 4:12 PM Zahid Rahman

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Wenchen Fan
g from maven. > > Backbutton.co.uk > ¯\_(ツ)_/¯ > ♡۶Java♡۶RMI ♡۶ > Make Use Method {MUM} > makeuse.org > <http://www.backbutton.co.uk> > > > On Fri, 27 Mar 2020 at 05:45, Wenchen Fan wrote: > >> Which Spark/Scala version do you use? >

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Wenchen Fan
Which Spark/Scala version do you use? On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman wrote: > > with the following sparksession configuration > > val spark = SparkSession.builder().master("local[*]").appName("Spark Session > take").getOrCreate(); > > this line works > > flights.filter(flight_row

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-24 Thread Wenchen Fan
Hi Ryan, It's great to hear that you are cleaning up this long-standing mess. Please let me know if you hit any problems that I can help with. Thanks, Wenchen On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas wrote: > On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan wrote: > >> 2.

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Wenchen Fan
with Hive connected. >>>> >>>> But since we are even thinking about native syntax as a first class and >>>> dropping Hive one implicitly (hide in doc) or explicitly, does it really >>>> matter we require a marker (like "HIVE") in rule

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
entation to make things be clear, but if the approach >> would be explaining the difference of rules and guide the tip to make the >> query be bound to the specific rule, the same could be applied to parser >> rule to address the root cause. >> >> >> On Wed, M

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
ry but I think we are making bad assumption on end users which is a > serious problem. > > If we really want to promote Spark's one for CREATE TABLE, then would it > really matter to treat Hive CREATE TABLE be "exceptional" one and try to > isolate each other? What's the point of

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
I think the general guideline is to promote Spark's own CREATE TABLE syntax instead of the Hive one. Previously these two rules are mutually exclusive because the native syntax requires the USING clause while the Hive syntax makes ROW FORMAT or STORED AS clause optional. It's a good move to make

Re: Spark 2.4.x and 3.x datasourcev2 api documentation & references

2020-03-18 Thread Wenchen Fan
For now you can take a look at `DataSourceV2Suite`, which contains both Java/Scala implementations. There is also an ongoing PR to implement catalog APIs for JDBC: https://github.com/apache/spark/pull/27345 We are still working on the user guide. On Mon, Mar 16, 2020 at 4:59 AM MadDoxX wrote:

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
OK let me put a proposal here: 1. Permanently ban CHAR for native data source tables, and only keep it for Hive compatibility. It's OK to forget about padding like what Snowflake and MySQL have done. But it's hard for Spark to require consistent behavior about CHAR type in all data sources. Since

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
I agree that Spark can define the semantic of CHAR(x) differently than the SQL standard (no padding), and ask the data sources to follow it. But the problem is, some data sources may not be able to skip padding, like the Hive serde table. On the other hand, it's easier to require padding for

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-17 Thread Wenchen Fan
I don't think option 1 is possible. For option 2: I think we need to do it anyway. It's kind of a bug that the typed Scala UDF doesn't support case class that thus can't support struct-type input columns. For option 3: It's a bit risky to add a new API but seems like we have a good reason. The

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Wenchen Fan
+1 (binding), assuming that this is for public stable APIs, not APIs that are marked as unstable, evolving, etc. On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía wrote: > +1 (non-binding) > > Michael's section on the trade-offs of maintaining / removing an API are > one of > the best reads I have

Re: Datasource V2 support in Spark 3.x

2020-03-05 Thread Wenchen Fan
Data Source V2 has evolved to Connector API which supports both data (the data source API) and metadata (the catalog API). The new APIs are under package org.apache.spark.sql.connector You can keep using Data Source V1 as there is no plan to deprecate it in the near future. But if you'd like to

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Wenchen Fan
The JIRA ticket will show the linked PR if there are any, which indicates that someone is working on it if the PR is active. Maybe the bot should also leave a comment on the JIRA ticket to make it clearer? On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun wrote: > Hi All, > > I would like to

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-18 Thread Wenchen Fan
+ non-foldable trimStr >>> 3. non-foldable srcStr + foldable trimStr >>> 4. non-foldable srcStr + non-foldable trimStr >>> >>> The case # 2 seems a rare case, and # 3 is probably the most common >>> case. Once we see the second case, we could outp

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-15 Thread Wenchen Fan
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: *avoid breaking changes unless we have to*. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to

Re: Adaptive Query Execution performance results in 3TB TPC-DS

2020-02-13 Thread Wenchen Fan
Thanks for providing the benchmark numbers! The result is very promising and I'm looking forward to seeing more feedback from real-world workloads. On Wed, Feb 12, 2020 at 3:43 PM Jia, Ke A wrote: > Hi all, > > We have completed the Spark 3.0 Adaptive Query Execution(AQE) performance > tests in

Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
t;>>> The new policy looks clear to me. +1 for the explicit policy. >>>> >>>> So, are we going to revise the existing conf names before 3.0.0 release? >>>> >>>> Or, is it applied to new up-coming configurations from now? >>>> >&

[DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
Hi all, I'd like to discuss the naming policy of Spark configs, as for now it depends on personal preference which leads to inconsistent namings. In general, the config name should be a noun that describes its meaning clearly. Good examples: spark.sql.session.timeZone

Re: Request to document the direct relationship between other configurations

2020-02-12 Thread Wenchen Fan
In general I think it's better to have more detailed documents, but we don't have to force everyone to do it if the config name is structured. I would +1 to document the relationship of we can't tell it from the config names, e.g. spark.shuffle.service.enabled and spark.dynamicAllocation.enabled.

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Wenchen Fan
What's your use case to compare intervals? It's tricky in Spark as there is only one interval type and you can't really compare one month with 30 days. On Wed, Feb 12, 2020 at 12:01 AM Enrico Minack wrote: > Hi Devs, > > I would like to know what is the current roadmap of making >

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Wenchen Fan
Great Job, Dongjoon! On Mon, Feb 10, 2020 at 4:18 PM Hyukjin Kwon wrote: > Thanks Dongjoon! > > 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성: > >> Happy to hear the release news! >> >> Bests, >> Takeshi >> >> On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun >> wrote: >> >>> There was a typo

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-02-05 Thread Wenchen Fan
This is a hack really and we don't recommend users to access internal classes directly. That's why there is no public document. If you really need to do it and are aware of the risks, you can read the source code. All expressions (or the so-called "native UDF") extend the base class `Expression`.

Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-03 Thread Wenchen Fan
AFAIK there is no ongoing critical bug fixes, +1 On Mon, Feb 3, 2020 at 11:46 PM Dongjoon Hyun wrote: > Yes, it does officially since 2.4.0. > > 2.4.5 is a maintenance release of 2.4.x line and the community didn't > support Hadoop 3.x on 'branch-2.4'. We didn't run test at all. > > Bests, >

Re: [FYI] `Target Version` on `Improvement`/`New Feature` JIRA issues

2020-02-02 Thread Wenchen Fan
Thanks for cleaning this up! On Sun, Feb 2, 2020 at 2:08 PM Xiao Li wrote: > Thanks! Dongjoon. > > Xiao > > On Sat, Feb 1, 2020 at 5:15 PM Hyukjin Kwon wrote: > >> Thanks Dongjoon. >> >> On Sun, 2 Feb 2020, 09:08 Dongjoon Hyun, wrote: >> >>> Hi, All. >>> >>> From Today, we have `branch-3.0`

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-23 Thread Wenchen Fan
t; Also, the followup JIRA requested seems still open >>>> https://issues.apache.org/jira/browse/SPARK-27386 >>>> I heard this was already discussed but I cannot find the summary of the >>>> meeting or any documentation. >>>> >>>>

Re: Enabling push-based shuffle in Spark

2020-01-23 Thread Wenchen Fan
The name "push-based shuffle" is a little misleading. This seems like a better shuffle service that co-locates shuffle blocks of one reducer at the map phase. I think this is a good idea. Is it possible to make it completely external via the shuffle plugin API? This looks like a good use case of

Re: Correctness and data loss issues

2020-01-21 Thread Wenchen Fan
I think we need to go through them during the 3.0 QA period, and try to fix the valid ones. For example, the first ticket should be fixed already in https://issues.apache.org/jira/browse/SPARK-28344 On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun wrote: > Hi, All. > > According to our policy,

Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Wenchen Fan
I think there are a few details we need to discuss. how frequently a source should update its metrics? For example, if file source needs to report size metrics per row, it'll be super slow. what metrics a source should report? data size? numFiles? read time? shall we show metrics in SQL web UI

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-16 Thread Wenchen Fan
The proposal makes sense to me. If we are not going to make interval type ANSI-compliant in this release, we should not expose it widely. Thanks for driving it, Kent! On Fri, Jan 17, 2020 at 10:52 AM Dr. Kent Yao wrote: > Following ANSI might be a good option but also a serious user behavior >

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Wenchen Fan
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss. About expression duplication. This is an existing design choice. We don't

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-15 Thread Wenchen Fan
Recently we merged several fixes to 2.4: https://issues.apache.org/jira/browse/SPARK-30325 a driver hang issue https://issues.apache.org/jira/browse/SPARK-30246 a memory leak issue https://issues.apache.org/jira/browse/SPARK-29708 a correctness issue(for a rarely used feature, so not merged

Re: Question about Datasource V2

2020-01-13 Thread Wenchen Fan
1. we plan to add view support in future releases. 2. can you open a JIRA ticket? This seems like a bug to me. 3. instead of defining a lot of fields in the table, we decide to use properties to keep all the extra information. We've defined some reserved properties like "comment", "location",

Re: How executor Understand which RDDs needed to be persist from the submitted Task

2020-01-09 Thread Wenchen Fan
t; > Iacovos > On 1/9/20 5:03 PM, Wenchen Fan wrote: > > RDD has a flag `storageLevel` which will be set by calling persist(). RDD > will be serialized and sent to executors for running tasks. So executors > just look at RDD.storageLevel and store output in its block manager wh

Re: How executor Understand which RDDs needed to be persist from the submitted Task

2020-01-09 Thread Wenchen Fan
RDD has a flag `storageLevel` which will be set by calling persist(). RDD will be serialized and sent to executors for running tasks. So executors just look at RDD.storageLevel and store output in its block manager when needed. On Thu, Jan 9, 2020 at 5:53 PM Jack Kolokasis wrote: > Hello all, >

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

2020-01-07 Thread Wenchen Fan
I think it's simply because as[T] is lazy. You will see the right schema if you do `df.as[T].map(identity)`. On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack wrote: > Hi Devs, > > I'd like to propose a stricter version of as[T]. Given the interface def > as[T](): Dataset[T], it is

Re: [DISCUSS] Support subdirectories when accessing partitioned Parquet Hive table

2020-01-06 Thread Wenchen Fan
Isn't your directory structure malformed? The directory name under the table path should be in the form of "partitionCol=value". And AFAIK this is the Hive standard. On Mon, Jan 6, 2020 at 6:59 PM Lotkowski, Michael wrote: > Hi all, > > > > Reviving this thread, we still have this issue and

Re: Release Apache Spark 2.4.5

2020-01-05 Thread Wenchen Fan
+1 On Mon, Jan 6, 2020 at 12:02 PM Jungtaek Lim wrote: > +1 to have another Spark 2.4 release, as Spark 2.4.4 was released in 4 > months old and there's release window for this. > > On Mon, Jan 6, 2020 at 12:38 PM Hyukjin Kwon wrote: > >> Yeah, I think it's nice to have another maintenance

Re: Fw:Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11

2019-12-29 Thread Wenchen Fan
+1 for the new thrift server to get rid of the Hive dependencies! On Mon, Dec 23, 2019 at 7:55 PM Yuming Wang wrote: > I'm +1 for this SPIP for these two reasons: > > 1. The current thriftserver has some issues that are not easy to solve, > such as: SPARK-28636

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Wenchen Fan
Sounds good! On Tue, Dec 24, 2019 at 7:48 AM Reynold Xin wrote: > We've pushed out 3.0 multiple times. The latest release window documented > on the website says we'd > code freeze and cut branch-3.0 early Dec. It looks like we are suffering a >

Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-18 Thread Wenchen Fan
+1, all tests pass On Thu, Dec 19, 2019 at 7:18 AM Takeshi Yamamuro wrote: > Thanks, Yuming! > > I checked the links and the prepared binaries. > Also, I run tests with -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver > -Pmesos -Pkubernetes -Psparkr > on java version "1.8.0_181. > All the things

Re: how to get partition column info in Data Source V2 writer

2019-12-18 Thread Wenchen Fan
Hi Aakash, You can try the latest DS v2 with the 3.0 preview, and the API is in a quite stable shape now. With the latest API, a Writer is created from a Table, and the Table has the partitioning information. Thanks, Wenchen On Wed, Dec 18, 2019 at 3:22 AM aakash aakash wrote: > Thanks

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Wenchen Fan
Can we make the JDBCDialect a public API that users can plugin? It looks like an end-less job to make sure Spark JDBC source supports all databases. On Wed, Dec 11, 2019 at 11:41 PM Xiao Li wrote: > You can follow how we test the other JDBC dialects. All JDBC dialects > require the docker

Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-10 Thread Wenchen Fan
Sounds good. Thanks for bringing this up! On Wed, Dec 11, 2019 at 3:18 PM Takeshi Yamamuro wrote: > That looks nice, thanks! > I checked the previous v2.4.4 release; it has around 130 commits (from > 2.4.3 to 2.4.4), so > I think branch-2.4 already has enough commits for the next release. > > A

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-10 Thread Wenchen Fan
PartitionReader extends Closable, seems reasonable to me to do the same for DataWriter. On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim wrote: > Hi devs, > > I'd like to propose to add close() on DataWriter explicitly, which is the > place for resource cleanup. > > The rationalization of the

Re: DataSourceWriter V2 Api questions

2019-12-05 Thread Wenchen Fan
ng tables on a periodic basis. >> >> It gets messy and probably moves you towards a write-once only tables, >> etc. >> >> >> >> Finally using views in a generic mongoDB connector may not be good and >> flexible enough. >> >> &

Re: [DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-04 Thread Wenchen Fan
> proposal > <https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing> > . > > Note that this proposal is a breaking change, but the impact should be > minimal since this applies only when there are temp views and tables with > the same name. >

Re: Slower than usual on PRs

2019-12-02 Thread Wenchen Fan
Sorry to hear that. Hope you get better soon! On Tue, Dec 3, 2019 at 1:28 AM Holden Karau wrote: > Hi Spark dev folks, > > Just an FYI I'm out dealing with recovering from a motorcycle accident so > my lack of (or slow) responses on PRs/docs is health related and please > don't block on any of

Re: Fw:Re:Re: A question about radd bytes size

2019-12-02 Thread Wenchen Fan
发件人:"zhangliyun" > 发送日期:2019-12-03 05:56:55 > 收件人:"Wenchen Fan" > 主题:Re:Re: A question about radd bytes size > > Hi Fan: >thanks for reply, I agree that the how the data is stored decides the > total bytes of the table file. > In my experiment, I fou

Re: A question about radd bytes size

2019-12-01 Thread Wenchen Fan
When we talk about bytes size, we need to specify how the data is stored. For example, if we cache the dataframe, then the bytes size is the number of bytes of the binary format of the table cache. If we write to hive tables, then the bytes size is the total size of the data files of the table.

[DISCUSS] PostgreSQL dialect

2019-11-26 Thread Wenchen Fan
Hi all, Recently we start an effort to achieve feature parity between Spark and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764 This goes very well. We've added many missing features(parser rules, built-in functions, etc.) to Spark, and also corrected several inappropriate

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Wenchen Fan
Do we have a limitation on the number of pre-built distributions? Seems this time we need 1. hadoop 2.7 + hive 1.2 2. hadoop 2.7 + hive 2.3 3. hadoop 3 + hive 2.3 AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination. On Sat, Nov

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
shuffle hash join? Like code generation for ShuffledHashJoinExec or > something…. > > > > *From: *Wenchen Fan > *Date: *Sunday, November 10, 2019 at 5:57 PM > *To: *"Wang, Gang" > *Cc: *"dev@spark.apache.org" > *Subject: *Re: Why not implement CodegenSupport

Re: Why not implement CodegenSupport in class ShuffledHashJoinExec?

2019-11-10 Thread Wenchen Fan
By default sort merge join is preferred over shuffle hash join, that's why we haven't spend resources to implement codegen for it. On Sun, Nov 10, 2019 at 3:15 PM Wang, Gang wrote: > There are some cases, shuffle hash join performs even better than sort > merge join. > > While, I noticed that

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Wenchen Fan
We really need some documents to define what non-deterministic means. AFAIK, non-deterministic expressions may produce a different result for the same input row, if the already processed input rows are different. The optimizer tries its best to not change the input sequence of non-deterministic

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-06 Thread Wenchen Fan
Sounds reasonable to me. We should make the behavior consistent within Spark. On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > Currently, when a PySpark Row is created with keyword arguments, the > fields are sorted alphabetically. This has created a lot of confusion with > users because it

Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-01 Thread Wenchen Fan
The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is more stable and we should make releases using 2.7 by default. +1 On Fri, Nov 1, 2019 at 7:16 AM Xiao Li wrote: > Spark 3.0 will still use the Hadoop 2.7 profile by default, I think. > Hadoop 2.7 profile is much more

Re: Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to implement left/right join" this should be "left/right non-equal join" On Thu, Oct 24, 2019 at 6:32 AM zhangliyun wrote: > > Hi Herman: >I guess what you mentioned before > ``` > if you are OK with slightly different

Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
I haven't looked into your query yet, just want to let you know that: Spark can only pick BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens. Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but

Re: DataSourceV2 sync notes - 2 October 2019

2019-10-18 Thread Wenchen Fan
Ryan Blue wrote: > Here are my notes from last week's DSv2 sync. > > *Attendees*: > > Ryan Blue > Terry Kim > Wenchen Fan > > *Topics*: > >- SchemaPruning only supports Parquet and ORC? >- Out of order optimizer rules >- 3.0 work >

Re: Apache Spark 3.0 timeline

2019-10-16 Thread Wenchen Fan
> I figure we are probably moving to code freeze late in the year, release early next year? Sounds good! On Thu, Oct 17, 2019 at 7:51 AM Dongjoon Hyun wrote: > Thanks! That sounds reasonable. I'm +1. :) > > Historically, 2.0-preview was on May 2016 and 2.0 was on July, 2016. 3.0 > seems to be

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Wenchen Fan
Does anybody remember what we did for 2.0 preview? Personally I'd like to avoid cutting branch-3.0 right now, otherwise we need to merge PRs into two branches in the following several months. Thanks, Wenchen On Wed, Oct 16, 2019 at 3:01 PM Xingbo Jiang wrote: > Hi Dongjoon, > > I'm not sure

Re: [DISCUSS] ViewCatalog interface for DSv2

2019-10-14 Thread Wenchen Fan
I'm fine with the view definition proposed here, but my major concern is how to make sure table/view share the same namespace. According to the SQL spec, if there is a view named "a", we can't create a table named "a" anymore. We can add documents and ask the implementation to guarantee it, but

Re: [build system] IMPORTANT! northern california fire danger, potential power outage(s)

2019-10-09 Thread Wenchen Fan
Thanks for the updates! On Thu, Oct 10, 2019 at 5:34 AM Shane Knapp wrote: > quick update: > > campus is losing power @ 8pm. this is after we were told 4am, 8am, > noon, and 2-4pm. :) > > PG expects to start bringing alameda county back online at noon > tomorrow, but i believe that target to

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-08 Thread Wenchen Fan
d to apply "package hack" but also need to > depend on catalyst. > > > On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan wrote: > >> AFAIK there is no public streaming data source API before DS v2. The >> Source and Sink API is private and is only for builtin streaming sourc

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Wenchen Fan
Regarding DS v2, I'd like to remove SPARK-26785 data source v2 API refactor: streaming write SPARK-26956 remove streaming output mode from data source v2 APIs and put the umbrella ticket

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-07 Thread Wenchen Fan
+1 I think this is the most reasonable default behavior among the three. On Mon, Oct 7, 2019 at 6:06 PM Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > +1 (non-binding) > > I have been following this standardization effort and I think it is sound > and it provides the needed

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

2019-10-07 Thread Wenchen Fan
AFAIK there is no public streaming data source API before DS v2. The Source and Sink API is private and is only for builtin streaming sources. Advanced users can still implement custom stream sources with private Spark APIs (you can put your classes under the org.apache.spark.sql package to access

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Wenchen Fan
dynamic partition pruning rule generates "hidden" filters that will be converted to real predicates at runtime, so it doesn't matter where we run the rule. For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better to run it before join reorder. On Sun, Sep 29, 2019 at 5:51 AM

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Wenchen Fan
> New pushdown API for DataSourceV2 One correction: I want to revisit the pushdown API to make sure it works for dynamic partition pruning and can be extended to support limit/aggregate/... pushdown in the future. It should be a small API update instead of a new API. On Fri, Sep 20, 2019 at 3:46

Re: [DISCUSS][SPIP][SPARK-29031] Materialized columns

2019-09-15 Thread Wenchen Fan
> 1. It is a waste of IO. The whole column (in Map format) should be read and Spark extract the required keys from the map, even though the query requires only one or a few keys in the map This sounds like a similar use case to nested column pruning. We should push down the map key extracting to

Re: Thoughts on Spark 3 release, or a preview release

2019-09-15 Thread Wenchen Fan
I don't expect to see a large DS V2 API change from now on. But we may update the API a little bit if we find problems during the preview. On Sat, Sep 14, 2019 at 10:16 PM Sean Owen wrote: > I don't think this suggests anything is finalized, including APIs. I > would not guess there will be

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-12 Thread Wenchen Fan
5:28 PM > *To:* Alastair Green > *Cc:* Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang > *Subject:* Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in > table insertion by default > > > We discussed this thread quite a bit in the DSv2 sync up and Russell >

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Wenchen Fan
Congratulations! On Tue, Sep 10, 2019 at 10:19 AM Yuanjian Li wrote: > Congratulations! > > sujith chacko 于2019年9月10日周二 上午10:15写道: > >> Congratulations all. >> >> On Tue, 10 Sep 2019 at 7:27 AM, Haibo wrote: >> >>> congratulations~ >>> >>> >>> >>> 在2019年09月10日 09:30,Joseph Torres >>> 写道: >>>

Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Wenchen Fan
Hi Nicholas, You are talking about a different thing. The PERMISSIVE mode is the failure mode for reading text-based data source (json, csv, etc.). It's not the general failure mode for Spark table insertion. I agree with you that the PERMISSIVE mode is hard to use. Feel free to open a JIRA

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Wenchen Fan
+1 To be honest I don't like the legacy policy. It's too loose and easy for users to make mistakes, especially when Spark returns null if a function hit errors like overflow. The strict policy is not good either. It's too strict and stops valid use cases like writing timestamp values to a date

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Wenchen Fan
Great! Thanks! On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun wrote: > We are happy to announce the availability of Spark 2.4.4! > > Spark 2.4.4 is a maintenance release containing stability fixes. This > release is based on the branch-2.4 maintenance branch of Spark. We strongly > recommend all

Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-28 Thread Wenchen Fan
+1, no more blocking issues that I'm aware of. On Wed, Aug 28, 2019 at 8:33 PM Sean Owen wrote: > +1 from me again. > > On Tue, Aug 27, 2019 at 6:06 PM Dongjoon Hyun > wrote: > > > > Please vote on releasing the following candidate as Apache Spark version > 2.4.4. > > > > The vote is open

Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-27 Thread Wenchen Fan
+1 On Wed, Aug 28, 2019 at 2:43 AM DB Tsai wrote: > +1 > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 42E5B25A8F7A82C1 > > On Tue, Aug 27, 2019 at 11:31 AM Dongjoon Hyun > wrote: > > > > +1. > > > > I also

Re: Apache Spark git repo moved to gitbox.apache.org

2019-08-26 Thread Wenchen Fan
yea I think we should, but no need to worry too much about it because gitbox still works in the release scripts. On Tue, Aug 27, 2019 at 3:23 AM Shane Knapp wrote: > revisiting this old thread... > > i noticed from the committers' page on the spark site that the 'apache' > remote should be

Re: JDK11 Support in Apache Spark

2019-08-25 Thread Wenchen Fan
Great work! On Sun, Aug 25, 2019 at 6:03 AM Xiao Li wrote: > Thank you for your contributions! This is a great feature for Spark > 3.0! We finally achieve it! > > Xiao > > On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung > wrote: > >> That’s great! >> >> -- >> *From:*

Re: [VOTE] Release Apache Spark 2.4.4 (RC1)

2019-08-19 Thread Wenchen Fan
Unfortunately, I need to -1. Recently we found that the repartition correctness bug can still be reproduced. The root cause has been identified and there are 2 PRs to fix 2 related issues: https://github.com/apache/spark/pull/25491 https://github.com/apache/spark/pull/25498 I think we should

Re: Release Spark 2.3.4

2019-08-18 Thread Wenchen Fan
+1 On Sat, Aug 17, 2019 at 3:37 PM Hyukjin Kwon wrote: > +1 too > > 2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성: > >> +1 >> >> Regards, >> Dilip Biswal >> Tel: 408-463-4980 >> dbis...@us.ibm.com >> >> >> >> - Original message - >> From: John Zhuge >> To: Xiao Li >> Cc: Takeshi

Re: [build system] colo maintenance & outage tomorrow, 10am-2pm PDT

2019-08-15 Thread Wenchen Fan
Thanks for tracking it Shane! On Fri, Aug 16, 2019 at 7:41 AM Shane Knapp wrote: > just got an update: > > there was a problem w/the replacement part, and they're trying to fix it. > if that's successful, the expect to have power restored within the hour. > > if that doesn't work, a new (new)

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Wenchen Fan
+1 On Wed, Aug 14, 2019 at 12:52 PM Holden Karau wrote: > +1 > Does anyone have any critical fixes they’d like to see in 2.4.4? > > On Tue, Aug 13, 2019 at 5:22 PM Sean Owen wrote: > >> Seems fine to me if there are enough valuable fixes to justify another >> release. If there are any other

Re: displaying "Test build" in PR

2019-08-13 Thread Wenchen Fan
"Can one of the admins verify this patch?" is a corrected message, as Jenkins won't test your PR until an admin approves it. BTW I think "5 minutes" is a reasonable delay for PR testing. It usually takes days to review and merge a PR, so I don't think seeing test progress right after PR creation

Re: [SPARK-23207] Repro

2019-08-12 Thread Wenchen Fan
Hi Tyson, Thanks for reporting it! I quickly checked the related scheduler code but can't find an obvious place that can go wrong with cached RDD. Sean said that he can't produce it, but the second job fails. This is actually expected. We need a lot more changes to completely fix this problem,

<    1   2   3   4   5   6   >