Re: [Spark Catalog API] Support for metadata Backup/Restore

2021-05-11 Thread Wenchen Fan
ient APIs to the end users >> in this approach? The users can only call backup or restore, right? >> >> Thanks, >> Tianchen >> >> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan wrote: >> >>> If a catalog implements backup/restore, it can easily expose some

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Wenchen Fan
[image: image.png] I checked the log in https://repository.apache.org/#stagingRepositories, seems the gpg key is not uploaded to the public keyserver. Liang-Chi can you take a look? On Tue, May 11, 2021 at 3:47 PM Wenchen Fan wrote: > +1 > > On Tue, May 11, 2021 at 2:59 AM Holden Kar

Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-18 Thread Wenchen Fan
Thank you, Liang-Chi! On Tue, May 18, 2021 at 1:32 PM Dongjoon Hyun wrote: > Finally! Thank you, Liang-Chi. > > Bests, > Dongjoon. > > > On Mon, May 17, 2021 at 10:14 PM Takeshi Yamamuro > wrote: > >> Thank you for the release work, Liang-Chi~ >> >> On Tue, May 18, 2021 at 2:11 PM Hyukjin Kwon

Re: Apache Spark 3.1.2 Release?

2021-05-18 Thread Wenchen Fan
+1, thanks! On Tue, May 18, 2021 at 1:37 PM Xiao Li wrote: > +1 Thanks, Dongjoon! > > Xiao > > > > On Mon, May 17, 2021 at 8:45 PM Kent Yao wrote: > >> +1. thanks Dongjoon >> >> *Kent Yao * >> @ Data Science Center, Hangzhou Research Institute, NetEase Corp. >> *a spark enthusiast* >> *kyuubi

Re: [Spark Catalog API] Support for metadata Backup/Restore

2021-05-07 Thread Wenchen Fan
If a catalog implements backup/restore, it can easily expose some client APIs to the end-users (e.g. REST API), I don't see a strong reason to expose the APIs to Spark. Do you plan to add new SQL commands in Spark to backup/restore a catalog? On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang wrote:

Re: Resolves too old JIRAs as incomplete

2021-05-20 Thread Wenchen Fan
+1 On Thu, May 20, 2021 at 11:59 AM Dongjoon Hyun wrote: > +1. > > Thank you, Takeshi. > > On Wed, May 19, 2021 at 7:49 PM Hyukjin Kwon wrote: > >> Yeah, I wanted to discuss this. I agree since 2.4.x became EOL >> >> 2021년 5월 20일 (목) 오전 10:54, Sean Owen 님이 작성: >> >>> I agree. Such old JIRAs

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-24 Thread Wenchen Fan
Ideally this should be handled by the underlying data source to produce a reasonably partitioned RDD as the input data. However if we already have a poorly partitioned RDD at hand and want to repartition it properly, I think an extra shuffle is required so that we can know the partition size

Re: Secrets store for DSv2

2021-05-24 Thread Wenchen Fan
You can take a look at PartitionReaderFactory. It's created at the driver side, serialized and sent to the executor side. For the write side, there is a similar channel: DataWriterFactory On Wed, May 19, 2021 at 4:37 AM Andrew Melo wrote: > Hello, > > When implementing a DSv2 datasource, where

Re: Purpose of OffsetHolder as a LeafNode?

2021-05-24 Thread Wenchen Fan
It's just an immediate place holder to update the query plan in each micro-batch. On Sat, May 15, 2021 at 10:29 PM Jacek Laskowski wrote: > Hi, > > Just stumbled upon OffsetHolder [1] and am curious why it's a LeafNode? > What logical plan could it be part of? > > [1] >

Re: About Spark executs sqlscript

2021-05-24 Thread Wenchen Fan
It's not possible to load everything into memory. We should use a big query connector (should be existing already?) and register table B and C and temp views in Spark. On Fri, May 14, 2021 at 8:50 AM bo zhao wrote: > Hi Team, > > I've followed Spark community for several years. This is my first

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread Wenchen Fan
hen it happens well >> before table resolution. And, View and Table are very different objects; >> returning Object from this API doesn't make much sense. >> >> One extra RPC is not unreasonable, and the choice should be left to >> sources. That's the easiest place

Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Wenchen Fan
I believe you can already see each plan change Spark did to your query plan in the debug-level logs. I think it's hard to do in the web UI as keeping all these historical query plans is expensive. Mapping the query plan to your application code is nearly impossible, as so many optimizations can

Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-26 Thread Wenchen Fan
OK, then I'd vote for TableViewCatalog, because 1. This is how Hive catalog works, and we need to migrate Hive catalog to the v2 API sooner or later. 2. Because of 1, TableViewCatalog is easy to support in the current table/view resolution framework. 3. It's better to avoid name conflicts between

Re: Bridging gap between Spark UI and Code

2021-05-25 Thread Wenchen Fan
You can see the SQL plan node name in the DAG visualization. Please refer to https://spark.apache.org/docs/latest/web-ui.html for more details. If you still have any confusion, please let us know and we will keep improving the document. On Tue, May 25, 2021 at 4:41 AM mhawes wrote: > @Wenc

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-25 Thread Wenchen Fan
what does a > repartition() call do if AQE is not enabled? this is essentially a new api > so would repartitionBySize or something be less confusing to users who > already use repartition(num_partitions). > > Tom > > On Monday, May 24, 2021, 12:30:20 PM CDT, Wenchen Fan > wrote: >

Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread Wenchen Fan
+1 (binding) On Thu, Apr 29, 2021 at 1:05 AM DB Tsai wrote: > +1 (binding) > > > On Apr 28, 2021, at 9:26 AM, Liang-Chi Hsieh wrote: > > > > > > Please vote on releasing the following candidate as Apache Spark version > > 2.4.8. > > > > The vote is open until May 4th at 9AM PST and passes if a

Re: Bintray replacement for spark-packages.org

2021-04-28 Thread Wenchen Fan
Shall we make new releases for 3.0 and 3.1? So that people don't need to change the sbt resolver/pom files to work around Bintray sunset. It's also been a while since the last 3.0 and 3.1 releases. On Tue, Apr 27, 2021 at 9:02 AM Matthew Powers wrote: > Great job fixing this!! I just checked

Re: Binary compatibility issues in 3.1.1?

2021-02-08 Thread Wenchen Fan
This is the cost of relying on Spark internal APIs, and the external connectors need to take care of it. BTW, the Alias change is source-compatible, and it shouldn't break the external connectors if they are compiled with Spark 3.1. On Tue, Feb 9, 2021 at 2:26 AM Alex Ott wrote: > although no,

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-08 Thread Wenchen Fan
This is a very important feature, thanks for working on it! Spark uses codegen by default, and it's a bit unfortunate to see that codegen support is treated as a non-goal. I think it's better to not ask the UDF implementations to provide two different code paths for interpreted evaluation and

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
w will Spark detect and produce those representations? What > if a function uses both String *and* UTF8String? Will Spark detect this > for each parameter? Having one or two functions called by Spark is much > easier to maintain in Spark and avoid a lot of debugging headaches when > somethin

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
Hi Holden, As Hyukjin said, following existing designs is not the principle of DS v2 API design. We should make sure the DS v2 API makes sense. AFAIK we didn't fully follow the catalog API design from Hive and I believe Ryan also agrees with it. I think the problem here is we were discussing

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
(Trino). On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan wrote: > Hi Holden, > > As Hyukjin said, following existing designs is not the principle of DS v2 > API design. We should make sure the DS v2 API makes sense. AFAIK we didn't > fully follow the catalog API design from Hive an

Re: Welcoming six new Apache Spark committers

2021-03-28 Thread Wenchen Fan
Congrats! On Mon, Mar 29, 2021 at 12:04 PM 郑瑞峰 wrote: > Congratulations to all! > > > -- 原始邮件 -- > *发件人:* "Yuanjian Li" ; > *发送时间:* 2021年3月29日(星期一) 上午10:38 > *收件人:* "Yi Wu"; > *抄送:* "Gengliang Wang";"Xiao Li";"Chao > Sun";"Mridul Muralidharan";"Dongjoon >

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-28 Thread Wenchen Fan
+1 On Mon, Mar 29, 2021 at 1:45 PM Holden Karau wrote: > +1 > > On Sun, Mar 28, 2021 at 10:25 PM sarutak wrote: > >> +1 (non-binding) >> >> - Kousuke >> >> > +1 (non-binding) >> > >> > On Sun, Mar 28, 2021 at 9:06 PM 郑瑞峰 >> > wrote: >> > >> >> +1 (non-binding) >> >> >> >> --

Re: PR testing and flaky tests (triggering executions separately)

2021-03-29 Thread Wenchen Fan
AFAIK, Github actions triggered checks are almost the same as SparkPullRequestBuilder except that it has one more Scala 2.13 check. So at least we don't have to wait for both SparkPullRequestBuilder and Github actions to merge PR. On Fri, Mar 26, 2021 at 6:09 PM Attila Zsolt Piros <

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Wenchen Fan
+1, it's great to have Pandas support in Spark out of the box. On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro wrote: > +1; the pandas interfaces are pretty popular and supporting them in > pyspark looks promising, I think. > one question I have; what's an initial goal of the proposal? > Is

Re: SessionCatalog lock issue

2021-03-18 Thread Wenchen Fan
The `synchronized` is needed for getting `currentDb` IIUC. So a small change is to only wrap `formatDatabaseName(name.database.getOrElse(currentDb))` with `synchronized`. On Thu, Mar 18, 2021 at 3:38 PM Chang Chen wrote: > hi all > > We met an issue which is related with SessionCatalog

Re: Observable Metrics on Spark Datasets

2021-03-18 Thread Wenchen Fan
I think a listener-based API makes sense for streaming (since you need to keep watching the result), but may not be very reasonable for batch queries (you only get the result once). The idea of Observation looks good, but we should define what happens if `observation.get` is called before the

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-09 Thread Wenchen Fan
+1 (binding) On Tue, Mar 9, 2021 at 1:47 PM Russell Spitzer wrote: > +1 (for what it's worth) > > Thanks for making such a robust proposal, i'm excited to see the new work > coming from this > > On Mar 8, 2021, at 11:44 PM, Dongjoon Hyun > wrote: > > +1 (binding) > > Thank you, Ryan. > >

Re: Apache Spark 3.2 Expectation

2021-03-11 Thread Wenchen Fan
There are many projects going on right now, such as new DS v2 APIs, ANSI interval types, join improvement, disaggregated shuffle, etc. I don't think it's realistic to do the branch cut in April. I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the branch 3 months earlier. We

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-02 Thread Wenchen Fan
Yes, GenericInternalRow is safe if when type mismatches, with the cost of using Object[], and primitive types need to do boxing. And this is a runtime failure, which is absolutely worse than query-compile-time checks. Also, don't forget my previous point: users need to specify the type and index

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Wenchen Fan
ap that requires string keys and some consistent type for >>> values. This would be easy with the InternalRow API because you can use >>> getString(pos) and get(pos + 1, valueType) to get the key/value pairs. >>> Use of UTF8String vs String will be checked at compile time. &

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Wenchen Fan
Great work and congrats! On Wed, Mar 3, 2021 at 3:51 PM Kent Yao wrote: > Congrats, all! > > Bests, > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi is a unified multi-tenant JDBC > interface

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-23 Thread Wenchen Fan
s? > > Especially, Wenchen, could you make your PR based on Ryan's PR? > > If we collect the scattered ideas into a single PR, that would be greatly > helpful not only for further discussions, but also when we go on a vote on > Ryan's PR or Wenchen's PR. > > Bests, > Dongjo

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Wenchen Fan
+1 On Thu, Apr 8, 2021 at 9:24 AM Sean Owen wrote: > Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all > profiles enabled. > I still get an odd failure in the Hive versions suite, but I keep seeing > that in my env and think it's something odd about my setup. > +1 >

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Wenchen Fan
> for example, having sub-groups where each group shares the resources - currently one GitHub organisation shares all resources across the projects. That's a good idea. We do need to thank Github to give free resources to ASF projects, but it's better if we can make it a business: we allow

Re: Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Wenchen Fan
Hi Tomas, thanks for reporting this bug! Is it possible to share your dataset so that other people can reproduce and debug it? On Thu, Apr 8, 2021 at 7:52 AM Tomas Bartalos wrote: > when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get > an incorrect result of 0 rows. > >

Re: Apache Spark 3.2 Expectation

2021-04-11 Thread Wenchen Fan
ust for an update, I will send a discussion email about my idea late >> this week or early next week. >> >> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성: >> >>> There are many projects going on right now, such as new DS v2 APIs, ANSI >>> interval types, join improve

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-14 Thread Wenchen Fan
+1 (binding) On Thu, Apr 15, 2021 at 12:22 AM Maxim Gekk wrote: > +1 (non-binding) > > On Wed, Apr 14, 2021 at 6:39 PM Dongjoon Hyun > wrote: > >> +1 >> >> Bests, >> Dongjoon. >> >> On Tue, Apr 13, 2021 at 10:38 PM Kent Yao wrote: >> >>> +1 (non-binding) >>> >>> *Kent Yao * >>> @ Data Science

Re: [DISCUSS] Add error IDs

2021-04-21 Thread Wenchen Fan
I think severity makes sense for logs, but not sure about errors. +1 to the proposal to improve the error message further. On Fri, Apr 16, 2021 at 6:01 PM Yuming Wang wrote: > +1 for this proposal. > > On Fri, Apr 16, 2021 at 5:15 AM Karen wrote: > >> We could leave space in the numbering

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
>>>> >> would >>>> >>>> >> also simplify using the functions in other contexts like >>>> pushing down >>>> >>>> >> filters into the ORC & Parquet readers although there are a lot >>>>

Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-16 Thread Wenchen Fan
+1 On Wed, Feb 17, 2021 at 1:43 PM Dongjoon Hyun wrote: > +1 > > Bests, > Dongjoon. > > > On Tue, Feb 16, 2021 at 2:27 AM Herman van Hovell > wrote: > >> +1 >> >> On Tue, Feb 16, 2021 at 11:08 AM Hyukjin Kwon >> wrote: >> >>> +1 >>> >>> 2021년 2월 16일 (화) 오후 5:10, Prashant Sharma 님이 작성: >>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
y of >>>>> >>>> >> getting to and calling UDFs. >>>>> >>>> >> >>>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs. >>>>> It is >>>>> >>>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-21 Thread Wenchen Fan
oblem when implementations define methods with the wrong > types? The InternalRow approach helps implementations catch that problem > (as demonstrated above) and also provides a fallback when there is a but > preventing the invoke optimization from working. That seems like a good > approach

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-22 Thread Wenchen Fan
to the Spark analyzer that the output should > be of type MapData (i.e., based on what was seen in the > input at compile time). The whole UDF becomes a Spark Expression > <https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transp

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Wenchen Fan
alRow and that it isn’t a usability >> problem to include it. >> >> Oh, and one last thought is that we already have users that call >> Dataset.map and use InternalRow. This would allow converting that code >> directly to a UDF. >> >> I think we’re

Re: [SQL] s.s.a.coalescePartitions.parallelismFirst true but recommends false

2021-09-06 Thread Wenchen Fan
This is correct. It's true by default so that AQE doesn't have performance regression. If you run a benchmark, larger parallelism usually means better performance. However, it's recommended to set it to false, so that AQE can give better resource utilization, which is good for a busy Spark

Re: [SQL] When SQLConf vals gets own accessor defs?

2021-09-06 Thread Wenchen Fan
I think SQLConf doesn't need defs anymore. In the beginning, SQLConf lived in sql/core, so we have to add defs if the code in sql/catalyst needs to access configs. Now SQLConf is in sql/catalyst (this was done a few years ago), defs are only needed if we have some special logic that is not just

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Wenchen Fan
Yea the file naming is a bit confusing, we can fix it in the next release. 3.2 actually means 3.2 or higher, so not a big deal I think. Congrats and thanks! On Wed, Oct 20, 2021 at 3:44 AM Jungtaek Lim wrote: > Thanks to Gengliang for driving this huge release! > > On Wed, Oct 20, 2021 at 1:50

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-10 Thread Wenchen Fan
+1 On Sat, Oct 9, 2021 at 2:36 PM angers zhu wrote: > +1 (non-binding) > > Cheng Pan 于2021年10月9日周六 下午2:06写道: > >> +1 (non-binding) >> >> Integration test passed[1] with my project[2]. >> >> [1] >> https://github.com/housepower/spark-clickhouse-connector/runs/3834335017 >> [2]

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-16 Thread Wenchen Fan
+1 On Mon, Nov 15, 2021 at 2:54 AM John Zhuge wrote: > +1 (non-binding) > > On Sun, Nov 14, 2021 at 10:33 AM Chao Sun wrote: > >> +1 (non-binding). Thanks Anton for the work! >> >> On Sun, Nov 14, 2021 at 10:01 AM Ryan Blue wrote: >> >>> +1 >>> >>> Thanks to Anton for all this great work! >>>

Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3

2021-11-16 Thread Wenchen Fan
Great job! On Sat, Nov 13, 2021 at 11:18 AM Hyukjin Kwon wrote: > Awesome! > > On Sat, Nov 13, 2021 at 12:04 PM Xiao Li wrote: > >> Thank you! Great job! >> >> Xiao >> >> >> On Fri, Nov 12, 2021 at 7:02 PM Mridul Muralidharan >> wrote: >> >>> >>> Nice job ! >>> There are some nice API's which

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Wenchen Fan
It's useful to have a SQL API to specify table options, similar to the DataFrameReader API. However, I share the same concern from @Hyukjin Kwon and am not very comfortable with using hints to do it. In the PR, someone mentioned TVF. I think it's better than hints, but still has problems. For

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan
et started because it fills > an existing gap. More complex use cases can be supported over time. > > Ryan > > On Wed, Oct 27, 2021 at 9:08 AM Wenchen Fan wrote: > >> IIUC, the general idea is to let each input split report its partition >> value, and Spark can

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Wenchen Fan
+1 to this SPIP and nice writeup of the design doc! Can we open comment permission in the doc so that we can discuss details there? On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon wrote: > Seems making sense to me. > > Would be great to have some feedback from people such as @We

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-28 Thread Wenchen Fan
> `BoundFunction` directly. That is easier than defining a way for Spark to > query the function catalog. > > In any case, I'm sure it's easy to understand how this works once you get > a concrete implementation. > > On Wed, Oct 27, 2021 at 9:35 AM Wenchen Fan wrote: > >> `B

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-31 Thread Wenchen Fan
+1 On Sat, Oct 30, 2021 at 8:58 AM Cheng Su wrote: > +1 > > > > Thanks, > > Cheng Su > > > > *From: *Holden Karau > *Date: *Friday, October 29, 2021 at 12:41 PM > *To: *DB Tsai > *Cc: *Dongjoon Hyun , Ryan Blue , > dev , huaxin gao > *Subject: *Re: [VOTE] SPIP: Storage Partitioned Join for

Re: Issue Upgrading to 3.2

2021-11-01 Thread Wenchen Fan
Hi Adam, Thanks for reporting this issue! Do you have the full stacktrace or a code snippet to reproduce the issue at Spark side? It looks like a bug, but it's not obvious to me how this bug can happen. Thanks, Wenchen On Sat, Oct 30, 2021 at 1:08 AM Adam Binford wrote: > Hi devs, > > I'm

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-11-01 Thread Wenchen Fan
The general idea looks great. This is indeed a complicated API and we probably need more time to evaluate the API design. It's better to commit this work earlier so that we have more time to verify it before the 3.3 release. Maybe we can commit the group-based API first, then the delta-based one,

Re: Issue Upgrading to 3.2

2021-11-01 Thread Wenchen Fan
Function registration: > Catalog.expressions.foreach(f => { > val functionIdentifier = > FunctionIdentifier(f.getClass.getSimpleName.dropRight(1)) > val expressionInfo = new ExpressionInfo( > f.getClass.getCanonicalName, > functionIdentifier.database.orNull, &g

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan
; >> >> >> Another major benefit for having bucketed table, is to avoid shuffle >> before aggregate. Just want to bring to our attention that it would be >> great to consider aggregate as well when doing this proposal. >> >> >> >>1. Any major us

Re: Time for Spark 3.2.1?

2021-12-06 Thread Wenchen Fan
+1 to make new maintenance releases for all 3.x branches. On Tue, Dec 7, 2021 at 8:57 AM Sean Owen wrote: > Always fine by me if someone wants to roll a release. > > It's been ~6 months since the last 3.0.x and 3.1.x releases, too; a new > release of those wouldn't hurt either, if any of our

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Wenchen Fan
Thanks, Shane! Really appreciate it! Wenchen On Tue, Dec 7, 2021 at 12:38 PM Xiao Li wrote: > Hi, Shane, > > Thank you for your work on it! > > Xiao > > > > > On Mon, Dec 6, 2021 at 6:20 PM L. C. Hsieh wrote: > >> Thank you, Shane. >> >> On Mon, Dec 6, 2021 at 4:27 PM Holden Karau wrote: >>

Re: Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-11 Thread Wenchen Fan
Hopefully, this StackOverflow answer can solve your problem: https://stackoverflow.com/questions/47523037/how-do-i-configure-pyspark-to-write-to-hdfs-by-default Spark doesn't control the behavior of qualifying paths. It's decided by certain configs and/or config files. On Tue, Jan 11, 2022 at

Re: Apache Spark 3.3 Release

2022-03-16 Thread Wenchen Fan
+1 to define an allowlist of features that we want to backport to branch 3.3. I also have a few in my mind complex type support in vectorized parquet reader: https://github.com/apache/spark/pull/34659 refine the DS v2 filter API for JDBC v2: https://github.com/apache/spark/pull/35768 a few new SQL

Re: Data correctness issue with Repartition + FetchFailure

2022-03-16 Thread Wenchen Fan
ous fix for > repartition works for deterministic data. With non-deterministic data, I > didn't find an API to pass DeterministicLevel to underlying rdd. > Do you plan to continue work on integration with SQL operators? If not, > I'm available to take a stab. > > On Mon, Mar 14, 20

Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Wenchen Fan
We fixed the repartition correctness bug before, by sorting the data before doing round-robin partitioning. But the issue is that we need to propagate the isDeterministic property through SQL operators. On Tue, Mar 15, 2022 at 1:50 AM Jason Xu wrote: > Hi Reynold, do you suggest removing

Re: [VOTE] Spark 3.1.3 RC4

2022-02-15 Thread Wenchen Fan
+1 On Tue, Feb 15, 2022 at 3:59 PM Yuming Wang wrote: > +1 (non-binding). > > On Tue, Feb 15, 2022 at 10:22 AM Ruifeng Zheng > wrote: > >> +1 (non-binding) >> >> checked the release script issue Dongjoon mentioned: >> >> curl -s >>

Re: Apache Spark 3.3 Release

2022-03-21 Thread Wenchen Fan
Shall we revisit this list after a week? Ideally, they should be either merged or rejected for 3.3, so that we can cut rc1. We can still discuss them case by case at that time if there are exceptions. On Sat, Mar 19, 2022 at 5:27 AM Dongjoon Hyun wrote: > Thank you for your summarization. > > I

Re: Apache Spark 3.3 Release

2022-03-21 Thread Wenchen Fan
Just checked the release calendar, the planned RC cut date is April: [image: image.png] Let's revisit after 2 weeks then? On Mon, Mar 21, 2022 at 2:47 PM Wenchen Fan wrote: > Shall we revisit this list after a week? Ideally, they should be either > merged or rejected for 3.3, so that we c

Re: bazel and external/

2022-03-21 Thread Wenchen Fan
How about renaming it to `connectors` if docker is the only exception and will be moved out? On Sat, Mar 19, 2022 at 6:18 PM Alkis Evlogimenos wrote: > It looks like renaming the directory and moving components can be separate > steps. If there is consensus that connectors will move out, should

Re: [VOTE] Spark 3.1.3 RC3

2022-02-07 Thread Wenchen Fan
Shall we use the release scripts of branch 3.1 to release 3.1? On Fri, Feb 4, 2022 at 4:57 AM Holden Karau wrote: > Good catch Dongjoon :) > > This release candidate fails, but feel free to keep testing for any other > potential blockers. > > I’ll roll RC4 next week with the older release

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-07 Thread Wenchen Fan
+1 (binding) On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee wrote: > +1 (non-binding). Thanks John! > It's great to see ViewCatalog moving on, it's a nice feature. > > Terry Kim 于2022年2月5日周六 11:57写道: > >> +1 (non-binding). Thanks John! >> >> Terry >> >> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-24 Thread Wenchen Fan
+1 On Tue, Jan 25, 2022 at 10:13 AM Ruifeng Zheng wrote: > +1 (non-binding) > > > -- 原始邮件 -- > *发件人:* "Kent Yao" ; > *发送时间:* 2022年1月25日(星期二) 上午10:09 > *收件人:* "John Zhuge"; > *抄送:* "dev"; > *主题:* Re: [VOTE] Release Spark 3.2.1 (RC2) > > +1, non-binding > > John

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Wenchen Fan
+1 On Tue, Sep 12, 2023 at 9:00 AM Yuanjian Li wrote: > +1 (non-binding) > > Yuanjian Li 于2023年9月11日周一 09:36写道: > >> @Peter Toth I've looked into the details of this >> issue, and it appears that it's neither a regression in version 3.5.0 nor a >> correctness issue. It's a bug related to a

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-31 Thread Wenchen Fan
Sorry for the last-minute bug report, but we found a regression in 3.5: the SQL INSERT command without a column list fills missing columns with NULL while Spark 3.4 does not allow it. According to the SQL standard, this shouldn't be allowed and thus a regression in 3.5. The fix has been merged

Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-03 Thread Wenchen Fan
Congrats! On Wed, Oct 4, 2023 at 8:25 AM Hyukjin Kwon wrote: > Woohoo! > > On Tue, 3 Oct 2023 at 22:47, Hussein Awala wrote: > >> Congrats to all of you! >> >> On Tue 3 Oct 2023 at 08:15, Rui Wang wrote: >> >>> Congratulations! Well deserved! >>> >>> -Rui >>> >>> >>> On Mon, Oct 2, 2023 at

Re: [VOTE] SPIP: State Data Source - Reader

2023-10-23 Thread Wenchen Fan
+1 On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim wrote: > Starting with my +1 (non-binding). Thanks! > > On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim > wrote: > >> Hi all, >> >> I'd like to start the vote for SPIP: State Data Source - Reader. >> >> The high level summary of the SPIP is that we

Re: Spark writing API

2023-08-17 Thread Wenchen Fan
o Wenchen, > > On Wed, Aug 16, 2023 at 23:33 Wenchen Fan wrote: > >> > is there a way to hint to the downstream users on the number of rows >> expected to write? >> >> It will be very hard to do. Spark pipelines the execution (within shuffle >> boundaries) a

Re: Spark writing API

2023-08-16 Thread Wenchen Fan
> is there a way to hint to the downstream users on the number of rows expected to write? It will be very hard to do. Spark pipelines the execution (within shuffle boundaries) and we can't predict the number of final output rows. On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran wrote: > > > On

Re: PR builder not working now

2022-04-19 Thread Wenchen Fan
Thank you, Hyukjin! On Wed, Apr 20, 2022 at 7:48 AM Dongjoon Hyun wrote: > It's great! Thank you. :) > > On Tue, Apr 19, 2022 at 4:42 PM Hyukjin Kwon wrote: > >> It's fixed now. >> >> On Tue, 19 Apr 2022 at 08:33, Hyukjin Kwon wrote: >> >>> It's still persistent. I will send an email to

Re: Unable to create view due to up cast error when migrating from Hive to Spark

2022-05-18 Thread Wenchen Fan
A view is essentially a SQL query. It's fragile to share views between Spark and Hive because different systems have different SQL dialects. They may interpret the view SQL query differently and introduce unexpected behaviors. In this case, Spark returns decimal type for gender * 0.3 - 0.1 but

Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-19 Thread Wenchen Fan
I think it should have been fixed by https://github.com/apache/spark/commit/0fdb6757946e2a0991256a3b73c0c09d6e764eed . Maybe the fix is not completed... On Thu, May 19, 2022 at 2:16 PM Kent Yao wrote: > Thanks, Maxim. > > Leave my -1 for this release candidate. > > Unfortunately, I don't know

Re: SIGMOD System Award for Apache Spark

2022-05-13 Thread Wenchen Fan
Great! Congratulations to everyone! On Fri, May 13, 2022 at 10:38 AM Gengliang Wang wrote: > Congratulations to the whole spark community! > > On Fri, May 13, 2022 at 10:14 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Congrats Spark community! >> >> On Fri, May 13, 2022 at

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Wenchen Fan
It's better to keep all APIs working. But in this case, I really have no idea how to make these 4 APIs reasonable. For example, tableExists(dbName: String, tableName: String) currently checks if table "dbName.tableName" exists in the Hive metastore, and does not work with v2 catalogs at all. It's

Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread Wenchen Fan
+1 On Thu, Jul 7, 2022 at 10:41 AM Xinrong Meng wrote: > +1 > > Thanks! > > > Xinrong Meng > > Software Engineer > > Databricks > > > On Wed, Jul 6, 2022 at 7:25 PM Xiao Li wrote: > >> +1 >> >> Xiao >> >> Cheng Su 于2022年7月6日周三 19:16写道: >> >>> +1 (non-binding) >>> >>> Thanks, >>> Cheng Su >>>

Re: [VOTE][SPIP] Spark Connect

2022-06-14 Thread Wenchen Fan
+1 On Tue, Jun 14, 2022 at 9:38 AM Ruifeng Zheng wrote: > +1 > > > -- 原始邮件 -- > *发件人:* "huaxin gao" ; > *发送时间:* 2022年6月14日(星期二) 上午8:47 > *收件人:* "L. C. Hsieh"; > *抄送:* "Spark dev list"; > *主题:* Re: [VOTE][SPIP] Spark Connect > > +1 > > On Mon, Jun 13, 2022 at 5:42

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Wenchen Fan
+1, tests are all green and there are no more blocker issues AFAIK. On Fri, Jun 10, 2022 at 12:27 PM Maxim Gekk wrote: > Please vote on releasing the following candidate as > Apache Spark version 3.3.0. > > The vote is open until 11:59pm Pacific time June 14th and passes if a > majority +1 PMC

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-14 Thread Wenchen Fan
+1 On Wed, Jul 13, 2022 at 7:29 PM Yikun Jiang wrote: > +1 (non-binding) > > Checked out tag and built from source on Linux aarch64 and ran some basic > test. > > > Regards, > Yikun > > > On Wed, Jul 13, 2022 at 5:54 AM Mridul Muralidharan > wrote: > >> >> +1 >> >> Signatures, digests, etc

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Wenchen Fan
I'd like to see an RC2 as well. There is kind of a correctness bug fixed after RC1 is cut: https://github.com/apache/spark/pull/36468 Users may hit this bug much more frequently if they enable ANSI mode. It's not a regression so I'd vote -0. On Wed, May 11, 2022 at 5:24 AM Thomas graves wrote:

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-18 Thread Wenchen Fan
+1 On Wed, Oct 19, 2022 at 4:59 AM Chao Sun wrote: > +1. Thanks Yuming! > > Chao > > On Tue, Oct 18, 2022 at 1:18 PM Thomas graves wrote: > > > > +1. Ran internal test suite. > > > > Tom > > > > On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang wrote: > > > > > > Please vote on releasing the

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-19 Thread Wenchen Fan
+1 On Mon, Sep 19, 2022 at 2:59 PM Yang,Jie(INF) wrote: > +1 (non-binding) > > > > Yang Jie > -- > *发件人:* Yikun Jiang > *发送时间:* 2022年9月19日 14:23:14 > *收件人:* Denny Lee > *抄送:* bo zhaobo; Yuming Wang; Kent Yao; Gengliang Wang; Hyukjin Kwon; > dev; zrf > *主题:* Re:

Re: Non-deterministic function duplicated in final Spark plan

2022-08-01 Thread Wenchen Fan
This is a hard one. Spark duplicates the join child plan if it's a self-join because Spark does not support diamond-shaped query plans. Seems the only option here is to write the join child plan to a parquet table (or using a shuffle) and read it back. On Mon, Aug 1, 2022 at 4:46 PM Enrico Minack

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Wenchen Fan
Thanks, Chao! On Wed, Nov 30, 2022 at 1:33 AM Chao Sun wrote: > We are happy to announce the availability of Apache Spark 3.2.3! > > Spark 3.2.3 is a maintenance release containing stability fixes. This > release is based on the branch-3.2 maintenance branch of Spark. We strongly > recommend

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-16 Thread Wenchen Fan
+1 On Thu, Nov 17, 2022 at 10:20 AM Yang,Jie(INF) wrote: > +1,non-binding > > > > The test combination of Java 11 + Scala 2.12 and Java 11 + Scala 2.13 has > passed. > > > > Yang Jie > > > > *发件人**: *Chris Nauroth > *日期**: *2022年11月17日 星期四 04:27 > *收件人**: *Yuming Wang > *抄送**:

Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Wenchen Fan
+1, I'm looking forward to it! On Thu, Nov 17, 2022 at 9:44 AM Ye Zhou wrote: > +1 (non-binding) > Thanks for proposing this improvement to SHS, it resolves the main > performance issue within SHS. > > On Wed, Nov 16, 2022 at 1:15 PM Jungtaek Lim > wrote: > >> +1 >> >> Nice to see the chance

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Wenchen Fan
+1 to improve the widely used micro-batch mode first. On Thu, Dec 1, 2022 at 8:49 AM Hyukjin Kwon wrote: > +1 > > On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu wrote: > >> +1 >> >> This is exciting. I agree with Jerry that this SPIP and continuous >> processing are orthogonal. This SPIP itself

Re: [VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-12-01 Thread Wenchen Fan
+1 On Thu, Dec 1, 2022 at 12:31 PM Shixiong Zhu wrote: > +1 > > > On Wed, Nov 30, 2022 at 8:04 PM Hyukjin Kwon wrote: > >> +1 >> >> On Thu, 1 Dec 2022 at 12:39, Mridul Muralidharan >> wrote: >> >>> >>> +1 >>> >>> Regards, >>> Mridul >>> >>> On Wed, Nov 30, 2022 at 8:55 PM Xingbo Jiang >>>

Re: [DISCUSS] SPIP: Better Spark UI scalability and Driver stability for large applications

2022-11-15 Thread Wenchen Fan
This looks great! UI stability/scalability has been a pain point for a long time. On Sat, Nov 12, 2022 at 5:24 AM Gengliang Wang wrote: > Hi Everyone, > > I want to discuss the "Better Spark UI scalability and Driver stability > for large applications" proposal. Please find the links below: > >

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-03 Thread Wenchen Fan
Sorry for the last-minute change, but we found two wrong behaviors and want to fix them before the release: https://github.com/apache/spark/pull/40641 We missed a corner case when the input index for `array_insert` is 0. It should fail as 0 is an invalid index.

<    1   2   3   4   5   6   >