Re: [SQL] s.s.a.coalescePartitions.parallelismFirst true but recommends false

2021-09-06 Thread Wenchen Fan
This is correct. It's true by default so that AQE doesn't have performance regression. If you run a benchmark, larger parallelism usually means better performance. However, it's recommended to set it to false, so that AQE can give better resource utilization, which is good for a busy Spark

Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-26 Thread Wenchen Fan
OK, then I'd vote for TableViewCatalog, because 1. This is how Hive catalog works, and we need to migrate Hive catalog to the v2 API sooner or later. 2. Because of 1, TableViewCatalog is easy to support in the current table/view resolution framework. 3. It's better to avoid name conflicts between

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-25 Thread Wenchen Fan
what does a > repartition() call do if AQE is not enabled? this is essentially a new api > so would repartitionBySize or something be less confusing to users who > already use repartition(num_partitions). > > Tom > > On Monday, May 24, 2021, 12:30:20 PM CDT, Wenchen Fan > wrote: >

Re: Bridging gap between Spark UI and Code

2021-05-25 Thread Wenchen Fan
You can see the SQL plan node name in the DAG visualization. Please refer to https://spark.apache.org/docs/latest/web-ui.html for more details. If you still have any confusion, please let us know and we will keep improving the document. On Tue, May 25, 2021 at 4:41 AM mhawes wrote: > @Wenc

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread Wenchen Fan
hen it happens well >> before table resolution. And, View and Table are very different objects; >> returning Object from this API doesn't make much sense. >> >> One extra RPC is not unreasonable, and the choice should be left to >> sources. That's the easiest place

Re: About Spark executs sqlscript

2021-05-24 Thread Wenchen Fan
It's not possible to load everything into memory. We should use a big query connector (should be existing already?) and register table B and C and temp views in Spark. On Fri, May 14, 2021 at 8:50 AM bo zhao wrote: > Hi Team, > > I've followed Spark community for several years. This is my first

Re: Purpose of OffsetHolder as a LeafNode?

2021-05-24 Thread Wenchen Fan
It's just an immediate place holder to update the query plan in each micro-batch. On Sat, May 15, 2021 at 10:29 PM Jacek Laskowski wrote: > Hi, > > Just stumbled upon OffsetHolder [1] and am curious why it's a LeafNode? > What logical plan could it be part of? > > [1] >

Re: Secrets store for DSv2

2021-05-24 Thread Wenchen Fan
You can take a look at PartitionReaderFactory. It's created at the driver side, serialized and sent to the executor side. For the write side, there is a similar channel: DataWriterFactory On Wed, May 19, 2021 at 4:37 AM Andrew Melo wrote: > Hello, > > When implementing a DSv2 datasource, where

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-24 Thread Wenchen Fan
Ideally this should be handled by the underlying data source to produce a reasonably partitioned RDD as the input data. However if we already have a poorly partitioned RDD at hand and want to repartition it properly, I think an extra shuffle is required so that we can know the partition size

Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Wenchen Fan
I believe you can already see each plan change Spark did to your query plan in the debug-level logs. I think it's hard to do in the web UI as keeping all these historical query plans is expensive. Mapping the query plan to your application code is nearly impossible, as so many optimizations can

Re: Resolves too old JIRAs as incomplete

2021-05-20 Thread Wenchen Fan
+1 On Thu, May 20, 2021 at 11:59 AM Dongjoon Hyun wrote: > +1. > > Thank you, Takeshi. > > On Wed, May 19, 2021 at 7:49 PM Hyukjin Kwon wrote: > >> Yeah, I wanted to discuss this. I agree since 2.4.x became EOL >> >> 2021년 5월 20일 (목) 오전 10:54, Sean Owen 님이 작성: >> >>> I agree. Such old JIRAs

Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-18 Thread Wenchen Fan
Thank you, Liang-Chi! On Tue, May 18, 2021 at 1:32 PM Dongjoon Hyun wrote: > Finally! Thank you, Liang-Chi. > > Bests, > Dongjoon. > > > On Mon, May 17, 2021 at 10:14 PM Takeshi Yamamuro > wrote: > >> Thank you for the release work, Liang-Chi~ >> >> On Tue, May 18, 2021 at 2:11 PM Hyukjin Kwon

Re: Apache Spark 3.1.2 Release?

2021-05-18 Thread Wenchen Fan
+1, thanks! On Tue, May 18, 2021 at 1:37 PM Xiao Li wrote: > +1 Thanks, Dongjoon! > > Xiao > > > > On Mon, May 17, 2021 at 8:45 PM Kent Yao wrote: > >> +1. thanks Dongjoon >> >> *Kent Yao * >> @ Data Science Center, Hangzhou Research Institute, NetEase Corp. >> *a spark enthusiast* >> *kyuubi

Re: [Spark Catalog API] Support for metadata Backup/Restore

2021-05-11 Thread Wenchen Fan
ient APIs to the end users >> in this approach? The users can only call backup or restore, right? >> >> Thanks, >> Tianchen >> >> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan wrote: >> >>> If a catalog implements backup/restore, it can easily expose some

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Wenchen Fan
[image: image.png] I checked the log in https://repository.apache.org/#stagingRepositories, seems the gpg key is not uploaded to the public keyserver. Liang-Chi can you take a look? On Tue, May 11, 2021 at 3:47 PM Wenchen Fan wrote: > +1 > > On Tue, May 11, 2021 at 2:59 AM Holden Kar

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Wenchen Fan
+1 On Tue, May 11, 2021 at 2:59 AM Holden Karau wrote: > +1 - pip install with Py 2.7 works (with the understandable warnings > regarding Python 2.7 no longer being maintained). > > On Mon, May 10, 2021 at 11:18 AM sarutak wrote: > > > > +1 (non-binding) > > > > - Kousuke > > > > > It looks

Re: [Spark Catalog API] Support for metadata Backup/Restore

2021-05-07 Thread Wenchen Fan
If a catalog implements backup/restore, it can easily expose some client APIs to the end-users (e.g. REST API), I don't see a strong reason to expose the APIs to Spark. Do you plan to add new SQL commands in Spark to backup/restore a catalog? On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang wrote:

Re: Bintray replacement for spark-packages.org

2021-04-28 Thread Wenchen Fan
Shall we make new releases for 3.0 and 3.1? So that people don't need to change the sbt resolver/pom files to work around Bintray sunset. It's also been a while since the last 3.0 and 3.1 releases. On Tue, Apr 27, 2021 at 9:02 AM Matthew Powers wrote: > Great job fixing this!! I just checked

Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread Wenchen Fan
+1 (binding) On Thu, Apr 29, 2021 at 1:05 AM DB Tsai wrote: > +1 (binding) > > > On Apr 28, 2021, at 9:26 AM, Liang-Chi Hsieh wrote: > > > > > > Please vote on releasing the following candidate as Apache Spark version > > 2.4.8. > > > > The vote is open until May 4th at 9AM PST and passes if a

Re: [DISCUSS] Add error IDs

2021-04-21 Thread Wenchen Fan
I think severity makes sense for logs, but not sure about errors. +1 to the proposal to improve the error message further. On Fri, Apr 16, 2021 at 6:01 PM Yuming Wang wrote: > +1 for this proposal. > > On Fri, Apr 16, 2021 at 5:15 AM Karen wrote: > >> We could leave space in the numbering

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-14 Thread Wenchen Fan
+1 (binding) On Thu, Apr 15, 2021 at 12:22 AM Maxim Gekk wrote: > +1 (non-binding) > > On Wed, Apr 14, 2021 at 6:39 PM Dongjoon Hyun > wrote: > >> +1 >> >> Bests, >> Dongjoon. >> >> On Tue, Apr 13, 2021 at 10:38 PM Kent Yao wrote: >> >>> +1 (non-binding) >>> >>> *Kent Yao * >>> @ Data Science

Re: Apache Spark 3.2 Expectation

2021-04-11 Thread Wenchen Fan
ust for an update, I will send a discussion email about my idea late >> this week or early next week. >> >> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성: >> >>> There are many projects going on right now, such as new DS v2 APIs, ANSI >>> interval types, join improve

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Wenchen Fan
> for example, having sub-groups where each group shares the resources - currently one GitHub organisation shares all resources across the projects. That's a good idea. We do need to thank Github to give free resources to ASF projects, but it's better if we can make it a business: we allow

Re: Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Wenchen Fan
Hi Tomas, thanks for reporting this bug! Is it possible to share your dataset so that other people can reproduce and debug it? On Thu, Apr 8, 2021 at 7:52 AM Tomas Bartalos wrote: > when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get > an incorrect result of 0 rows. > >

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Wenchen Fan
+1 On Thu, Apr 8, 2021 at 9:24 AM Sean Owen wrote: > Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all > profiles enabled. > I still get an odd failure in the Hive versions suite, but I keep seeing > that in my env and think it's something odd about my setup. > +1 >

Re: PR testing and flaky tests (triggering executions separately)

2021-03-29 Thread Wenchen Fan
AFAIK, Github actions triggered checks are almost the same as SparkPullRequestBuilder except that it has one more Scala 2.13 check. So at least we don't have to wait for both SparkPullRequestBuilder and Github actions to merge PR. On Fri, Mar 26, 2021 at 6:09 PM Attila Zsolt Piros <

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-28 Thread Wenchen Fan
+1 On Mon, Mar 29, 2021 at 1:45 PM Holden Karau wrote: > +1 > > On Sun, Mar 28, 2021 at 10:25 PM sarutak wrote: > >> +1 (non-binding) >> >> - Kousuke >> >> > +1 (non-binding) >> > >> > On Sun, Mar 28, 2021 at 9:06 PM 郑瑞峰 >> > wrote: >> > >> >> +1 (non-binding) >> >> >> >> --

Re: Welcoming six new Apache Spark committers

2021-03-28 Thread Wenchen Fan
Congrats! On Mon, Mar 29, 2021 at 12:04 PM 郑瑞峰 wrote: > Congratulations to all! > > > -- 原始邮件 -- > *发件人:* "Yuanjian Li" ; > *发送时间:* 2021年3月29日(星期一) 上午10:38 > *收件人:* "Yi Wu"; > *抄送:* "Gengliang Wang";"Xiao Li";"Chao > Sun";"Mridul Muralidharan";"Dongjoon >

Re: Observable Metrics on Spark Datasets

2021-03-18 Thread Wenchen Fan
I think a listener-based API makes sense for streaming (since you need to keep watching the result), but may not be very reasonable for batch queries (you only get the result once). The idea of Observation looks good, but we should define what happens if `observation.get` is called before the

Re: SessionCatalog lock issue

2021-03-18 Thread Wenchen Fan
The `synchronized` is needed for getting `currentDb` IIUC. So a small change is to only wrap `formatDatabaseName(name.database.getOrElse(currentDb))` with `synchronized`. On Thu, Mar 18, 2021 at 3:38 PM Chang Chen wrote: > hi all > > We met an issue which is related with SessionCatalog

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Wenchen Fan
+1, it's great to have Pandas support in Spark out of the box. On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro wrote: > +1; the pandas interfaces are pretty popular and supporting them in > pyspark looks promising, I think. > one question I have; what's an initial goal of the proposal? > Is

Re: Apache Spark 3.2 Expectation

2021-03-11 Thread Wenchen Fan
There are many projects going on right now, such as new DS v2 APIs, ANSI interval types, join improvement, disaggregated shuffle, etc. I don't think it's realistic to do the branch cut in April. I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the branch 3 months earlier. We

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-09 Thread Wenchen Fan
+1 (binding) On Tue, Mar 9, 2021 at 1:47 PM Russell Spitzer wrote: > +1 (for what it's worth) > > Thanks for making such a robust proposal, i'm excited to see the new work > coming from this > > On Mar 8, 2021, at 11:44 PM, Dongjoon Hyun > wrote: > > +1 (binding) > > Thank you, Ryan. > >

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Wenchen Fan
ap that requires string keys and some consistent type for >>> values. This would be easy with the InternalRow API because you can use >>> getString(pos) and get(pos + 1, valueType) to get the key/value pairs. >>> Use of UTF8String vs String will be checked at compile time. &

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Wenchen Fan
Great work and congrats! On Wed, Mar 3, 2021 at 3:51 PM Kent Yao wrote: > Congrats, all! > > Bests, > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi is a unified multi-tenant JDBC > interface

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-02 Thread Wenchen Fan
Yes, GenericInternalRow is safe if when type mismatches, with the cost of using Object[], and primitive types need to do boxing. And this is a runtime failure, which is absolutely worse than query-compile-time checks. Also, don't forget my previous point: users need to specify the type and index

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-23 Thread Wenchen Fan
s? > > Especially, Wenchen, could you make your PR based on Ryan's PR? > > If we collect the scattered ideas into a single PR, that would be greatly > helpful not only for further discussions, but also when we go on a vote on > Ryan's PR or Wenchen's PR. > > Bests, > Dongjo

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-22 Thread Wenchen Fan
to the Spark analyzer that the output should > be of type MapData (i.e., based on what was seen in the > input at compile time). The whole UDF becomes a Spark Expression > <https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transp

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-21 Thread Wenchen Fan
oblem when implementations define methods with the wrong > types? The InternalRow approach helps implementations catch that problem > (as demonstrated above) and also provides a fallback when there is a but > preventing the invoke optimization from working. That seems like a good > approach

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Wenchen Fan
alRow and that it isn’t a usability >> problem to include it. >> >> Oh, and one last thought is that we already have users that call >> Dataset.map and use InternalRow. This would allow converting that code >> directly to a UDF. >> >> I think we’re

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
y of >>>>> >>>> >> getting to and calling UDFs. >>>>> >>>> >> >>>>> >>>> >> I like having the ScalarFunction as the API to call the UDFs. >>>>> It is >>>>> >>>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
>>>> >> would >>>> >>>> >> also simplify using the functions in other contexts like >>>> pushing down >>>> >>>> >> filters into the ORC & Parquet readers although there are a lot >>>>

Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-16 Thread Wenchen Fan
+1 On Wed, Feb 17, 2021 at 1:43 PM Dongjoon Hyun wrote: > +1 > > Bests, > Dongjoon. > > > On Tue, Feb 16, 2021 at 2:27 AM Herman van Hovell > wrote: > >> +1 >> >> On Tue, Feb 16, 2021 at 11:08 AM Hyukjin Kwon >> wrote: >> >>> +1 >>> >>> 2021년 2월 16일 (화) 오후 5:10, Prashant Sharma 님이 작성: >>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
(Trino). On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan wrote: > Hi Holden, > > As Hyukjin said, following existing designs is not the principle of DS v2 > API design. We should make sure the DS v2 API makes sense. AFAIK we didn't > fully follow the catalog API design from Hive an

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
Hi Holden, As Hyukjin said, following existing designs is not the principle of DS v2 API design. We should make sure the DS v2 API makes sense. AFAIK we didn't fully follow the catalog API design from Hive and I believe Ryan also agrees with it. I think the problem here is we were discussing

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
w will Spark detect and produce those representations? What > if a function uses both String *and* UTF8String? Will Spark detect this > for each parameter? Having one or two functions called by Spark is much > easier to maintain in Spark and avoid a lot of debugging headaches when > somethin

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-08 Thread Wenchen Fan
This is a very important feature, thanks for working on it! Spark uses codegen by default, and it's a bit unfortunate to see that codegen support is treated as a non-goal. I think it's better to not ask the UDF implementations to provide two different code paths for interpreted evaluation and

Re: Binary compatibility issues in 3.1.1?

2021-02-08 Thread Wenchen Fan
This is the cost of relying on Spark internal APIs, and the external connectors need to take care of it. BTW, the Alias change is source-compatible, and it shouldn't break the external connectors if they are compiled with Spark 3.1. On Tue, Feb 9, 2021 at 2:26 AM Alex Ott wrote: > although no,

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-22 Thread Wenchen Fan
ing a performance regression in some TPC-DS queries > (q88 for instance) that is caused by a recent commit in 3.1, highly likely > in the period from 19th November, 2020 to 18th December, 2020. > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Fri, Jan 22,

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-21 Thread Wenchen Fan
-1 as I just found a regression in 3.1. A self-join query works well in 3.0 but fails in 3.1. It's being fixed at https://github.com/apache/spark/pull/31287 On Fri, Jan 22, 2021 at 4:34 AM Tom Graves wrote: > +1 > > built from tarball, verified sha and regular CI and tests all pass. > > Tom > >

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-06 Thread Wenchen Fan
I agree with Jungtaek that people are likely to be biased when testing 3.1.0. At least this will not be the same community-blessed release as previous ones, because the voting is already affected by the fact that 3.1.0 is already in maven central. Skipping 3.1.0 sounds better to me. On Thu, Jan

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-12-01 Thread Wenchen Fan
I'm reviving this thread because this feature was reverted before the 3.0 release, and now we are trying to add it back since the CREATE TABLE syntax is unified. The benefits are pretty clear: CREATE TABLE by default (without USING or STORED AS) should create native tables that work best with

Re: How to convert InternalRow to Row.

2020-11-30 Thread Wenchen Fan
748) > --- > Any idea about this error? > > Thanks > Jason > > On Mon, 30 Nov 2020 at 19:34, Jia, Ke A wrote: > >> The fromRow method is removed in spark3.0. And the new API is : >> >> val encoder = RowEncoder(schema) >> >> v

Re: How to convert InternalRow to Row.

2020-11-27 Thread Wenchen Fan
InternalRow is an internal/developer API that might change overtime. Right now, the way to convert it to Row is to use `RowEncoder`, but you need to know the data schema: val encoder = RowEncoder(schema) val row = encoder.fromRow(internalRow) On Fri, Nov 27, 2020 at 6:16 AM Jason Jun wrote: >

Re: SPIP: Catalog API for view metadata

2020-11-09 Thread Wenchen Fan
iMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing> > has been updated. Please review. > > On Thu, Sep 3, 2020 at 9:22 AM John Zhuge wrote: > >> Wenchen, sorry for the delay, I will post an update shortly. >> >> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan wrote: >> &

Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-05 Thread Wenchen Fan
+1 On Fri, Nov 6, 2020 at 12:56 PM kalyan wrote: > +1 > > On Fri, Nov 6, 2020, 5:58 AM Matei Zaharia > wrote: > >> +1 >> >> Matei >> >> > On Nov 5, 2020, at 10:25 AM, EveLiao wrote: >> > >> > +1 >> > Thanks! >> > >> > >> > >> > -- >> > Sent from:

Re: [DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-26 Thread Wenchen Fan
+1 to fail fast. Thanks for reporting this, Jungtaek! On Mon, Oct 26, 2020 at 8:36 AM Jungtaek Lim wrote: > Yeah I'm in favor of fast-fail if things are not working out as end users > intended. Spark should only fail back when it doesn't make any difference > but only some sort of performance.

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

2020-10-13 Thread Wenchen Fan
It will speed up the process a lot if a simple code snippet to reproduce the error is provided. On Sat, Oct 3, 2020 at 4:40 AM Marc Le Bihan wrote: > Yes. As I explained at the beginning of the message. > > For com/fasterxml/jackson/module/scala/ScalaObjectMapper missing > I will check myself

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
nk Hive compatibility itself is a “use case”. > > Why? > > Hive is an external database that defines its own behavior and with which > Spark claims to be compatible. If Hive isn’t a valid use case, then why is > EXTERNAL supported at all? > > On Wed, Oct 7, 2020 at 10:17

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
a Hive-compatible catalog. A great recent > example is Nessie <https://projectnessie.org/tools/hive/>, which enables > branching and tagging table states across several table formats and aims to > be compatible with Hive. > > On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan wrote: > &g

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
log >> implementation. >> >> I don’t think that there is a good reason to force catalogs to break >> compatibility with Hive SQL, while making it appear as though DDL is >> compatible. Because removing EXTERNAL would be a breaking change to the >> SQL parser, I thin

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
n the length of identifier, > so things that work with custom catalog no longer work when it replaces > default session catalog. > > On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan wrote: > >> Ah, this is by design. V1 tables should still go through the v1 session >> catalog.

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
table, and then the catalyst goes with v1 exec. I guess all commands > leveraging TempViewOrV1Table to determine whether the table is v1 vs v2 > would all suffer from this issue. > > On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan wrote: > >> Not all the DDL commands support v2

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog. Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We

Official support of CREATE EXTERNAL TABLE

2020-10-06 Thread Wenchen Fan
Hi all, I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax. A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility. When you write

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-15 Thread Wenchen Fan
+1 On Tue, Sep 15, 2020 at 2:42 PM Dongjoon Hyun wrote: > +1 > > Bests, > Dongjoon. > > On Mon, Sep 14, 2020 at 9:19 PM kalyan wrote: > >> +1 >> >> Will positively improve the performance and reliability of spark... >> Looking fwd to this.. >> >> Regards >> Kalyan. >> >> On Tue, Sep 15, 2020,

Re: [ANNOUNCE] Announcing Apache Spark 3.0.1

2020-09-11 Thread Wenchen Fan
Great work, thanks, Ruifeng! On Fri, Sep 11, 2020 at 11:09 PM Gengliang Wang < gengliang.w...@databricks.com> wrote: > Congrats! > Thanks for the work, Ruifeng! > > > On Fri, Sep 11, 2020 at 9:51 PM Takeshi Yamamuro > wrote: > >> Congrats and thanks, Ruifeng! >> >> >> On Fri, Sep 11, 2020 at

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-10 Thread Wenchen Fan
details. >> >> I have now updated the key in those keyservers. Now, how do I refresh >> nexus? >> >> Thanks, >> >> On Thu, Sep 10, 2020 at 9:13 AM Sean Owen wrote: >> >>> Yes I can do that and I am sure it's fine, but why has it been visible >>

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-09 Thread Wenchen Fan
keyserver. > > Regards, > Mridul > > [1] wget https://dist.apache.org/repos/dist/dev/spark/KEYS -O - | gpg > --import > > On Wed, Sep 9, 2020 at 8:03 PM Wenchen Fan wrote: > >> I checked >> https://repository.apache.org/content/repositories/orgapachespark-13

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-09 Thread Wenchen Fan
I checked https://repository.apache.org/content/repositories/orgapachespark-1361/ , it says the Signature Validation failed. Prashant, can you double-check your gpg key and make sure it's uploaded to public key servers like the following? http://pool.sks-keyservers.net:11371

Re: SPIP: Catalog API for view metadata

2020-09-03 Thread Wenchen Fan
Any updates here? I agree that a new View API is better, but we need a solution to avoid performance regression. We need to elaborate on the cache idea. On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue wrote: > I think it is a good idea to keep tables and views separate. > > The main two arguments

Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-19 Thread Wenchen Fan
I think so. I don't see other bug reports for 2.4. On Thu, Aug 20, 2020 at 12:11 AM Nicholas Marion wrote: > It appears all 3 issues slated for Spark 2.4.7 have been merged. Should we > be looking at getting RC2 ready? > > > Regards, > > *NICHOLAS T. MARION * > IBM Open Data Analytics for z/OS

Re: [SparkSql] Casting of Predicate Literals

2020-08-19 Thread Wenchen Fan
; CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" > without the cast? > > On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer > wrote: > >> Thanks! That's exactly what I was hoping for! Thanks for finding the Jira >> for me! >>

Re: SPIP: Catalog API for view metadata

2020-08-18 Thread Wenchen Fan
en >> though executing the view in Hive results in data that has the most recent >> schema when underlying tables evolve -- so newly added nested field data >> shows up in the view evaluation query result but not in the view schema). >> >> On Fri, Aug 14, 2020 at 2

Re: SPIP: Catalog API for view metadata

2020-08-14 Thread Wenchen Fan
ot;dual" catalog. >>>>> - The implementation for a "dual" catalog plugin should ensure: >>>>> - Creating a view in view catalog when a table of the same >>>>> name exists should fail. >>>>> - Creating a table i

Re: SPIP: Catalog API for view metadata

2020-08-12 Thread Wenchen Fan
Hi John, Thanks for working on this! View support is very important to the catalog plugin API. After reading your doc, I have one high-level question: should view be a separated API or it's just a special type of table? AFAIK in most databases, tables and views share the same namespace. You

Re: [SparkSql] Casting of Predicate Literals

2020-08-04 Thread Wenchen Fan
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638 On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer wrote: > I've just run into this issue again with another user and I feel like most > folks here have seen some flavor of this at some point. > >

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Wenchen Fan
+1, thanks for driving it, Holden! On Fri, Jul 31, 2020 at 10:24 AM Holden Karau wrote: > +1 from myself :) > > On Thu, Jul 30, 2020 at 2:53 PM Jungtaek Lim > wrote: > >> +1 (non-binding, I guess) >> >> Thanks for raising the issue and sorting it out! >> >> On Fri, Jul 31, 2020 at 6:47 AM

Re: InterpretedUnsafeProjection - error in getElementSize

2020-07-24 Thread Wenchen Fan
Can you create a JIRA ticket? There are many people happy to help to fix it. On Tue, Jul 21, 2020 at 9:49 PM Janda Martin wrote: > Hi, > I think that I found error in > InterpretedUnsafeProjection::getElementSize. This method differs from > similar implementation in GenerateUnsafeProjection.

Re: Catalog API for Partition

2020-07-20 Thread Wenchen Fan
Yea we don't want the partitions to be Hive-specific. My point is, we call it "Partition Catalog APIs", which makes me confused about the relationship between this and the "partitions" in `TableCatalog.createTable`. Are these two orthogonal? Or you kind of unify them? On Sat, Jul 18, 2020 at

Re: Catalog API for Partition

2020-07-17 Thread Wenchen Fan
In Hive, partition does two things: 1. Act as an index to speed up data scan 2. Act as a way to manage the data. People can add/drop partitions. How do you unify these 2 things in your API design? On Fri, Jul 17, 2020 at 12:03 AM JackyLee wrote: > Hi devs, > > In order to support Partition

Re: [DISCUSS] -1s and commits

2020-07-16 Thread Wenchen Fan
It looks like there are two topics: 1. PRs with -1 2. PRs with someone asking to wait for certain days. Holden, it seems you are hitting 2? I think 2 can be problematic if there are people who keep asking to wait, and block the PR indefinitely. But if it's only asked once, this seems OK. BTW,

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-15 Thread Wenchen Fan
not done for Spark 2.4.6 because it was too late on the vote > process but it makes perfect sense to have this in 2.4.7. > > On Wed, Jul 15, 2020 at 9:07 AM Wenchen Fan wrote: > > > > Yea I think 2.4.7 is good to go. Let's start! > > > > On Wed, Jul 15, 2020 at 1:50 PM Pr

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-15 Thread Wenchen Fan
Yea I think 2.4.7 is good to go. Let's start! On Wed, Jul 15, 2020 at 1:50 PM Prashant Sharma wrote: > Hi Folks, > > So, I am back, and searched the JIRAS with target version as "2.4.7" and > Resolved, found only 2 jiras. So, are we good to go, with just a couple of > jiras fixed ? Shall I

Re: Welcoming some new Apache Spark committers

2020-07-15 Thread Wenchen Fan
Congrats and welcome! On Wed, Jul 15, 2020 at 2:18 PM Mridul Muralidharan wrote: > > Congratulations ! > > Regards, > Mridul > > On Tue, Jul 14, 2020 at 12:37 PM Matei Zaharia > wrote: > >> Hi all, >> >> The Spark PMC recently voted to add several new committers. Please join >> me in welcoming

Re: [PSA] Apache Spark uses GitHub Actions to run the tests

2020-07-14 Thread Wenchen Fan
To clarify, we need to wait for: 1. Java documentation build test in github actions 2. dependency test in github actions 3. either github action all green or jenkin pass If the PR touches Kinesis, or it uses other profiles, we must wait for jenkins to pass. Do I miss something? On Tue, Jul 14,

Re: [VOTE] Decommissioning SPIP

2020-07-02 Thread Wenchen Fan
+1 On Fri, Jul 3, 2020 at 12:06 AM DB Tsai wrote: > +1 > > On Thu, Jul 2, 2020 at 8:59 AM Ryan Blue > wrote: > >> +1 >> >> On Thu, Jul 2, 2020 at 8:00 AM Dongjoon Hyun >> wrote: >> >>> +1. >>> >>> Thank you, Holden. >>> >>> Bests, >>> Dongjoon. >>> >>> On Thu, Jul 2, 2020 at 6:43 AM wuyi

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-30 Thread Wenchen Fan
Hi Jason, Thanks for reporting! https://issues.apache.org/jira/browse/SPARK-32136 looks like a breaking change and we should investigate. On Wed, Jul 1, 2020 at 11:31 AM Holden Karau wrote: > I can take care of 2.4.7 unless someone else wants to do it. > > On Tue, Jun 30, 2020 at 8:29 PM Jason

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Wenchen Fan
Shall we start a new thread to discuss the bundled Hadoop version in PySpark? I don't have a strong opinion on changing the default, as users can still download the Hadoop 2.7 version. On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun wrote: > To Xiao. > Why Apache project releases should be

Re: Datasource with ColumnBatchScan support.

2020-06-17 Thread Wenchen Fan
If you already have your own `FileFormat` implementation: just override the `supportBatch` method. On Tue, Jun 16, 2020 at 5:39 AM Nasrulla Khan Haris wrote: > HI Spark developers, > > > > FileSourceScanExec >

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-11 Thread Wenchen Fan
these > accomplishes that. That's valuable, but is what a summary blog is for. > > I can't feel strongly about this, so, would just say, propose process > changes for 3.1 and codify in the contributing guide but stick with what we > have for 3.0. > > > On Wed, Jun 10, 2020 at 10

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-10 Thread Wenchen Fan
t; They aren't anywhere then (3.0 is done, so not the migration guide). Some >> are important. >> Change could be OK but how about proposing this going forward? >> >> >> On Wed, Jun 10, 2020 at 10:35 AM Wenchen Fan wrote: >> >>> My 2 cents: >>

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-10 Thread Wenchen Fan
My 2 cents: Since we have a migration guide, I think people who hit problems when upgrading Spark will read it. We should mention all the breaking changes there, except for trivial ones like obvious bug fixes. Even if there is no meaningful migration to guide for things like removing a deprecated

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Wenchen Fan
+1 (binding) On Tue, Jun 9, 2020 at 6:15 PM Dr. Kent Yao wrote: > +1 (non-binding) > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail:

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-05-31 Thread Wenchen Fan
+1 (binding), although I don't know why we jump from RC 3 to RC 8... On Mon, Jun 1, 2020 at 7:47 AM Holden Karau wrote: > Please vote on releasing the following candidate as Apache Spark > version 2.4.6. > > The vote is open until June 5th at 9AM PST and passes if a majority +1 PMC > votes are

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Wenchen Fan
Seems the priority of SPARK-31706 is incorrectly marked, and it's a blocker now. The fix was merged just a few hours ago. This should be a -1 for RC2. On Wed, May 20, 2020 at 2:42 PM rickestcode wrote: > +1 > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > >

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Wenchen Fan
+1, no known blockers. On Mon, May 18, 2020 at 12:49 AM DB Tsai wrote: > +1 as well. Thanks. > > On Sun, May 17, 2020 at 7:39 AM Sean Owen wrote: > >> +1 , same response as to the last RC. >> This looks like it includes the fix discussed last time, as well as a >> few more small good fixes. >>

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

2020-05-13 Thread Wenchen Fan
I think we already have this table capacity: ACCEPT_ANY_SCHEMA. Can you try that? On Thu, May 14, 2020 at 6:17 AM Russell Spitzer wrote: > I would really appreciate that, I'm probably going to just write a planner > rule for now which matches up my table schema with the query output if they >

Re: [Datasource V2] Exception Handling for Catalogs - Naming Suggestions

2020-05-13 Thread Wenchen Fan
This looks a bit specific and maybe it's better to allow catalogs to customize the error message, which is more general. On Wed, May 13, 2020 at 12:16 AM Russell Spitzer wrote: > Currently the way some actions work, we receive an error during analysis > phase. For example, doing a "SELECT *

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Wenchen Fan
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests. I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code

<    1   2   3   4   5   6   >