Re: Enabling push-based shuffle in Spark

2020-01-23 Thread Wenchen Fan
The name "push-based shuffle" is a little misleading. This seems like a better shuffle service that co-locates shuffle blocks of one reducer at the map phase. I think this is a good idea. Is it possible to make it completely external via the shuffle plugin API? This looks like a good use case of

Re: [FYI] `Target Version` on `Improvement`/`New Feature` JIRA issues

2020-02-02 Thread Wenchen Fan
Thanks for cleaning this up! On Sun, Feb 2, 2020 at 2:08 PM Xiao Li wrote: > Thanks! Dongjoon. > > Xiao > > On Sat, Feb 1, 2020 at 5:15 PM Hyukjin Kwon wrote: > >> Thanks Dongjoon. >> >> On Sun, 2 Feb 2020, 09:08 Dongjoon Hyun, wrote: >> >>> Hi, All. >>> >>> From Today, we have `branch-3.0`

Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-03 Thread Wenchen Fan
AFAIK there is no ongoing critical bug fixes, +1 On Mon, Feb 3, 2020 at 11:46 PM Dongjoon Hyun wrote: > Yes, it does officially since 2.4.0. > > 2.4.5 is a maintenance release of 2.4.x line and the community didn't > support Hadoop 3.x on 'branch-2.4'. We didn't run test at all. > > Bests, >

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-02-05 Thread Wenchen Fan
This is a hack really and we don't recommend users to access internal classes directly. That's why there is no public document. If you really need to do it and are aware of the risks, you can read the source code. All expressions (or the so-called "native UDF") extend the base class `Expression`.

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-18 Thread Wenchen Fan
+ non-foldable trimStr >>> 3. non-foldable srcStr + foldable trimStr >>> 4. non-foldable srcStr + non-foldable trimStr >>> >>> The case # 2 seems a rare case, and # 3 is probably the most common >>> case. Once we see the second case, we could outp

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Wenchen Fan
What's your use case to compare intervals? It's tricky in Spark as there is only one interval type and you can't really compare one month with 30 days. On Wed, Feb 12, 2020 at 12:01 AM Enrico Minack wrote: > Hi Devs, > > I would like to know what is the current roadmap of making >

Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Wenchen Fan
The JIRA ticket will show the linked PR if there are any, which indicates that someone is working on it if the PR is active. Maybe the bot should also leave a comment on the JIRA ticket to make it clearer? On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun wrote: > Hi All, > > I would like to

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-15 Thread Wenchen Fan
It's unfortunate that we don't have a clear document to talk about breaking changes (I'm working on it BTW). I believe the general guidance is: *avoid breaking changes unless we have to*. For example, the previous result was so broken that we have to fix it, moving to Scala 2.12 makes us have to

Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Wenchen Fan
t;>>> The new policy looks clear to me. +1 for the explicit policy. >>>> >>>> So, are we going to revise the existing conf names before 3.0.0 release? >>>> >>>> Or, is it applied to new up-coming configurations from now? >>>> >&

Re: How executor Understand which RDDs needed to be persist from the submitted Task

2020-01-09 Thread Wenchen Fan
RDD has a flag `storageLevel` which will be set by calling persist(). RDD will be serialized and sent to executors for running tasks. So executors just look at RDD.storageLevel and store output in its block manager when needed. On Thu, Jan 9, 2020 at 5:53 PM Jack Kolokasis wrote: > Hello all, >

Re: How executor Understand which RDDs needed to be persist from the submitted Task

2020-01-09 Thread Wenchen Fan
t; > Iacovos > On 1/9/20 5:03 PM, Wenchen Fan wrote: > > RDD has a flag `storageLevel` which will be set by calling persist(). RDD > will be serialized and sent to executors for running tasks. So executors > just look at RDD.storageLevel and store output in its block manager wh

Re: Question about Datasource V2

2020-01-13 Thread Wenchen Fan
1. we plan to add view support in future releases. 2. can you open a JIRA ticket? This seems like a bug to me. 3. instead of defining a lot of fields in the table, we decide to use properties to keep all the extra information. We've defined some reserved properties like "comment", "location",

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-15 Thread Wenchen Fan
Recently we merged several fixes to 2.4: https://issues.apache.org/jira/browse/SPARK-30325 a driver hang issue https://issues.apache.org/jira/browse/SPARK-30246 a memory leak issue https://issues.apache.org/jira/browse/SPARK-29708 a correctness issue(for a rarely used feature, so not merged

Re: Correctness and data loss issues

2020-01-21 Thread Wenchen Fan
I think we need to go through them during the 3.0 QA period, and try to fix the valid ones. For example, the first ticket should be fixed already in https://issues.apache.org/jira/browse/SPARK-28344 On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun wrote: > Hi, All. > > According to our policy,

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-16 Thread Wenchen Fan
The proposal makes sense to me. If we are not going to make interval type ANSI-compliant in this release, we should not expose it widely. Thanks for driving it, Kent! On Fri, Jan 17, 2020 at 10:52 AM Dr. Kent Yao wrote: > Following ANSI might be a good option but also a serious user behavior >

Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Wenchen Fan
I think there are a few details we need to discuss. how frequently a source should update its metrics? For example, if file source needs to report size metrics per row, it'll be super slow. what metrics a source should report? data size? numFiles? read time? shall we show metrics in SQL web UI

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Wenchen Fan
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss. About expression duplication. This is an existing design choice. We don't

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Wenchen Fan
Sounds good! On Tue, Dec 24, 2019 at 7:48 AM Reynold Xin wrote: > We've pushed out 3.0 multiple times. The latest release window documented > on the website says we'd > code freeze and cut branch-3.0 early Dec. It looks like we are suffering a >

Re: Fw:Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11

2019-12-29 Thread Wenchen Fan
+1 for the new thrift server to get rid of the Hive dependencies! On Mon, Dec 23, 2019 at 7:55 PM Yuming Wang wrote: > I'm +1 for this SPIP for these two reasons: > > 1. The current thriftserver has some issues that are not easy to solve, > such as: SPARK-28636

Re: Release Apache Spark 2.4.5

2020-01-05 Thread Wenchen Fan
+1 On Mon, Jan 6, 2020 at 12:02 PM Jungtaek Lim wrote: > +1 to have another Spark 2.4 release, as Spark 2.4.4 was released in 4 > months old and there's release window for this. > > On Mon, Jan 6, 2020 at 12:38 PM Hyukjin Kwon wrote: > >> Yeah, I think it's nice to have another maintenance

Re: [DISCUSS] Support subdirectories when accessing partitioned Parquet Hive table

2020-01-06 Thread Wenchen Fan
Isn't your directory structure malformed? The directory name under the table path should be in the form of "partitionCol=value". And AFAIK this is the Hive standard. On Mon, Jan 6, 2020 at 6:59 PM Lotkowski, Michael wrote: > Hi all, > > > > Reviving this thread, we still have this issue and

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

2020-01-07 Thread Wenchen Fan
I think it's simply because as[T] is lazy. You will see the right schema if you do `df.as[T].map(identity)`. On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack wrote: > Hi Devs, > > I'd like to propose a stricter version of as[T]. Given the interface def > as[T](): Dataset[T], it is

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Wenchen Fan
+1 (binding), assuming that this is for public stable APIs, not APIs that are marked as unstable, evolving, etc. On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía wrote: > +1 (non-binding) > > Michael's section on the trade-offs of maintaining / removing an API are > one of > the best reads I have

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Wenchen Fan
The ongoing critical issues I'm aware of are: SPARK-31257 : Fix ambiguous two different CREATE TABLE syntaxes SPARK-31404 : backward compatibility issues after switching to Proleptic Gregorian

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Wenchen Fan
Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not sure this is possible as the DS V2 API is very different in 3.0, e.g. there is no `DataSourceV2` anymore, and you should implement `TableProvider` (if you don't have database/table). On Wed, Apr 8, 2020 at 6:58 AM Andrew

Re: DSv2 & DataSourceRegister

2020-04-08 Thread Wenchen Fan
llo > > On Tue, Apr 7, 2020 at 23:16 Wenchen Fan wrote: > >> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not >> sure this is possible as the DS V2 API is very different in 3.0, e.g. there >> is no `DataSourceV2` anymore, and you should implement `T

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Wenchen Fan
Yea, release candidates are different from the preview version, as release candidates are not official releases, so they won't appear in Maven Central, can't be downloaded in the Spark official website, etc. On Wed, Apr 1, 2020 at 12:32 PM Sean Owen wrote: > These are release candidates, not

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Wenchen Fan
Which Spark/Scala version do you use? On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman wrote: > > with the following sparksession configuration > > val spark = SparkSession.builder().master("local[*]").appName("Spark Session > take").getOrCreate(); > > this line works > > flights.filter(flight_row

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Wenchen Fan
g from maven. > > Backbutton.co.uk > ¯\_(ツ)_/¯ > ♡۶Java♡۶RMI ♡۶ > Make Use Method {MUM} > makeuse.org > <http://www.backbutton.co.uk> > > > On Fri, 27 Mar 2020 at 05:45, Wenchen Fan wrote: > >> Which Spark/Scala version do you use? >

Re: Programmatic: parquet file corruption error

2020-03-27 Thread Wenchen Fan
Running Spark application with an IDE is not officially supported. It may work under some cases but there is no guarantee at all. The official way is to run interactive queries with spark-shell or package your application to a jar and use spark-submit. On Thu, Mar 26, 2020 at 4:12 PM Zahid Rahman

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-29 Thread Wenchen Fan
I agree that we can cut the RC anyway even if there are blockers, to move us to a more official "code freeze" status. About the CREATE TABLE unification, it's still WIP and not close-to-merge yet. Can we fix some specific problems like CREATE EXTERNAL TABLE surgically and leave the unification to

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Wenchen Fan
IIUC We are moving away from having 2 classes for Java and Scala, like JavaRDD and RDD. It's much simpler to maintain and use with a single class. I don't have a strong preference over option 3 or 4. We may need to collect more data points from actual users. On Mon, Apr 27, 2020 at 9:50 PM

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Wenchen Fan
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket? On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati wrote: > Just wondering if any one could help me out on this. > > Thank you! > > > > > *Regards,Dhrubajyoti Hati.* > > > On Wed, Apr 22, 2020

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Wenchen Fan
ards,Dhrubajyoti Hati.* > > > On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote: > >> This looks like a bug that path filter doesn't work for hive table >> reading. Can you open a JIRA ticket? >> >> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati >

Re: is there any tool to visualize the spark physical plan or spark plan

2020-04-30 Thread Wenchen Fan
Does the Spark SQL web UI work for you? https://spark.apache.org/docs/3.0.0-preview/web-ui.html#sql-tab On Thu, Apr 30, 2020 at 5:30 PM Manu Zhang wrote: > Hi Kelly, > > If you can parse event log, then try listening on > `SparkListenerSQLExecutionStart` event and build a `SparkPlanGraph` like

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Wenchen Fan
SPARK-30098 was merged about 6 months ago. It's not a clean revert and we may need to spend quite a bit of time to resolve conflicts and fix tests. I don't see why it's still a problem if a feature is disabled and hidden from end-users (it's undocumented, the config is internal). The related code

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-10 Thread Wenchen Fan
99|146327314953| > |18995|243603134985| > |18991|476309451025| > |18993|287916490001| > |18998|324427845137| > |18992|412640801297| > |18994|302012976401| > +-++ > ... > > This can happen with such inconsistent schemas because State in Structured > Streaming doesn't check the schema (both

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-08 Thread Wenchen Fan
y seems to be >>>> different), and once we notice the issue it would be really odd if we >>>> publish it as it is, and try to fix it later (the fix may not be even >>>> included in 3.0.x as it might bring behavioral change). >>>> >>>>

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

2020-05-13 Thread Wenchen Fan
I think we already have this table capacity: ACCEPT_ANY_SCHEMA. Can you try that? On Thu, May 14, 2020 at 6:17 AM Russell Spitzer wrote: > I would really appreciate that, I'm probably going to just write a planner > rule for now which matches up my table schema with the query output if they >

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Wenchen Fan
+1, no known blockers. On Mon, May 18, 2020 at 12:49 AM DB Tsai wrote: > +1 as well. Thanks. > > On Sun, May 17, 2020 at 7:39 AM Sean Owen wrote: > >> +1 , same response as to the last RC. >> This looks like it includes the fix discussed last time, as well as a >> few more small good fixes. >>

Re: [Datasource V2] Exception Handling for Catalogs - Naming Suggestions

2020-05-13 Thread Wenchen Fan
This looks a bit specific and maybe it's better to allow catalogs to customize the error message, which is more general. On Wed, May 13, 2020 at 12:16 AM Russell Spitzer wrote: > Currently the way some actions work, we receive an error during analysis > phase. For example, doing a "SELECT *

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
entation to make things be clear, but if the approach >> would be explaining the difference of rules and guide the tip to make the >> query be bound to the specific rule, the same could be applied to parser >> rule to address the root cause. >> >> >> On Wed, M

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Wenchen Fan
with Hive connected. >>>> >>>> But since we are even thinking about native syntax as a first class and >>>> dropping Hive one implicitly (hide in doc) or explicitly, does it really >>>> matter we require a marker (like "HIVE") in rule

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
I agree that Spark can define the semantic of CHAR(x) differently than the SQL standard (no padding), and ask the data sources to follow it. But the problem is, some data sources may not be able to skip padding, like the Hive serde table. On the other hand, it's easier to require padding for

Re: Spark 2.4.x and 3.x datasourcev2 api documentation & references

2020-03-18 Thread Wenchen Fan
For now you can take a look at `DataSourceV2Suite`, which contains both Java/Scala implementations. There is also an ongoing PR to implement catalog APIs for JDBC: https://github.com/apache/spark/pull/27345 We are still working on the user guide. On Mon, Mar 16, 2020 at 4:59 AM MadDoxX wrote:

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
I think the general guideline is to promote Spark's own CREATE TABLE syntax instead of the Hive one. Previously these two rules are mutually exclusive because the native syntax requires the USING clause while the Hive syntax makes ROW FORMAT or STORED AS clause optional. It's a good move to make

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
OK let me put a proposal here: 1. Permanently ban CHAR for native data source tables, and only keep it for Hive compatibility. It's OK to forget about padding like what Snowflake and MySQL have done. But it's hard for Spark to require consistent behavior about CHAR type in all data sources. Since

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
ry but I think we are making bad assumption on end users which is a > serious problem. > > If we really want to promote Spark's one for CREATE TABLE, then would it > really matter to treat Hive CREATE TABLE be "exceptional" one and try to > isolate each other? What's the point of

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-24 Thread Wenchen Fan
Hi Ryan, It's great to hear that you are cleaning up this long-standing mess. Please let me know if you hit any problems that I can help with. Thanks, Wenchen On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas wrote: > On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan wrote: > >> 2.

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-17 Thread Wenchen Fan
I don't think option 1 is possible. For option 2: I think we need to do it anyway. It's kind of a bug that the typed Scala UDF doesn't support case class that thus can't support struct-type input columns. For option 3: It's a bit risky to add a new API but seems like we have a good reason. The

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-08 Thread Wenchen Fan
Can you give some simple examples to demonstrate the problem? I think the inconsistency would bring problems but don't know how. On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim wrote: > (bump to expose the discussion to more readers) > > On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim > wrote: > >> Hi

Re: [ANNOUNCE] Announcing Apache Spark 3.0.1

2020-09-11 Thread Wenchen Fan
Great work, thanks, Ruifeng! On Fri, Sep 11, 2020 at 11:09 PM Gengliang Wang < gengliang.w...@databricks.com> wrote: > Congrats! > Thanks for the work, Ruifeng! > > > On Fri, Sep 11, 2020 at 9:51 PM Takeshi Yamamuro > wrote: > >> Congrats and thanks, Ruifeng! >> >> >> On Fri, Sep 11, 2020 at

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-09 Thread Wenchen Fan
I checked https://repository.apache.org/content/repositories/orgapachespark-1361/ , it says the Signature Validation failed. Prashant, can you double-check your gpg key and make sure it's uploaded to public key servers like the following? http://pool.sks-keyservers.net:11371

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-09 Thread Wenchen Fan
keyserver. > > Regards, > Mridul > > [1] wget https://dist.apache.org/repos/dist/dev/spark/KEYS -O - | gpg > --import > > On Wed, Sep 9, 2020 at 8:03 PM Wenchen Fan wrote: > >> I checked >> https://repository.apache.org/content/repositories/orgapachespark-13

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-10 Thread Wenchen Fan
details. >> >> I have now updated the key in those keyservers. Now, how do I refresh >> nexus? >> >> Thanks, >> >> On Thu, Sep 10, 2020 at 9:13 AM Sean Owen wrote: >> >>> Yes I can do that and I am sure it's fine, but why has it been visible >>

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-15 Thread Wenchen Fan
+1 On Tue, Sep 15, 2020 at 2:42 PM Dongjoon Hyun wrote: > +1 > > Bests, > Dongjoon. > > On Mon, Sep 14, 2020 at 9:19 PM kalyan wrote: > >> +1 >> >> Will positively improve the performance and reliability of spark... >> Looking fwd to this.. >> >> Regards >> Kalyan. >> >> On Tue, Sep 15, 2020,

Re: SPIP: Catalog API for view metadata

2020-09-03 Thread Wenchen Fan
Any updates here? I agree that a new View API is better, but we need a solution to avoid performance regression. We need to elaborate on the cache idea. On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue wrote: > I think it is a good idea to keep tables and views separate. > > The main two arguments

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog. Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
n the length of identifier, > so things that work with custom catalog no longer work when it replaces > default session catalog. > > On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan wrote: > >> Ah, this is by design. V1 tables should still go through the v1 session >> catalog.

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
table, and then the catalyst goes with v1 exec. I guess all commands > leveraging TempViewOrV1Table to determine whether the table is v1 vs v2 > would all suffer from this issue. > > On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan wrote: > >> Not all the DDL commands support v2

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
log >> implementation. >> >> I don’t think that there is a good reason to force catalogs to break >> compatibility with Hive SQL, while making it appear as though DDL is >> compatible. Because removing EXTERNAL would be a breaking change to the >> SQL parser, I thin

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
a Hive-compatible catalog. A great recent > example is Nessie <https://projectnessie.org/tools/hive/>, which enables > branching and tagging table states across several table formats and aims to > be compatible with Hive. > > On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan wrote: > &g

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
nk Hive compatibility itself is a “use case”. > > Why? > > Hive is an external database that defines its own behavior and with which > Spark claims to be compatible. If Hive isn’t a valid use case, then why is > EXTERNAL supported at all? > > On Wed, Oct 7, 2020 at 10:17

Re: My I report a special comparaison of executions leading on issues on Spark JIRA ?

2020-10-13 Thread Wenchen Fan
It will speed up the process a lot if a simple code snippet to reproduce the error is provided. On Sat, Oct 3, 2020 at 4:40 AM Marc Le Bihan wrote: > Yes. As I explained at the beginning of the message. > > For com/fasterxml/jackson/module/scala/ScalaObjectMapper missing > I will check myself

Official support of CREATE EXTERNAL TABLE

2020-10-06 Thread Wenchen Fan
Hi all, I'd like to start a discussion thread about this topic, as it blocks an important feature that we target for Spark 3.1: unify the CREATE TABLE SQL syntax. A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden feature in Spark for Hive compatibility. When you write

Re: SPIP: Catalog API for view metadata

2020-08-18 Thread Wenchen Fan
en >> though executing the view in Hive results in data that has the most recent >> schema when underlying tables evolve -- so newly added nested field data >> shows up in the view evaluation query result but not in the view schema). >> >> On Fri, Aug 14, 2020 at 2

Re: [SparkSql] Casting of Predicate Literals

2020-08-19 Thread Wenchen Fan
; CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" > without the cast? > > On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer > wrote: > >> Thanks! That's exactly what I was hoping for! Thanks for finding the Jira >> for me! >>

Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-19 Thread Wenchen Fan
I think so. I don't see other bug reports for 2.4. On Thu, Aug 20, 2020 at 12:11 AM Nicholas Marion wrote: > It appears all 3 issues slated for Spark 2.4.7 have been merged. Should we > be looking at getting RC2 ready? > > > Regards, > > *NICHOLAS T. MARION * > IBM Open Data Analytics for z/OS

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-05-31 Thread Wenchen Fan
+1 (binding), although I don't know why we jump from RC 3 to RC 8... On Mon, Jun 1, 2020 at 7:47 AM Holden Karau wrote: > Please vote on releasing the following candidate as Apache Spark > version 2.4.6. > > The vote is open until June 5th at 9AM PST and passes if a majority +1 PMC > votes are

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Wenchen Fan
Seems the priority of SPARK-31706 is incorrectly marked, and it's a blocker now. The fix was merged just a few hours ago. This should be a -1 for RC2. On Wed, May 20, 2020 at 2:42 PM rickestcode wrote: > +1 > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > >

Re: [DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-26 Thread Wenchen Fan
+1 to fail fast. Thanks for reporting this, Jungtaek! On Mon, Oct 26, 2020 at 8:36 AM Jungtaek Lim wrote: > Yeah I'm in favor of fast-fail if things are not working out as end users > intended. Spark should only fail back when it doesn't make any difference > but only some sort of performance.

Re: [SparkSql] Casting of Predicate Literals

2020-08-04 Thread Wenchen Fan
I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638 On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer wrote: > I've just run into this issue again with another user and I feel like most > folks here have seen some flavor of this at some point. > >

Re: SPIP: Catalog API for view metadata

2020-08-12 Thread Wenchen Fan
Hi John, Thanks for working on this! View support is very important to the catalog plugin API. After reading your doc, I have one high-level question: should view be a separated API or it's just a special type of table? AFAIK in most databases, tables and views share the same namespace. You

Re: [VOTE] Decommissioning SPIP

2020-07-02 Thread Wenchen Fan
+1 On Fri, Jul 3, 2020 at 12:06 AM DB Tsai wrote: > +1 > > On Thu, Jul 2, 2020 at 8:59 AM Ryan Blue > wrote: > >> +1 >> >> On Thu, Jul 2, 2020 at 8:00 AM Dongjoon Hyun >> wrote: >> >>> +1. >>> >>> Thank you, Holden. >>> >>> Bests, >>> Dongjoon. >>> >>> On Thu, Jul 2, 2020 at 6:43 AM wuyi

Re: [PSA] Apache Spark uses GitHub Actions to run the tests

2020-07-14 Thread Wenchen Fan
To clarify, we need to wait for: 1. Java documentation build test in github actions 2. dependency test in github actions 3. either github action all green or jenkin pass If the PR touches Kinesis, or it uses other profiles, we must wait for jenkins to pass. Do I miss something? On Tue, Jul 14,

Re: Welcoming some new Apache Spark committers

2020-07-15 Thread Wenchen Fan
Congrats and welcome! On Wed, Jul 15, 2020 at 2:18 PM Mridul Muralidharan wrote: > > Congratulations ! > > Regards, > Mridul > > On Tue, Jul 14, 2020 at 12:37 PM Matei Zaharia > wrote: > >> Hi all, >> >> The Spark PMC recently voted to add several new committers. Please join >> me in welcoming

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-15 Thread Wenchen Fan
Yea I think 2.4.7 is good to go. Let's start! On Wed, Jul 15, 2020 at 1:50 PM Prashant Sharma wrote: > Hi Folks, > > So, I am back, and searched the JIRAS with target version as "2.4.7" and > Resolved, found only 2 jiras. So, are we good to go, with just a couple of > jiras fixed ? Shall I

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-15 Thread Wenchen Fan
not done for Spark 2.4.6 because it was too late on the vote > process but it makes perfect sense to have this in 2.4.7. > > On Wed, Jul 15, 2020 at 9:07 AM Wenchen Fan wrote: > > > > Yea I think 2.4.7 is good to go. Let's start! > > > > On Wed, Jul 15, 2020 at 1:50 PM Pr

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-30 Thread Wenchen Fan
Hi Jason, Thanks for reporting! https://issues.apache.org/jira/browse/SPARK-32136 looks like a breaking change and we should investigate. On Wed, Jul 1, 2020 at 11:31 AM Holden Karau wrote: > I can take care of 2.4.7 unless someone else wants to do it. > > On Tue, Jun 30, 2020 at 8:29 PM Jason

Re: Datasource with ColumnBatchScan support.

2020-06-17 Thread Wenchen Fan
If you already have your own `FileFormat` implementation: just override the `supportBatch` method. On Tue, Jun 16, 2020 at 5:39 AM Nasrulla Khan Haris wrote: > HI Spark developers, > > > > FileSourceScanExec >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Wenchen Fan
Shall we start a new thread to discuss the bundled Hadoop version in PySpark? I don't have a strong opinion on changing the default, as users can still download the Hadoop 2.7 version. On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun wrote: > To Xiao. > Why Apache project releases should be

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-10 Thread Wenchen Fan
My 2 cents: Since we have a migration guide, I think people who hit problems when upgrading Spark will read it. We should mention all the breaking changes there, except for trivial ones like obvious bug fixes. Even if there is no meaningful migration to guide for things like removing a deprecated

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Wenchen Fan
+1 (binding) On Tue, Jun 9, 2020 at 6:15 PM Dr. Kent Yao wrote: > +1 (non-binding) > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail:

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-10 Thread Wenchen Fan
t; They aren't anywhere then (3.0 is done, so not the migration guide). Some >> are important. >> Change could be OK but how about proposing this going forward? >> >> >> On Wed, Jun 10, 2020 at 10:35 AM Wenchen Fan wrote: >> >>> My 2 cents: >>

Re: Quick sync: what goes in migration guide vs release notes?

2020-06-11 Thread Wenchen Fan
these > accomplishes that. That's valuable, but is what a summary blog is for. > > I can't feel strongly about this, so, would just say, propose process > changes for 3.1 and codify in the contributing guide but stick with what we > have for 3.0. > > > On Wed, Jun 10, 2020 at 10

Re: InterpretedUnsafeProjection - error in getElementSize

2020-07-24 Thread Wenchen Fan
Can you create a JIRA ticket? There are many people happy to help to fix it. On Tue, Jul 21, 2020 at 9:49 PM Janda Martin wrote: > Hi, > I think that I found error in > InterpretedUnsafeProjection::getElementSize. This method differs from > similar implementation in GenerateUnsafeProjection.

Re: [DISCUSS] -1s and commits

2020-07-16 Thread Wenchen Fan
It looks like there are two topics: 1. PRs with -1 2. PRs with someone asking to wait for certain days. Holden, it seems you are hitting 2? I think 2 can be problematic if there are people who keep asking to wait, and block the PR indefinitely. But if it's only asked once, this seems OK. BTW,

Re: Catalog API for Partition

2020-07-17 Thread Wenchen Fan
In Hive, partition does two things: 1. Act as an index to speed up data scan 2. Act as a way to manage the data. People can add/drop partitions. How do you unify these 2 things in your API design? On Fri, Jul 17, 2020 at 12:03 AM JackyLee wrote: > Hi devs, > > In order to support Partition

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Wenchen Fan
+1, thanks for driving it, Holden! On Fri, Jul 31, 2020 at 10:24 AM Holden Karau wrote: > +1 from myself :) > > On Thu, Jul 30, 2020 at 2:53 PM Jungtaek Lim > wrote: > >> +1 (non-binding, I guess) >> >> Thanks for raising the issue and sorting it out! >> >> On Fri, Jul 31, 2020 at 6:47 AM

Re: Catalog API for Partition

2020-07-20 Thread Wenchen Fan
Yea we don't want the partitions to be Hive-specific. My point is, we call it "Partition Catalog APIs", which makes me confused about the relationship between this and the "partitions" in `TableCatalog.createTable`. Are these two orthogonal? Or you kind of unify them? On Sat, Jul 18, 2020 at

Re: SPIP: Catalog API for view metadata

2020-08-14 Thread Wenchen Fan
ot;dual" catalog. >>>>> - The implementation for a "dual" catalog plugin should ensure: >>>>> - Creating a view in view catalog when a table of the same >>>>> name exists should fail. >>>>> - Creating a table i

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-06 Thread Wenchen Fan
I agree with Jungtaek that people are likely to be biased when testing 3.1.0. At least this will not be the same community-blessed release as previous ones, because the voting is already affected by the fact that 3.1.0 is already in maven central. Skipping 3.1.0 sounds better to me. On Thu, Jan

Re: How to convert InternalRow to Row.

2020-11-27 Thread Wenchen Fan
InternalRow is an internal/developer API that might change overtime. Right now, the way to convert it to Row is to use `RowEncoder`, but you need to know the data schema: val encoder = RowEncoder(schema) val row = encoder.fromRow(internalRow) On Fri, Nov 27, 2020 at 6:16 AM Jason Jun wrote: >

Re: How to convert InternalRow to Row.

2020-11-30 Thread Wenchen Fan
748) > --- > Any idea about this error? > > Thanks > Jason > > On Mon, 30 Nov 2020 at 19:34, Jia, Ke A wrote: > >> The fromRow method is removed in spark3.0. And the new API is : >> >> val encoder = RowEncoder(schema) >> >> v

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-12-01 Thread Wenchen Fan
I'm reviving this thread because this feature was reverted before the 3.0 release, and now we are trying to add it back since the CREATE TABLE syntax is unified. The benefits are pretty clear: CREATE TABLE by default (without USING or STORED AS) should create native tables that work best with

Re: SPIP: Catalog API for view metadata

2020-11-09 Thread Wenchen Fan
iMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing> > has been updated. Please review. > > On Thu, Sep 3, 2020 at 9:22 AM John Zhuge wrote: > >> Wenchen, sorry for the delay, I will post an update shortly. >> >> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan wrote: >> &

Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-05 Thread Wenchen Fan
+1 On Fri, Nov 6, 2020 at 12:56 PM kalyan wrote: > +1 > > On Fri, Nov 6, 2020, 5:58 AM Matei Zaharia > wrote: > >> +1 >> >> Matei >> >> > On Nov 5, 2020, at 10:25 AM, EveLiao wrote: >> > >> > +1 >> > Thanks! >> > >> > >> > >> > -- >> > Sent from:

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-22 Thread Wenchen Fan
ing a performance regression in some TPC-DS queries > (q88 for instance) that is caused by a recent commit in 3.1, highly likely > in the period from 19th November, 2020 to 18th December, 2020. > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Fri, Jan 22,

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-21 Thread Wenchen Fan
-1 as I just found a regression in 3.1. A self-join query works well in 3.0 but fails in 3.1. It's being fixed at https://github.com/apache/spark/pull/31287 On Fri, Jan 22, 2021 at 4:34 AM Tom Graves wrote: > +1 > > built from tarball, verified sha and regular CI and tests all pass. > > Tom > >

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Wenchen Fan
+1 On Tue, May 11, 2021 at 2:59 AM Holden Karau wrote: > +1 - pip install with Py 2.7 works (with the understandable warnings > regarding Python 2.7 no longer being maintained). > > On Mon, May 10, 2021 at 11:18 AM sarutak wrote: > > > > +1 (non-binding) > > > > - Kousuke > > > > > It looks

<    1   2   3   4   5   6   >