from:"Reynold Xin"

Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Reynold Xin

Late +1 On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh < vii...@gmail.com > wrote: > > > > Hi devs, > > > > Thanks for all the inputs. I think overall there are positive inputs in > Spark community about having RocksDB state store as external module. Then > let's go forward with this direc

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Reynold Xin

Enrico - do feel free to reopen the PRs or email people directly, unless you are told otherwise. On Thu, Feb 18, 2021 at 9:09 AM, Nicholas Chammas < nicholas.cham...@gmail.com > wrote: > > On Thu, Feb 18, 2021 at 10:34 AM Sean Owen < srowen@ gmail. com ( > sro...@gmail.com ) > wrote: > > >>

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-24 Thread Reynold Xin

+1 Correctness issues are serious! On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan < mri...@gmail.com > wrote: > > That is indeed cause for concern. > +1 on extending the voting deadline until we finish investigation of this. > > > > > Regards, > Mridul > > > > On Wed, Feb 24, 2021

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Reynold Xin

I don't think we should deprecate existing APIs. Spark's own Python API is relatively stable and not difficult to support. It has a pretty large number of users and existing code. Also pretty easy to learn by data engineers. pandas API is a great for data science, but isn't that great for some

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Reynold Xin

+1. Would open up a huge persona for Spark. On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < cutl...@gmail.com > wrote: > > +1 (non-binding) > > > On Fri, Mar 26, 2021 at 9:49 AM Maciej < mszymkiew...@gmail.com > wrote: > > >> +1 (nonbinding) >> >> >> >> On 3/26/21 3:52 PM, Hyukjin Kwon wr

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-07 Thread Reynold Xin

+1 On Thu, Oct 07, 2021 at 11:54 PM, Yuming Wang < wgy...@gmail.com > wrote: > > +1 (non-binding). > > > On Fri, Oct 8, 2021 at 1:02 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com ( > dongjoon.h...@gmail.com ) > wrote: > > >> +1 for Apache Spark 3.2.0 RC7. >> >> >> It looks good to me. I te

Re: spark binary map

2021-10-16 Thread Reynold Xin

Read up on Unsafe here: https://mechanical-sympathy.blogspot.com/ On Sat, Oct 16, 2021 at 12:41 AM, Rohan Bajaj < rohanbaja...@gmail.com > wrote: > > In 2015 Reynold Xin made improvements to Spark and it was basically moving > some structures that were on the java heap and movin

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Reynold Xin

tl;dr: there's no easy way to implement aggregate expressions that'd require multiple pass over data. It is simply not something that's supported and doing so would be very high cost. Would you be OK using approximate percentile? That's relatively cheap. On Mon, Dec 13, 2021 at 6:43 PM, Nichola

Re: Data correctness issue with Repartition + FetchFailure

2022-03-12 Thread Reynold Xin

This is why RoundRobinPartitioning shouldn't be used ... On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote: > > Hi Spark community, > > I reported a data correctness issue in https:/ / issues. apache. org/ jira/ > browse/ SPARK-38388 ( https://issues.apache.org/jira/b

Re: Stickers and Swag

2022-06-14 Thread Reynold Xin

Nice! Going to order a few items myself ... On Tue, Jun 14, 2022 at 7:54 PM, Gengliang Wang < ltn...@gmail.com > wrote: > > FYI now you can find the shopping information on https:/ / spark. apache. org/ > community ( https://spark.apache.org/community ) as well :) > > > > Gengliang > > > >

Re: Re: [VOTE][SPIP] Spark Connect

2022-06-15 Thread Reynold Xin

+1 super excited about this. I think it'd make Spark a lot more usable in application development and cloud setting: (1) Makes it easier to embed in applications with thinner client dependencies. (2) Easier to isolate user code vs system code in the driver. (3) Opens up the potential to upgrade

Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Reynold Xin

Spark Connect :) (It’s work in progress) On Mon, Dec 12 2022 at 2:29 PM, Kevin Su < pings...@gmail.com > wrote: > > Hey there, How can I get the same spark context in two different python > processes? > Let’s say I create a context in Process A, and then I want to use python > subprocess B to g

Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Reynold Xin

+1 On Thu, Jan 12, 2023 at 9:46 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > +1 for the proposal (guiding only without any code change). > > > Thanks, > Dongjoon. > > On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu < zsxwing@ gmail. com ( > zsxw...@gmail.com ) > wrote: > > >> +1 >

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Reynold Xin

+1 This is a great idea. On Wed, Jun 21, 2023 at 8:29 AM, Holden Karau < hol...@pigscanfly.ca > wrote: > > I’d like to start with a +1, better Python testing tools integrated into > the project make sense. > > On Wed, Jun 21, 2023 at 8:11 AM Amanda Liu < amandastephanieliu@ gmail. com > ( aman

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-25 Thread Reynold Xin

Personally I'd love this, but I agree with some of the earlier comments that this should not be Python specific (meaning I should be able to implement a data source in Python and then make it usable across all languages Spark supports). I think we should find a way to make this reusable beyond P

Re: [VOTE][SPIP] Python Data Source API

2023-07-07 Thread Reynold Xin

+1! On Fri, Jul 7 2023 at 11:58 AM, Holden Karau < hol...@pigscanfly.ca > wrote: > > +1 > > > On Fri, Jul 7, 2023 at 9:55 AM huaxin gao < huaxin.ga...@gmail.com > wrote: > > > >> +1 >> >> >> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh < mich.talebza...@gmail.com >> > wrote: >> >> >>>

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Reynold Xin

It should be the same as SQL. Otherwise it takes away a lot of potential future optimization opportunities. On Mon, Sep 18 2023 at 8:47 AM, Nicholas Chammas < nicholas.cham...@gmail.com > wrote: > > I’ve always considered DataFrames to be logically equivalent to SQL tables > or queries. > >

Re: [DISCUSS] SPIP: ShuffleManager short name registration via SparkPlugin

2023-11-04 Thread Reynold Xin

Why do we need this? The reason data source APIs need it is because it will be used by very unsophisticated end users and used all the time (for each connection / query). Shuffle is something you set up once, presumably by fairly sophisticated admins / engineers. On Sat, Nov 04, 2023 at 2:42 PM

Re: [VOTE] SPIP: Testing Framework for Spark UI Javascript files

2023-11-24 Thread Reynold Xin

+1 On Fri, Nov 24, 2023 at 10:19 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > +1 > > > Thanks, > Dongjoon. > > On Fri, Nov 24, 2023 at 7:14 PM Ye Zhou < zhouyejoe@ gmail. com ( > zhouye...@gmail.com ) > wrote: > > >> +1(non-binding) >> >> On Fri, Nov 24, 2023 at 11:16 Mridul M

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Reynold Xin

+1 on doing this in 3.0. On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung < felixcheun...@hotmail.com > wrote: > > I’m +1 if 3.0 > > > > > *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) > > *Sent:* Monday, March 25, 2019 6:48 PM > *To:* Hyukjin Kwon > *Cc:* dev; Bryan Cutler; Tak

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Reynold Xin

At some point we should celebrate having the larger RC number ever in Spark ... On Mon, Mar 25, 2019 at 9:44 PM, DB Tsai < dbt...@dbtsai.com.invalid > wrote: > > > > RC9 was just cut. Will send out another thread once the build is finished. > > > > > Sincerely, > > > > DB Tsai > ---

Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin

We have been thinking about some of these issues. Some of them are harder to do, e.g. Spark DataFrames are fundamentally immutable, and making the logical plan mutable is a significant deviation from the current paradigm that might confuse the hell out of some users. We are considering building a s

Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin

s working on it - I'd prefer > collaborating. > > Note - I'm not recommending we make the logical plan mutable (as I am > scared of that too!). I think there are other ways of handling that - but > we can go into details later. > > On Tue, Mar 26, 2019 at 11:58 AM R

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Reynold Xin

n Kwon < gurwls223@ gmail. com ( > gurwls...@gmail.com ) > wrote: > > >> BTW, I am working on the documentation related with this subject at https:/ >> / issues. apache. org/ jira/ browse/ SPARK-26022 ( >> https://issues.apache.org/jira/browse/SPARK-26022 ) to desc

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Reynold Xin

26% improvement is underwhelming if it requires massive refactoring of the codebase. Also you can't just add the benefits up this way, because: - Both vectorization and codegen reduces the overhead in virtual function calls - Vectorization code is more friendly to compilers / CPUs, but requires

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin

Yes this is known and an issue for performance. Do you have any thoughts on how to fix this? On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote: > I describe some of the details here: > https://issues.apache.org/jira/browse/SPARK-27296 > > The short version of the story is that aggregating dat

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin

All of these options are likely to have implications for the catalyst > systems. I'm not sure if they are minor more substantial. > > > On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> Yes this is known a

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin

of DataType classes. >> >> >> All of these options are likely to have implications for the catalyst >> systems. I'm not sure if they are minor more substantial. >> >> >> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com ( >

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-29 Thread Reynold Xin

We tried enabling blacklisting for some customers and in the cloud, very quickly they end up having 0 executors due to various transient errors. So unfortunately I think the current implementation is terrible for cloud deployments, and shouldn't be on by default. The heart of the issue is that t

Do you use single-quote syntax for the DataFrame API?

2019-03-30 Thread Reynold Xin

As part of evolving the Scala language, the Scala team is considering removing single-quote syntax for representing symbols. Single-quote syntax is one of the ways to represent a column in Spark's DataFrame API. While I personally don't use them (I prefer just using strings for column names, or

Re: [DISCUSS] Spark Columnar Processing

2019-04-01 Thread Reynold Xin

I just realized I didn't make it very clear my stance here ... here's another try: I think it's a no brainer to have a good columnar UDF interface. This would facilitate a lot of high performance applications, e.g. GPU-based accelerations for machine learning algorithms. On rewriting the entir

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Reynold Xin

gt; Do you have design doc? I'm also interested in this topic and want to help >>>> contribute. >>>> >>>> On Tue, Apr 2, 2019 at 10:00 PM Bobby Evans < bobby@ apache. org ( >>>> bo...@apache.org ) > wrote: >>>> >>>>

Re: pyspark.sql.functions ide friendly

2019-04-17 Thread Reynold Xin

Are you talking about the ones that are defined in a dictionary? If yes, that was actually not that great in hindsight (makes it harder to read & change), so I'm OK changing it. E.g. _functions = { 'lit': _lit_doc, 'col': 'Returns a :class:`Column` based on the given column name.',

Re: Spark 2.4.2

2019-04-17 Thread Reynold Xin

For Jackson - are you worrying about JSON parsing for users or internal Spark functionality breaking? On Wed, Apr 17, 2019 at 6:02 PM Sean Owen wrote: > There's only one other item on my radar, which is considering updating > Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up

Re: Spark 2.4.2

2019-04-17 Thread Reynold Xin

normally wouldn't backport, except that I've heard a > few times about concerns about CVEs affecting Databind, so wondering > who else out there might have an opinion. I'm not pushing for it > necessarily. > > On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin wrote: > > >

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Reynold Xin

"if others think it would be helpful, we can cancel this vote, update the SPIP to clarify exactly what I am proposing, and then restart the vote after we have gotten more agreement on what APIs should be exposed" That'd be very useful. At least I was confused by what the SPIP was about. No poin

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-26 Thread Reynold Xin

I do feel it'd be better to not switch default Scala versions in a minor release. I don't know how much downstream this impacts. Dotnet is a good data point. Anybody else hit this issue? On Thu, Apr 25, 2019 at 11:36 PM, Terry Kim < yumin...@gmail.com > wrote: > > > > Very much interested in

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin

Looks like a great idea to make changes in Spark 3.0 to prepare for Scala 2.13 upgrade. Are there breaking changes that would require us to have two different source code for 2.12 vs 2.13? On Fri, May 10, 2019 at 11:41 AM, Sean Owen < sro...@gmail.com > wrote: > > > > While that's not happe

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin

d. I failed and gave up. >> >> >> At some point maybe we figure out whether we can remove the SBT-based >> build if it's super painful, but only if there's not much other choice. >> That is for a future thread. >> >> >> >> On

Re: Interesting implications of supporting Scala 2.13

2019-05-11 Thread Reynold Xin

> Interested in thoughts on how to proceed on something like this, as there > will probably be a few more similar issues. > > > > On Fri, May 10, 2019 at 3:32 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> >> >>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-25 Thread Reynold Xin

Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling... On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun wrote: > +1 > > Thanks, > Dongjoon. > > On Fri, May 24, 2019 at 17:03 DB Tsai wrote: > >> +1 on exposing the APIs for columnar processing support. >> >

Re: [RESULT][VOTE] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Reynold Xin

Thanks Tom. I finally had time to look at the updated SPIP 10 mins ago. I support the high level idea and +1 on the SPIP. That said, I think the proposed API is too complicated and invasive change to the existing internals. A much simpler API would be to expose a columnar batch iterator interf

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Reynold Xin

+1 on Xiangrui’s plan. On Thu, May 30, 2019 at 7:55 AM shane knapp wrote: > I don't have a good sense of the overhead of continuing to support >> Python 2; is it large enough to consider dropping it in Spark 3.0? >> >> from the build/test side, it will actually be pretty easy to continue > suppo

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Reynold Xin

Seems like a good idea. Can we test this with a component first? On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun wrote: > Hi, All. > > Since we use both Apache JIRA and GitHub actively for Apache Spark > contributions, we have lots of JIRAs and PRs consequently. One specific > thing I've been long

Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-01 Thread Reynold Xin

That's a good idea. We should only be using squash. On Mon, Jul 01, 2019 at 1:52 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Hi, Apache Spark PMC members and committers. > > > We are using GitHub `Merge Button` in `spark-website` repository > because it's very convenient. > > >

Revisiting Python / pandas UDF

2019-07-05 Thread Reynold Xin

Hi all, In the past two years, the pandas UDFs are perhaps the most important changes to Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. I created a ticket and a document summarizing the issues,

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin

There is no explicit limit but a JVM string cannot be bigger than 2G. It will also at some point run out of memory with too big of a query plan tree or become incredibly slow due to query planning complexity. I've seen queries that are tens of MBs in size. On Thu, Jul 11, 2019 at 5:01 AM, 李书明 <

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin

tens of MB's. Any samples to share :) > > > Regards, > Gourav > > On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> There is no explicit limit but a JVM string cannot be bigger than 2G. It >&g

Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-23 Thread Reynold Xin

I like the spirit, but not sure about the exact proposal. Take a look at k8s': https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon wrote: > (Plus, it helps to track history too. Spark's commit logs are gr

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Reynold Xin

Matt what do you mean by maximizing 3, while allowing not throwing errors when any operations overflow? Those two seem contradicting. On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote: > > > > I’m -1, simply from disagreeing with the premise that we can afford to not >

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Reynold Xin

> > > > Agreed that a separate discussion about overflow might be warranted. I’m > surprised we don’t throw an error now, but it might be warranted to do so. > > > > > > > > > -Matt Cheah > > > > > > > > *Fro

Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Reynold Xin

We can also just write using one partition, which will be sufficient for most use cases. On Mon, Aug 5, 2019 at 7:48 PM Matt Cheah wrote: > There might be some help from the staging table catalog as well. > > > > -Matt Cheah > > > > *From: *Wenchen Fan > *Date: *Monday, August 5, 2019 at 7:40 P

Re: JDK11 Support in Apache Spark

2019-08-26 Thread Reynold Xin

Would it be possible to have one build that works for both? On Mon, Aug 26, 2019 at 10:22 AM Dongjoon Hyun wrote: > Thank you all! > > Let me add more explanation on the current status. > > - If you want to run on JDK8, you need to build on JDK8 > - If you want to run on JDK11, you need

Re: JDK11 Support in Apache Spark

2019-08-26 Thread Reynold Xin

rote: >>> >>> >>>> maybe in the future, but not right now as the hadoop 2.7 build is broken. >>>> >>>> >>>> also, i busted dev/ run-tests. py ( http://dev/run-tests.py ) in my changes >>>> to support java11 in PRBs: >>>> https:/ / github. com/ apache/ spark/ pull/ 25585 ( >>>> https://github.com/apache/spark/pull/25585 ) >>>> >>>> >>>> >>>> quick fix, testing now. >>>> >>>> On Mon, Aug 26, 2019 at 10:23 AM Reynold Xin < rxin@ databricks. com ( >>>> r...@databricks.com ) > wrote: >>>> >>>> >>>>> Would it be possible to have one build that works for both? >>>>> >>>> >>>> >>> >>> >> >> > >

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Reynold Xin

Having three modes is a lot. Why not just use ansi mode as default, and legacy for backward compatibility? Then over time there's only the ANSI mode, which is standard compliant and easy to understand. We also don't need to invent a standard just for Spark. On Thu, Sep 05, 2019 at 12:27 AM, Wen

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Reynold Xin

+1! Long due for a preview release. On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < hol...@pigscanfly.ca > wrote: > > I like the idea from the PoV of giving folks something to start testing > against and exploring so they can raise issues with us earlier in the > process and we have more time to

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin

DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider th

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin

lity would be a problem in maintaining compatibility > between the 2.5 version and the 3.0 version. If we find that we need to > make API changes (other than additions) then we can make those in the 3.1 > release. Because the goals we set for the 3.0 release have been reached > with the

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin

> UnsafeRow > was part of the original proposal. > > > > In any case, the goal for 3.0 was not to replace the use of InternalRow , > it was to get the majority of SQL working on top of the interface added > after 2.4. That’s done and stable, so I think a 2.5 release

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin

gt;> Sounds like we agree, then. We will use it for 3.0, but there are known >>>> problems with it. >>>> >>>> Thinking we’d have dsv2 working in both 3.x (which will change and >>>> progress towards more stable, but will have to break certain APIs)

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin

lRow, > but I think we would want to carefully consider whether that is the right > decision. And in any case, we would be able to keep 2.5 and 3.0 > compatible, which is the main goal. > > On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin < r...@databricks.com > wrote: > > &

Re: Collections passed from driver to executors

2019-09-23 Thread Reynold Xin

A while ago we changed it so the task gets broadcasted too, so I think the two are fairly similar. On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com > wrote: > > I was wondering if anyone could help with this question. > > On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hat

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin

Whoever created the JIRA years ago didn't describe dpp correctly, but the linked jira in Hive was correct (which unfortunately is much more terse than any of the patches we have in Spark https://issues.apache.org/jira/browse/HIVE-9152 ). Henry R's description was also correct. On Wed, Oct 02,

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin

gt;> >> Regarding the place in the optimizer rules, it's preferred to happen late >> in the optimization, and definitely after join reorder. >> >> >> Thanks, >> Maryann >> >> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin wrote: >> >>

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Reynold Xin

l write up, but I think we should at least give some > up-to-date description on that JIRA entry. > > On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin < r...@databricks.com > wrote: > > >> No there is no separate write up internally. >> >> On Wed, Oct 2

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Reynold Xin

Can we just tag master? On Wed, Oct 16, 2019 at 12:34 AM, Wenchen Fan < cloud0...@gmail.com > wrote: > > Does anybody remember what we did for 2.0 preview? Personally I'd like to > avoid cutting branch-3.0 right now, otherwise we need to merge PRs into > two branches in the following several mon

Re: Add spark dependency on on org.opencypher:okapi-shade.okapi

2019-10-16 Thread Reynold Xin

Just curious - did we discuss why this shouldn't be another Apache sister project? On Wed, Oct 16, 2019 at 10:21 AM, Sean Owen < sro...@gmail.com > wrote: > > > > We don't all have to agree on whether to add this -- there are like 10 > people with an opinion -- and I certainly would not veto

Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-29 Thread Reynold Xin

Does the description make sense? This is a preview release so there is no need to retarget versions. On Tue, Oct 29, 2019 at 7:01 PM Xingbo Jiang wrote: > Please vote on releasing the following candidate as Apache Spark version > 3.0.0-preview. > > The vote is open until November 2 PST and passe

Re: Why Spark generates Java code and not Scala?

2019-11-09 Thread Reynold Xin

It’s mainly due to compilation speed. Scala compiler is known to be slow. Even javac is quite slow. We use Janino which is a simpler compiler to get faster compilation speed at runtime. Also for low level code we can’t use (due to perf concerns) any of the edges scala has over java, eg we can’t us

Re: Spark 3.0 preview release 2?

2019-12-08 Thread Reynold Xin

If the cost is low, why don't we just do monthly previews until we code freeze? If it is high, maybe we should discuss and do it when there are people that volunteer On Sun, Dec 08, 2019 at 10:32 PM, Xiao Li < gatorsm...@gmail.com > wrote: > > > > I got many great feedbacks from the com

Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Reynold Xin

We've pushed out 3.0 multiple times. The latest release window documented on the website ( http://spark.apache.org/versioning-policy.html ) says we'd code freeze and cut branch-3.0 early Dec. It looks like we are suffering a bit from the tragedy of the commons, that nobody is pushing for getting

Re: [SPARK-30296][SQL] Add Dataset diffing feature

2020-01-07 Thread Reynold Xin

Can this perhaps exist as an utility function outside Spark? On Tue, Jan 07, 2020 at 12:18 AM, Enrico Minack < m...@enrico.minack.dev > wrote: > > > > Hi Devs, > > > > I'd like to get your thoughts on this Dataset feature proposal. Comparing > datasets is a central operation when regressio

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-10 Thread Reynold Xin

Introducing a new data type has high overhead, both in terms of internal complexity and users' cognitive load. Introducing two data types would have even higher overhead. I looked quickly and looks like both Redshift and Snowflake, two of the most recent SQL analytics successes, have only one i

Re: Adding Maven Central mirror from Google to the build?

2020-01-21 Thread Reynold Xin

This seems reasonable! On Tue, Jan 21, 2020 at 3:23 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > +1, I'm supporting the following proposal. > > > > this mirror as the primary repo in the build, falling back to Central if > needed. > > > Thanks, > Dongjoon. > > > > On Tue, Jan

Re: Enabling push-based shuffle in Spark

2020-01-21 Thread Reynold Xin

Thanks for writing this up. Usually when people talk about push-based shuffle, they are motivating it primarily to reduce the latency of short queries, by pipelining the map phase, shuffle phase, and the reduce phase (which this design isn't going to address). It's interesting you are targetin

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-01-21 Thread Reynold Xin

If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead. If your UDF is cheap, it will help tremendously. On Mon, Jan 20, 2020 at 6:33 PM, < em...@yeikel.com > wrote: > > > > Hi, >

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-01-29 Thread Reynold Xin

;>>> named output from CleanupAliases >>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window >>>>> aggregate >>>>> SPARK-25531 new write APIs for data source v2 >>>>> SPARK-25547 Pluggable jdbc connection factory >>>>> SPARK-2

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-01 Thread Reynold Xin

Note that branch-3.0 was cut. Please focus on testing, polish, and let's get the release out! On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin < r...@databricks.com > wrote: > > Just a reminder - code freeze is coming this Fri ! > > > > There can always be ex

Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Reynold Xin

This is really cool. We should also be more opinionated about how we specify time and intervals. On Wed, Feb 12, 2020 at 3:15 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Thank you, Wenchen. > > > The new policy looks clear to me. +1 for the explicit policy. > > > So, are we go

Re: [DISCUSS] Shall we mark spark streaming component as deprecated.

2020-03-02 Thread Reynold Xin

It's a good discussion to have though: should we deprecate dstream, and what do we need to do to make that happen? My experience working with a lot of Spark users is that in general I recommend them staying away from dstream, due to a lot of design and architectural issues. On Mon, Mar 02, 2020

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Reynold Xin

+1 On Mon, Mar 09, 2020 at 3:53 PM, John Zhuge < jzh...@apache.org > wrote: > > +1 (non-binding) > > > On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer < heuermh@ gmail. com ( > heue...@gmail.com ) > wrote: > > >> +1 (non-binding) >> >> >> I am disappointed however that this only mentions API

Re: FYI: The evolution on `CHAR` type behavior

2020-03-14 Thread Reynold Xin

I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users? For old users, their old code that was working for char(3) would now stop working. For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi S

Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Reynold Xin

the banning was > the proposed alternative to reduce the potential issue. > > > Please give us your opinion since it's still PR. > > > Bests, > Dongjoon. > > On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) &g

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin

, >> >> >> 100% agree with Reynold. >> >> >> >> >> Regards, >> Gourav Sengupta >> >> >> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com ( >> r...@databricks.com ) > wrote: >> >&g

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin

( >>> dongjoon.h...@gmail.com ) > wrote: >>> >>> Hi, Reynold. >>> (And +Michael Armbrust) >>> >>> >>> If you think so, do you think it's okay that we change the return value >>> silently? Then, I'm won

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin

systems also deviate away from the standard on this specific behavior. On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < r...@databricks.com > wrote: > > I looked up our usage logs (sorry I can't share this publicly) and trim > has at least four orders of magnitude higher usage

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin

te away > from the standard on this specific behavior. > > > Bests, > Dongjoon. > > On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> BTW I'm not opposing us sticking to SQL standard (I

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin

periences with the non-negligible cases in on-prem. > > > > Bests, > Dongjoon. > > > On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> −User >> >> >> >> char barely sh

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin

You are joking when you said " informed widely and discussed in many ways twice" right? This thread doesn't even talk about char/varchar: https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E (Yes it talked about changing the

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin

default datasource as provider for CREATE TABLE > syntax", 2019/12/06 > > https:/ / lists. apache. org/ thread. html/ > > 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev. > spark. apache. org%3E ( > https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-24 Thread Reynold Xin

I actually think we should start cutting RCs. We can cut RCs even with blockers. On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Hi, All. > > First of all, always "Community Over Code"! > I wish you the best health and happiness. > > As we know, we are s

Re: results of taken(3) not appearing in console window

2020-03-26 Thread Reynold Xin

bcc dev, +user You need to print out the result. Take itself doesn't print. You only got the results printed to the console because the Scala REPL automatically prints the returned value from take. On Thu, Mar 26, 2020 at 12:15 PM, Zahid Rahman < zahidr1...@gmail.com > wrote: > > I am running

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Reynold Xin

ll have the blockers that will fail the >>> RCs. >>> >>> >>> >>> Cheers, >>> >>> >>> Xiao >>> >>> >>> >>> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com >&

[VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version 3.0.0. The vote is open until 11:59pm Pacific time Fri Apr 3 , and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.0.0 [ ] -1 Do not release this pack

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Reynold Xin

The Apache Software Foundation requires voting before any release can be published. On Tue, Mar 31, 2020 at 11:27 PM, Stephen Coy < s...@infomedia.com.au.invalid > wrote: > > >> On 1 Apr 2020, at 5:20 pm, Sean Owen < srowen@ gmail. com ( >> sro...@gmail.com ) > wrote: >> >> It can be publish

Re: Spark DAG scheduler

2020-04-16 Thread Reynold Xin

The RDD is the DAG. On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi < abdi...@husky.neu.edu > wrote: > > Hello everyone, > > I am implementing a caching mechanism for analytic workloads running on > top of Spark and I need to retrieve the Spark DAG right after it is > generated and the DAG schedule

Re: Spark DAG scheduler

2020-04-16 Thread Reynold Xin

bdi...@husky.neu.edu > wrote: > > Is it correct to say, the nodes in the DAG are RDDs and the edges are > computations? > > > On Thu, Apr 16, 2020 at 6:21 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> The RDD is the DAG. >>

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-28 Thread Reynold Xin

The con is much more than just more effort to maintain a parallel API. It puts the burden for all libraries and library developers to maintain a parallel API as well. That’s one of the primary reasons we moved away from this RDD vs JavaRDD approach in the old RDD API. On Tue, Apr 28, 2020 at 12:3

[VOTE] Apache Spark 3.0 RC2

2020-05-18 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version 3.0.0. The vote is open until Thu May 21 11:59pm Pacific time and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.0.0 [ ] -1 Do not release this packa

[vote] Apache Spark 3.0 RC3

2020-06-06 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version 3.0.0. The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.0.0 [ ] -1 Do not release this package because ... To lea

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1414 matches

Mail list logo