Re: Welcome Yikun Jiang as a Spark committer

2022-10-08 Thread Jungtaek Lim
Congrats! 2022년 10월 8일 (토) 오후 3:24, huaxin gao 님이 작성: > Congratulations! > > On Fri, Oct 7, 2022 at 11:22 PM Yang,Jie(INF) wrote: > >> Congratulations Yikun! >> >> Regards, >> Yang Jie >> -- >> *发件人:* Mridul Muralidharan >> *发送时间:* 2022年10月8日 14:16:02 >> *收件人:*

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Jungtaek Lim
+1 On Thu, Oct 6, 2022 at 5:59 AM Chao Sun wrote: > +1 > > > and specifically may allow us to finally move off of the ancient version > of Guava (?) > > I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. > > On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng > wrote: > >> +1. >> >>

Re: [Structured Streaming + Kafka] Reduced support for alternative offset management

2022-09-01 Thread Jungtaek Lim
: https://github.com/HeartSaVioR/spark-sql-kafka-offset-committer Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Tue, Aug 30, 2022 at 5:05 PM Martin Andersson wrote: > I was looking around for some documentation regarding how checkpointing > (or rather, delivery semantics) is don

Re: Welcoming three new PMC members

2022-08-09 Thread Jungtaek Lim
Congrats everyone! On Wed, Aug 10, 2022 at 8:57 AM Hyukjin Kwon wrote: > Congrats everybody! > > On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan > wrote: > >> >> Congratulations ! >> Great to have you join the PMC !! >> >> Regards, >> Mridul >> >> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan

Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Jungtaek Lim
Congrats Xinrong! Well deserved. 2022년 8월 9일 (화) 오후 5:13, Hyukjin Kwon 님이 작성: > Hi all, > > The Spark PMC recently added Xinrong Meng as a committer on the project. > Xinrong is the major contributor of PySpark especially Pandas API on Spark. > She has guided a lot of new contributors

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-18 Thread Jungtaek Lim
available data in a single microbatch. While this can behave the same with Trigger.Once on processing new available data (watermark advancement happens after processing all the data), this can also handle previous uncommitted batch(es) as well as no-data batch. On Tue, Jul 12, 2022 at 9:43 AM Jungtaek Lim

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-11 Thread Jungtaek Lim
Final reminder. I'll leave this thread for a couple of days to see further voices, and go forward if there is no outstanding comment. On Sat, Jul 9, 2022 at 9:54 PM Jungtaek Lim wrote: > It sounds like none of the approaches perfectly solve the issue of > backfill. > > 1. Trigger

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-09 Thread Jungtaek Lim
batches are processed > in the correct event time order when starting from scratch. > > I'm not against deprecating Trigger.Once, just wanted to chime in that > someone was using it! I'm itching to upgrade and try out the new stuff. > > Adam > > On Fri, Jul 8, 2022 at 9:16 AM Ju

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Jungtaek Lim
t its own design to deal with.) > > Adam > > On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim > wrote: > >> Bump to get a chance to expose the proposal to wider audiences. >> >> Given that there are not many active contributors/maintainers in area >> Stru

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Jungtaek Lim
ve forward if there are no outstanding objections. On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim wrote: > Hi dev, > > I would like to hear voices about deprecating Trigger.Once, and promoting > Trigger.AvailableNow as a replacement [1] in Structured Streaming. > (It doesn't mean we remove

[DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-06 Thread Jungtaek Lim
to the behavior of Trigger.AvailableNow, it handles no-data batch as well before termination of the query. Please review and let us know if you have any feedback or concerns on the proposal. Thanks! Jungtaek Lim 1. https://issues.apache.org/jira/browse/SPARK-36533

Observed consistent test failure in master (ParquetIOSuite)

2022-06-27 Thread Jungtaek Lim
looks into this sooner. Thanks! Jungtaek Lim (HeartSaVioR)

Re: 回复: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-14 Thread Jungtaek Lim
+1 (non-binding) Checked signature and checksum. Confirmed SPARK-39412 is resolved. Built source tgz with JDK 11. Thanks Max for driving the efforts of this huge release! On Tue, Jun 14, 2022 at 2:51 PM huaxin gao wrote: > +1 (non-binding) >

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-08 Thread Jungtaek Lim
Apologize for late participation. I'm sorry, but -1 (non-binding) from me. Unfortunately I found a major user-facing issue which hurts UX seriously on Kafka data source usage. In some cases, Kafka data source can throw IllegalStateException for the case of failOnDataLoss=true which condition is

Re: SIGMOD System Award for Apache Spark

2022-05-12 Thread Jungtaek Lim
Congrats Spark community! On Fri, May 13, 2022 at 10:40 AM Qian Sun wrote: > Congratulations !!! > > 2022年5月13日 上午3:44,Matei Zaharia 写道: > > Hi all, > > We recently found out that Apache Spark received > the SIGMOD System Award this > year, given by

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-23 Thread Jungtaek Lim
ations (does it require a server-side > update or not?), and document the change itself for sure along with any > Spark-side migration notes. > > On Fri, Mar 18, 2022 at 8:47 PM Jungtaek Lim > wrote: > >> The thing is, it is “us” who upgrades Kafka client and makes possible >

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-23 Thread Jungtaek Lim
Bump to try gathering more voices before taking action. For now, I see two voices as option 2 & 5 (similar to option 2 but not in the migration node but in the release note). On Fri, Mar 18, 2022 at 7:15 PM Jungtaek Lim wrote: > CORRECTION: in option 2, we enumerate KIPs which ma

Re: bazel and external/

2022-03-22 Thread Jungtaek Lim
teps. If there is consensus that connectors will move out, should >>>> the directory be named misc for everything else until there is some >>>> direction for the remaining modules? >>>> >>>> On Fri, 18 Mar 2022 at 03:03 Jungtaek Lim >>>> wrote: >>

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
would affect > Kafka usage itself; focus on the connector-related issues. > > On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim > wrote: > >> CORRECTION: in option 2, we enumerate KIPs which may bring >> incompatibility with older brokers (not all KIPs). >> &g

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
s important to check :) >> >> Seems like my Kafka Spark compatibility gist is out-of-date so maybe I >> need to invest some time to resurrect it: >> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9 >> >> Hope my thoughts are helpful! >> >>

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
CORRECTION: in option 2, we enumerate KIPs which may bring incompatibility with older brokers (not all KIPs). On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim wrote: > Hi dev, > > I would like to initiate the discussion about how to deal with the > migration guide on upgrading Kafka

[DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
End users can indicate the upgrade in the release note, and we expect end users to actively check the notable changes (& KIPs) from Kafka doc. 5. Options not described above... Please take a look and provide your voice on this. Thanks, Jungtaek Lim (HeartSaVioR) ps. Probably this would

Re: bazel and external/

2022-03-17 Thread Jungtaek Lim
rs. > > On Thu, Mar 17, 2022 at 7:33 PM Jungtaek Lim > wrote: > >> We seem to just focus on how to avoid the conflict with the name >> "external" used in bazel. Since we consider the possibility of renaming, >> why not revisit the modules "external" conta

Re: bazel and external/

2022-03-17 Thread Jungtaek Lim
We seem to just focus on how to avoid the conflict with the name "external" used in bazel. Since we consider the possibility of renaming, why not revisit the modules "external" contains? Looks like kinds of the modules external directory contains are 1) Docker 2) Connectors 3) Sink on Dropwizard

Re: Apache Spark 3.3 Release

2022-03-03 Thread Jungtaek Lim
Thanks Maxim for volunteering to drive the release! I support the plan (March 15th) to perform a release branch cut. Btw, would we be open for modification of critical/blocker issues after the release branch cut? I have a blocker JIRA ticket and the PR is open for reviewing, but need some time to

Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Jungtaek Lim
If ASF wants to do it, INFRA could probably deal with it for entire projects, like ASF code of conduct being exposed to the right side of the all ASF github repos recently. On Wed, Dec 15, 2021 at 11:49 PM Sean Owen wrote: > It might imply that this is a way to fund Spark alone, and it isn't. >

Re: [Proposal] Deprecate Trigger.Once and replace with Trigger.AvailableNow

2021-12-12 Thread Jungtaek Lim
Friendly reminder. I'll submit the proposed change if there is no objection observed this week. On Wed, Dec 8, 2021 at 4:16 PM Jungtaek Lim wrote: > Hi dev, > > I would like to hear voices about deprecating Trigger.Once, and replacing > it with Trigger.AvailableNow [1] in Structur

[Proposal] Deprecate Trigger.Once and replace with Trigger.AvailableNow

2021-12-07 Thread Jungtaek Lim
in migration guide - Replace all usages of Trigger.Once with Trigger.AvailableNow, except the test cases of Trigger.Once itself Please review the proposal and share your voice on this. Thanks! Jungtaek Lim 1. https://issues.apache.org/jira/browse/SPARK-36533

Re: Time for Spark 3.2.1?

2021-12-07 Thread Jungtaek Lim
+1 for both releases and the time! On Wed, Dec 8, 2021 at 3:46 PM Mridul Muralidharan wrote: > > +1 for maintenance release, and also +1 for doing this in Jan ! > > Thanks, > Mridul > > On Tue, Dec 7, 2021 at 11:41 PM Gengliang Wang wrote: > >> +1 for new maintenance releases for all 3.x

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Jungtaek Lim
Thanks for all the hard work you have been doing, Shane! On Tue, Dec 7, 2021 at 2:17 PM Nick Pentreath wrote: > Wow! end of an era > > Thanks so much to you Shane for all you work over 10 (!!) years. And to > Amplab also! > > Farewell Spark Jenkins! > > N > > On Tue, Dec 7, 2021 at 6:49 AM

Re: Update Spark 3.3 release window?

2021-10-28 Thread Jungtaek Lim
+1 for mid-March 2022. +1 for EOL 2.x as well. I guess we did it already according to Dongjoon's quote from the Spark website. On Fri, Oct 29, 2021 at 3:49 AM Dongjoon Hyun wrote: > +1 for mid March for Spark 3.3. > > For 2.4, our document already mentioned its EOL like > > " For example,

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Jungtaek Lim
Thanks to Gengliang for driving this huge release! On Wed, Oct 20, 2021 at 1:50 AM Dongjoon Hyun wrote: > Thank you so much, Gengliang and all! > > Dongjoon. > > On Tue, Oct 19, 2021 at 8:48 AM Xiao Li wrote: > >> Thank you, Gengliang! >> >> Congrats to our community and all the contributors!

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-24 Thread Jungtaek Lim
Meta question: this doesn't target Spark 3.2, right? Many folks have been working on branch cut for Spark 3.2, so might be less active to jump in new feature proposals right now. On Fri, Jun 25, 2021 at 9:00 AM Holden Karau wrote: > I took an initial look at the PRs this morning and I’ll go

Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-21 Thread Jungtaek Lim
+1 (non-binding) Thanks for your efforts! On Mon, Jun 21, 2021 at 2:40 PM Kent Yao wrote: > +1 (non-binding) > > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi is a unified multi-tenant JDBC >

Re: Apache Spark 3.0.3 Release?

2021-06-09 Thread Jungtaek Lim
Late +1 Thanks! On Thu, Jun 10, 2021 at 12:06 PM Yi Wu wrote: > Thanks all, I'll start the RC soon. > > On Wed, Jun 9, 2021 at 7:07 PM Gengliang Wang wrote: > >> +1, thanks Yi >> >> Gengliang Wang >> >> >> >> >> On Jun 9, 2021, at 6:03 PM, 郑瑞峰 wrote: >> >> +1, thanks Yi >> >> >>

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-02 Thread Jungtaek Lim
Nice! Thanks Dongjoon for your amazing efforts! On Wed, Jun 2, 2021 at 2:59 PM Liang-Chi Hsieh wrote: > Thank you, Dongjoon! > > > > Takeshi Yamamuro wrote > > Thank you, Dongjoon! > > > > On Wed, Jun 2, 2021 at 2:29 PM Xiao Li > > > lixiao@ > > > wrote: > > > >> Thank you! > >> > >> Xiao >

Re: Apache Spark 3.1.2 Release?

2021-05-18 Thread Jungtaek Lim
Late +1 here as well, thanks for volunteering! 2021년 5월 19일 (수) 오전 11:24, 郑瑞峰 님이 작성: > late +1. thanks Dongjoon! > > > -- 原始邮件 -- > *发件人:* "Dongjoon Hyun" ; > *发送时间:* 2021年5月19日(星期三) 凌晨1:29 > *收件人:* "Wenchen Fan"; > *抄送:* "Xiao Li";"Kent Yao";"John >

Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-18 Thread Jungtaek Lim
Thanks for the huge efforts on driving the release! On Tue, May 18, 2021 at 4:53 PM Wenchen Fan wrote: > Thank you, Liang-Chi! > > On Tue, May 18, 2021 at 1:32 PM Dongjoon Hyun > wrote: > >> Finally! Thank you, Liang-Chi. >> >> Bests, >> Dongjoon. >> >> >> On Mon, May 17, 2021 at 10:14 PM

Re: [DISCUSS] Add RocksDB StateStore

2021-04-27 Thread Jungtaek Lim
I think adding RocksDB state store to sql/core directly would be OK. Personally I also voted "either way is fine with me" against RocksDB state store implementation in Spark ecosystem. The overall stance hasn't changed, but I'd like to point out that the risk becomes quite lower than before, given

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-13 Thread Jungtaek Lim
+1 (non-binding) signature OK, extracting tgz files OK, build source without running tests OK. On Tue, Apr 13, 2021 at 5:02 PM Herman van Hovell wrote: > +1 > > On Tue, Apr 13, 2021 at 2:40 AM sarutak wrote: > >> +1 (non-binding) >> >> > +1 >> > >> > On Tue, 13 Apr 2021, 02:58 Sean Owen,

Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Jungtaek Lim
Congrats all! 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성: > Congrats! Welcome! > > > Matei Zaharia wrote > > Hi all, > > > > The Spark PMC recently voted to add several new committers. Please join > me > > in welcoming them to their new role! Our new committers are: > > > > - Maciej

Re: Checkpointing in Spark Structured Streaming

2021-03-22 Thread Jungtaek Lim
t; Rohit > > On Mon, Mar 22, 2021 at 4:09 PM Jungtaek Lim > wrote: > >> I see some points making async checkpoint be tricky to add in >> micro-batch; one example is "end to end exactly-once", as the commit phase >> in sink for the batch N can be run "after&qu

Re: Checkpointing in Spark Structured Streaming

2021-03-22 Thread Jungtaek Lim
I see some points making async checkpoint be tricky to add in micro-batch; one example is "end to end exactly-once", as the commit phase in sink for the batch N can be run "after" the batch N + 1 has been started and write for batch N + 1 can happen before committing batch N. state store

Re: Determine global watermark via StreamingQueryProgress eventTime watermark String

2021-03-16 Thread Jungtaek Lim
There was a similar question (but another approach) and I've explained the current status a bit. https://lists.apache.org/thread.html/r89a61a10df71ccac132ce5d50b8fe405635753db7fa2aeb79f82fb77%40%3Cuser.spark.apache.org%3E I guess this would also answer your question as well. At least for now,

Re: Observable Metrics on Spark Datasets

2021-03-16 Thread Jungtaek Lim
egistration happens. I think this qualifies > as: "all the logic happens in the JVM". All that is transferred to Python > is a row's data. No listeners needed. > > Enrico > > > > Am 16.03.21 um 00:13 schrieb Jungtaek Lim: > > If I remember correctly, the ma

Re: Observable Metrics on Spark Datasets

2021-03-15 Thread Jungtaek Lim
If I remember correctly, the major audience of the "observe" API is Structured Streaming, micro-batch mode. From the example, the abstraction in 2 isn't something working with Structured Streaming. It could be still done with callback, but it remains the question how much complexity is hidden from

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-11 Thread Jungtaek Lim
+1 (non-binding) Excellent description on SPIP doc! Thanks for the amazing effort! On Wed, Mar 10, 2021 at 3:19 AM Liang-Chi Hsieh wrote: > > +1 (non-binding). > > Thanks for the work! > > > Erik Krogen wrote > > +1 from me (non-binding) > > > > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao > > >

Re: Property spark.sql.streaming.minBatchesToRetain

2021-03-09 Thread Jungtaek Lim
That property decides how many log files (log file is created per batch per type - types are like offsets, commits, etc.) to retain on the checkpoint. Unless you're struggling with a small files problem on checkpoint, you wouldn't need to tune the value. I guess that's why the configuration is

Re: using accumulators in (MicroBatch) InputPartitionReader

2021-03-07 Thread Jungtaek Lim
I'm not sure about the accumulator approach; one possible approach which might work (DISCLAIMER: a random thought) would be employing an RPC endpoint on the driver side which receives such information from executors and plays as a coordinator. Beware that Spark's RPC implementation is package

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Jungtaek Lim
Thanks Hyukjin for driving the huge release, and thanks everyone for contributing the release! On Wed, Mar 3, 2021 at 6:54 PM angers zhu wrote: > Great work, Hyukjin ! > > Bests, > Angers > > Wenchen Fan 于2021年3月3日周三 下午5:02写道: > >> Great work and congrats! >> >> On Wed, Mar 3, 2021 at 3:51 PM

Re: Please take a look at the draft of the Spark 3.1.1 release notes

2021-02-27 Thread Jungtaek Lim
Thanks Hyukjin! I've only looked into the SS part, and added a comment. Otherwise it looks great! On Sat, Feb 27, 2021 at 7:12 PM Dongjoon Hyun wrote: > Thank you for sharing, Hyukjin! > > Dongjoon. > > On Sat, Feb 27, 2021 at 12:36 AM Hyukjin Kwon wrote: > >> Hi all, >> >> I am preparing to

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-22 Thread Jungtaek Lim
+1 (non-binding) Verified signatures. Only a few commits added after RC2 which don't seem to change the SS behavior, so I'd carry over my +1 from RC2. On Mon, Feb 22, 2021 at 3:57 PM Hyukjin Kwon wrote: > Starting with my +1 (binding). > > 2021년 2월 22일 (월) 오후 3:56, Hyukjin Kwon 님이 작성: > >>

Re: Please use Jekyll via "bundle exec" from now on

2021-02-18 Thread Jungtaek Lim
Nice fix. Thanks! On Thu, Feb 18, 2021 at 7:13 PM Hyukjin Kwon wrote: > Thanks Attlila for fixing and sharing this. > > 2021년 2월 18일 (목) 오후 6:17, Attila Zsolt Piros 님이 > 작성: > >> Hello everybody, >> >> To pin the exact same version of Jekyll across all the contributors, Ruby >> Bundler is

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Jungtaek Lim
oposal. There may be no disagreement. It might result in the > other person joining your PR. As I say, not sure if there's a deeper issue > than that if even this hasn't been tried? > > On Mon, Feb 15, 2021 at 8:35 PM Jungtaek Lim > wrote: > >> Thanks for the input, Hyukj

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-15 Thread Jungtaek Lim
think that the actual issue by setting an assignee happens > rarely, and it is an issue to several specific cases that would need a look > case-by-case. > Were there specific cases that made you concerned? > > > 2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이 > 작성: > >> Hi

[DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-14 Thread Jungtaek Lim
g JIRA issues with only sketched ideas or even just rationalizations.) Would like to hear everyone's voices. Thanks, Jungtaek Lim (HeartSaVioR) ps. better yet, probably it's better then to restrict something explicitly if we sincerely respect the underlying culture on the statement "In case several people contributed, prefer to assign to the more ‘junior’, non-committer contributor".

Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-09 Thread Jungtaek Lim
+1 (non-binding) * verified signatures * built custom distribution with enabling kubernetes & hadoop-cloud profile * built custom docker image from dist * ran applications "rate to kafka" & "kafka to kafka" on k8s cluster (local k3s) Thanks for driving the release

Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Jungtaek Lim
+1 to add, no matter to add under sql-core vs external module. Rationalization for myself: * The discussion thread and voices here show strong demand for adding RocksDB state store out of the box. * No workaround on huge state store problem out of the box. Direct competitors on streaming

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-18 Thread Jungtaek Lim
+1 (non-binding) * verified signature and sha for all files (there's a glitch which I'll describe in below) * built source (DISCLAIMER: didn't run tests) and made custom distribution, and built a docker image based on the distribution - used profiles: kubernetes, hadoop-3.2, hadoop-cloud * ran

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-06 Thread Jungtaek Lim
No worries about the accident. We're human beings, and everyone can make a mistake. Let's wait and see the response of INFRA-21266. Just a 2 cents, I'm actually leaning toward to skip 3.1.0 and start the release process for 3.1.1, as anyone could be some sort of "rushing" on verification on

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-05 Thread Jungtaek Lim
There's an issue SPARK-33635 [1] reported due to performance regression on Kafka read between Spark 2.4 vs 3.0, which sounds like a blocker. I'll mark this as a blocker, unless anyone has different opinions. 1. https://issues.apache.org/jira/browse/SPARK-33635 On Wed, Jan 6, 2021 at 9:01 AM

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-11-28 Thread Jungtaek Lim
e mode support. That is to say, if we use "complete" mode for every >> aggregation operators, the wrong result will return. >> >> SPARK-26655 would be a good start, which only considers about "append" >> mode. Maybe we need more discussion on the watermark

Re: Seeking committers' help to review on SS PR

2020-11-27 Thread Jungtaek Lim
/pull/27649 https://github.com/apache/spark/pull/28363 These are under 100 lines of changes per each, and not invasive. On Sat, Nov 28, 2020 at 11:34 AM Jungtaek Lim wrote: > Thanks for providing valuable feedback. Appreciate it. Sorry I haven't had > time to reply to this in time (w

Re: Seeking committers' help to review on SS PR

2020-11-27 Thread Jungtaek Lim
long if you have other capable >>> reviews and you are a committer, if you don't see that it impacts other >>> code meaningfully in a way that really demands review from others, and in >>> good faith judge that it is worthwhile. I think you are the one de facto >>> expert o

Re: [SS] full outer stream-stream join

2020-11-22 Thread Jungtaek Lim
Adding rationalization here, my request for raising the thead to dev mailing list is, to figure out possible reasons not having full outer join at the moment when adding left/right outer join. This is rather historical knowledge, so I have no idea about this. Most likely a limited number of folks

Seeking committers' help to review on SS PR

2020-11-22 Thread Jungtaek Lim
to continue struggling with new PRs. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://github.com/apache/spark/pull/24173 2. https://issues.apache.org/jira/browse/SPARK-27237

Re: [DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
more to see if anyone has further comments. Otherwise I'll merge this.". I see both are used across various PRs, so it's not really something I want to blame. Just want to make us think about what would be the ideal approach we'd be better to prefer. On Sat, Nov 14, 2020 at 3:46 PM Ju

Re: [DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
d enough to know what they are. Can you point them out? I think that > is most productive for everyone to understand. > > On Fri, Nov 13, 2020 at 10:16 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Hi devs, >> >> I know this is a super sensitiv

[DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
merging. Again I know it's super hard to reconsider the ongoing practice while the project has gone for the long way (10 years), but just wanted to hear the voices about this. Thanks, Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-08 Thread Jungtaek Lim
mark and implementing operator-wise watermark properly. This is just a workaround, but fixing watermark would require major effort. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://issues.apache.org/jira/browse/SPARK-24634 On Sat, Nov 7, 2020 at 3:59 PM Liang-Chi Hsieh wrote: > Hi devs, > >

Re: [DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-25 Thread Jungtaek Lim
incorrect. > > On Fri, Oct 23, 2020 at 5:24 AM Russell Spitzer > wrote: > >> I was convinced that we should probably just fail, but if that is too >> much of a change, then logging the exception is also acceptable. >> >> On Thu, Oct 22, 2020, 10:32 PM Jungtaek Lim &g

[DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-22 Thread Jungtaek Lim
to add the exception information in the error log message at least. Would like to hear the voices. Thanks, Jungtaek Lim (HeartSaVioR)

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
the catalog name when writing table >> names, you can set your custom catalog as the default catalog (See >> SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is >> used to extend the v1 session catalog, not replace it. >> >> On Wed, Oct 7, 2020

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
about > the new v2 DDL commands that work with v2 catalog APIs. > > On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim > wrote: > >> My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it >> simply works when I use custom catalog without replacing the default >> c

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
CREATE TABLE LIKE), > so it's possible that some commands still go through the v1 session catalog > although you configured a custom v2 session catalog. > > Can you create JIRA tickets if you hit any DDL commands that don't support > v2 catalog? We should fix them. > > On Wed, O

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-06 Thread Jungtaek Lim
st suites, but we haven't been > exploring the use of a session catalog for fallback. We use v2 for > everything now, which avoids the problem and comes with multi-catalog > support. > > On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim > wrote: > >> Hi devs, >> >>

SQL DDL statements with replacing default catalog with custom catalog

2020-10-06 Thread Jungtaek Lim
interfaces. That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something? Thanks, Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this. On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim wrote: > Bump this again. > > On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Bump again. >> >> Unlike file st

Re: Output mode in Structured Streaming and DSv1 sink/DSv2 table

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this On Sun, Sep 20, 2020 at 1:59 PM Jungtaek Lim wrote: > Hi devs, > > We have a capability check in DSv2 defining which operations can be done > against the data source both read and write. The concept was brought in > DSv2, so

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-25 Thread Jungtaek Lim
rrect them. On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot wrote: > Hi Jungtaek Lim, > > Nice to hear from you again since last time we talked :) and congrats on > becoming a Spark committer in the meantime ! (if I'm not mistaking you were > not at the time) > > I totall

Output mode in Structured Streaming and DSv1 sink/DSv2 table

2020-09-19 Thread Jungtaek Lim
urce is unable to truncate? (Foreach and Kafka output tables will be unable to apply complete mode afterwards.) Looking forward to hear everyone's thoughts. Thanks, Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread Jungtaek Lim
ng ago. >> >> Anecdotally, yes there are people using it that I know of at least, >> but I wouldn't know a lot of them. >> I think the question is, is it causing a problem, like a lot of >> maintenance? doesn't sound like it. >> >> On Tue, Sep 15, 2020 a

Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread Jungtaek Lim
obably not? I don't see that it's > anywhere near deprecated, and not sure it's unmaintained - obviously > tests etc still have to keep passing. > > On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim > wrote: > > > > Hi devs, > > > > It was Spark 2.3 in Feb 2018 w

[DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-14 Thread Jungtaek Lim
enance. I know there's a mood to avoid discontinue support as possible, but it sounds weird to keep something as "unmaintained", especially it's still "experimental" and main authors are no more active enough to promise maintenance/improvement on the module. Thoughts? Thanks, Jungtaek Lim (HeartSaVioR)

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-04 Thread Jungtaek Lim
Unfortunately I don't see enough active committers working on Structured Streaming; I don't expect major features/improvements can be brought in this situation. Technically I can review and merge the PR on major improvements in SS, but that depends on how huge the proposal is changing. If the

Re: [VOTE] Release Spark 3.0.1 (RC3)

2020-08-29 Thread Jungtaek Lim
all asc and sha512 files. - Checked no blocker issues exist on 3.0.1. Thanks, Jungtaek Lim (HeartSaVioR) On Sat, Aug 29, 2020 at 11:28 AM Sean Owen wrote: > +1 from me. Same result as the last RC. I did see this test failure > but I think it was transient; unless anyone else sees it. >

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-25 Thread Jungtaek Lim
Bump this again. On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim wrote: > Bump again. > > Unlike file stream sink which has lots of limitations and many of us have > been suggesting alternatives, file stream source is the only way if end > users want to read the data from files.

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-17 Thread Jungtaek Lim
2020 at 3:06 PM Jungtaek Lim wrote: > Hi German, > > option 1 isn't about "deleting" the old files, as your input directory may > be accessed by multiple queries. Kafka centralizes the maintenance of input > data hence possible to apply retention without problem. >

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-31 Thread Jungtaek Lim
> > How I see it, I think It would be interesting to have a retention period > to delete old files and/or the possibility of indicating an offset > (Timestamp). It would be very "similar" to how we do it with kafka. > > WDYT? > > On Thu, 30 Jul 2020 at 23:51, Jungtaek

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Jungtaek Lim
+1 (non-binding, I guess) Thanks for raising the issue and sorting it out! On Fri, Jul 31, 2020 at 6:47 AM Holden Karau wrote: > Hi Spark Developers, > > After the discussion of the proposal to amend Spark committer guidelines, > it appears folks are generally in agreement on policy

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread Jungtaek Lim
ost. is there > any way we can avoid listing the entire base directory and then filtering > out the new files. if the data is organized as partitions using date, will > it help to list only those partitions where new files were added? > > > On Thu, Jul 30, 2020 at 11:22 AM Jungta

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-19 Thread Jungtaek Lim
at 6:18 AM Jungtaek Lim wrote: > Hi devs, > > As I have been going through the various issues on metadata log growing, > it's not only the issue of sink, but also the issue of source. > Unlike sink metadata log which entries should be available to the readers, > the source me

[DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-19 Thread Jungtaek Lim
uch timestamp and forward order. This doesn't cover all use cases of "latestFirst", but "latestFirst" doesn't seem to be natural with the concept of SS (think about watermark), I'd prefer to support alternatives instead of struggling with "latestFirst". Would like to hear your opinions. Thanks, Jungtaek Lim (HeartSaVioR)

Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Jungtaek Lim
For me merge script worked for python 2.7, but I got some trouble with the encoding issue (probably from contributor's name) so now I use the merge script with virtualenv & python 3.7.7. "python3" would be OK for me as well as it doesn't break virtualenv with python 3. On Sat, Jul 18, 2020 at

Re: [DISCUSS] -1s and commits

2020-07-16 Thread Jungtaek Lim
On Fri, Jul 17, 2020 at 8:06 AM Holden Karau wrote: > > > On Thu, Jul 16, 2020 at 3:34 PM Jungtaek Lim > wrote: > >> I agree with Wenchen that there are different topics. >> > I agree. I mentioned it in my postscript because I wanted to provide the > context

Re: [DISCUSS] -1s and commits

2020-07-16 Thread Jungtaek Lim
I agree with Wenchen that there are different topics. The policy of veto is obvious, as ASF doc describes it with explicitly saying non-overridable per project. In any way, the approach of resolving the situation should lead to voters withdrawing their vetoes. There's nothing to interpret

Re: Welcoming some new Apache Spark committers

2020-07-15 Thread Jungtaek Lim
;>> Congratulations ! >>> >>> Regards, >>> Mridul >>> >>> On Tue, Jul 14, 2020 at 12:37 PM Matei Zaharia >>> wrote: >>> >>>> Hi all, >>>> >>>> The Spark PMC recently voted to add several new committers. P

Re: [DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-07-12 Thread Jungtaek Lim
Just submitted the patch: https://github.com/apache/spark/pull/29077 On Tue, Jun 16, 2020 at 3:40 PM Jungtaek Lim wrote: > Bump this again. I filed SPARK-31985 [1] and plan to submit a PR in a > couple of days if there's no voice on the reason we should keep it. > > 1. https://issue

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-09 Thread Jungtaek Lim
As a side note, I've raised patches for addressing two frequent flaky tests, CliSuite [1] and HiveSessionImplSuite [2]. Hope this helps to mitigate the situation. 1. https://github.com/apache/spark/pull/29036 2. https://github.com/apache/spark/pull/29039 On Thu, Jul 9, 2020 at 11:51 AM Hyukjin

Re: m2 cache issues in Jenkins?

2020-07-06 Thread Jungtaek Lim
Jungtaek Lim wrote: > Could this be a flaky or persistent issue? It failed with Scala gendoc but > it didn't fail with the part the PR modified. It ran from worker-05. > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125121/consoleFull > > On Tue, Jul 7,

<    1   2   3   4   >