Re: [VOTE] Release Spark 3.4.2 (RC1)

2023-11-30 Thread Jia Fan
+1

L. C. Hsieh  于2023年11月30日周四 12:33写道:

> +1
>
> Thanks Dongjoon!
>
> On Wed, Nov 29, 2023 at 7:53 PM Mridul Muralidharan 
> wrote:
> >
> > +1
> >
> > Signatures, digests, etc check out fine.
> > Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
> >
> > Regards,
> > Mridul
> >
> > On Wed, Nov 29, 2023 at 5:08 AM Yang Jie  wrote:
> >>
> >> +1(non-binding)
> >>
> >> Jie Yang
> >>
> >> On 2023/11/29 02:08:04 Kent Yao wrote:
> >> > +1(non-binding)
> >> >
> >> > Kent Yao
> >> >
> >> > On 2023/11/27 01:12:53 Dongjoon Hyun wrote:
> >> > > Hi, Marc.
> >> > >
> >> > > Given that it exists in 3.4.0 and 3.4.1, I don't think it's a
> release
> >> > > blocker for Apache Spark 3.4.2.
> >> > >
> >> > > When the patch is ready, we can consider it for 3.4.3.
> >> > >
> >> > > In addition, note that we categorized release-blocker-level issues
> by
> >> > > marking 'Blocker' priority with `Target Version` before the vote.
> >> > >
> >> > > Best,
> >> > > Dongjoon.
> >> > >
> >> > >
> >> > > On Sat, Nov 25, 2023 at 12:01 PM Marc Le Bihan <
> mlebiha...@gmail.com> wrote:
> >> > >
> >> > > > -1 If you can wait that the last remaining problem with Generics
> (?) is
> >> > > > entirely solved, that causes this exception to be thrown :
> >> > > >
> >> > > > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be
> cast to class [Ljava.lang.reflect.TypeVariable; ([Ljava.lang.Object; and
> [Ljava.lang.reflect.TypeVariable; are in module java.base of loader
> 'bootstrap')
> >> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:116)
> >> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
> >> > > > at
> scala.collection.ArrayOps$.map$extension(ArrayOps.scala:929)
> >> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
> >> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60)
> >> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53)
> >> > > > at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62)
> >> > > > at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179)
> >> > > > at org.apache.spark.sql.Encoders.bean(Encoders.scala)
> >> > > >
> >> > > >
> >> > > > https://issues.apache.org/jira/browse/SPARK-45311
> >> > > >
> >> > > > Thanks !
> >> > > >
> >> > > > Marc Le Bihan
> >> > > >
> >> > > >
> >> > > > On 25/11/2023 11:48, Dongjoon Hyun wrote:
> >> > > >
> >> > > > Please vote on releasing the following candidate as Apache Spark
> version
> >> > > > 3.4.2.
> >> > > >
> >> > > > The vote is open until November 30th 1AM (PST) and passes if a
> majority +1
> >> > > > PMC votes are cast, with a minimum of 3 +1 votes.
> >> > > >
> >> > > > [ ] +1 Release this package as Apache Spark 3.4.2
> >> > > > [ ] -1 Do not release this package because ...
> >> > > >
> >> > > > To learn more about Apache Spark, please see
> https://spark.apache.org/
> >> > > >
> >> > > > The tag to be voted on is v3.4.2-rc1 (commit
> >> > > > 0c0e7d4087c64efca259b4fb656b8be643be5686)
> >> > > > https://github.com/apache/spark/tree/v3.4.2-rc1
> >> > > >
> >> > > > The release files, including signatures, digests, etc. can be
> found at:
> >> > > > https://dist.apache.org/repos/dist/dev/spark/v3.4.2-rc1-bin/
> >> > > >
> >> > > > Signatures used for Spark RCs can be found in this file:
> >> > > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> > > >
> >> > > > The staging repository for this release can be found at:
> >> > > >
> https://repository.apache.org/content/repositories/orgapachespark-1450/
> >> > > >
> >> > > > The documentation corresponding to this release can be found at:
> >> > > > https://dist.apache.org/repos/dist/dev/spark/v3.4.2-rc1-docs/
> >> > > >
> >> > > > The list of bug fixes going into 3.4.2 can be found at the
> following URL:
> >> > > > https://issues.apache.org/jira/projects/SPARK/versions/12353368
> >> > > >
> >> > > > This release is using the release script of the tag v3.4.2-rc1.
> >> > > >
> >> > > > FAQ
> >> > > >
> >> > > > =
> >> > > > How can I help test this release?
> >> > > > =
> >> > > >
> >> > > > If you are a Spark user, you can help us test this release by
> taking
> >> > > > an existing Spark workload and running on this release candidate,
> then
> >> > > > reporting any regressions.
> >> > > >
> >> > > > If you're working in PySpark you can set up a virtual env and
> install
> >> > > > the current RC and see if anything important breaks, in the
> Java/Scala
> >> > > > you can add the staging repository to your projects resolvers and
> test
> >> > > > with the RC (make sure to clean up the artifact cache
> before/after so
> >> > > > you don't end up building with a out of date RC going forward).
> >> > > >
> >> 

Re: [VOTE] SPIP: State Data Source - Reader

2023-10-24 Thread Jia Fan
+1

L. C. Hsieh  于2023年10月24日周二 13:23写道:

> +1
>
> On Mon, Oct 23, 2023 at 6:31 PM Anish Shrigondekar
>  wrote:
> >
> > +1 (non-binding)
> >
> > Thanks,
> > Anish
> >
> > On Mon, Oct 23, 2023 at 5:01 PM Wenchen Fan  wrote:
> >>
> >> +1
> >>
> >> On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >>>
> >>> Starting with my +1 (non-binding). Thanks!
> >>>
> >>> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> 
>  Hi all,
> 
>  I'd like to start the vote for SPIP: State Data Source - Reader.
> 
>  The high level summary of the SPIP is that we propose a new data
> source which enables a read ability for state store in the checkpoint, via
> batch query. This would enable two major use cases 1) constructing tests
> with verifying state store 2) inspecting values in state store in the
> scenario of incident.
> 
>  References:
> 
>  JIRA ticket
>  SPIP doc
>  Discussion thread
> 
>  Please vote on the SPIP for the next 72 hours:
> 
>  [ ] +1: Accept the proposal as an official SPIP
>  [ ] +0
>  [ ] -1: I don’t think this is a good idea because …
> 
>  Thanks!
>  Jungtaek Lim (HeartSaVioR)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Jia Fan
+1

Ruifeng Zheng  于2023年9月12日周二 08:46写道:

> +1
>
> On Tue, Sep 12, 2023 at 7:24 AM Hyukjin Kwon  wrote:
>
>> +1
>>
>> On Tue, Sep 12, 2023 at 7:05 AM Xiao Li  wrote:
>>
>>> +1
>>>
>>> Xiao
>>>
>>> Yuanjian Li  于2023年9月11日周一 10:53写道:
>>>
 @Peter Toth  I've looked into the details of
 this issue, and it appears that it's neither a regression in version 3.5.0
 nor a correctness issue. It's a bug related to a new feature. I think we
 can fix this in 3.5.1 and list it as a known issue of the Scala client of
 Spark Connect in 3.5.0.

 Mridul Muralidharan  于2023年9月10日周日 04:12写道:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Phive -Pyarn -Pmesos
> -Pkubernetes
>
> Regards,
> Mridul
>
> On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li 
> wrote:
>
>> Please vote on releasing the following candidate(RC5) as Apache Spark
>> version 3.5.0.
>>
>> The vote is open until 11:59pm Pacific time Sep 11th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.5.0-rc5 (commit
>> ce5ddad990373636e94071e7cef2f31021add07b):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc5
>>
>> The release files, including signatures, digests, etc. can be found
>> at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1449
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/
>>
>> The list of bug fixes going into 3.5.0 can be found at the following
>> URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>> This release is using the release script of the tag v3.5.0-rc5.
>>
>>
>> FAQ
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open tickets targeted at 3.5.0 can be found at:
>>
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.0
>>
>> Committers should look at those and triage. Extremely important bug
>>
>> fixes, documentation, and API tweaks that impact compatibility should
>>
>> be worked on immediately. Everything else please retarget to an
>>
>> appropriate release.
>>
>> ==
>>
>> But my bug isn't fixed?
>>
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>>
>> release unless the bug in question is a regression from the previous
>>
>> release. That being said, if there is something which is a regression
>>
>> that has not been correctly targeted please ping me or a committer to
>>
>> help target the issue.
>>
>> Thanks,
>>
>> Yuanjian Li
>>
>


Re: [DISCUSS] Incremental statistics collection

2023-08-28 Thread Jia Fan
For those databases with automatic deduplication capabilities, such as
hbase, we have inserted 100 rows with the same rowkey, but in fact there is
only one in hbase. Is the new statistical value we added 100 or 1, or hbase
already contains this rowkey, the value would be 0. How should we handle
this situation?

Mich Talebzadeh  于2023年8月29日周二 07:22写道:

> I have never been fond of the notion that measuring inserts, updates, and
> deletes (referred to as DML) is the sole criterion for signaling a
> necessity to update statistics for Spark's CBO. Nevertheless, in the
> absence of an alternative mechanism, it seems this is the only approach at
> our disposal (can we use AI for it ). Personally, I would prefer some
> form of indication regarding shifts in the distribution of values in the
> histogram, overall density, and similar indicators. The decision to execute
> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
> column-level statistics, which is why I would tend to focus on monitoring
> individual column-level statistics to detect any signals warranting a
> statistics update.
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Impressive, yet in the realm of classic DBMSs, it could be seen as a case
>> of old wine in a new bottle. The objective, I assume, is to employ dynamic
>> sampling to enhance the optimizer's capacity to create effective execution
>> plans without the burden of complete I/O and in less time.
>>
>> For instance:
>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>>
>> This approach could potentially aid in estimating deltas by utilizing
>> sampling.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH 
>> wrote:
>>
>>> Hi all,
>>>
>>> I would like to propose the incremental collection of statistics in
>>> spark. SPARK-44817 
>>> has been raised for the same.
>>>
>>> Currently, spark invalidates the stats after data changing commands
>>> which would make CBO non-functional. To update these stats, user either
>>> needs to run `ANALYZE TABLE` command or turn
>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
>>> their own drawbacks, executing `ANALYZE TABLE` command triggers full table
>>> scan while the other one only updates table and partition stats and can be
>>> costly in certain cases.
>>>
>>> The goal of this proposal is to collect stats incrementally while
>>> executing data changing commands by utilizing the framework introduced in
>>> SPARK-21669 .
>>>
>>> SPIP Document has been attached along with JIRA:
>>>
>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>>>
>>> Hive also supports automatic collection of statistics to keep the stats
>>> consistent.
>>> I can find multiple spark JIRAs asking for the same:
>>> https://issues.apache.org/jira/browse/SPARK-28872
>>> https://issues.apache.org/jira/browse/SPARK-33825
>>>
>>> Regards,
>>> Rakesh
>>>
>>


Re: Some questions about Spark github action

2023-08-24 Thread Jia Fan
Thanks Xinrong and Jack. I will take a look, also I find
https://github.com/apache/spark/pull/32092 is what I want. Thanks a lot.

Xinrong Meng  于2023年8月25日周五 04:30写道:

> Hi Jia,
>
> Consider reviewing GitHub Action variables like
> $GITHUB_REPOSITORY. Detailed information can be found
> https://docs.github.com/en/actions/learn-github-actions/variables.
> Additionally, you might find the code segment 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L72
> to be helpful.
>
> Thanks,
>
> Xinrong Meng
>
> On Thu, Aug 24, 2023 at 11:52 AM Jack Wells 
> wrote:
>
>> Hi Jia,
>>
>> Github Action workflows are stored in the .github/workflows directory off
>> the base of the git repo. Here’s a link:
>> https://github.com/apache/spark/tree/master/.github/workflows. Does this
>> help?
>>
>> Jack
>>
>> On Aug 24, 2023 at 04:54:31, Jia Fan  wrote:
>>
>>> Hi, folks
>>>   I'm a PMC member of Apache SeaTunnel. Recently, I’m optimizing the
>>> github action process on SeaTunnel. The main purpose is to be like Spark,
>>> when a developer submits a PR, it can automatically run github action on
>>> the fork repository instead of the main repository. In this way, all
>>> developers can control the github action opening and retrying of their own
>>> PRs. I checked Spark's github action configuration and found nothing
>>> special, is there any key point that I haven't noticed? I'd be very
>>> grateful if anyone could help.
>>>
>>> Best regards,
>>> Jia Fan
>>>
>>


Some questions about Spark github action

2023-08-24 Thread Jia Fan
Hi, folks
  I'm a PMC member of Apache SeaTunnel. Recently, I’m optimizing the github
action process on SeaTunnel. The main purpose is to be like Spark, when a
developer submits a PR, it can automatically run github action on the fork
repository instead of the main repository. In this way, all developers can
control the github action opening and retrying of their own PRs. I checked
Spark's github action configuration and found nothing special, is there any
key point that I haven't noticed? I'd be very grateful if anyone could help.

Best regards,
Jia Fan


Re: [VOTE] Release Apache Spark 3.3.3 (RC1)

2023-08-13 Thread Jia Fan
+1

Mridul Muralidharan  于2023年8月11日周五 15:57写道:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>
> Regards,
> Mridul
>
>
> On Fri, Aug 11, 2023 at 2:00 AM Cheng Pan  wrote:
>
>> +1 (non-binding)
>>
>> Passed integration test with Apache Kyuubi.
>>
>> Thanks for driving this release.
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Aug 11, 2023, at 06:36, L. C. Hsieh  wrote:
>> >
>> > +1
>> >
>> > Thanks Yuming.
>> >
>> > On Thu, Aug 10, 2023 at 3:24 PM Dongjoon Hyun 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> Dongjoon
>> >>
>> >> On 2023/08/10 07:14:07 yangjie01 wrote:
>> >>> +1
>> >>> Thanks, Jie Yang
>> >>>
>> >>>
>> >>> 发件人: Yuming Wang 
>> >>> 日期: 2023年8月10日 星期四 13:33
>> >>> 收件人: Dongjoon Hyun 
>> >>> 抄送: dev 
>> >>> 主题: Re: [VOTE] Release Apache Spark 3.3.3 (RC1)
>> >>>
>> >>> +1 myself.
>> >>>
>> >>> On Tue, Aug 8, 2023 at 12:41 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>> >>> Thank you, Yuming.
>> >>>
>> >>> Dongjoon.
>> >>>
>> >>> On Mon, Aug 7, 2023 at 9:30 AM yangjie01 > yangji...@baidu.com>> wrote:
>> >>> HI,Dongjoon and Yuming
>> >>>
>> >>> I submitted a PR a few days ago to try to fix this issue:
>> https://github.com/apache/spark/pull/42167<
>> https://mailshield.baidu.com/check?q=zJC5kBC6NRCGy3lXApap3GX6%2bKB9Gi%2b%2fTr0LBfwtxiuVHIiRznzQ7iofG2KJFsJB>.
>> The reason for the failure is that the branch daily test and the master use
>> the same yml file.
>> >>>
>> >>> Jie Yang
>> >>>
>> >>> 发件人: Dongjoon Hyun > dongjoon.h...@gmail.com>>
>> >>> 日期: 2023年8月8日 星期二 00:18
>> >>> 收件人: Yuming Wang mailto:yumw...@apache.org>>
>> >>> 抄送: dev mailto:dev@spark.apache.org>>
>> >>> 主题: Re: [VOTE] Release Apache Spark 3.3.3 (RC1)
>> >>>
>> >>> Hi, Yuming.
>> >>>
>> >>> One of the community GitHub Action test pipelines is unhealthy
>> consistently due to Python mypy linter.
>> >>>
>> >>> https://github.com/apache/spark/actions/workflows/build_branch33.yml<
>> https://mailshield.baidu.com/check?q=zL6yo8WBsL15wzkqifGHCZlkv7KqucJxpuNp8neenIT6Re6167OIO8%2fCYlTH0k%2b29wZ%2fDuFIdfwQCHRIDBzTS292DGk6EvIh
>> >
>> >>>
>> >>> It seems due to the pipeline difference between the same Python mypy
>> linter already pass in commit build,
>> >>>
>> >>> Dongjoon.
>> >>>
>> >>>
>> >>> On Fri, Aug 4, 2023 at 8:09 PM Yuming Wang > > wrote:
>> >>> Please vote on releasing the following candidate as Apache Spark
>> version 3.3.3.
>> >>>
>> >>> The vote is open until 11:59pm Pacific time August 10th and passes if
>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >>>
>> >>> [ ] +1 Release this package as Apache Spark 3.3.3
>> >>> [ ] -1 Do not release this package because ...
>> >>>
>> >>> To learn more about Apache Spark, please see https://spark.apache.org
>> <
>> https://mailshield.baidu.com/check?q=cUpKoLnajWahunV4UDIAXHiHyx%2f5wSVGtwwdag%3d%3d
>> >
>> >>>
>> >>> The tag to be voted on is v3.3.3-rc1 (commit
>> 8c2b3319c6734250ff9d72f3d7e5cab56b142195):
>> >>> https://github.com/apache/spark/tree/v3.3.3-rc1<
>> https://mailshield.baidu.com/check?q=8FCIKpLCdZkaDTtrM2i6z6MozYaNPIUxXbtoz6UY4Dd9HDZ%2fGD1yoiMERdI6DE0Tv%2bgl0w%3d%3d
>> >
>> >>>
>> >>> The release files, including signatures, digests, etc. can be found
>> at:
>> >>> https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-bin<
>> https://mailshield.baidu.com/check?q=E6K9wCUIl7R2GWg35cz6FTdyOlAIldH1DzrC5lMm5vEz7tsnGbtOoOh3Xhjgt%2bKmRTfJyMzbsWs8FQuvjrnyEw%3d%3d
>> >
>> >>>
>> >>> Signatures used for Spark RCs can be found in this file:
>> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS<
>> https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d
>> >
>> >>>
>> >>> The staging repository for this release can be found at:
>> >>>
>> https://repository.apache.org/content/repositories/orgapachespark-1445<
>> https://mailshield.baidu.com/check?q=qwIV%2bgL7su%2fhDHaSq3L7D4SvWg6hop35lQ6SmnXKIqkCT%2b5Z2apQOzuDyyPx6aoUTTbwled13%2b5ajYiObU6S6Fie%2bMXccPyMOLOrKg%3d%3d
>> >
>> >>>
>> >>> The documentation corresponding to this release can be found at:
>> >>> https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-docs<
>> https://mailshield.baidu.com/check?q=8J9mpKGDzLZWyCARq00pdYmMTZ7Xg2gOIhMdnfDmdhOphsDhxGAe3BboUHQltnOgRUrIx2ycA8%2b%2fDX2SG1gd6g%3d%3d
>> >
>> >>>
>> >>> The list of bug fixes going into 3.3.3 can be found at the following
>> URL:
>> >>> https://s.apache.org/rjci4<
>> https://mailshield.baidu.com/check?q=CDSiusCyO4bcrg80RMEGb9gnL5P2xcxAWMuq6OOUhbc%3d
>> >
>> >>>
>> >>> This release is using the release script of the tag v3.3.3-rc1.
>> >>>
>> >>>
>> >>> FAQ
>> >>>
>> >>> =
>> >>> How can I help test this release?
>> >>> =
>> >>> If you are a Spark user, you can help us test this release by taking
>> >>> an existing Spark workload and running on this release candidate, then
>> >>> reporting any 

Re: What else could be removed in Spark 4?

2023-08-07 Thread Jia Fan
Thanks Sean  for open this discussion.

1. I think drop Scala 2.12 is a good option.

2. Personally, I think we should remove most methods that are deprecated since 
2.x/1.x unless it can't find a good replacement. There is already a 3.x version 
as a buffer and I don't think it is good practice to use the deprecated method 
of 2.x on 4.x.

3. For Mesos, I think we should remove it from doc first.


Jia Fan



> 2023年8月8日 05:47,Sean Owen  写道:
> 
> While we're noodling on the topic, what else might be worth removing in Spark 
> 4?
> 
> For example, looks like we're finally hitting problems supporting Java 8 
> through 21 all at once, related to Scala 2.13.x updates. It would be 
> reasonable to require Java 11, or even 17, as a baseline for the multi-year 
> lifecycle of Spark 4.
> 
> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard 
> otherwise.
> 
> There was a good discussion about whether old deprecated methods should be 
> removed. They can't be removed at other times, but, doesn't mean they all 
> should be. createExternalTable was brought up as a first example. What 
> deprecated methods are worth removing?
> 
> There's Mesos support, long since deprecated, which seems like something to 
> prune.
> 
> Are there old Hive/Hadoop version combos we should just stop supporting?



Re: Welcome two new Apache Spark committers

2023-08-06 Thread Jia Fan
Congratulations!


Jia Fan


> 2023年8月7日 11:28,Ye Xianjin  写道:
> 
> Congratulations!
> 
> Sent from my iPhone
> 
>> On Aug 7, 2023, at 11:16 AM, Yuming Wang  wrote:
>> 
>> 
>> 
>> Congratulations!
>> 
>> On Mon, Aug 7, 2023 at 11:11 AM Kent Yao > <mailto:y...@apache.org>> wrote:
>>> Congrats! Peter and Xiduo!
>>> 
>>> Cheng Pan mailto:pan3...@gmail.com>> 于2023年8月7日周一 
>>> 11:01写道:
>>> >
>>> > Congratulations! Peter and Xiduo!
>>> >
>>> > Thanks,
>>> > Cheng Pan
>>> >
>>> >
>>> > > On Aug 7, 2023, at 10:58, Gengliang Wang >> > > <mailto:ltn...@gmail.com>> wrote:
>>> > >
>>> > > Congratulations! Peter and Xiduo!
>>> >
>>> >
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>> > <mailto:dev-unsubscr...@spark.apache.org>
>>> >
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>> <mailto:dev-unsubscr...@spark.apache.org>
>>> 



Re: [VOTE] SPIP: XML data source support

2023-07-28 Thread Jia Fan

+ 1


> 2023年7月29日 13:06,Adrian Pop-Tifrea  写道:
> 
> +1, the more data source formats, the better, and if the solution is already 
> thoroughly tested, I say we should go for it.
> 
> On Sat, Jul 29, 2023, 06:35 Xiao Li  > wrote:
>> +1
>> 
>> On Fri, Jul 28, 2023 at 15:54 Sean Owen > > wrote:
>>> +1 I think that porting the package 'as is' into Spark is probably 
>>> worthwhile.
>>> That's relatively easy; the code is already pretty battle-tested and not 
>>> that big and even originally came from Spark code, so is more or less 
>>> similar already.
>>> 
>>> One thing it never got was DSv2 support, which means XML reading would 
>>> still be somewhat behind other formats. (I was not able to implement it.)
>>> This isn't a necessary goal right now, but would be possibly part of the 
>>> logic of moving it into the Spark code base.
>>> 
>>> On Fri, Jul 28, 2023 at 5:38 PM Sandip Agarwala 
>>>  wrote:
 Dear Spark community,
 
 I would like to start the vote for "SPIP: XML data source support".
 
 XML is a widely used data format. An external spark-xml package 
 (https://github.com/databricks/spark-xml) is available to read and write 
 XML data in spark. Making spark-xml built-in will provide a better user 
 experience for Spark SQL and structured streaming. The proposal is to 
 inline code from the spark-xml package.
 
 SPIP link:
 https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
 
 JIRA:
 https://issues.apache.org/jira/browse/SPARK-44265
 
 Discussion Thread:
 https://lists.apache.org/thread/q32hxgsp738wom03mgpg9ykj9nr2n1fh
 
 Please vote on the SPIP for the next 72 hours:
 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because __.
 
 Thanks, Sandip



Re: [Reminder] Spark 3.5 Branch Cut

2023-07-14 Thread Jia Fan
Can we put [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to
JdbcDialect into 3.5.0?
https://github.com/apache/spark/pull/41855
Since this is the last major version update of 3.x, I think we need to make
sure JdbcDialect can support more databases.


Gengliang Wang  于2023年7月15日周六 05:20写道:

> Hi Yuanjian,
>
> Besides the abovementioned changes, it would be great to include the UI
> page for Spakr Connect: SPARK-44394
> .
>
> Best Regards,
> Gengliang
>
> On Fri, Jul 14, 2023 at 11:44 AM Julek Sompolski
>  wrote:
>
>> Thank you,
>> My changes that you listed are tracked under this Epic:
>> https://issues.apache.org/jira/browse/SPARK-43754
>> I am also working on https://issues.apache.org/jira/browse/SPARK-44422,
>> didn't mention it before because I have hopes that this one will make it
>> before the cut.
>>
>> (Unrelated) My colleague is also working on
>> https://issues.apache.org/jira/browse/SPARK-43923 and I am reviewing
>> https://github.com/apache/spark/pull/41443, so I hope that that one will
>> also make it before the cut.
>>
>> Best regards,
>> Juliusz Sompolski
>>
>> On Fri, Jul 14, 2023 at 7:34 PM Yuanjian Li 
>> wrote:
>>
>>> Hi everyone,
>>> As discussed earlier in "Time for Spark v3.5.0 release", I will cut
>>> branch-3.5 on *Monday, July 17th at 1 pm PST* as scheduled.
>>>
>>> Please plan your PR merge accordingly with the given timeline.
>>> Currently, we have received the following exception merge requests:
>>>
>>>- SPARK-44421: Reattach to existing execute in Spark Connect (server
>>>mechanism)
>>>- SPARK-44423:  Reattach to existing execute in Spark Connect (scala
>>>client)
>>>- SPARK-44424:  Reattach to existing execute in Spark Connect
>>>(python client)
>>>
>>> If there are any other exception feature requests, please reply to this
>>> email. We will not merge any new features in 3.5 after the branch cut.
>>>
>>> Best,
>>> Yuanjian
>>>
>>


Re: Time for Spark v3.5.0 release

2023-07-04 Thread Jia Fan
+1

Maxim Gekk  于2023年7月4日周二 17:23写道:

> +1
>
> On Tue, Jul 4, 2023 at 11:55 AM Kent Yao  wrote:
>
>> +1, thank you
>>
>> Kent
>>
>> On 2023/07/04 05:32:52 Dongjoon Hyun wrote:
>> > +1
>> >
>> > Thank you, Yuanjian
>> >
>> > Dongjoon
>> >
>> > On Tue, Jul 4, 2023 at 1:03 AM Hyukjin Kwon 
>> wrote:
>> >
>> > > Yeah one day postponed shouldn't be a big deal.
>> > >
>> > > On Tue, Jul 4, 2023 at 7:10 AM Yuanjian Li 
>> wrote:
>> > >
>> > >> Hi All,
>> > >>
>> > >> According to the Spark versioning policy at
>> > >> https://spark.apache.org/versioning-policy.html, should we cut
>> > >> *branch-3.5* on *July 17th, 2023*? (We initially proposed January
>> 16th,
>> > >> but since it's a Sunday, I suggest we postpone it by one day).
>> > >>
>> > >> I would like to volunteer as the release manager for Apache Spark
>> 3.5.0.
>> > >>
>> > >> Best,
>> > >> Yuanjian
>> > >>
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Beginner - Looking for starter issues

2023-06-29 Thread Jia Fan
Hi Harry,
Maybe you can start with 
https://issues.apache.org/jira/browse/SPARK-37935



Jia Fan


> 2023年6月28日 08:09,Harry  写道:
> 
> Hi, 
> 
> I am looking to pick up some tasks on ASF Jira.
> I have a basic understanding of how things work in the Spark code base.
> So I am thinking if I can start with some simple tasks to get ramped up.
> I tried searching on JIRA open issues and there were many.
> It was confusing as some tasks are interdependent on others or are being 
> completed as part of a bigger task. 
> Is there a tag for starter issues ?
> Please let me know if you know of any small tasks. 
> I can start with a big one but I don't want to take up too much of the 
> reporter's time.
> 
> Thanking you in advance,
> Harry



Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-19 Thread Jia Fan
+1

Dongjoon Hyun  于2023年6月20日周二 10:41写道:

> Please vote on releasing the following candidate as Apache Spark version
> 3.4.1.
>
> The vote is open until June 23rd 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.4.1-rc1 (commit
> 6b1ff22dde1ead51cbf370be6e48a802daae58b6)
> https://github.com/apache/spark/tree/v3.4.1-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1443/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/
>
> The list of bug fixes going into 3.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12352874
>
> This release is using the release script of the tag v3.4.1-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.1?
> ===
>
> The current list of open tickets targeted at 3.4.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Jia Fan
By the way, like Holden said, what's big feature for 4.0.0? I think very
big version change always bring some different.

Jia Fan  于2023年6月13日周二 08:25写道:

> +1
>
> ____
>
> Jia Fan
>
>
>
> 2023年6月13日 03:51,Chao Sun  写道:
>
> +1
>
> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
>  wrote:
>
>> +1 (non-binding)
>>
>> Thank you!
>> Kazu
>>
>>
>> On Jun 12, 2023, at 11:32 AM, Holden Karau  wrote:
>>
>> -0
>>
>> I'd like to see more of a doc around what we're planning on for a 4.0
>> before we pick a target release date etc. (feels like cart before the
>> horse).
>>
>> But it's a weak preference.
>>
>> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:
>>
>>> Thanks for starting the vote.
>>>
>>> I do have a concern about the target release date of Spark 4.0.
>>>
>>> L. C. Hsieh  于2023年6月12日周一 11:09写道:
>>>
>>>> +1
>>>>
>>>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
>>>> wrote:
>>>> >
>>>> > +1
>>>> >
>>>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
>>>> wrote:
>>>> >>
>>>> >> +1
>>>> >>
>>>> >> Dongjoon
>>>> >>
>>>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>>>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>>>> >> >
>>>> >> > The vote is open until June 16th 1AM (PST) and passes if a
>>>> majority +1 PMC
>>>> >> > votes are cast, with a minimum of 3 +1 votes.
>>>> >> >
>>>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>>>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>>>> >> >
>>>> >> > ===
>>>> >> > Apache Spark 4.0.0 Release Plan
>>>> >> > ===
>>>> >> >
>>>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
>>>> branch.
>>>> >> >
>>>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>>>> >> >
>>>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>>>> >> >
>>>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>>>> >> >
>>>> >>
>>>> >> -
>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> >>
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>>
>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Jia Fan
+1



Jia Fan



> 2023年6月13日 03:51,Chao Sun  写道:
> 
> +1
> 
> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura 
>  wrote:
>> +1 (non-binding)
>> 
>> Thank you!
>> Kazu
>> 
>> 
>>> On Jun 12, 2023, at 11:32 AM, Holden Karau >> <mailto:hol...@pigscanfly.ca>> wrote:
>>> 
>>> -0
>>> 
>>> I'd like to see more of a doc around what we're planning on for a 4.0 
>>> before we pick a target release date etc. (feels like cart before the 
>>> horse).
>>> 
>>> But it's a weak preference.
>>> 
>>> On Mon, Jun 12, 2023 at 11:24 AM Xiao Li >> <mailto:gatorsm...@gmail.com>> wrote:
>>>> Thanks for starting the vote. 
>>>> 
>>>> I do have a concern about the target release date of Spark 4.0. 
>>>> 
>>>> L. C. Hsieh mailto:vii...@gmail.com>> 于2023年6月12日周一 
>>>> 11:09写道:
>>>>> +1
>>>>> 
>>>>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao >>>> <mailto:huaxin.ga...@gmail.com>> wrote:
>>>>> >
>>>>> > +1
>>>>> >
>>>>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun >>>> > <mailto:dongj...@apache.org>> wrote:
>>>>> >>
>>>>> >> +1
>>>>> >>
>>>>> >> Dongjoon
>>>>> >>
>>>>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>>>>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>>>>> >> >
>>>>> >> > The vote is open until June 16th 1AM (PST) and passes if a majority 
>>>>> >> > +1 PMC
>>>>> >> > votes are cast, with a minimum of 3 +1 votes.
>>>>> >> >
>>>>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>>>>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>>>>> >> >
>>>>> >> > ===
>>>>> >> > Apache Spark 4.0.0 Release Plan
>>>>> >> > ===
>>>>> >> >
>>>>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master 
>>>>> >> > branch.
>>>>> >> >
>>>>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>>>>> >> >
>>>>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>>>>> >> >
>>>>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>>>>> >> >
>>>>> >>
>>>>> >> -
>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>>>> >> <mailto:dev-unsubscr...@spark.apache.org>
>>>>> >>
>>>>> 
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>>>> <mailto:dev-unsubscr...@spark.apache.org>
>>>>> 
>>> 
>>> 
>>> -- 
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.): 
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> 



Re: Apache Spark 3.4.1 Release?

2023-06-08 Thread Jia Fan
+1



Jia Fan



> 2023年6月9日 08:00,Yuming Wang  写道:
> 
> +1.
> 
> On Fri, Jun 9, 2023 at 7:14 AM Chao Sun  <mailto:sunc...@apache.org>> wrote:
>> +1 too
>> 
>> On Thu, Jun 8, 2023 at 2:34 PM kazuyuki tanimura
>>  wrote:
>> >
>> > +1 (non-binding), Thank you Dongjoon
>> >
>> > Kazu
>> >
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> <mailto:dev-unsubscr...@spark.apache.org>
>> 



Re: Apache Spark 3.5.0 Expectations (?)

2023-05-28 Thread Jia Fan
Thanks Dongjoon!
There are some ticket I want to share.
SPARK-39420 Support ANALYZE TABLE on v2 tables
SPARK-42750 Support INSERT INTO by name
SPARK-43521 Support CREATE TABLE LIKE FILE

Dongjoon Hyun  于2023年5月29日周一 08:42写道:

> Hi, All.
>
> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
> currently a few notable things are under discussions in the mailing list.
>
> I believe it's a good time to share a short summary list (containing both
> completed and in-progress items) to give a highlight in advance and to
> collect your targets too.
>
> Please share your expectations or working items if you want to prioritize
> them more in the community in Apache Spark 3.5.0 timeframe.
>
> (Sorted by ID)
> SPARK-40497 Upgrade Scala 2.13.11
> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
> 1.12.316)
> SPARK-43024 Upgrade Pandas to 2.0.0
> SPARK-43200 Remove Hadoop 2 reference in docs
> SPARK-43347 Remove Python 3.7 Support
> SPARK-43348 Support Python 3.8 in PyPy3
> SPARK-43351 Add Spark Connect Go prototype code and example
> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
> SPARK-43394 Upgrade to Maven 3.8.8
> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
> SPARK-43446 Upgrade to Apache Arrow 12.0.0
> SPARK-43447 Support R 4.3.0
> SPARK-43489 Remove protobuf 2.5.0
> SPARK-43519 Bump Parquet to 1.13.1
> SPARK-43581 Upgrade kubernetes-client to 6.6.2
> SPARK-43588 Upgrade to ASM 9.5
> SPARK-43600 Update K8s doc to recommend K8s 1.24+
> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
> SPARK-43831 Build and Run Spark on Java 21
> SPARK-43832 Upgrade to Scala 2.12.18
> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
> SPARK-43842 Upgrade gcs-connector to 2.2.14
> SPARK-43844 Update to ORC 1.9.0
> UMBRELLA: Add SQL functions into Scala, Python and R API
>
> Thanks,
> Dongjoon.
>
> PS. The above is not a list of release blockers. Instead, it could be a
> nice-to-have from someone's perspective.
>


Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-24 Thread Jia Fan
+1
It is important that different APIs can be used to call the same function

Ryan Berti  于2023年5月25日周四 01:48写道:

> During my recent experience developing functions, I found that identifying
> locations (sql + connect functions.scala + functions.py, FunctionRegistry,
> + whatever is required for R) and standards for adding function signatures
> was not straight forward (should you use optional args or overload
> functions? which col/lit helpers should be used when?). Are there docs
> describing all of the locations + standards for defining a function? If
> not, that'd be great to have too.
>
> Ryan Berti
>
> Senior Data Engineer  |  Ads DE
>
> M 7023217573
>
> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>
>
>
> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
> wrote:
>
>> +1
>>
>> Functions available in SQL (more general in one API) should be available
>> in all APIs. I am very much in favor of this.
>>
>> Enrico
>>
>>
>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>
>> Hi all,
>>
>> I would like to discuss adding all SQL functions into Scala, Python and R
>> API.
>> We have SQL functions that do not exist in Scala, Python and R around 175.
>> For example, we don’t have pyspark.sql.functions.percentile but you can
>> invoke
>> it as a SQL function, e.g., SELECT percentile(...).
>>
>> The reason why we do not have all functions in the first place is that we
>> want to
>> only add commonly used functions, see also
>> https://github.com/apache/spark/pull/21318 (which I agreed at that time)
>>
>> However, this has been raised multiple times over years, from the OSS
>> community, dev mailing list, JIRAs, stackoverflow, etc.
>> Seems it’s confusing about which function is available or not.
>>
>> Yes, we have a workaround. We can call all expressions by expr("...") or 
>> call_udf("...",
>> Columns ...)
>> But still it seems that it’s not very user-friendly because they expect
>> them available under the functions namespace.
>>
>> Therefore, I would like to propose adding all expressions into all
>> languages so that Spark is simpler and less confusing, e.g., which API is
>> in functions or not.
>>
>> Any thoughts?
>>
>>
>>


Re: [CONNECT] New Clients for Go and Rust

2023-05-19 Thread Jia Fan
Hi,

Thanks for contribution!
I prefer (1). There are some reason:

1. Different repository can maintain independent versions, different
release times, and faster bug fix releases.

2. Different languages have different build tools. Putting them in one
repository will make the main repository more and more complicated, and it
will become extremely difficult to perform a complete build in the main
repository.

3. Different repository will make CI configuration and execute easier, and
the PR and commit lists will be clearer.

4. Other repository also have different client to governed, like
clickhouse. It use different repository for jdbc, odbc, c++. Please refer:
https://github.com/ClickHouse/clickhouse-java
https://github.com/ClickHouse/clickhouse-odbc
https://github.com/ClickHouse/clickhouse-cpp

PS: I'm looking forward to the javascript connect client!

Thanks Regards
Jia Fan

Martin Grund  于2023年5月19日周五 20:03写道:

> Hi folks,
>
> When Bo (thanks for the time and contribution) started the work on
> https://github.com/apache/spark/pull/41036 he started the Go client
> directly in the Spark repository. In the meantime, I was approached by
> other engineers who are willing to contribute to working on a Rust client
> for Spark Connect.
>
> Now one of the key questions is where should these connectors live and how
> we manage expectations most effectively.
>
> At the high level, there are two approaches:
>
> (1) "3rd party" (non-JVM / Python) clients should live in separate
> repositories owned and governed by the Apache Spark community.
>
> (2) All clients should live in the main Apache Spark repository in the
> `connector/connect/client` directory.
>
> (3) Non-native (Python, JVM) Spark Connect clients should not be part of
> the Apache Spark repository and governance rules.
>
> Before we iron out how exactly, we mark these clients as experimental and
> how we align their release process etc with Spark, my suggestion would be
> to get a consensus on this first question.
>
> Personally, I'm fine with (1) and (2) with a preference for (2).
>
> Would love to get feedback from other members of the community!
>
> Thanks
> Martin
>
>
>
>


Re: The Spark email setting should be update

2023-04-19 Thread Jia Fan
Thanks for Kelly's explanation, yes, it is the same as what you described.

Jonathan Kelly  于2023年4月20日周四 07:03写道:

> In Gmail, if I click the Reply button at the bottom of this thread, it
> defaults to sending the reply only to the individual who sent the last
> message. Similarly, if I click the Reply arrow button to the right of each
> message, it responds only to the person who sent that message.
>
> In order to respond to the list, I had to click "Reply All", move the list
> to the To field and remove everybody else.
>
> Is this the same issue you are talking about, Jia?
>
> ~ Jonathan Kelly
>
> On Wed, Apr 19, 2023 at 3:29 PM Rui Wang  wrote:
>
>> I am replying now and the default address is dev@spark.apache.org.
>>
>>
>> -Rui
>>
>> On Mon, Apr 17, 2023 at 4:27 AM Jia Fan  wrote:
>>
>>> Hi, everyone.
>>>
>>> I find that every time I reply to dev's mailing list, the default
>>> address of the reply is the sender of the mail, not dev@spark.apache.org.
>>> It caused me to think that the email reply to dev was successful several
>>> times, but it wasn't. This should not be a common problem, because when I
>>> reply to emails from other communities, the default reply address is
>>> d...@xxx.apache.org. Can spark modify the corresponding settings to
>>> reduce the chance of developers replying incorrectly?
>>>
>>> Thanks
>>>
>>>
>>> 
>>>
>>>
>>> Jia Fan
>>>
>>


The Spark email setting should be update

2023-04-17 Thread Jia Fan
Hi, everyone.

I find that every time I reply to dev's mailing list, the default address
of the reply is the sender of the mail, not dev@spark.apache.org. It caused
me to think that the email reply to dev was successful several times, but
it wasn't. This should not be a common problem, because when I reply to
emails from other communities, the default reply address is
d...@xxx.apache.org. Can spark modify the corresponding settings to reduce
the chance of developers replying incorrectly?

Thanks





Jia Fan


Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-11 Thread Jia Fan
+1

Wenchen Fan  于2023年4月11日周二 14:32写道:

> +1
>
> On Tue, Apr 11, 2023 at 9:57 AM Yuming Wang  wrote:
>
>> +1.
>>
>> On Tue, Apr 11, 2023 at 9:14 AM Yikun Jiang  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Also ran the docker image related test (signatures/standalone/k8s) with
>>> rc7: https://github.com/apache/spark-docker/pull/32
>>>
>>> Regards,
>>> Yikun
>>>
>>>
>>> On Tue, Apr 11, 2023 at 4:44 AM Jacek Laskowski  wrote:
>>>
 +1

 * Built fine with Scala 2.13
 and -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano
 * Ran some demos on Java 17
 * Mac mini / Apple M2 Pro / Ventura 13.3.1

 Pozdrawiam,
 Jacek Laskowski
 
 "The Internals Of" Online Books 
 Follow me on https://twitter.com/jaceklaskowski

 


 On Sat, Apr 8, 2023 at 1:30 AM Xinrong Meng 
 wrote:

> Please vote on releasing the following candidate(RC7) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *April 12th* and passes
> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.4.0-rc7 (commit
> 87a5442f7ed96b11051d8a9333476d080054e5a0):
> https://github.com/apache/spark/tree/v3.4.0-rc7
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1441
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc7.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>



Re: Undelivered Mail Returned to Sender

2023-03-08 Thread Jia Fan
Hi guys,
  After I reload and rebuild project. Finally I run test case successfully.
Thanks for you help, Herman and Hyukjin.

Mail Delivery System  于2023年3月8日周三
22:38写道:

> This is the mail system at host mxout1-he-de.apache.org.
>
> I'm sorry to have to inform you that your message could not
> be delivered to one or more recipients. It's attached below.
>
> For further assistance, please send mail to postmaster.
>
> If you do so, please include this problem report. You can
> delete your own text from the attached returned message.
>
>The mail system
>
> : Host or domain name not found. Name
> service
> error for name=databricks.com.invalid type=: Host not found
>
>
>
> -- Forwarded message --
> From: Jia Fan 
> To: Herman van Hovell 
> Cc: Hyukjin Kwon , dev@spark.apache.org
> Bcc:
> Date: Wed, 8 Mar 2023 22:37:33 +0800
> Subject: Re: [Question] Can't start Spark Connect
> Hi Herman,
> I just use ./build/mvn -DskipTests clean package, also had try to
> use ./build/mvn -DskipTests clean install.
>
> Herman van Hovell  于2023年3月8日周三 21:17写道:
>
>> Hi Jia,
>>
>> How are you building connect?
>>
>> Kind regards,
>> Herman
>>
>> On Wed, Mar 8, 2023 at 8:48 AM Jia Fan  wrote:
>>
>>> Thanks for reply,
>>> I had done clean build with maven few times. But always report
>>>
>>> /Users/xxx/Code/spark/core/target/generated-sources/org/apache/spark/status/protobuf/StoreTypes.java:658:9
>>> java: symbol not found
>>>Symbol: class UnusedPrivateParameter
>>>Location: class org.apache.spark.status.protobuf.StoreTypes.JobData
>>>I think maybe it's protobuf version conflict?
>>> https://user-images.githubusercontent.com/32387433/223716946-85761a34-f86c-4ba1-9557-a59d0d5b9958.png
>>> ">
>>>
>>>
>>> Hyukjin Kwon  于2023年3月8日周三 19:09写道:
>>>
>>>> Just doing a clean build with Maven, and running a test case like
>>>> `SparkConnectServiceSuite` in IntelliJ should work.
>>>>
>>>> On Wed, 8 Mar 2023 at 15:02, Jia Fan  wrote:
>>>>
>>>>> Hi developers,
>>>>>I want to contribute some code for Spark Connect. Any doc for
>>>>> starters? I want to debug SimpleSparkConnectService but I can't start it
>>>>> with IDEA. I would appreciate any help.
>>>>>
>>>>> Thanks
>>>>>
>>>>> 
>>>>>
>>>>>
>>>>> Jia Fan
>>>>>
>>>>


Re: [Question] Can't start Spark Connect

2023-03-08 Thread Jia Fan
Hi Herman,
I just use ./build/mvn -DskipTests clean package, also had try to
use ./build/mvn -DskipTests clean install.

Herman van Hovell  于2023年3月8日周三 21:17写道:

> Hi Jia,
>
> How are you building connect?
>
> Kind regards,
> Herman
>
> On Wed, Mar 8, 2023 at 8:48 AM Jia Fan  wrote:
>
>> Thanks for reply,
>> I had done clean build with maven few times. But always report
>>
>> /Users/xxx/Code/spark/core/target/generated-sources/org/apache/spark/status/protobuf/StoreTypes.java:658:9
>> java: symbol not found
>>Symbol: class UnusedPrivateParameter
>>Location: class org.apache.spark.status.protobuf.StoreTypes.JobData
>>I think maybe it's protobuf version conflict?
>> https://user-images.githubusercontent.com/32387433/223716946-85761a34-f86c-4ba1-9557-a59d0d5b9958.png
>> ">
>>
>>
>> Hyukjin Kwon  于2023年3月8日周三 19:09写道:
>>
>>> Just doing a clean build with Maven, and running a test case like
>>> `SparkConnectServiceSuite` in IntelliJ should work.
>>>
>>> On Wed, 8 Mar 2023 at 15:02, Jia Fan  wrote:
>>>
>>>> Hi developers,
>>>>I want to contribute some code for Spark Connect. Any doc for
>>>> starters? I want to debug SimpleSparkConnectService but I can't start it
>>>> with IDEA. I would appreciate any help.
>>>>
>>>> Thanks
>>>>
>>>> 
>>>>
>>>>
>>>> Jia Fan
>>>>
>>>


Re: [Question] Can't start Spark Connect

2023-03-08 Thread Jia Fan
Thanks for reply,
I had done clean build with maven few times. But always report
/Users/xxx/Code/spark/core/target/generated-sources/org/apache/spark/status/protobuf/StoreTypes.java:658:9
java: symbol not found
   Symbol: class UnusedPrivateParameter
   Location: class org.apache.spark.status.protobuf.StoreTypes.JobData
   I think maybe it's protobuf version conflict?
https://user-images.githubusercontent.com/32387433/223716946-85761a34-f86c-4ba1-9557-a59d0d5b9958.png
">


Hyukjin Kwon  于2023年3月8日周三 19:09写道:

> Just doing a clean build with Maven, and running a test case like
> `SparkConnectServiceSuite` in IntelliJ should work.
>
> On Wed, 8 Mar 2023 at 15:02, Jia Fan  wrote:
>
>> Hi developers,
>>I want to contribute some code for Spark Connect. Any doc for
>> starters? I want to debug SimpleSparkConnectService but I can't start it
>> with IDEA. I would appreciate any help.
>>
>> Thanks
>>
>> 
>>
>>
>> Jia Fan
>>
>


[Question] Can't start Spark Connect

2023-03-07 Thread Jia Fan
Hi developers,
   I want to contribute some code for Spark Connect. Any doc for starters?
I want to debug SimpleSparkConnectService but I can't start it with IDEA. I
would appreciate any help.

Thanks




Jia Fan