Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Chao Sun
This is an important feature which can unblock several other projects
including bucket join support for DataSource v2, complete support for
enforcing DataSource v2 distribution requirements on the write path, etc. I
like Ryan's proposals which look simple and elegant, with nice support on
function overloading and variadic arguments. On the other hand, I think
Wenchen made a very good point about performance. Overall, I'm excited to
see active discussions on this topic and believe the community will come to
a proposal with the best of both sides.

Chao

On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:

> +1 for Liang-chi's.
>
> Thanks Ryan and Wenchen for leading this.
>
>
> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
>
>> Basically I think the proposal makes sense to me and I'd like to support
>> the
>> SPIP as it looks like we have strong need for the important feature.
>>
>> Thanks Ryan for working on this and I do also look forward to Wenchen's
>> implementation. Thanks for the discussion too.
>>
>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>> alternative to me. Besides Wenchen's alternative implementation, is there
>> a
>> chance we also have the SupportsInvoke for comparison?
>>
>>
>> John Zhuge wrote
>> > Excited to see our Spark community rallying behind this important
>> feature!
>> >
>> > The proposal lays a solid foundation of minimal feature set with careful
>> > considerations for future optimizations and extensions. Can't wait to
>> see
>> > it leading to more advanced functionalities like views with shared
>> custom
>> > functions, function pushdown, lambda, etc. It has already borne fruit
>> from
>> > the constructive collaborations in this thread. Looking forward to
>> > Wenchen's prototype and further discussions including the SupportsInvoke
>> > extension proposed by Ryan.
>> >
>> >
>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>
>> > owen.omalley@
>>
>> > 
>> > wrote:
>> >
>> >> I think this proposal is a very good thing giving Spark a standard way
>> of
>> >> getting to and calling UDFs.
>> >>
>> >> I like having the ScalarFunction as the API to call the UDFs. It is
>> >> simple, yet covers all of the polymorphic type cases well. I think it
>> >> would
>> >> also simplify using the functions in other contexts like pushing down
>> >> filters into the ORC & Parquet readers although there are a lot of
>> >> details
>> >> that would need to be considered there.
>> >>
>> >> .. Owen
>> >>
>> >>
>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>
>> > ekrogen@.com
>>
>> > 
>> >> wrote:
>> >>
>> >>> I agree that there is a strong need for a FunctionCatalog within Spark
>> >>> to
>> >>> provide support for shareable UDFs, as well as make movement towards
>> >>> more
>> >>> advanced functionality like views which themselves depend on UDFs, so
>> I
>> >>> support this SPIP wholeheartedly.
>> >>>
>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>> >>> and
>> >>> extensible. I generally think Wenchen's proposal is easier for a user
>> to
>> >>> work with in the common case, but has greater potential for confusing
>> >>> and
>> >>> hard-to-debug behavior due to use of reflective method signature
>> >>> searches.
>> >>> The merits on both sides can hopefully be more properly examined with
>> >>> code,
>> >>> so I look forward to seeing an implementation of Wenchen's ideas to
>> >>> provide
>> >>> a more concrete comparison. I am optimistic that we will not let the
>> >>> debate
>> >>> over this point unreasonably stall the SPIP from making progress.
>> >>>
>> >>> Thank you to both Wenchen and Ryan for your detailed consideration and
>> >>> evaluation of these ideas!
>> >>> --
>> >>> *From:* Dongjoon Hyun 
>>
>> > dongjoon.hyun@
>>
>> > 
>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>> >>> *To:* Ryan Blue 
>>
>> > blue@
>>
>> > 
>> >>> *Cc:* Holden Karau 
>>
>> > holden@
>>
>> > ; Hyukjin Kwon <
>> >>>
>>
>> > gurwls223@
>>
>> >>; Spark Dev List 
>>
>> > dev@.apache
>>
>> > ; Wenchen Fan
>> >>> 
>>
>> > cloud0fan@
>>
>> > 
>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>> >>>
>> >>> BTW, I forgot to add my opinion explicitly in this thread because I
>> was
>> >>> on the PR before this thread.
>> >>>
>> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>> >>> there for almost two years.
>> >>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>> >>> the latest updated design docs and AS-IS PR.
>> >>>
>> >>> And, the rest of the progress in this thread is also very satisfying
>> to
>> >>> me.
>> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>> >>>
>> >>> To All:
>> >>> Please take a look at the design doc and the PR, and give us some
>> >>> opinions.
>> >>> We really need your participation in order to make DSv2 more complete.
>> >>> This will unblock other DSv2 features, too.
>> >>>
>> >>> Bests,
>> >>> Dongjoon.
>> >>>
>> >>>
>> 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Hyukjin Kwon
+1 for Liang-chi's.

Thanks Ryan and Wenchen for leading this.


2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:

> Basically I think the proposal makes sense to me and I'd like to support
> the
> SPIP as it looks like we have strong need for the important feature.
>
> Thanks Ryan for working on this and I do also look forward to Wenchen's
> implementation. Thanks for the discussion too.
>
> Actually I think the SupportsInvoke proposed by Ryan looks a good
> alternative to me. Besides Wenchen's alternative implementation, is there a
> chance we also have the SupportsInvoke for comparison?
>
>
> John Zhuge wrote
> > Excited to see our Spark community rallying behind this important
> feature!
> >
> > The proposal lays a solid foundation of minimal feature set with careful
> > considerations for future optimizations and extensions. Can't wait to see
> > it leading to more advanced functionalities like views with shared custom
> > functions, function pushdown, lambda, etc. It has already borne fruit
> from
> > the constructive collaborations in this thread. Looking forward to
> > Wenchen's prototype and further discussions including the SupportsInvoke
> > extension proposed by Ryan.
> >
> >
> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>
> > owen.omalley@
>
> > 
> > wrote:
> >
> >> I think this proposal is a very good thing giving Spark a standard way
> of
> >> getting to and calling UDFs.
> >>
> >> I like having the ScalarFunction as the API to call the UDFs. It is
> >> simple, yet covers all of the polymorphic type cases well. I think it
> >> would
> >> also simplify using the functions in other contexts like pushing down
> >> filters into the ORC & Parquet readers although there are a lot of
> >> details
> >> that would need to be considered there.
> >>
> >> .. Owen
> >>
> >>
> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>
> > ekrogen@.com
>
> > 
> >> wrote:
> >>
> >>> I agree that there is a strong need for a FunctionCatalog within Spark
> >>> to
> >>> provide support for shareable UDFs, as well as make movement towards
> >>> more
> >>> advanced functionality like views which themselves depend on UDFs, so I
> >>> support this SPIP wholeheartedly.
> >>>
> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
> >>> and
> >>> extensible. I generally think Wenchen's proposal is easier for a user
> to
> >>> work with in the common case, but has greater potential for confusing
> >>> and
> >>> hard-to-debug behavior due to use of reflective method signature
> >>> searches.
> >>> The merits on both sides can hopefully be more properly examined with
> >>> code,
> >>> so I look forward to seeing an implementation of Wenchen's ideas to
> >>> provide
> >>> a more concrete comparison. I am optimistic that we will not let the
> >>> debate
> >>> over this point unreasonably stall the SPIP from making progress.
> >>>
> >>> Thank you to both Wenchen and Ryan for your detailed consideration and
> >>> evaluation of these ideas!
> >>> --
> >>> *From:* Dongjoon Hyun 
>
> > dongjoon.hyun@
>
> > 
> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
> >>> *To:* Ryan Blue 
>
> > blue@
>
> > 
> >>> *Cc:* Holden Karau 
>
> > holden@
>
> > ; Hyukjin Kwon <
> >>>
>
> > gurwls223@
>
> >>; Spark Dev List 
>
> > dev@.apache
>
> > ; Wenchen Fan
> >>> 
>
> > cloud0fan@
>
> > 
> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
> >>>
> >>> BTW, I forgot to add my opinion explicitly in this thread because I was
> >>> on the PR before this thread.
> >>>
> >>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
> >>> there for almost two years.
> >>> 2. I already gave my +1 on that PR last Saturday because I agreed with
> >>> the latest updated design docs and AS-IS PR.
> >>>
> >>> And, the rest of the progress in this thread is also very satisfying to
> >>> me.
> >>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
> >>>
> >>> To All:
> >>> Please take a look at the design doc and the PR, and give us some
> >>> opinions.
> >>> We really need your participation in order to make DSv2 more complete.
> >>> This will unblock other DSv2 features, too.
> >>>
> >>> Bests,
> >>> Dongjoon.
> >>>
> >>>
> >>>
> >>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
>
> > dongjoon.hyun@
>
> > 
> >>> wrote:
> >>>
> >>> Hi, Ryan.
> >>>
> >>> We didn't move past anything (both yours and Wenchen's). What Wenchen
> >>> suggested is double-checking the alternatives with the implementation
> to
> >>> give more momentum to our discussion.
> >>>
> >>> Your new suggestion about optional extention also sounds like a new
> >>> reasonable alternative to me.
> >>>
> >>> We are still discussing this topic together and I hope we can make a
> >>> conclude at this time (for Apache Spark 3.2) without being stucked like
> >>> last time.
> >>>
> >>> I really appreciate your leadership in this dicsussion and the moving
> >>> direction of this discussion looks constructive to me. Let's give 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Liang-Chi Hsieh
Basically I think the proposal makes sense to me and I'd like to support the
SPIP as it looks like we have strong need for the important feature.

Thanks Ryan for working on this and I do also look forward to Wenchen's
implementation. Thanks for the discussion too.

Actually I think the SupportsInvoke proposed by Ryan looks a good
alternative to me. Besides Wenchen's alternative implementation, is there a
chance we also have the SupportsInvoke for comparison?


John Zhuge wrote
> Excited to see our Spark community rallying behind this important feature!
> 
> The proposal lays a solid foundation of minimal feature set with careful
> considerations for future optimizations and extensions. Can't wait to see
> it leading to more advanced functionalities like views with shared custom
> functions, function pushdown, lambda, etc. It has already borne fruit from
> the constructive collaborations in this thread. Looking forward to
> Wenchen's prototype and further discussions including the SupportsInvoke
> extension proposed by Ryan.
> 
> 
> On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 

> owen.omalley@

> 
> wrote:
> 
>> I think this proposal is a very good thing giving Spark a standard way of
>> getting to and calling UDFs.
>>
>> I like having the ScalarFunction as the API to call the UDFs. It is
>> simple, yet covers all of the polymorphic type cases well. I think it
>> would
>> also simplify using the functions in other contexts like pushing down
>> filters into the ORC & Parquet readers although there are a lot of
>> details
>> that would need to be considered there.
>>
>> .. Owen
>>
>>
>> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 

> ekrogen@.com

> 
>> wrote:
>>
>>> I agree that there is a strong need for a FunctionCatalog within Spark
>>> to
>>> provide support for shareable UDFs, as well as make movement towards
>>> more
>>> advanced functionality like views which themselves depend on UDFs, so I
>>> support this SPIP wholeheartedly.
>>>
>>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>> and
>>> extensible. I generally think Wenchen's proposal is easier for a user to
>>> work with in the common case, but has greater potential for confusing
>>> and
>>> hard-to-debug behavior due to use of reflective method signature
>>> searches.
>>> The merits on both sides can hopefully be more properly examined with
>>> code,
>>> so I look forward to seeing an implementation of Wenchen's ideas to
>>> provide
>>> a more concrete comparison. I am optimistic that we will not let the
>>> debate
>>> over this point unreasonably stall the SPIP from making progress.
>>>
>>> Thank you to both Wenchen and Ryan for your detailed consideration and
>>> evaluation of these ideas!
>>> --
>>> *From:* Dongjoon Hyun 

> dongjoon.hyun@

> 
>>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>> *To:* Ryan Blue 

> blue@

> 
>>> *Cc:* Holden Karau 

> holden@

> ; Hyukjin Kwon <
>>> 

> gurwls223@

>>; Spark Dev List 

> dev@.apache

> ; Wenchen Fan
>>> 

> cloud0fan@

> 
>>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>>
>>> BTW, I forgot to add my opinion explicitly in this thread because I was
>>> on the PR before this thread.
>>>
>>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>>> there for almost two years.
>>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>>> the latest updated design docs and AS-IS PR.
>>>
>>> And, the rest of the progress in this thread is also very satisfying to
>>> me.
>>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>>
>>> To All:
>>> Please take a look at the design doc and the PR, and give us some
>>> opinions.
>>> We really need your participation in order to make DSv2 more complete.
>>> This will unblock other DSv2 features, too.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 

> dongjoon.hyun@

> 
>>> wrote:
>>>
>>> Hi, Ryan.
>>>
>>> We didn't move past anything (both yours and Wenchen's). What Wenchen
>>> suggested is double-checking the alternatives with the implementation to
>>> give more momentum to our discussion.
>>>
>>> Your new suggestion about optional extention also sounds like a new
>>> reasonable alternative to me.
>>>
>>> We are still discussing this topic together and I hope we can make a
>>> conclude at this time (for Apache Spark 3.2) without being stucked like
>>> last time.
>>>
>>> I really appreciate your leadership in this dicsussion and the moving
>>> direction of this discussion looks constructive to me. Let's give some
>>> time
>>> to the alternatives.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue 

> blue@

>  wrote:
>>>
>>> I don’t think we should so quickly move past the drawbacks of this
>>> approach. The problems are significant enough that using invoke is not
>>> sufficient on its own. But, I think we can add it as an optional
>>> extension
>>> to shore up the weaknesses.
>>>

Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-12 Thread Hyukjin Kwon
I have received a ping about a new blocker, a regression on a temporary
function in CTE - worked before but now it's broken (
https://github.com/apache/spark/pull/31550). Thank you @Peter Toth

I tend to treat this as a legitimate blocker. I will cut another RC right
after this fix if we're all good with it.

2021년 2월 11일 (목) 오전 9:20, Takeshi Yamamuro 님이 작성:

> +1
>
> I looked around the jira tickets and I think there is no explicit blocker
> issue on the Spark SQL component.
> Also, I ran the tests on AWS envs and I couldn't find any issue there, too.
>
> Bests,
> Takeshi
>
> On Thu, Feb 11, 2021 at 7:37 AM Mridul Muralidharan 
> wrote:
>
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
>> -Phive-thriftserver -Pmesos -Pkubernetes
>>
>> I keep getting test failures
>> with org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite: removing this
>> suite gets the build through though - does anyone have suggestions on how
>> to fix it ?
>> Perhaps a local problem at my end ?
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Mon, Feb 8, 2021 at 6:24 PM Hyukjin Kwon  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.1.1.
>>>
>>> The vote is open until February 15th 5PM PST and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> Note that it is 7 days this time because it is a holiday season in
>>> several countries including South Korea (where I live), China etc., and I
>>> would like to make sure people do not miss it because it is a holiday
>>> season.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.1.1-rc2 (commit
>>> cf0115ac2d60070399af481b14566f33d22ec45e):
>>> https://github.com/apache/spark/tree/v3.1.1-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> 
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1365
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-docs/
>>>
>>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>>> https://s.apache.org/41kf2
>>>
>>> This release is using the release script of the tag v3.1.1-rc2.
>>>
>>> FAQ
>>>
>>> ===
>>> What happened to 3.1.0?
>>> ===
>>>
>>> There was a technical issue during Apache Spark 3.1.0 preparation, and
>>> it was discussed and decided to skip 3.1.0.
>>> Please see
>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>>> more details.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC via "pip install
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/pyspark-3.1.1.tar.gz
>>> "
>>> and see if anything important breaks.
>>> In the Java/Scala, you can add the staging repository to your projects
>>> resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.1.1?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.1.1 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.1.1
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Yuming Wang
+1.

On Sat, Feb 13, 2021 at 10:38 AM Takeshi Yamamuro 
wrote:

> +1, too. Thanks, Dongjoon!
>
> 2021/02/13 11:07、Xiao Li のメール:
>
> 
> +1
>
> Happy Lunar New Year!
>
> Xiao
>
> On Fri, Feb 12, 2021 at 5:33 PM Hyukjin Kwon  wrote:
>
>> Yeah, +1 too
>>
>> 2021년 2월 13일 (토) 오전 4:49, Dongjoon Hyun 님이 작성:
>>
>>> Thank you, Sean!
>>>
>>> On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:
>>>
 Sounds like a fine time to me, sure.

 On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> As of today, `branch-3.0` has 307 patches (including 25 correctness
> patches) since v3.0.1 tag (released on September 8th, 2020).
>
> Since we stabilized branch-3.0 during 3.1.x preparation so far,
> it would be great if we start to release Apache Spark 3.0.2 next week.
> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
>
> What do you think about the Apache Spark 3.0.2 release?
>
> Bests,
> Dongjoon.
>
>
> --
> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
> SPARK-32635 When pyspark.sql.functions.lit() function is used with
> dataframe cache, it returns wrong result
> SPARK-32753 Deduplicating and repartitioning the same column create
> duplicate rows with AQE
> SPARK-32764 compare of -0.0 < 0.0 return true
> SPARK-32840 Invalid interval value can happen to be just adhesive with
> the unit
> SPARK-32908 percentile_approx() returns incorrect results
> SPARK-33019 Use
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
> SPARK-33183 Bug in optimizer rule EliminateSorts
> SPARK-33260 SortExec produces incorrect results if sortOrder is a
> Stream
> SPARK-33290 REFRESH TABLE should invalidate cache even though the
> table itself may not be cached
> SPARK-33358 Spark SQL CLI command processing loop can't exit while one
> comand fail
> SPARK-33404 "date_trunc" expression returns incorrect results
> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
> SPARK-33591 NULL is recognized as the "null" string in partition specs
> SPARK-33593 Vector reader got incorrect data with binary partition
> value
> SPARK-33726 Duplicate field names causes wrong answers during
> aggregation
> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
> SPARK-34187 Use available offset range obtained during polling when
> checking offset validation
> SPARK-34212 For parquet table, after changing the precision and scale
> of decimal type in hive, spark reads incorrect value
> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
> SPARK-34229 Avro should read decimal values with the file schema
> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
>

>
> --
>
>


Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Takeshi Yamamuro
+1, too. Thanks, Dongjoon!

> 2021/02/13 11:07、Xiao Li のメール:
> 
> 
> +1 
> 
> Happy Lunar New Year!
> 
> Xiao
> 
>> On Fri, Feb 12, 2021 at 5:33 PM Hyukjin Kwon  wrote:
>> Yeah, +1 too
>> 
>> 2021년 2월 13일 (토) 오전 4:49, Dongjoon Hyun 님이 작성:
>>> Thank you, Sean!
>>> 
 On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:
 Sounds like a fine time to me, sure.
 
> On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun  
> wrote:
> Hi, All.
> 
> As of today, `branch-3.0` has 307 patches (including 25 correctness 
> patches) since v3.0.1 tag (released on September 8th, 2020).
> 
> Since we stabilized branch-3.0 during 3.1.x preparation so far,
> it would be great if we start to release Apache Spark 3.0.2 next week.
> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
> 
> What do you think about the Apache Spark 3.0.2 release?
> 
> Bests,
> Dongjoon.
> 
> 
> --
> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
> SPARK-32635 When pyspark.sql.functions.lit() function is used with 
> dataframe cache, it returns wrong result
> SPARK-32753 Deduplicating and repartitioning the same column create 
> duplicate rows with AQE
> SPARK-32764 compare of -0.0 < 0.0 return true
> SPARK-32840 Invalid interval value can happen to be just adhesive with 
> the unit
> SPARK-32908 percentile_approx() returns incorrect results
> SPARK-33019 Use 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
> SPARK-33183 Bug in optimizer rule EliminateSorts
> SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
> SPARK-33290 REFRESH TABLE should invalidate cache even though the table 
> itself may not be cached
> SPARK-33358 Spark SQL CLI command processing loop can't exit while one 
> comand fail
> SPARK-33404 "date_trunc" expression returns incorrect results
> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
> SPARK-33591 NULL is recognized as the "null" string in partition specs
> SPARK-33593 Vector reader got incorrect data with binary partition value
> SPARK-33726 Duplicate field names causes wrong answers during aggregation
> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
> SPARK-34187 Use available offset range obtained during polling when 
> checking offset validation
> SPARK-34212 For parquet table, after changing the precision and scale of 
> decimal type in hive, spark reads incorrect value
> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
> SPARK-34229 Avro should read decimal values with the file schema
> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
> 
> 
> -- 
> 


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread John Zhuge
Excited to see our Spark community rallying behind this important feature!

The proposal lays a solid foundation of minimal feature set with careful
considerations for future optimizations and extensions. Can't wait to see
it leading to more advanced functionalities like views with shared custom
functions, function pushdown, lambda, etc. It has already borne fruit from
the constructive collaborations in this thread. Looking forward to
Wenchen's prototype and further discussions including the SupportsInvoke
extension proposed by Ryan.


On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
wrote:

> I think this proposal is a very good thing giving Spark a standard way of
> getting to and calling UDFs.
>
> I like having the ScalarFunction as the API to call the UDFs. It is
> simple, yet covers all of the polymorphic type cases well. I think it would
> also simplify using the functions in other contexts like pushing down
> filters into the ORC & Parquet readers although there are a lot of details
> that would need to be considered there.
>
> .. Owen
>
>
> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
> wrote:
>
>> I agree that there is a strong need for a FunctionCatalog within Spark to
>> provide support for shareable UDFs, as well as make movement towards more
>> advanced functionality like views which themselves depend on UDFs, so I
>> support this SPIP wholeheartedly.
>>
>> I find both of the proposed UDF APIs to be sufficiently user-friendly and
>> extensible. I generally think Wenchen's proposal is easier for a user to
>> work with in the common case, but has greater potential for confusing and
>> hard-to-debug behavior due to use of reflective method signature searches.
>> The merits on both sides can hopefully be more properly examined with code,
>> so I look forward to seeing an implementation of Wenchen's ideas to provide
>> a more concrete comparison. I am optimistic that we will not let the debate
>> over this point unreasonably stall the SPIP from making progress.
>>
>> Thank you to both Wenchen and Ryan for your detailed consideration and
>> evaluation of these ideas!
>> --
>> *From:* Dongjoon Hyun 
>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>> *To:* Ryan Blue 
>> *Cc:* Holden Karau ; Hyukjin Kwon <
>> gurwls...@gmail.com>; Spark Dev List ; Wenchen Fan
>> 
>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>
>> BTW, I forgot to add my opinion explicitly in this thread because I was
>> on the PR before this thread.
>>
>> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been
>> there for almost two years.
>> 2. I already gave my +1 on that PR last Saturday because I agreed with
>> the latest updated design docs and AS-IS PR.
>>
>> And, the rest of the progress in this thread is also very satisfying to
>> me.
>> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>>
>> To All:
>> Please take a look at the design doc and the PR, and give us some
>> opinions.
>> We really need your participation in order to make DSv2 more complete.
>> This will unblock other DSv2 features, too.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
>> wrote:
>>
>> Hi, Ryan.
>>
>> We didn't move past anything (both yours and Wenchen's). What Wenchen
>> suggested is double-checking the alternatives with the implementation to
>> give more momentum to our discussion.
>>
>> Your new suggestion about optional extention also sounds like a new
>> reasonable alternative to me.
>>
>> We are still discussing this topic together and I hope we can make a
>> conclude at this time (for Apache Spark 3.2) without being stucked like
>> last time.
>>
>> I really appreciate your leadership in this dicsussion and the moving
>> direction of this discussion looks constructive to me. Let's give some time
>> to the alternatives.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue  wrote:
>>
>> I don’t think we should so quickly move past the drawbacks of this
>> approach. The problems are significant enough that using invoke is not
>> sufficient on its own. But, I think we can add it as an optional extension
>> to shore up the weaknesses.
>>
>> Here’s a summary of the drawbacks:
>>
>>- Magic function signatures are error-prone
>>- Spark would need considerable code to help users find what went
>>wrong
>>- Spark would likely need to coerce arguments (e.g., String,
>>Option[Int]) for usability
>>- It is unclear how Spark will find the Java Method to call
>>- Use cases that require varargs fall back to casting; users will
>>also get this wrong (cast to String instead of UTF8String)
>>- The non-codegen path is significantly slower
>>
>> The benefit of invoke is to avoid moving data into a row, like this:
>>
>> -- using invoke
>> int result = udfFunction(x, y)
>>
>> -- using row
>> udfRow.update(0, x); -- actual: values[0] = x;
>> udfRow.update(1, y);
>> int result = udfFunction(udfRow);
>>
>> And, again, 

Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Xiao Li
+1

Happy Lunar New Year!

Xiao

On Fri, Feb 12, 2021 at 5:33 PM Hyukjin Kwon  wrote:

> Yeah, +1 too
>
> 2021년 2월 13일 (토) 오전 4:49, Dongjoon Hyun 님이 작성:
>
>> Thank you, Sean!
>>
>> On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:
>>
>>> Sounds like a fine time to me, sure.
>>>
>>> On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 As of today, `branch-3.0` has 307 patches (including 25 correctness
 patches) since v3.0.1 tag (released on September 8th, 2020).

 Since we stabilized branch-3.0 during 3.1.x preparation so far,
 it would be great if we start to release Apache Spark 3.0.2 next week.
 And, I'd like to volunteer for Apache Spark 3.0.2 release manager.

 What do you think about the Apache Spark 3.0.2 release?

 Bests,
 Dongjoon.


 --
 SPARK-31511 Make BytesToBytesMap iterator() thread-safe
 SPARK-32635 When pyspark.sql.functions.lit() function is used with
 dataframe cache, it returns wrong result
 SPARK-32753 Deduplicating and repartitioning the same column create
 duplicate rows with AQE
 SPARK-32764 compare of -0.0 < 0.0 return true
 SPARK-32840 Invalid interval value can happen to be just adhesive with
 the unit
 SPARK-32908 percentile_approx() returns incorrect results
 SPARK-33019 Use
 spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
 SPARK-33183 Bug in optimizer rule EliminateSorts
 SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
 SPARK-33290 REFRESH TABLE should invalidate cache even though the table
 itself may not be cached
 SPARK-33358 Spark SQL CLI command processing loop can't exit while one
 comand fail
 SPARK-33404 "date_trunc" expression returns incorrect results
 SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
 SPARK-33591 NULL is recognized as the "null" string in partition specs
 SPARK-33593 Vector reader got incorrect data with binary partition value
 SPARK-33726 Duplicate field names causes wrong answers during
 aggregation
 SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
 SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
 SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
 SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
 SPARK-34187 Use available offset range obtained during polling when
 checking offset validation
 SPARK-34212 For parquet table, after changing the precision and scale
 of decimal type in hive, spark reads incorrect value
 SPARK-34213 LOAD DATA doesn't refresh v1 table cache
 SPARK-34229 Avro should read decimal values with the file schema
 SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache

>>>

--


Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Hyukjin Kwon
Yeah, +1 too

2021년 2월 13일 (토) 오전 4:49, Dongjoon Hyun 님이 작성:

> Thank you, Sean!
>
> On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:
>
>> Sounds like a fine time to me, sure.
>>
>> On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> As of today, `branch-3.0` has 307 patches (including 25 correctness
>>> patches) since v3.0.1 tag (released on September 8th, 2020).
>>>
>>> Since we stabilized branch-3.0 during 3.1.x preparation so far,
>>> it would be great if we start to release Apache Spark 3.0.2 next week.
>>> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
>>>
>>> What do you think about the Apache Spark 3.0.2 release?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> --
>>> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
>>> SPARK-32635 When pyspark.sql.functions.lit() function is used with
>>> dataframe cache, it returns wrong result
>>> SPARK-32753 Deduplicating and repartitioning the same column create
>>> duplicate rows with AQE
>>> SPARK-32764 compare of -0.0 < 0.0 return true
>>> SPARK-32840 Invalid interval value can happen to be just adhesive with
>>> the unit
>>> SPARK-32908 percentile_approx() returns incorrect results
>>> SPARK-33019 Use
>>> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
>>> SPARK-33183 Bug in optimizer rule EliminateSorts
>>> SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
>>> SPARK-33290 REFRESH TABLE should invalidate cache even though the table
>>> itself may not be cached
>>> SPARK-33358 Spark SQL CLI command processing loop can't exit while one
>>> comand fail
>>> SPARK-33404 "date_trunc" expression returns incorrect results
>>> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
>>> SPARK-33591 NULL is recognized as the "null" string in partition specs
>>> SPARK-33593 Vector reader got incorrect data with binary partition value
>>> SPARK-33726 Duplicate field names causes wrong answers during aggregation
>>> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
>>> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
>>> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
>>> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
>>> SPARK-34187 Use available offset range obtained during polling when
>>> checking offset validation
>>> SPARK-34212 For parquet table, after changing the precision and scale of
>>> decimal type in hive, spark reads incorrect value
>>> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
>>> SPARK-34229 Avro should read decimal values with the file schema
>>> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
>>>
>>


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Owen O'Malley
I think this proposal is a very good thing giving Spark a standard way of
getting to and calling UDFs.

I like having the ScalarFunction as the API to call the UDFs. It is simple,
yet covers all of the polymorphic type cases well. I think it would also
simplify using the functions in other contexts like pushing down filters
into the ORC & Parquet readers although there are a lot of details that
would need to be considered there.

.. Owen


On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
wrote:

> I agree that there is a strong need for a FunctionCatalog within Spark to
> provide support for shareable UDFs, as well as make movement towards more
> advanced functionality like views which themselves depend on UDFs, so I
> support this SPIP wholeheartedly.
>
> I find both of the proposed UDF APIs to be sufficiently user-friendly and
> extensible. I generally think Wenchen's proposal is easier for a user to
> work with in the common case, but has greater potential for confusing and
> hard-to-debug behavior due to use of reflective method signature searches.
> The merits on both sides can hopefully be more properly examined with code,
> so I look forward to seeing an implementation of Wenchen's ideas to provide
> a more concrete comparison. I am optimistic that we will not let the debate
> over this point unreasonably stall the SPIP from making progress.
>
> Thank you to both Wenchen and Ryan for your detailed consideration and
> evaluation of these ideas!
> --
> *From:* Dongjoon Hyun 
> *Sent:* Wednesday, February 10, 2021 6:06 PM
> *To:* Ryan Blue 
> *Cc:* Holden Karau ; Hyukjin Kwon <
> gurwls...@gmail.com>; Spark Dev List ; Wenchen Fan <
> cloud0...@gmail.com>
> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>
> BTW, I forgot to add my opinion explicitly in this thread because I was on
> the PR before this thread.
>
> 1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been there
> for almost two years.
> 2. I already gave my +1 on that PR last Saturday because I agreed with the
> latest updated design docs and AS-IS PR.
>
> And, the rest of the progress in this thread is also very satisfying to me.
> (e.g. Ryan's extension suggestion and Wenchen's alternative)
>
> To All:
> Please take a look at the design doc and the PR, and give us some opinions.
> We really need your participation in order to make DSv2 more complete.
> This will unblock other DSv2 features, too.
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
> wrote:
>
> Hi, Ryan.
>
> We didn't move past anything (both yours and Wenchen's). What Wenchen
> suggested is double-checking the alternatives with the implementation to
> give more momentum to our discussion.
>
> Your new suggestion about optional extention also sounds like a new
> reasonable alternative to me.
>
> We are still discussing this topic together and I hope we can make a
> conclude at this time (for Apache Spark 3.2) without being stucked like
> last time.
>
> I really appreciate your leadership in this dicsussion and the moving
> direction of this discussion looks constructive to me. Let's give some time
> to the alternatives.
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue  wrote:
>
> I don’t think we should so quickly move past the drawbacks of this
> approach. The problems are significant enough that using invoke is not
> sufficient on its own. But, I think we can add it as an optional extension
> to shore up the weaknesses.
>
> Here’s a summary of the drawbacks:
>
>- Magic function signatures are error-prone
>- Spark would need considerable code to help users find what went wrong
>- Spark would likely need to coerce arguments (e.g., String,
>Option[Int]) for usability
>- It is unclear how Spark will find the Java Method to call
>- Use cases that require varargs fall back to casting; users will also
>get this wrong (cast to String instead of UTF8String)
>- The non-codegen path is significantly slower
>
> The benefit of invoke is to avoid moving data into a row, like this:
>
> -- using invoke
> int result = udfFunction(x, y)
>
> -- using row
> udfRow.update(0, x); -- actual: values[0] = x;
> udfRow.update(1, y);
> int result = udfFunction(udfRow);
>
> And, again, that won’t actually help much in cases that require varargs.
>
> I suggest we add a new marker trait for BoundMethod called SupportsInvoke.
> If that interface is implemented, then Spark will look for a method that
> matches the expected signature based on the bound input type. If it isn’t
> found, Spark can print a warning and fall back to the InternalRow call:
> “Cannot find udfFunction(int, int)”.
>
> This approach allows the invoke optimization, but solves many of the
> problems:
>
>- The method to invoke is found using the proposed load and bind
>approach
>- Magic function signatures are optional and do not cause runtime
>failures
>- Because this is an optional optimization, 

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread Erik Krogen
I agree that there is a strong need for a FunctionCatalog within Spark to 
provide support for shareable UDFs, as well as make movement towards more 
advanced functionality like views which themselves depend on UDFs, so I support 
this SPIP wholeheartedly.

I find both of the proposed UDF APIs to be sufficiently user-friendly and 
extensible. I generally think Wenchen's proposal is easier for a user to work 
with in the common case, but has greater potential for confusing and 
hard-to-debug behavior due to use of reflective method signature searches. The 
merits on both sides can hopefully be more properly examined with code, so I 
look forward to seeing an implementation of Wenchen's ideas to provide a more 
concrete comparison. I am optimistic that we will not let the debate over this 
point unreasonably stall the SPIP from making progress.

Thank you to both Wenchen and Ryan for your detailed consideration and 
evaluation of these ideas!

From: Dongjoon Hyun 
Sent: Wednesday, February 10, 2021 6:06 PM
To: Ryan Blue 
Cc: Holden Karau ; Hyukjin Kwon ; 
Spark Dev List ; Wenchen Fan 
Subject: Re: [DISCUSS] SPIP: FunctionCatalog

BTW, I forgot to add my opinion explicitly in this thread because I was on the 
PR before this thread.

1. The `FunctionCatalog API` PR was made on May 9, 2019 and has been there for 
almost two years.
2. I already gave my +1 on that PR last Saturday because I agreed with the 
latest updated design docs and AS-IS PR.

And, the rest of the progress in this thread is also very satisfying to me.
(e.g. Ryan's extension suggestion and Wenchen's alternative)

To All:
Please take a look at the design doc and the PR, and give us some opinions.
We really need your participation in order to make DSv2 more complete.
This will unblock other DSv2 features, too.

Bests,
Dongjoon.



On Wed, Feb 10, 2021 at 10:58 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, Ryan.

We didn't move past anything (both yours and Wenchen's). What Wenchen suggested 
is double-checking the alternatives with the implementation to give more 
momentum to our discussion.

Your new suggestion about optional extention also sounds like a new reasonable 
alternative to me.

We are still discussing this topic together and I hope we can make a conclude 
at this time (for Apache Spark 3.2) without being stucked like last time.

I really appreciate your leadership in this dicsussion and the moving direction 
of this discussion looks constructive to me. Let's give some time to the 
alternatives.

Bests,
Dongjoon.

On Wed, Feb 10, 2021 at 10:14 AM Ryan Blue 
mailto:b...@apache.org>> wrote:

I don’t think we should so quickly move past the drawbacks of this approach. 
The problems are significant enough that using invoke is not sufficient on its 
own. But, I think we can add it as an optional extension to shore up the 
weaknesses.

Here’s a summary of the drawbacks:

  *   Magic function signatures are error-prone
  *   Spark would need considerable code to help users find what went wrong
  *   Spark would likely need to coerce arguments (e.g., String, Option[Int]) 
for usability
  *   It is unclear how Spark will find the Java Method to call
  *   Use cases that require varargs fall back to casting; users will also get 
this wrong (cast to String instead of UTF8String)
  *   The non-codegen path is significantly slower

The benefit of invoke is to avoid moving data into a row, like this:

-- using invoke
int result = udfFunction(x, y)

-- using row
udfRow.update(0, x); -- actual: values[0] = x;
udfRow.update(1, y);
int result = udfFunction(udfRow);


And, again, that won’t actually help much in cases that require varargs.

I suggest we add a new marker trait for BoundMethod called SupportsInvoke. If 
that interface is implemented, then Spark will look for a method that matches 
the expected signature based on the bound input type. If it isn’t found, Spark 
can print a warning and fall back to the InternalRow call: “Cannot find 
udfFunction(int, int)”.

This approach allows the invoke optimization, but solves many of the problems:

  *   The method to invoke is found using the proposed load and bind approach
  *   Magic function signatures are optional and do not cause runtime failures
  *   Because this is an optional optimization, Spark can be more strict about 
types
  *   Varargs cases can still use rows
  *   Non-codegen can use an evaluation method rather than falling back to slow 
Java reflection

This seems like a good extension to me; this provides a plan for optimizing the 
UDF call to avoid building a row, while the existing proposal covers the other 
cases well and addresses how to locate these function calls.

This also highlights that the approach used in DSv2 and this proposal is 
working: start small and use extensions to layer on more complex support.

On Wed, Feb 10, 2021 at 9:04 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:

Thank you all for making a 

Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Dongjoon Hyun
Thank you, Sean!

On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:

> Sounds like a fine time to me, sure.
>
> On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> As of today, `branch-3.0` has 307 patches (including 25 correctness
>> patches) since v3.0.1 tag (released on September 8th, 2020).
>>
>> Since we stabilized branch-3.0 during 3.1.x preparation so far,
>> it would be great if we start to release Apache Spark 3.0.2 next week.
>> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
>>
>> What do you think about the Apache Spark 3.0.2 release?
>>
>> Bests,
>> Dongjoon.
>>
>>
>> --
>> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
>> SPARK-32635 When pyspark.sql.functions.lit() function is used with
>> dataframe cache, it returns wrong result
>> SPARK-32753 Deduplicating and repartitioning the same column create
>> duplicate rows with AQE
>> SPARK-32764 compare of -0.0 < 0.0 return true
>> SPARK-32840 Invalid interval value can happen to be just adhesive with
>> the unit
>> SPARK-32908 percentile_approx() returns incorrect results
>> SPARK-33019 Use
>> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
>> SPARK-33183 Bug in optimizer rule EliminateSorts
>> SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
>> SPARK-33290 REFRESH TABLE should invalidate cache even though the table
>> itself may not be cached
>> SPARK-33358 Spark SQL CLI command processing loop can't exit while one
>> comand fail
>> SPARK-33404 "date_trunc" expression returns incorrect results
>> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
>> SPARK-33591 NULL is recognized as the "null" string in partition specs
>> SPARK-33593 Vector reader got incorrect data with binary partition value
>> SPARK-33726 Duplicate field names causes wrong answers during aggregation
>> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
>> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
>> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
>> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
>> SPARK-34187 Use available offset range obtained during polling when
>> checking offset validation
>> SPARK-34212 For parquet table, after changing the precision and scale of
>> decimal type in hive, spark reads incorrect value
>> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
>> SPARK-34229 Avro should read decimal values with the file schema
>> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
>>
>


Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Sean Owen
Sounds like a fine time to me, sure.

On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> As of today, `branch-3.0` has 307 patches (including 25 correctness
> patches) since v3.0.1 tag (released on September 8th, 2020).
>
> Since we stabilized branch-3.0 during 3.1.x preparation so far,
> it would be great if we start to release Apache Spark 3.0.2 next week.
> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
>
> What do you think about the Apache Spark 3.0.2 release?
>
> Bests,
> Dongjoon.
>
>
> --
> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
> SPARK-32635 When pyspark.sql.functions.lit() function is used with
> dataframe cache, it returns wrong result
> SPARK-32753 Deduplicating and repartitioning the same column create
> duplicate rows with AQE
> SPARK-32764 compare of -0.0 < 0.0 return true
> SPARK-32840 Invalid interval value can happen to be just adhesive with the
> unit
> SPARK-32908 percentile_approx() returns incorrect results
> SPARK-33019 Use
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
> SPARK-33183 Bug in optimizer rule EliminateSorts
> SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
> SPARK-33290 REFRESH TABLE should invalidate cache even though the table
> itself may not be cached
> SPARK-33358 Spark SQL CLI command processing loop can't exit while one
> comand fail
> SPARK-33404 "date_trunc" expression returns incorrect results
> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
> SPARK-33591 NULL is recognized as the "null" string in partition specs
> SPARK-33593 Vector reader got incorrect data with binary partition value
> SPARK-33726 Duplicate field names causes wrong answers during aggregation
> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
> SPARK-34187 Use available offset range obtained during polling when
> checking offset validation
> SPARK-34212 For parquet table, after changing the precision and scale of
> decimal type in hive, spark reads incorrect value
> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
> SPARK-34229 Avro should read decimal values with the file schema
> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
>


Apache Spark 3.0.2 Release ?

2021-02-12 Thread Dongjoon Hyun
Hi, All.

As of today, `branch-3.0` has 307 patches (including 25 correctness
patches) since v3.0.1 tag (released on September 8th, 2020).

Since we stabilized branch-3.0 during 3.1.x preparation so far,
it would be great if we start to release Apache Spark 3.0.2 next week.
And, I'd like to volunteer for Apache Spark 3.0.2 release manager.

What do you think about the Apache Spark 3.0.2 release?

Bests,
Dongjoon.


--
SPARK-31511 Make BytesToBytesMap iterator() thread-safe
SPARK-32635 When pyspark.sql.functions.lit() function is used with
dataframe cache, it returns wrong result
SPARK-32753 Deduplicating and repartitioning the same column create
duplicate rows with AQE
SPARK-32764 compare of -0.0 < 0.0 return true
SPARK-32840 Invalid interval value can happen to be just adhesive with the
unit
SPARK-32908 percentile_approx() returns incorrect results
SPARK-33019 Use
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
SPARK-33183 Bug in optimizer rule EliminateSorts
SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
SPARK-33290 REFRESH TABLE should invalidate cache even though the table
itself may not be cached
SPARK-33358 Spark SQL CLI command processing loop can't exit while one
comand fail
SPARK-33404 "date_trunc" expression returns incorrect results
SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
SPARK-33591 NULL is recognized as the "null" string in partition specs
SPARK-33593 Vector reader got incorrect data with binary partition value
SPARK-33726 Duplicate field names causes wrong answers during aggregation
SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
SPARK-34187 Use available offset range obtained during polling when
checking offset validation
SPARK-34212 For parquet table, after changing the precision and scale of
decimal type in hive, spark reads incorrect value
SPARK-34213 LOAD DATA doesn't refresh v1 table cache
SPARK-34229 Avro should read decimal values with the file schema
SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache


Re: Using bundler for Jekyll?

2021-02-12 Thread attilapiros
Managed to improve the site building a bit more: with a Gemfile we can pin
Jekyll to an exact version. For this we just have to call Jekyll via `bundle
exec jekyll`.
 
The PR [1] is opened.

[1] https://github.com/apache/spark-website/pull/303



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Using bundler for Jekyll?

2021-02-12 Thread attilapiros
Sure I will do that, too.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Using bundler for Jekyll?

2021-02-12 Thread Sean Owen
Seems fine to me. How about just regenerating the whole site once with the
latest version and requiring that?

On Fri, Feb 12, 2021 at 7:09 AM attilapiros 
wrote:

> I run into the same problem today and tried to find the version where the
> diff is minimal, so I wrote a script:
>
> ```
> #!/bin/zsh
>
> versions=('3.7.3' '3.7.2' '3.7.0' '3.6.3' '3.6.2' '3.6.1' '3.6.0' '3.5.2'
> '3.5.1' '3.5.0' '3.4.5' '3.4.4' '3.4.3' '3.4.2' '3.4.1' '3.4.0')
>
> for i in $versions; do
>   gem uninstall -a -x jekyll rouge
>   gem install jekyll --version $i
>   jekyll build
>   git diff --stat
>   git reset --hard HEAD
> done
> ```
>
> Based on this the best version is: jekyll-3.6.3:
>
> ```
> site/community.html |  2 +-
>  site/sitemap.xml| 14 +++---
>  2 files changed, 8 insertions(+), 8 deletions(-)
> ```
>
> What about changing the README.md [1] and specifying this exact version?
>
> Moreover changing the command to install it to:
>
> ```
>  gem install jekyll --version 3.6.3
> ```
>
> This installs the right rouge version as it is dependency.
>
> Finally I would give this command too as prequest:
>
> ```
>   gem uninstall -a -x jekyll rouge
> ```
>
> Because gem keeps all the installed versions and only one is used.
>
>
> [1]
>
> https://github.com/apache/spark-website/blob/6a5fc2ccaa5ad648dc0b25575ff816c10e648bdf/README.md#L5
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Using bundler for Jekyll?

2021-02-12 Thread attilapiros
I run into the same problem today and tried to find the version where the
diff is minimal, so I wrote a script:

```
#!/bin/zsh

versions=('3.7.3' '3.7.2' '3.7.0' '3.6.3' '3.6.2' '3.6.1' '3.6.0' '3.5.2'
'3.5.1' '3.5.0' '3.4.5' '3.4.4' '3.4.3' '3.4.2' '3.4.1' '3.4.0')

for i in $versions; do
  gem uninstall -a -x jekyll rouge
  gem install jekyll --version $i
  jekyll build
  git diff --stat
  git reset --hard HEAD
done
```

Based on this the best version is: jekyll-3.6.3:

```
site/community.html |  2 +-
 site/sitemap.xml| 14 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)
```

What about changing the README.md [1] and specifying this exact version? 

Moreover changing the command to install it to:
 
```
 gem install jekyll --version 3.6.3
``` 
 
This installs the right rouge version as it is dependency.

Finally I would give this command too as prequest:

```
  gem uninstall -a -x jekyll rouge
```

Because gem keeps all the installed versions and only one is used.

 
[1]
https://github.com/apache/spark-website/blob/6a5fc2ccaa5ad648dc0b25575ff816c10e648bdf/README.md#L5



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org