Re: LLM script for error message improvement

2023-08-02 Thread Ruifeng Zheng
+1 from my side, I'm fine to have it as a helper script

On Thu, Aug 3, 2023 at 10:53 AM Hyukjin Kwon  wrote:

> I think adding that dev tool script to improve the error message is fine.
>
> On Thu, 3 Aug 2023 at 10:24, Haejoon Lee
>  wrote:
>
>> Dear contributors, I hope you are doing well!
>>
>> I see there are contributors who are interested in working on error
>> message improvements and persistent contribution, so I want to share an
>> llm-based error message improvement script for helping your contribution.
>>
>> You can find a detail for the script at
>> https://github.com/apache/spark/pull/41711. I believe this can help your
>> error message improvement work, so I encourage you to take a look at the
>> pull request and leverage the script.
>>
>> Please let me know if you have any questions or concerns.
>>
>> Thanks all for your time and contributions!
>>
>> Best regards,
>>
>> Haejoon
>>
>


Re: Spark writing API

2023-08-02 Thread Andrew Melo
Hello Spark Devs

Could anyone help me with this?

Thanks,
Andrew

On Wed, May 31, 2023 at 20:57 Andrew Melo  wrote:

> Hi all
>
> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
> https://github.com/spark-root/laurelin
> ) to read the ROOT (https://root.cern) file format (which is used in high
> energy physics). I've recently presented my work in a conference (
> https://indico.jlab.org/event/459/contributions/11603/).
>
> All of that to say,
>
> A) is there no reason that the builtin (eg parquet) data sources can't
> consume the external APIs? It's hard to write a plugin that has to use a
> specific API when you're competing with another source who gets access to
> the internals directly.
>
> B) What is the Spark-approved API to code against for to write? There is a
> mess of *ColumnWriter classes in the Java namespace, and while there is no
> documentation, it's unclear which is preferred by the core (maybe
> ArrowWriterColumnVector?). We can give a zero copy write if the API
> describes it
>
> C) Putting aside everything above, is there a way to hint to the
> downstream users on the number of rows expected to write? Any smart writer
> will use off-heap memory to write to disk/memory, so the current API that
> shoves rows in doesn't do the trick. You don't want to keep reallocating
> buffers constantly
>
> D) what is sparks plan to use arrow-based columnar data representations? I
> see that there a lot of external efforts whose only option is to inject
> themselves in the CLASSPATH. The regular DSv2 api is already crippled for
> reads and for writes it's even worse. Is there a commitment from the spark
> core to bring the API to parity? Or is instead is it just a YMMV commitment
>
> Thanks!
> Andrew
>
>
>
>
>
> --
> It's dark in this basement.
>
-- 
It's dark in this basement.


Re: LLM script for error message improvement

2023-08-02 Thread Hyukjin Kwon
I think adding that dev tool script to improve the error message is fine.

On Thu, 3 Aug 2023 at 10:24, Haejoon Lee 
wrote:

> Dear contributors, I hope you are doing well!
>
> I see there are contributors who are interested in working on error
> message improvements and persistent contribution, so I want to share an
> llm-based error message improvement script for helping your contribution.
>
> You can find a detail for the script at
> https://github.com/apache/spark/pull/41711. I believe this can help your
> error message improvement work, so I encourage you to take a look at the
> pull request and leverage the script.
>
> Please let me know if you have any questions or concerns.
>
> Thanks all for your time and contributions!
>
> Best regards,
>
> Haejoon
>


LLM script for error message improvement

2023-08-02 Thread Haejoon Lee
Dear contributors, I hope you are doing well!

I see there are contributors who are interested in working on error message
improvements and persistent contribution, so I want to share an llm-based
error message improvement script for helping your contribution.

You can find a detail for the script at
https://github.com/apache/spark/pull/41711. I believe this can help your
error message improvement work, so I encourage you to take a look at the
pull request and leverage the script.

Please let me know if you have any questions or concerns.

Thanks all for your time and contributions!

Best regards,

Haejoon


Query hints visible to DSV2 connectors?

2023-08-02 Thread Alex Cruise
Hey folks,

I'm adding an optional feature to my DSV2 connector where it can choose
between a row-based or columnar PartitionReader dynamically depending on a
query's schema. I'd like to be able to supply a hint at query time that's
visible to the connector, but at the moment I can't see any way to
accomplish that.

>From what I can see the artifacts produced by the existing hint system [
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html
or sql("select 1").hint("foo").show()] aren't visible from the
TableCatalog/Table/ScanBuilder.

I guess I could set a config parameter but I'd rather do this on a
per-query basis. Any tips?

Thanks!

-0xe1a


[VOTE][RESULT] XML data source support

2023-08-02 Thread Sandip Agarwala
The vote passes with 7 +1s (4 binding +1s).
Thank you all for your comments and votes!

(* = binding)
Adrian Pop-Tifrea
Hyukjin Kwon *
Jia Fan
Mich Talebzadeh
Maciej Szymkiewicz *
Sean Owen *
Xiao Li *

SPIP link:
https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
JIRA:
https://issues.apache.org/jira/browse/SPARK-44265
Discussion Thread:
https://lists.apache.org/thread/q32hxgsp738wom03mgpg9ykj9nr2n1fh
Vote thread:
https://lists.apache.org/thread/vmcl0tyhlbf4b6njb4no2ztjxmjh1b24

Best regards,
Sandip


Re: [Reminder] Spark 3.5 RC Cut

2023-08-02 Thread Bjørn Jørgensen
@Dongjoon Hyun  FYI
[image: image.png]

We better ask common-...@hadoop.apache.org.

ons. 2. aug. 2023 kl. 18:03 skrev Dongjoon Hyun :

> Oh, I got it, Emil and Bjorn.
>
> Dongjoon.
>
> On Wed, Aug 2, 2023 at 12:32 AM Bjørn Jørgensen 
> wrote:
>
>> "*As far as I can tell this makes both 3.3.5 and 3.3.6 unusable with s3
>> without providing an alternative committer code.*"
>>
>> https://github.com/apache/hadoop/pull/5706#issuecomment-1619927992
>>
>> ons. 2. aug. 2023 kl. 08:05 skrev Emil Ejbyfeldt
>> :
>>
>>>  > Apache Spark is not affected by HADOOP-18757 because it is not a part
>>> of
>>>  > both Apache Hadoop 3.3.5 and 3.3.6.
>>>
>>> I am not sure I am following what you are trying to say here. Is that
>>> the jira is saying that only 3.3.5 is affected? Here I think the Jira is
>>> just incorrect. The jira was created (and the PR with the fix) was
>>> created before 3.3.6 was released and I just think the jira has not been
>>> updated to reflect the fact that 3.3.6 is also affected.
>>>
>>>  > HADOOP-18757 seems to be merged just two weeks ago and there is no
>>>  > Apache Hadoop release with it, isn't it?
>>>
>>> That is correct, there is no hadoop release containing the fix. So
>>> therefore 3.3.6 would also be affected by the regression.
>>>
>>> Best,
>>> Emil
>>>
>>> On 02/08/2023 07:51, Dongjoon Hyun wrote:
>>> > It's still invalid information, Emil.
>>> >
>>> > Apache Spark is not affected by HADOOP-18757 because it is not a part
>>> of
>>> > both Apache Hadoop 3.3.5 and 3.3.6.
>>> >
>>> > HADOOP-18757 seems to be merged just two weeks ago and there is no
>>> > Apache Hadoop release with it, isn't it?
>>> >
>>> > Could you check your local branch once more, please?
>>> >
>>> > Dongjoon.
>>> >
>>> >
>>> >
>>> > On Tue, Aug 1, 2023 at 9:46 PM Emil Ejbyfeldt <
>>> eejbyfe...@liveintent.com
>>> > > wrote:
>>> >
>>> > Hi,
>>> >
>>> > Yes, sorry about that seem to have messed up the link. Should have
>>> been
>>> > https://issues.apache.org/jira/browse/HADOOP-18757
>>> > 
>>> >
>>> > Best,
>>> > Emil
>>> >
>>> > On 01/08/2023 19:08, Dongjoon Hyun wrote:
>>> >  > Hi, Emil.
>>> >  >
>>> >  > HADOOP-18568 is still open and it seems to be never a part of
>>> the
>>> > Hadoop
>>> >  > trunk branch.
>>> >  >
>>> >  > Do you mean another JIRA?
>>> >  >
>>> >  > Dongjoon.
>>> >  >
>>> >  >
>>> >  >
>>> >  > On Tue, Aug 1, 2023 at 2:59 AM Emil Ejbyfeldt
>>> >  > >> > .invalid> wrote:
>>> >  >
>>> >  > Hi,
>>> >  >
>>> >  > We previously ran some experiments on builds from the 3.5
>>> > branch and
>>> >  > noticed that Hadoop had a regression
>>> >  > (https://issues.apache.org/jira/browse/HADOOP-18568
>>> > 
>>> >  > >> > >) in their
>>> s3a
>>> >  > committer affecting 3.3.5 and 3.3.6 (Spark 3.4 uses hadoop
>>> > 3.3.4). This
>>> >  > fix has been merged into Hadoop and will be part the next
>>> > release of
>>> >  > Hadoop.
>>> >  >
>>> >  >   From our testing the regression when writing data to S3
>>> > with large
>>> >  > number of tasks S3 is severe enough that we would need to
>>> > revert to
>>> >  > hadoop 3.3.4 in order to use spark 3.5 release.
>>> >  >
>>> >  > Since it only for S3 I am not sure it warrants action
>>> changes
>>> > in Spark
>>> >  > (e.g rolling back hadoop to 3.3.4). But it probably
>>> something
>>> > people
>>> >  > testing the rc against s3 should be aware of.
>>> >  >
>>> >  > Best,
>>> >  > Emil
>>> >  >
>>> >  > On 29/07/2023 10:29, Yuanjian Li wrote:
>>> >  >  > Hi everyone,
>>> >  >  >
>>> >  >  > Following the release timeline, I will cut the RC
>>> > on*Tuesday, Aug
>>> >  > 1st at
>>> >  >  > 1 pm PST* as scheduled.
>>> >  >  >
>>> >  >  > Date  Event
>>> >  >  > July 17th 2023
>>> >  >  > Late July
>>> >  >  > 2023  Code freeze. Release branch cut.
>>> >  >  > QA period. Focus on bug fixes, tests, stability and docs.
>>> >  >  > Generally, no new features merged.
>>> >  >  >
>>> >  >  >
>>> >  >  > August 2023   Release candidates (RC), voting, etc. until
>>> > final
>>> >  > release passes
>>> >  >  >
>>> >  >  >
>>> >  >  > Best,
>>> >  >  > Yuanjian
>>> >  >
>>> >  >
>>> >
>>>  -
>>> >  > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > 

Re: [Reminder] Spark 3.5 RC Cut

2023-08-02 Thread Dongjoon Hyun
Oh, I got it, Emil and Bjorn.

Dongjoon.

On Wed, Aug 2, 2023 at 12:32 AM Bjørn Jørgensen 
wrote:

> "*As far as I can tell this makes both 3.3.5 and 3.3.6 unusable with s3
> without providing an alternative committer code.*"
>
> https://github.com/apache/hadoop/pull/5706#issuecomment-1619927992
>
> ons. 2. aug. 2023 kl. 08:05 skrev Emil Ejbyfeldt
> :
>
>>  > Apache Spark is not affected by HADOOP-18757 because it is not a part
>> of
>>  > both Apache Hadoop 3.3.5 and 3.3.6.
>>
>> I am not sure I am following what you are trying to say here. Is that
>> the jira is saying that only 3.3.5 is affected? Here I think the Jira is
>> just incorrect. The jira was created (and the PR with the fix) was
>> created before 3.3.6 was released and I just think the jira has not been
>> updated to reflect the fact that 3.3.6 is also affected.
>>
>>  > HADOOP-18757 seems to be merged just two weeks ago and there is no
>>  > Apache Hadoop release with it, isn't it?
>>
>> That is correct, there is no hadoop release containing the fix. So
>> therefore 3.3.6 would also be affected by the regression.
>>
>> Best,
>> Emil
>>
>> On 02/08/2023 07:51, Dongjoon Hyun wrote:
>> > It's still invalid information, Emil.
>> >
>> > Apache Spark is not affected by HADOOP-18757 because it is not a part
>> of
>> > both Apache Hadoop 3.3.5 and 3.3.6.
>> >
>> > HADOOP-18757 seems to be merged just two weeks ago and there is no
>> > Apache Hadoop release with it, isn't it?
>> >
>> > Could you check your local branch once more, please?
>> >
>> > Dongjoon.
>> >
>> >
>> >
>> > On Tue, Aug 1, 2023 at 9:46 PM Emil Ejbyfeldt <
>> eejbyfe...@liveintent.com
>> > > wrote:
>> >
>> > Hi,
>> >
>> > Yes, sorry about that seem to have messed up the link. Should have
>> been
>> > https://issues.apache.org/jira/browse/HADOOP-18757
>> > 
>> >
>> > Best,
>> > Emil
>> >
>> > On 01/08/2023 19:08, Dongjoon Hyun wrote:
>> >  > Hi, Emil.
>> >  >
>> >  > HADOOP-18568 is still open and it seems to be never a part of the
>> > Hadoop
>> >  > trunk branch.
>> >  >
>> >  > Do you mean another JIRA?
>> >  >
>> >  > Dongjoon.
>> >  >
>> >  >
>> >  >
>> >  > On Tue, Aug 1, 2023 at 2:59 AM Emil Ejbyfeldt
>> >  > > > .invalid> wrote:
>> >  >
>> >  > Hi,
>> >  >
>> >  > We previously ran some experiments on builds from the 3.5
>> > branch and
>> >  > noticed that Hadoop had a regression
>> >  > (https://issues.apache.org/jira/browse/HADOOP-18568
>> > 
>> >  > > > >) in their s3a
>> >  > committer affecting 3.3.5 and 3.3.6 (Spark 3.4 uses hadoop
>> > 3.3.4). This
>> >  > fix has been merged into Hadoop and will be part the next
>> > release of
>> >  > Hadoop.
>> >  >
>> >  >   From our testing the regression when writing data to S3
>> > with large
>> >  > number of tasks S3 is severe enough that we would need to
>> > revert to
>> >  > hadoop 3.3.4 in order to use spark 3.5 release.
>> >  >
>> >  > Since it only for S3 I am not sure it warrants action changes
>> > in Spark
>> >  > (e.g rolling back hadoop to 3.3.4). But it probably something
>> > people
>> >  > testing the rc against s3 should be aware of.
>> >  >
>> >  > Best,
>> >  > Emil
>> >  >
>> >  > On 29/07/2023 10:29, Yuanjian Li wrote:
>> >  >  > Hi everyone,
>> >  >  >
>> >  >  > Following the release timeline, I will cut the RC
>> > on*Tuesday, Aug
>> >  > 1st at
>> >  >  > 1 pm PST* as scheduled.
>> >  >  >
>> >  >  > Date  Event
>> >  >  > July 17th 2023
>> >  >  > Late July
>> >  >  > 2023  Code freeze. Release branch cut.
>> >  >  > QA period. Focus on bug fixes, tests, stability and docs.
>> >  >  > Generally, no new features merged.
>> >  >  >
>> >  >  >
>> >  >  > August 2023   Release candidates (RC), voting, etc. until
>> > final
>> >  > release passes
>> >  >  >
>> >  >  >
>> >  >  > Best,
>> >  >  > Yuanjian
>> >  >
>> >  >
>> >
>>  -
>> >  > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > 
>> >  > > > >
>> >  >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> 

Re: [Reminder] Spark 3.5 RC Cut

2023-08-02 Thread Bjørn Jørgensen
"*As far as I can tell this makes both 3.3.5 and 3.3.6 unusable with s3
without providing an alternative committer code.*"

https://github.com/apache/hadoop/pull/5706#issuecomment-1619927992

ons. 2. aug. 2023 kl. 08:05 skrev Emil Ejbyfeldt
:

>  > Apache Spark is not affected by HADOOP-18757 because it is not a part of
>  > both Apache Hadoop 3.3.5 and 3.3.6.
>
> I am not sure I am following what you are trying to say here. Is that
> the jira is saying that only 3.3.5 is affected? Here I think the Jira is
> just incorrect. The jira was created (and the PR with the fix) was
> created before 3.3.6 was released and I just think the jira has not been
> updated to reflect the fact that 3.3.6 is also affected.
>
>  > HADOOP-18757 seems to be merged just two weeks ago and there is no
>  > Apache Hadoop release with it, isn't it?
>
> That is correct, there is no hadoop release containing the fix. So
> therefore 3.3.6 would also be affected by the regression.
>
> Best,
> Emil
>
> On 02/08/2023 07:51, Dongjoon Hyun wrote:
> > It's still invalid information, Emil.
> >
> > Apache Spark is not affected by HADOOP-18757 because it is not a part of
> > both Apache Hadoop 3.3.5 and 3.3.6.
> >
> > HADOOP-18757 seems to be merged just two weeks ago and there is no
> > Apache Hadoop release with it, isn't it?
> >
> > Could you check your local branch once more, please?
> >
> > Dongjoon.
> >
> >
> >
> > On Tue, Aug 1, 2023 at 9:46 PM Emil Ejbyfeldt  > > wrote:
> >
> > Hi,
> >
> > Yes, sorry about that seem to have messed up the link. Should have
> been
> > https://issues.apache.org/jira/browse/HADOOP-18757
> > 
> >
> > Best,
> > Emil
> >
> > On 01/08/2023 19:08, Dongjoon Hyun wrote:
> >  > Hi, Emil.
> >  >
> >  > HADOOP-18568 is still open and it seems to be never a part of the
> > Hadoop
> >  > trunk branch.
> >  >
> >  > Do you mean another JIRA?
> >  >
> >  > Dongjoon.
> >  >
> >  >
> >  >
> >  > On Tue, Aug 1, 2023 at 2:59 AM Emil Ejbyfeldt
> >  >  > .invalid> wrote:
> >  >
> >  > Hi,
> >  >
> >  > We previously ran some experiments on builds from the 3.5
> > branch and
> >  > noticed that Hadoop had a regression
> >  > (https://issues.apache.org/jira/browse/HADOOP-18568
> > 
> >  >  > >) in their s3a
> >  > committer affecting 3.3.5 and 3.3.6 (Spark 3.4 uses hadoop
> > 3.3.4). This
> >  > fix has been merged into Hadoop and will be part the next
> > release of
> >  > Hadoop.
> >  >
> >  >   From our testing the regression when writing data to S3
> > with large
> >  > number of tasks S3 is severe enough that we would need to
> > revert to
> >  > hadoop 3.3.4 in order to use spark 3.5 release.
> >  >
> >  > Since it only for S3 I am not sure it warrants action changes
> > in Spark
> >  > (e.g rolling back hadoop to 3.3.4). But it probably something
> > people
> >  > testing the rc against s3 should be aware of.
> >  >
> >  > Best,
> >  > Emil
> >  >
> >  > On 29/07/2023 10:29, Yuanjian Li wrote:
> >  >  > Hi everyone,
> >  >  >
> >  >  > Following the release timeline, I will cut the RC
> > on*Tuesday, Aug
> >  > 1st at
> >  >  > 1 pm PST* as scheduled.
> >  >  >
> >  >  > Date  Event
> >  >  > July 17th 2023
> >  >  > Late July
> >  >  > 2023  Code freeze. Release branch cut.
> >  >  > QA period. Focus on bug fixes, tests, stability and docs.
> >  >  > Generally, no new features merged.
> >  >  >
> >  >  >
> >  >  > August 2023   Release candidates (RC), voting, etc. until
> > final
> >  > release passes
> >  >  >
> >  >  >
> >  >  > Best,
> >  >  > Yuanjian
> >  >
> >  >
> >
>  -
> >  > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > 
> >  >  > >
> >  >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [Reminder] Spark 3.5 RC Cut

2023-08-02 Thread Emil Ejbyfeldt

> Apache Spark is not affected by HADOOP-18757 because it is not a part of
> both Apache Hadoop 3.3.5 and 3.3.6.

I am not sure I am following what you are trying to say here. Is that 
the jira is saying that only 3.3.5 is affected? Here I think the Jira is 
just incorrect. The jira was created (and the PR with the fix) was 
created before 3.3.6 was released and I just think the jira has not been 
updated to reflect the fact that 3.3.6 is also affected.


> HADOOP-18757 seems to be merged just two weeks ago and there is no
> Apache Hadoop release with it, isn't it?

That is correct, there is no hadoop release containing the fix. So 
therefore 3.3.6 would also be affected by the regression.


Best,
Emil

On 02/08/2023 07:51, Dongjoon Hyun wrote:

It's still invalid information, Emil.

Apache Spark is not affected by HADOOP-18757 because it is not a part of 
both Apache Hadoop 3.3.5 and 3.3.6.


HADOOP-18757 seems to be merged just two weeks ago and there is no 
Apache Hadoop release with it, isn't it?


Could you check your local branch once more, please?

Dongjoon.



On Tue, Aug 1, 2023 at 9:46 PM Emil Ejbyfeldt > wrote:


Hi,

Yes, sorry about that seem to have messed up the link. Should have been
https://issues.apache.org/jira/browse/HADOOP-18757


Best,
Emil

On 01/08/2023 19:08, Dongjoon Hyun wrote:
 > Hi, Emil.
 >
 > HADOOP-18568 is still open and it seems to be never a part of the
Hadoop
 > trunk branch.
 >
 > Do you mean another JIRA?
 >
 > Dongjoon.
 >
 >
 >
 > On Tue, Aug 1, 2023 at 2:59 AM Emil Ejbyfeldt
 > mailto:eejbyfe...@liveintent.com>.invalid> wrote:
 >
 >     Hi,
 >
 >     We previously ran some experiments on builds from the 3.5
branch and
 >     noticed that Hadoop had a regression
 >     (https://issues.apache.org/jira/browse/HADOOP-18568

 >     >) in their s3a
 >     committer affecting 3.3.5 and 3.3.6 (Spark 3.4 uses hadoop
3.3.4). This
 >     fix has been merged into Hadoop and will be part the next
release of
 >     Hadoop.
 >
 >       From our testing the regression when writing data to S3
with large
 >     number of tasks S3 is severe enough that we would need to
revert to
 >     hadoop 3.3.4 in order to use spark 3.5 release.
 >
 >     Since it only for S3 I am not sure it warrants action changes
in Spark
 >     (e.g rolling back hadoop to 3.3.4). But it probably something
people
 >     testing the rc against s3 should be aware of.
 >
 >     Best,
 >     Emil
 >
 >     On 29/07/2023 10:29, Yuanjian Li wrote:
 >      > Hi everyone,
 >      >
 >      > Following the release timeline, I will cut the RC
on*Tuesday, Aug
 >     1st at
 >      > 1 pm PST* as scheduled.
 >      >
 >      > Date  Event
 >      > July 17th 2023
 >      > Late July
 >      > 2023  Code freeze. Release branch cut.
 >      > QA period. Focus on bug fixes, tests, stability and docs.
 >      > Generally, no new features merged.
 >      >
 >      >
 >      > August 2023   Release candidates (RC), voting, etc. until
final
 >     release passes
 >      >
 >      >
 >      > Best,
 >      > Yuanjian
 >
 >   
  -

 >     To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 >     >
 >



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org