Re: What else could be removed in Spark 4?

2023-08-16 Thread Yang Jie
I would like to know how we should handle the two Kinesis-related modules in 
Spark 4.0. They have a very low frequency of code updates, and because the 
corresponding tests are not continuously executed in any GitHub Actions 
pipeline, so I think they significantly lack quality assurance. On top of that, 
I am not certain if the test cases, which require AWS credentials in these 
modules, get verified during each Spark version release.

Thanks,
Jiie Yang

On 2023/08/08 08:28:37 Cheng Pan wrote:
> What do you think about removing HiveContext and even SQLContext?
> 
> And as an extension of this question, should we re-implement the Hive using 
> DSv2 API in Spark 4?
> 
> For developers who want to implement a custom DataSource plugin, he/she may 
> want to learn something from the Spark built-in one[1], and Hive is a good 
> candidate. A kind of legacy implementation may confuse the developers.
> 
> It was discussed/requested in [2][3][4][5]
> 
> There were some requests for multiple Hive metastores support[6], and I have 
> experienced that users choose Presto/Trino instead of Spark because the 
> former supports multi HMS.
> 
> BTW, there are known third-party Hive DSv2 implementations[7][8].
> 
> [1] https://www.mail-archive.com/dev@spark.apache.org/msg30353.html
> [2] https://www.mail-archive.com/dev@spark.apache.org/msg25715.html
> [3] https://issues.apache.org/jira/browse/SPARK-31241
> [4] https://issues.apache.org/jira/browse/SPARK-39797
> [5] https://issues.apache.org/jira/browse/SPARK-44518
> [6] https://www.mail-archive.com/dev@spark.apache.org/msg30228.html
> [7] https://github.com/permanentstar/spark-sql-dsv2-extension
> [8] 
> https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive
> 
> Thanks,
> Cheng Pan
> 
> 
> > On Aug 8, 2023, at 10:09, Wenchen Fan  wrote:
> > 
> > I think the principle is we should remove things that block us from 
> > supporting new things like Java 21, or come with a significant maintenance 
> > cost. If there is no benefit to removing deprecated APIs (just to keep the 
> > codebase clean?), I'd prefer to leave them there and not bother.
> > 
> > On Tue, Aug 8, 2023 at 9:00 AM Jia Fan  wrote:
> > Thanks Sean  for open this discussion.
> > 
> > 1. I think drop Scala 2.12 is a good option.
> > 
> > 2. Personally, I think we should remove most methods that are deprecated 
> > since 2.x/1.x unless it can't find a good replacement. There is already a 
> > 3.x version as a buffer and I don't think it is good practice to use the 
> > deprecated method of 2.x on 4.x.
> > 
> > 3. For Mesos, I think we should remove it from doc first.
> > 
> > 
> > Jia Fan
> > 
> > 
> > 
> >> 2023年8月8日 05:47,Sean Owen  写道:
> >> 
> >> While we're noodling on the topic, what else might be worth removing in 
> >> Spark 4?
> >> 
> >> For example, looks like we're finally hitting problems supporting Java 8 
> >> through 21 all at once, related to Scala 2.13.x updates. It would be 
> >> reasonable to require Java 11, or even 17, as a baseline for the 
> >> multi-year lifecycle of Spark 4.
> >> 
> >> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard 
> >> otherwise.
> >> 
> >> There was a good discussion about whether old deprecated methods should be 
> >> removed. They can't be removed at other times, but, doesn't mean they all 
> >> should be. createExternalTable was brought up as a first example. What 
> >> deprecated methods are worth removing?
> >> 
> >> There's Mesos support, long since deprecated, which seems like something 
> >> to prune.
> >> 
> >> Are there old Hive/Hadoop version combos we should just stop supporting?
> > 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark writing API

2023-08-16 Thread Andrew Melo
Hello Wenchen,

On Wed, Aug 16, 2023 at 23:33 Wenchen Fan  wrote:

> > is there a way to hint to the downstream users on the number of rows
> expected to write?
>
> It will be very hard to do. Spark pipelines the execution (within shuffle
> boundaries) and we can't predict the number of final output rows.
>

Perhaps I don't understand -- even in the case of multiple shuffles, you
can assume that there is exactly one shuffle boundary before the write
operation, and that shuffle boundary knows the number of input rows for
that shuffle. That number of rows has to be, by construction, the upper
bound on the number of rows that will be passed to the writer.

If the writer can be hinted that bound then it can do something smart with
allocating (memory or disk). By comparison, the current API just gives
rows/batches one at a time, and in the case of off-heap allocation (like
with arrow's off-heap storage), it's crazy inefficient to try and do the
equivalent of realloc() to grow the buffer size.

Thanks
Andrew



> On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran 
> wrote:
>
>>
>>
>> On Thu, 1 Jun 2023 at 00:58, Andrew Melo  wrote:
>>
>>> Hi all
>>>
>>> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
>>> https://github.com/spark-root/laurelin
>>> ) to read the ROOT (https://root.cern) file format (which is used in
>>> high energy physics). I've recently presented my work in a conference (
>>> https://indico.jlab.org/event/459/contributions/11603/).
>>>
>>>
>> nice paper given the esoteric nature of HEP file formats.
>>
>> All of that to say,
>>>
>>> A) is there no reason that the builtin (eg parquet) data sources can't
>>> consume the external APIs? It's hard to write a plugin that has to use a
>>> specific API when you're competing with another source who gets access to
>>> the internals directly.
>>>
>>> B) What is the Spark-approved API to code against for to write? There is
>>> a mess of *ColumnWriter classes in the Java namespace, and while there is
>>> no documentation, it's unclear which is preferred by the core (maybe
>>> ArrowWriterColumnVector?). We can give a zero copy write if the API
>>> describes it
>>>
>>
>> There's a dangerous tendency for things that libraries need to be tagged
>> private [spark], normally worked around by people putting their code into
>> org.apache.spark packages. Really everyone who does that should try to get
>> a longer term fix in, as well as that quick-and-effective workaround.
>> Knowing where problems lie would be a good first step. spark sub-modules
>> are probably a place to get insight into where those low-level internal
>> operations are considered important, although many uses may be for historic
>> "we wrote it that way a long time ago" reasons
>>
>>
>>>
>>> C) Putting aside everything above, is there a way to hint to the
>>> downstream users on the number of rows expected to write? Any smart writer
>>> will use off-heap memory to write to disk/memory, so the current API that
>>> shoves rows in doesn't do the trick. You don't want to keep reallocating
>>> buffers constantly
>>>
>>> D) what is sparks plan to use arrow-based columnar data representations?
>>> I see that there a lot of external efforts whose only option is to inject
>>> themselves in the CLASSPATH. The regular DSv2 api is already crippled for
>>> reads and for writes it's even worse. Is there a commitment from the spark
>>> core to bring the API to parity? Or is instead is it just a YMMV commitment
>>>
>>
>> No idea, I'm afraid. I do think arrow makes a good format for processing,
>> and it'd be interesting to see how well it actually works as a wire format
>> to replace other things (e.g hive's protocol), especially on RDMA networks
>> and the like. I'm not up to date with ongoing work there -if anyone has
>> pointers that'd be interesting.
>>
>>>
>>> Thanks!
>>> Andrew
>>>
>>>
>>>
>>>
>>>
>>> --
>>> It's dark in this basement.
>>>
>> --
It's dark in this basement.


Re: Spark writing API

2023-08-16 Thread Wenchen Fan
> is there a way to hint to the downstream users on the number of rows
expected to write?

It will be very hard to do. Spark pipelines the execution (within shuffle
boundaries) and we can't predict the number of final output rows.

On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran 
wrote:

>
>
> On Thu, 1 Jun 2023 at 00:58, Andrew Melo  wrote:
>
>> Hi all
>>
>> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
>> https://github.com/spark-root/laurelin
>> ) to read the ROOT (https://root.cern) file format (which is used in
>> high energy physics). I've recently presented my work in a conference (
>> https://indico.jlab.org/event/459/contributions/11603/).
>>
>>
> nice paper given the esoteric nature of HEP file formats.
>
> All of that to say,
>>
>> A) is there no reason that the builtin (eg parquet) data sources can't
>> consume the external APIs? It's hard to write a plugin that has to use a
>> specific API when you're competing with another source who gets access to
>> the internals directly.
>>
>> B) What is the Spark-approved API to code against for to write? There is
>> a mess of *ColumnWriter classes in the Java namespace, and while there is
>> no documentation, it's unclear which is preferred by the core (maybe
>> ArrowWriterColumnVector?). We can give a zero copy write if the API
>> describes it
>>
>
> There's a dangerous tendency for things that libraries need to be tagged
> private [spark], normally worked around by people putting their code into
> org.apache.spark packages. Really everyone who does that should try to get
> a longer term fix in, as well as that quick-and-effective workaround.
> Knowing where problems lie would be a good first step. spark sub-modules
> are probably a place to get insight into where those low-level internal
> operations are considered important, although many uses may be for historic
> "we wrote it that way a long time ago" reasons
>
>
>>
>> C) Putting aside everything above, is there a way to hint to the
>> downstream users on the number of rows expected to write? Any smart writer
>> will use off-heap memory to write to disk/memory, so the current API that
>> shoves rows in doesn't do the trick. You don't want to keep reallocating
>> buffers constantly
>>
>> D) what is sparks plan to use arrow-based columnar data representations?
>> I see that there a lot of external efforts whose only option is to inject
>> themselves in the CLASSPATH. The regular DSv2 api is already crippled for
>> reads and for writes it's even worse. Is there a commitment from the spark
>> core to bring the API to parity? Or is instead is it just a YMMV commitment
>>
>
> No idea, I'm afraid. I do think arrow makes a good format for processing,
> and it'd be interesting to see how well it actually works as a wire format
> to replace other things (e.g hive's protocol), especially on RDMA networks
> and the like. I'm not up to date with ongoing work there -if anyone has
> pointers that'd be interesting.
>
>>
>> Thanks!
>> Andrew
>>
>>
>>
>>
>>
>> --
>> It's dark in this basement.
>>
>


Unsubscribe

2023-08-16 Thread 赵军
赵军