DSv1)?

Wenchen Fan Tue, 08 Oct 2019 22:46:20 -0700

> Would you mind if I ask the condition of being public API?

The module doesn't matter, but the package matters. We have many public
APIs in the catalyst module as well. (e.g. DataType)


There are 3 packages in Spark SQL that are meant to be private:
1. org.apache.spark.sql.catalyst
2. org.apache.spark.sql.execution
3. org.apache.spark.sql.internal

You can check out the full list of private packages of Spark in
project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages

Basically, classes/interfaces that don't appear in the official Spark API
doc are private.

Source/Sink traits are in org.apache.spark.sql.execution and thus they are
private.

On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Would you mind if I ask the condition of being public API? Source/Sink
> traits are not marked as @DeveloperApi but they're defined as public, and
> located to sql-core so even not semantically private (for catalyst), easy
> to give a signal they're public APIs.
>
> Also, if I'm not missing here, creating streaming DataFrame via RDD[Row]
> is not available even for private API. There're some other approaches on
> using private API: 1) SQLContext.internalCreateDataFrame - as it requires
> RDD[InternalRow], they should also depend on catalyst and have to deal with
> InternalRow which Spark community seems to be desired to change it
> eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in
> catalyst. So they not only need to apply "package hack" but also need to
> depend on catalyst.
>
>
> On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> AFAIK there is no public streaming data source API before DS v2. The
>> Source and Sink API is private and is only for builtin streaming sources.
>> Advanced users can still implement custom stream sources with private Spark
>> APIs (you can put your classes under the org.apache.spark.sql package to
>> access the private methods).
>>
>> That said, DS v2 is the first public streaming data source API. It's
>> really hard to design a stable, efficient and flexible data source API that
>> is unified between batch and streaming. DS v2 has evolved a lot in the
>> master branch and hopefully there will be no big breaking changes anymore.
>>
>>
>> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I remembered the actual case from developer who implements custom data
>>> source.
>>>
>>>
>>> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E
>>>
>>> Quoting here:
>>> We started implementing DSv2 in the 2.4 branch, but quickly discovered
>>> that the DSv2 in 3.0 was a complete breaking change (to the point where it
>>> could have been named DSv3 and it wouldn’t have come as a surprise). Since
>>> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided
>>> to fall back into DSv1 in order to ease the future transition to Spark 3.
>>>
>>> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution
>>> on dealing with DSv2 breaking change is having DSv1 as temporary solution,
>>> even DSv2 for 3.x will be available. They need some time to make transition.
>>>
>>> I would file an issue to support streaming data source on DSv1 and
>>> submit a patch unless someone objects.
>>>
>>>
>>> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <ja...@japila.pl> wrote:
>>>
>>>> Hi Jungtaek,
>>>>
>>>> Thanks a lot for your very prompt response!
>>>>
>>>> > Looks like it's missing, or intended to force custom streaming source
>>>> implemented as DSv2.
>>>>
>>>> That's exactly my understanding = no more DSv1 data sources. That
>>>> however is not consistent with the official message, is it? Spark 2.4.4
>>>> does not actually say "we're abandoning DSv1", and people could not really
>>>> want to jump on DSv2 since it's not recommended (unless I missed that).
>>>>
>>>> I love surprises (as that's where people pay more for consulting :)),
>>>> but not necessarily before public talks (with one at SparkAISummit in two
>>>> weeks!) Gonna be challenging! Hope I won't spread a wrong word.
>>>>
>>>> Pozdrawiam,
>>>> Jacek Laskowski
>>>> ----
>>>> https://about.me/JacekLaskowski
>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>>> The Internals of Spark Structured Streaming
>>>> https://bit.ly/spark-structured-streaming
>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>
>>>>
>>>>
>>>> On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Looks like it's missing, or intended to force custom streaming source
>>>>> implemented as DSv2.
>>>>>
>>>>> I'm not sure Spark community wants to expand DSv1 API: I could propose
>>>>> the change if we get some supports here.
>>>>>
>>>>> To Spark community: given we bring major changes on DSv2, someone
>>>>> would want to rely on DSv1 while transition from old DSv2 to new DSv2
>>>>> happens and new DSv2 gets stabilized. Would we like to provide necessary
>>>>> changes on DSv1?
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <ja...@japila.pl>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I think I've got stuck and without your help I won't move any
>>>>>> further. Please help.
>>>>>>
>>>>>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1,
>>>>>> MicroBatch) and in getBatch phase when requested for a DataFrame, there 
>>>>>> is
>>>>>> this assert [1] I can't seem to go past with any DataFrame I managed to
>>>>>> create as it's not streaming.
>>>>>>
>>>>>>           assert(batch.isStreaming,
>>>>>>             s"DataFrame returned by getBatch from $source did not
>>>>>> have isStreaming=true\n" +
>>>>>>               s"${batch.queryExecution.logical}")
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441
>>>>>>
>>>>>> All I could find is private[sql],
>>>>>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or 
>>>>>> [3]
>>>>>>
>>>>>> [2]
>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428
>>>>>> [3]
>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81
>>>>>>
>>>>>> Pozdrawiam,
>>>>>> Jacek Laskowski
>>>>>> ----
>>>>>> https://about.me/JacekLaskowski
>>>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals
>>>>>> The Internals of Spark Structured Streaming
>>>>>> https://bit.ly/spark-structured-streaming
>>>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
>>>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>>>
>>>>>>

Re: [SS] How to create a streaming DataFrame (for a custom Source in Spark 2.4.4 / MicroBatch / DSv1)?

Reply via email to