> Would you mind if I ask the condition of being public API? The module doesn't matter, but the package matters. We have many public APIs in the catalyst module as well. (e.g. DataType)
There are 3 packages in Spark SQL that are meant to be private: 1. org.apache.spark.sql.catalyst 2. org.apache.spark.sql.execution 3. org.apache.spark.sql.internal You can check out the full list of private packages of Spark in project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages Basically, classes/interfaces that don't appear in the official Spark API doc are private. Source/Sink traits are in org.apache.spark.sql.execution and thus they are private. On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Would you mind if I ask the condition of being public API? Source/Sink > traits are not marked as @DeveloperApi but they're defined as public, and > located to sql-core so even not semantically private (for catalyst), easy > to give a signal they're public APIs. > > Also, if I'm not missing here, creating streaming DataFrame via RDD[Row] > is not available even for private API. There're some other approaches on > using private API: 1) SQLContext.internalCreateDataFrame - as it requires > RDD[InternalRow], they should also depend on catalyst and have to deal with > InternalRow which Spark community seems to be desired to change it > eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in > catalyst. So they not only need to apply "package hack" but also need to > depend on catalyst. > > > On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> AFAIK there is no public streaming data source API before DS v2. The >> Source and Sink API is private and is only for builtin streaming sources. >> Advanced users can still implement custom stream sources with private Spark >> APIs (you can put your classes under the org.apache.spark.sql package to >> access the private methods). >> >> That said, DS v2 is the first public streaming data source API. It's >> really hard to design a stable, efficient and flexible data source API that >> is unified between batch and streaming. DS v2 has evolved a lot in the >> master branch and hopefully there will be no big breaking changes anymore. >> >> >> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> I remembered the actual case from developer who implements custom data >>> source. >>> >>> >>> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E >>> >>> Quoting here: >>> We started implementing DSv2 in the 2.4 branch, but quickly discovered >>> that the DSv2 in 3.0 was a complete breaking change (to the point where it >>> could have been named DSv3 and it wouldn’t have come as a surprise). Since >>> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided >>> to fall back into DSv1 in order to ease the future transition to Spark 3. >>> >>> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution >>> on dealing with DSv2 breaking change is having DSv1 as temporary solution, >>> even DSv2 for 3.x will be available. They need some time to make transition. >>> >>> I would file an issue to support streaming data source on DSv1 and >>> submit a patch unless someone objects. >>> >>> >>> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <ja...@japila.pl> wrote: >>> >>>> Hi Jungtaek, >>>> >>>> Thanks a lot for your very prompt response! >>>> >>>> > Looks like it's missing, or intended to force custom streaming source >>>> implemented as DSv2. >>>> >>>> That's exactly my understanding = no more DSv1 data sources. That >>>> however is not consistent with the official message, is it? Spark 2.4.4 >>>> does not actually say "we're abandoning DSv1", and people could not really >>>> want to jump on DSv2 since it's not recommended (unless I missed that). >>>> >>>> I love surprises (as that's where people pay more for consulting :)), >>>> but not necessarily before public talks (with one at SparkAISummit in two >>>> weeks!) Gonna be challenging! Hope I won't spread a wrong word. >>>> >>>> Pozdrawiam, >>>> Jacek Laskowski >>>> ---- >>>> https://about.me/JacekLaskowski >>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals >>>> The Internals of Spark Structured Streaming >>>> https://bit.ly/spark-structured-streaming >>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >>>> Follow me at https://twitter.com/jaceklaskowski >>>> >>>> >>>> >>>> On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim < >>>> kabhwan.opensou...@gmail.com> wrote: >>>> >>>>> Looks like it's missing, or intended to force custom streaming source >>>>> implemented as DSv2. >>>>> >>>>> I'm not sure Spark community wants to expand DSv1 API: I could propose >>>>> the change if we get some supports here. >>>>> >>>>> To Spark community: given we bring major changes on DSv2, someone >>>>> would want to rely on DSv1 while transition from old DSv2 to new DSv2 >>>>> happens and new DSv2 gets stabilized. Would we like to provide necessary >>>>> changes on DSv1? >>>>> >>>>> Thanks, >>>>> Jungtaek Lim (HeartSaVioR) >>>>> >>>>> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <ja...@japila.pl> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I think I've got stuck and without your help I won't move any >>>>>> further. Please help. >>>>>> >>>>>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, >>>>>> MicroBatch) and in getBatch phase when requested for a DataFrame, there >>>>>> is >>>>>> this assert [1] I can't seem to go past with any DataFrame I managed to >>>>>> create as it's not streaming. >>>>>> >>>>>> assert(batch.isStreaming, >>>>>> s"DataFrame returned by getBatch from $source did not >>>>>> have isStreaming=true\n" + >>>>>> s"${batch.queryExecution.logical}") >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441 >>>>>> >>>>>> All I could find is private[sql], >>>>>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or >>>>>> [3] >>>>>> >>>>>> [2] >>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428 >>>>>> [3] >>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81 >>>>>> >>>>>> Pozdrawiam, >>>>>> Jacek Laskowski >>>>>> ---- >>>>>> https://about.me/JacekLaskowski >>>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals >>>>>> The Internals of Spark Structured Streaming >>>>>> https://bit.ly/spark-structured-streaming >>>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >>>>>> Follow me at https://twitter.com/jaceklaskowski >>>>>> >>>>>>