Hi, Thanks much for such thorough conversation. Enjoyed it very much.
> Source/Sink traits are in org.apache.spark.sql.execution and thus they are private. That would explain why I couldn't find scaladocs. Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Wed, Oct 9, 2019 at 7:46 AM Wenchen Fan <cloud0...@gmail.com> wrote: > > Would you mind if I ask the condition of being public API? > > The module doesn't matter, but the package matters. We have many public > APIs in the catalyst module as well. (e.g. DataType) > > There are 3 packages in Spark SQL that are meant to be private: > 1. org.apache.spark.sql.catalyst > 2. org.apache.spark.sql.execution > 3. org.apache.spark.sql.internal > > You can check out the full list of private packages of Spark in > project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages > > Basically, classes/interfaces that don't appear in the official Spark API > doc are private. > > Source/Sink traits are in org.apache.spark.sql.execution and thus they are > private. > > On Tue, Oct 8, 2019 at 6:19 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> > wrote: > >> Would you mind if I ask the condition of being public API? Source/Sink >> traits are not marked as @DeveloperApi but they're defined as public, and >> located to sql-core so even not semantically private (for catalyst), easy >> to give a signal they're public APIs. >> >> Also, if I'm not missing here, creating streaming DataFrame via RDD[Row] >> is not available even for private API. There're some other approaches on >> using private API: 1) SQLContext.internalCreateDataFrame - as it requires >> RDD[InternalRow], they should also depend on catalyst and have to deal with >> InternalRow which Spark community seems to be desired to change it >> eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in >> catalyst. So they not only need to apply "package hack" but also need to >> depend on catalyst. >> >> >> On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> AFAIK there is no public streaming data source API before DS v2. The >>> Source and Sink API is private and is only for builtin streaming sources. >>> Advanced users can still implement custom stream sources with private Spark >>> APIs (you can put your classes under the org.apache.spark.sql package to >>> access the private methods). >>> >>> That said, DS v2 is the first public streaming data source API. It's >>> really hard to design a stable, efficient and flexible data source API that >>> is unified between batch and streaming. DS v2 has evolved a lot in the >>> master branch and hopefully there will be no big breaking changes anymore. >>> >>> >>> On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> >>>> I remembered the actual case from developer who implements custom data >>>> source. >>>> >>>> >>>> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E >>>> >>>> Quoting here: >>>> We started implementing DSv2 in the 2.4 branch, but quickly discovered >>>> that the DSv2 in 3.0 was a complete breaking change (to the point where it >>>> could have been named DSv3 and it wouldn’t have come as a surprise). Since >>>> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided >>>> to fall back into DSv1 in order to ease the future transition to Spark 3. >>>> >>>> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution >>>> on dealing with DSv2 breaking change is having DSv1 as temporary solution, >>>> even DSv2 for 3.x will be available. They need some time to make >>>> transition. >>>> >>>> I would file an issue to support streaming data source on DSv1 and >>>> submit a patch unless someone objects. >>>> >>>> >>>> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <ja...@japila.pl> wrote: >>>> >>>>> Hi Jungtaek, >>>>> >>>>> Thanks a lot for your very prompt response! >>>>> >>>>> > Looks like it's missing, or intended to force custom streaming >>>>> source implemented as DSv2. >>>>> >>>>> That's exactly my understanding = no more DSv1 data sources. That >>>>> however is not consistent with the official message, is it? Spark 2.4.4 >>>>> does not actually say "we're abandoning DSv1", and people could not really >>>>> want to jump on DSv2 since it's not recommended (unless I missed that). >>>>> >>>>> I love surprises (as that's where people pay more for consulting :)), >>>>> but not necessarily before public talks (with one at SparkAISummit in two >>>>> weeks!) Gonna be challenging! Hope I won't spread a wrong word. >>>>> >>>>> Pozdrawiam, >>>>> Jacek Laskowski >>>>> ---- >>>>> https://about.me/JacekLaskowski >>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals >>>>> The Internals of Spark Structured Streaming >>>>> https://bit.ly/spark-structured-streaming >>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >>>>> Follow me at https://twitter.com/jaceklaskowski >>>>> >>>>> >>>>> >>>>> On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim < >>>>> kabhwan.opensou...@gmail.com> wrote: >>>>> >>>>>> Looks like it's missing, or intended to force custom streaming source >>>>>> implemented as DSv2. >>>>>> >>>>>> I'm not sure Spark community wants to expand DSv1 API: I could >>>>>> propose the change if we get some supports here. >>>>>> >>>>>> To Spark community: given we bring major changes on DSv2, someone >>>>>> would want to rely on DSv1 while transition from old DSv2 to new DSv2 >>>>>> happens and new DSv2 gets stabilized. Would we like to provide necessary >>>>>> changes on DSv1? >>>>>> >>>>>> Thanks, >>>>>> Jungtaek Lim (HeartSaVioR) >>>>>> >>>>>> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <ja...@japila.pl> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I think I've got stuck and without your help I won't move any >>>>>>> further. Please help. >>>>>>> >>>>>>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, >>>>>>> MicroBatch) and in getBatch phase when requested for a DataFrame, there >>>>>>> is >>>>>>> this assert [1] I can't seem to go past with any DataFrame I managed to >>>>>>> create as it's not streaming. >>>>>>> >>>>>>> assert(batch.isStreaming, >>>>>>> s"DataFrame returned by getBatch from $source did not >>>>>>> have isStreaming=true\n" + >>>>>>> s"${batch.queryExecution.logical}") >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441 >>>>>>> >>>>>>> All I could find is private[sql], >>>>>>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or >>>>>>> [3] >>>>>>> >>>>>>> [2] >>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428 >>>>>>> [3] >>>>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81 >>>>>>> >>>>>>> Pozdrawiam, >>>>>>> Jacek Laskowski >>>>>>> ---- >>>>>>> https://about.me/JacekLaskowski >>>>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals >>>>>>> The Internals of Spark Structured Streaming >>>>>>> https://bit.ly/spark-structured-streaming >>>>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >>>>>>> Follow me at https://twitter.com/jaceklaskowski >>>>>>> >>>>>>>