[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming

Jordan Moore (JIRA) Sat, 15 Dec 2018 15:59:27 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722326#comment-16722326
 ]


Jordan Moore commented on SPARK-26314:
--------------------------------------

It is powered by, yes. But I say it became purely based on the issue creator, 
who is now the CEO for the company that maintains and distributes the source 
for the more popularized Avro Schema Registry when using Kafka. 

Putting that aside, how about we meet in the middle -- Append a section to the 
official Spark-Avro documentation for how to build said UDF or Encoder/Decoder 
for other forms of -Avro- binary serialization? (note: we're no longer talking 
about just Avro anymore, since as mentioned in the Structured Streaming Kafka 
docs, the key and value are always returned as binary)

The {{cast(value) as STRING}}, and {{from_json}} functions seem to be 
well-documented from what I have seen, but how to actually implement those 
functions for other non-string, non-primative datatypes seems to be missing. To 
quote the documentation "Use DataFrame operations to explicitly deserialize the 
keys". Personally, I don't know what those are, or where to immediately find 
those. Does one then go to the SparkSQL page or the Java/Scala/Python docs for 
the DataFrame object and get flooded with functions that really don't do what's 
really needed, which is deserializing a simple message? 

Let's go though a scenario where, as a developer, all I know is that I have 
Avro data in Kafka. In this scenario, I am new to Spark, maybe new to Kafka as 
well, and let's say I do not know if I am using "Confluent encoded Avro" or 
not. 
Naturally, one might go to the [SparkSQL > 
DataSources|https://spark.apache.org/docs/2.4.0/sql-data-sources.html] docs. 
Immediately, confused, we see that says "Avro *Files*"; but we are using Spark 
Structured Streaming with Kafka, and that doesn't sound like what is needed. In 
any case, we venture on anyway to find the link with to/from Avro functions 
listed. To our surprise, it shows Kafka examples! We copy-paste some code into 
our application, then are confused when/if we either get an error, or just null 
returned to us because the deserializer that expected an Avro schema as part of 
the message failed. Reading up and down this Avro SparkSQL page, I see nothing 
why that might be, and so we venture off into the depths of Google, 
StackOverflow, JIRA, Github issues, etc. still not really understanding what 
the root problem is -- that our Avro is really encoded differently. 

I hope I've highlighted the issue here - make it less error-prone and more 
beginner friendly rather than losing community trust that the product will 
solve their use-cases.

> support Confluent encoded Avro in Spark Structured Streaming
> ------------------------------------------------------------
>
>                 Key: SPARK-26314
>                 URL: https://issues.apache.org/jira/browse/SPARK-26314
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.4.0
>            Reporter: David Ahern
>            Priority: Major
>
> As Avro has now been added as a first class citizen,
> [https://spark.apache.org/docs/latest/sql-data-sources-avro.html]
> please make Confluent encoded avro work out of the box with Spark Structured 
> Streaming
> as described in this link, Avro messages on Kafka encoded with confluent 
> serializer also need to be decoded with confluent.  It would be great if this 
> worked out of the box
> [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain]
> here are details on the Confluent encoding
> [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id]
> It's been a year since i worked on anything to do with Avro and Spark 
> Structured Streaming, but i had to take an approach such as this when getting 
> it to work.  This is what i  used as a reference at that time
> [https://github.com/tubular/confluent-spark-avro]
> Also, here is another link i found that someone has done in the meantime
> [https://github.com/AbsaOSS/ABRiS]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming

Reply via email to