[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722326#comment-16722326 ]
Jordan Moore commented on SPARK-26314: -------------------------------------- It is powered by, yes. But I say it became purely based on the issue creator, who is now the CEO for the company that maintains and distributes the source for the more popularized Avro Schema Registry when using Kafka. Putting that aside, how about we meet in the middle -- Append a section to the official Spark-Avro documentation for how to build said UDF or Encoder/Decoder for other forms of -Avro- binary serialization? (note: we're no longer talking about just Avro anymore, since as mentioned in the Structured Streaming Kafka docs, the key and value are always returned as binary) The {{cast(value) as STRING}}, and {{from_json}} functions seem to be well-documented from what I have seen, but how to actually implement those functions for other non-string, non-primative datatypes seems to be missing. To quote the documentation "Use DataFrame operations to explicitly deserialize the keys". Personally, I don't know what those are, or where to immediately find those. Does one then go to the SparkSQL page or the Java/Scala/Python docs for the DataFrame object and get flooded with functions that really don't do what's really needed, which is deserializing a simple message? Let's go though a scenario where, as a developer, all I know is that I have Avro data in Kafka. In this scenario, I am new to Spark, maybe new to Kafka as well, and let's say I do not know if I am using "Confluent encoded Avro" or not. Naturally, one might go to the [SparkSQL > DataSources|https://spark.apache.org/docs/2.4.0/sql-data-sources.html] docs. Immediately, confused, we see that says "Avro *Files*"; but we are using Spark Structured Streaming with Kafka, and that doesn't sound like what is needed. In any case, we venture on anyway to find the link with to/from Avro functions listed. To our surprise, it shows Kafka examples! We copy-paste some code into our application, then are confused when/if we either get an error, or just null returned to us because the deserializer that expected an Avro schema as part of the message failed. Reading up and down this Avro SparkSQL page, I see nothing why that might be, and so we venture off into the depths of Google, StackOverflow, JIRA, Github issues, etc. still not really understanding what the root problem is -- that our Avro is really encoded differently. I hope I've highlighted the issue here - make it less error-prone and more beginner friendly rather than losing community trust that the product will solve their use-cases. > support Confluent encoded Avro in Spark Structured Streaming > ------------------------------------------------------------ > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 2.4.0 > Reporter: David Ahern > Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org