[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695828#comment-17695828 ] Gustavo Martin commented on SPARK-26314: My team just stumbled upon this problem :( I was hoping Spark would be making use of the AVRO capabilities for finding the right schema associated with some event when using a Schema Registy. > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722326#comment-16722326 ] Jordan Moore commented on SPARK-26314: -- It is powered by, yes. But I say it became purely based on the issue creator, who is now the CEO for the company that maintains and distributes the source for the more popularized Avro Schema Registry when using Kafka. Putting that aside, how about we meet in the middle -- Append a section to the official Spark-Avro documentation for how to build said UDF or Encoder/Decoder for other forms of -Avro- binary serialization? (note: we're no longer talking about just Avro anymore, since as mentioned in the Structured Streaming Kafka docs, the key and value are always returned as binary) The {{cast(value) as STRING}}, and {{from_json}} functions seem to be well-documented from what I have seen, but how to actually implement those functions for other non-string, non-primative datatypes seems to be missing. To quote the documentation "Use DataFrame operations to explicitly deserialize the keys". Personally, I don't know what those are, or where to immediately find those. Does one then go to the SparkSQL page or the Java/Scala/Python docs for the DataFrame object and get flooded with functions that really don't do what's really needed, which is deserializing a simple message? Let's go though a scenario where, as a developer, all I know is that I have Avro data in Kafka. In this scenario, I am new to Spark, maybe new to Kafka as well, and let's say I do not know if I am using "Confluent encoded Avro" or not. Naturally, one might go to the [SparkSQL > DataSources|https://spark.apache.org/docs/2.4.0/sql-data-sources.html] docs. Immediately, confused, we see that says "Avro *Files*"; but we are using Spark Structured Streaming with Kafka, and that doesn't sound like what is needed. In any case, we venture on anyway to find the link with to/from Avro functions listed. To our surprise, it shows Kafka examples! We copy-paste some code into our application, then are confused when/if we either get an error, or just null returned to us because the deserializer that expected an Avro schema as part of the message failed. Reading up and down this Avro SparkSQL page, I see nothing why that might be, and so we venture off into the depths of Google, StackOverflow, JIRA, Github issues, etc. still not really understanding what the root problem is -- that our Avro is really encoded differently. I hope I've highlighted the issue here - make it less error-prone and more beginner friendly rather than losing community trust that the product will solve their use-cases. > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722322#comment-16722322 ] Dongjoon Hyun commented on SPARK-26314: --- Sorry, but I totally disagree from the first point. bq. AVRO-1124 effectively became the Confluent Schema Registry We are able to say in the opposite direction like `Confluent Schema Registry` is powered by Apache Avro. For the second point, `--packages` allows any libraries. Again, Apache Spark's built-in Avro library is also working in the same way and there was no change from the previous Apache Spark behavior. If a company's product has some problems before and now, the company had better fix the problems inside its libraries instead of the Apache Spark community. My suggestion is to file as a general Apache Spark issue if there is specific features or bugs on UDF or Encoder/Decoder in general. > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722314#comment-16722314 ] Jordan Moore commented on SPARK-26314: -- [~dongjoon], I however feel like AVRO-1124 won't have any traction. Even Avro releases themselves have been non-existent over the last few years. My point was that AVRO-1124 effectively _became_ the Confluent Schema Registry, with Jay, being CEO of Confluent. My second point, while I understand Apache projects are maintained by varying developers, each with their own directions on the projects, those other projects _do support_ it. You're effectively losing a large portion of developers by not having it be easily available as a Spark offered library, or at the very, very least shown how to integrate it in the Spark documentation. Simply adding --packages is not enough, as it still requires a SparkSQL UDF, or Encoder/Decoder to wrap Confluent's KafkaAvroSerializer classes. > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722310#comment-16722310 ] Dongjoon Hyun commented on SPARK-26314: --- [~cricket007]. Thank you for mentioning AVRO-1124 and the related stuffs. :) So, I believe Apache Spark will automatically support what you requested(`Confluent encoded avro`) if AVRO-1024 and the related sutffs are resolved and released as a part of Apache Avro. > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722240#comment-16722240 ] Jordan Moore commented on SPARK-26314: -- Hi [~dongjoon], I'm sure you are speaking for the Spark Community, and perhaps think that "Confluent Platform" is a specific, customized version of Kafka and that the Confluent Schema Registry isn't using Apache Avro? Neither are true. The Schema Registry is an HTTP service that sits on the side of Kafka (in Confluent's implementation), and acts as a type of Apache Avro key-value database. And, as a product, branched out from Jay Krep's initial proposal at https://issues.apache.org/jira/browse/AVRO-1124 >From a streaming perspective, it is more performant because every message is >not required to contain the full Avro schema, only a reference to an external >source. Therefore, I think the proposal here is to open up or extend the Avro >encoders in Spark to allow for these other ways to get data in that format. Why can't Spark add support for it like other projects in this streaming space? When others see these projects, they may feel Spark is behind, in a way. - Flink - https://github.com/apache/flink/tree/master/flink-formats/flink-avro-confluent-registry - NiFi - https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-confluent-platform-bundle - Streamsets - https://streamsets.com/blog/evolving-avro-schemas-apache-kafka-streamsets-data-collector/ Other than Hortonworks and HDInsights-derived products using their own Registry, I feel like the Confluent Schema Registry is the only other similar product out there for a well-known Avro on Kafka implementation. Otherwise, we are seeing these other Github issues/repos and StackOverflow questions pop-up where people are are basically baking in the Confluent SchemaRegistry HTTP client Java class themselves, fetching a schema, then parsing in Spark, and continuing on their way using Avro, as normal. However, as I mentioned above with storing the whole Schema as part of every message, this is not as performant as it could be, as the Kafka serializers & deserializers within Spark shouldn't need to care if the schema is part of the message content, or not. Thoughts? > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714133#comment-16714133 ] Dongjoon Hyun commented on SPARK-26314: --- One more thing: Apache Spark supports Apache Kafka, not a single vendor's Kafka distribution, too. That's the simplest way to support all Apache Kafka powered distributions. > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714131#comment-16714131 ] Dongjoon Hyun commented on SPARK-26314: --- Hi, [~davidahern]. Apache Spark supports Apache Avro. I don't see any specific reason why Spark should support a specific vendor's library. If you are looking at the release note, Spark supports Avro with `--packages` which is the same way to support 3rd party libraries. I'm not sure what is the problem of Confluent encoded Avro. But, in general, if a 3rd party library has any problem out of the box, it's the problem of that library. The creator should fix and release that in their repository. > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713986#comment-16713986 ] David Ahern commented on SPARK-26314: - https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713985#comment-16713985 ] David Ahern commented on SPARK-26314: - https://stackoverflow.com/questions/48882723/integrating-spark-structured-streaming-with-the-kafka-schema-registry > support Confluent encoded Avro in Spark Structured Streaming > > > Key: SPARK-26314 > URL: https://issues.apache.org/jira/browse/SPARK-26314 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: David Ahern >Priority: Major > > As Avro has now been added as a first class citizen, > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > please make Confluent encoded avro work out of the box with Spark Structured > Streaming > as described in this link, Avro messages on Kafka encoded with confluent > serializer also need to be decoded with confluent. It would be great if this > worked out of the box > [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain] > here are details on the Confluent encoding > [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id] > It's been a year since i worked on anything to do with Avro and Spark > Structured Streaming, but i had to take an approach such as this when getting > it to work. This is what i used as a reference at that time > [https://github.com/tubular/confluent-spark-avro] > Also, here is another link i found that someone has done in the meantime > [https://github.com/AbsaOSS/ABRiS] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org