justaparth commented on code in PR #41498: URL: https://github.com/apache/spark/pull/41498#discussion_r1223192039
########## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala: ########## @@ -247,12 +247,86 @@ private[sql] class ProtobufDeserializer( updater.setLong(ordinal, micros + TimeUnit.NANOSECONDS.toMicros(nanoSeconds)) case (MESSAGE, StringType) - if protoType.getMessageType.getFullName == "google.protobuf.Any" => + if protoType.getMessageType.getFullName == "google.protobuf.Any" => (updater, ordinal, value) => // Convert 'Any' protobuf message to JSON string. val jsonStr = jsonPrinter.print(value.asInstanceOf[DynamicMessage]) updater.set(ordinal, UTF8String.fromString(jsonStr)) + // Handle well known wrapper types. We unpack the value field instead of keeping Review Comment: > Can we have a concrete example where this makes a difference? What problem are we solving? > Can we have a fully spelled out example in Spark that shows the the benefits? Did you check the PR description? Theres an example of what the deserialization would look like before and after. Beyond that, i'm not sure what kind of example would help... Could you be specific in what you're looking for here? > Wrapper types are used because the wrapper is important, otherwise no need to use it. I don't see how stripping the wrapper is the right thing. > Can we have a fully spelled out example in Spark that shows the the benefits? Let me give you a concrete example from the existing code. There are well known Timestamp and Duration types, and the spark protobuf library handles them in a custom way to TimestampType / DayTimeIntervalType in spark. proto message in tests: https://github.com/apache/spark/blob/master/connector/protobuf/src/test/resources/protobuf/functions_suite.proto#L167-L175 schema conversion logic: https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/SchemaConverters.scala#L79-L90 deserialization logic: https://github.com/apache/spark/blob/master/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala#L229-L247 By your logic, we should NOT be doing this; we should treat them as messages. However, the library is choosing to make a decision based on the semantic meaning of those well known types. Namely, they're not just opaque messages but messages with a well known meaning, and the library is saying "i know what these are meant to be and i will deserialize them differently. In the same way, the rest of wrapper types are not simply opaque messages; they are intended to feel like primitives, and I think that we should handle the rest of the well known types just like were handling Timestamp / Duration. _(As an aside: the code for Timestamp and Duration seems overly broad, it'll catch any class that just happens to be named Timestamp or Duration, which is likely not what we want. And the tests aren't referencing the google protobuf classes, but an exact copy of them committed to the repo. I can send a separate PR for consideration of how to fix that)_ > These are just utilities, not a Protobuf spec. Protobuf actually defines how the json output should look, so that is part of the spec. But I agree they never said the words "spark struct should look like XXX". I just think that spark struct is way closer to json (it's a data format that is not code) than Java classes or something, and so json provides a good guide. And also like I said above, the existing code already is handling some well known types in a special way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org