[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071 ]
Hyukjin Kwon edited comment on SPARK-24924 at 8/4/18 5:44 AM: -------------------------------------------------------------- If it already throws an error for CSV case too, I would prefer to have the improved error message of course. {quote} I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. {quote} I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. {quote} In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I'm worried about other users who didn't happen to see this jira. {quote} We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? {quote} I think the main reason for this is that the code is actually ported from Avro {{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} How many users complained about the csv thing? {quote} For instance, I saw these comments/issues below: https://github.com/databricks/spark-csv/issues/367 https://github.com/databricks/spark-csv/issues/373 https://github.com/apache/spark/pull/17916#issuecomment-301898567 For clarification, it's not personally related to me in any way at all but I thought we better keep it consistent with CSV's. To sum up, I get your position but I think the current approach makes a coherent point too. In that case, I think we better follow what we have done with CSV. was (Author: hyukjin.kwon): If it already throws an error for CSV case too, I would prefer to have the improved error message of course. {quote} I don't buy this agrument, the code has been restructured a lot and you could have introduced bugs, behavior changes, etc. {quote} I have followed the changes in Avro and I don't think there are big differences. We should keep the behaviours in particular within 2.4.0. If I missed some and this introduced a bug or behaviour changes, I personally think we should fix them within 2.4.0. That was one of key things I took into account when I merged some changes. {quote} Users could have also made their own modified version of the databricks spark-avro package (which we actually have to support primitive types) and thus the implementation is not the same and yet you are assuming it is. {quote} In this case, users should provide their own short name of the package. I would say it's discouraged to use the same name with Spark's builtin datasources, or other packages name reserved - I wonder if users would actually try to have the same name in practice. {quote} I'm worried about other users who didn't happen to see this jira. {quote} We will make this in the release note - I think I listed up the possible stories about this in https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708 {quote} I also realize these are 3rd party packages but I think we are making the assumption here based on this being a databricks package, which in my opinion we shouldn't. What if this was companyX package which we didn't know about, what would/should be the expected behavior? {quote} I think the main reason for this is that the code is actually ported from Avro {{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} indicates the builtin Avro, right? {quote} How many users complained about the csv thing? {quote} For instance, I saw these comments/issues below: https://github.com/databricks/spark-csv/issues/367 https://github.com/databricks/spark-csv/issues/373 https://github.com/apache/spark/pull/17916#issuecomment-301898567 For clarification, it's related to me in any way but I thought we better keep it consistent with CSV's. To sum up, I get your position but I think the current approach makes a coherent point too. In that case, I think we better follow what we have done with CSV. > Add mapping for built-in Avro data source > ----------------------------------------- > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.4.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun > Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org