[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21878


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205365847
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

Thanks @gengliangwang for clarifying this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205360960
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

Looks same thing could happen in Kafka too already.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205360211
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

Ah, I see. Okay. Then, how about this: I assume the documentation about 
Avro will be done to Spark for 2.4.0 soon. When it's done, we add a message 
here for Avro like please see the documentation to use Spark's avro package?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205353836
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

I totally agree with the mapping, we should do it.
The comment here is about when Spark can't find any avro package, we should 
show a message for loading the spark-avro jar(org.apache.spark.sql.avro). 
Different from CSV, the package spark-avro is not loaded by default within 
Spark(at least as I tried spark-shell).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205350947
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

Or is the Avro jar meant to be separately distributed? I thought it'd be 
included within Spark.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205350125
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

Eh, if users were using the external avro, they will likely meet the error 
if they directly upgrade Spark.
Otherwise, users will see the release note that Avro pacakge is included in 
2.4.0, and they will not provide this jar.
If users miss this release note, then they will try to explicitly provide 
the thirdparty jar which will give the error message above.

FWIW, if it's fully qualified path, the thridparty jar will still be used 
in theory.

Did I misunderstand or miss something maybe?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205348403
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

No, I mean by default the avro package is not loaded. E.g. If we start 
spark-shell without loading the jar, then it will show error "Failed to find 
data source: avro. Please find an Avro package at 
http://spark.apache.org/third-party-projects.html;. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205346891
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

Actually, I think it would be okay. If user provide `avro`, it will show an 
error like:

```
17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
```

in most cases (see https://github.com/apache/spark/pull/17916)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-26 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21878#discussion_r205345025
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -635,12 +637,6 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
-} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
-  provider1 == "com.databricks.spark.avro") {
-  throw new AnalysisException(
--- End diff --

Should we show message to user for loading the built-in spark-avro jar? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...

2018-07-25 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/21878

[SPARK-24924][SQL] Add mapping for built-in Avro data source

## What changes were proposed in this pull request?

This PR aims to the followings.
1. Like `com.databricks.spark.csv` mapping, we had better map 
`com.databricks.spark.avro` to built-in Avro data source.
2. Remove incorrect error message, `Please find an Avro package at ...`.

## How was this patch tested?

Pass the newly added tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-24924

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21878.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21878


commit d95ba4081ac1188515b7e6363640700d56f2c93f
Author: Dongjoon Hyun 
Date:   2018-07-25T22:51:56Z

[SPARK-24924][SQL] Add mapping for built-in Avro data source




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org