date:20180803

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569104#comment-16569104
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

For fully qualifed path, we already could specify like 
{{com.databricks.spark.avro.AvroFormat}} and I guess that will use thrid party 
one if I am not mistaken. 

Probably we should not do this but this is what we do with CSV which kind of 
makes a point as well. Wouldn't we better just follow what we do?

If we should make an error for this case, I guess it should target 3.0.0 for 
CSV and revert this PR.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-03 Thread Rifaqat Shah (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569088#comment-16569088
 ] 

Rifaqat Shah commented on SPARK-14220:
--

great! congratulations and thanks..

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569085#comment-16569085
 ] 

Wenchen Fan commented on SPARK-24924:
-

when the short name conflicts, I feel it's better to pick the built-in data 
source than failing the job and say it conflicts. When the full class name of 
the data source is specified like com.databricks.spark.avro, we should respect 
it.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24940) Coalesce Hint for SQL Queries

2018-08-03 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24940.
-
   Resolution: Fixed
 Assignee: John Zhuge
Fix Version/s: 2.4.0

> Coalesce Hint for SQL Queries
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
> Fix For: 2.4.0
>
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24722) Column-based API for pivoting

2018-08-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24722.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21699
[https://github.com/apache/spark/pull/21699]

> Column-based API for pivoting
> -
>
> Key: SPARK-24722
> URL: https://issues.apache.org/jira/browse/SPARK-24722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the pivot() function accepts the pivot column as a string. It is 
> not consistent to groupBy API and causes additional problem of using nested 
> columns as the pivot column.
> `Column` support is needed for (a) API consistency, (b) user productivity and 
> (c) performance. In general, we should follow to the POLA - 
> https://en.wikipedia.org/wiki/Principle_of_least_astonishment in designing of 
> the API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24722) Column-based API for pivoting

2018-08-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24722:


Assignee: Maxim Gekk

> Column-based API for pivoting
> -
>
> Key: SPARK-24722
> URL: https://issues.apache.org/jira/browse/SPARK-24722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the pivot() function accepts the pivot column as a string. It is 
> not consistent to groupBy API and causes additional problem of using nested 
> columns as the pivot column.
> `Column` support is needed for (a) API consistency, (b) user productivity and 
> (c) performance. In general, we should follow to the POLA - 
> https://en.wikipedia.org/wiki/Principle_of_least_astonishment in designing of 
> the API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071
]

Hyukjin Kwon edited comment on SPARK-24924 at 8/4/18 5:44 AM:
--

If it already throws an error for CSV case too, I would prefer to have the
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big
differences. We should keep the behaviours in particular within 2.4.0. If I
missed some and this introduced a bug or behaviour changes, I personally think
we should fix them within 2.4.0. That was one of key things I took into account
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks
spark-avro package (which we actually have to support primitive types) and thus
the implementation is not the same and yet you are assuming it is.
{quote}

In this case, users should provide their own short name of the package. I would
say it's discouraged to use the same name with Spark's builtin datasources, or
other packages name reserved - I wonder if users would actually try to have the
same name in practice.

{quote}
I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible
stories about this in
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the
assumption here based on this being a databricks package, which in my opinion
we shouldn't. What if this was companyX package which we didn't know about,
what would/should be the expected behavior?
{quote}

I think the main reason for this is that the code is actually ported from Avro
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}}
indicates the builtin Avro, right?

{quote}
How many users complained about the csv thing?
{quote}

For instance, I saw these comments/issues below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomment-301898567

For clarification, it's not personally related to me in any way at all but I
thought we better keep it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a
coherent point too. In that case, I think we better follow what we have done
with CSV.

was (Author: hyukjin.kwon):
If it already throws an error for CSV case too, I would prefer to have the
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could
have introduced bugs, behavior changes, etc.
{quote}

{quote}
I'm worried about other users who didn't happen to see this jira.
{quote}

I think the main reason for this is that the code is actually ported from Avro
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}}
indicates the builtin Avro, right?

{quote}
How many users complained about the csv thing?
{quote}

For instance, I saw these comments/issues below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.co

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569072#comment-16569072
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

Also, for clarification, we already issue warnings:

{code}
17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
{code}

So, I guess it's virtually error vs warning.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071
]

Hyukjin Kwon edited comment on SPARK-24924 at 8/4/18 5:42 AM:
--

If it already throws an error for CSV case too, I would prefer to have the
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could
have introduced bugs, behavior changes, etc.
{quote}

{quote}
I'm worried about other users who didn't happen to see this jira.
{quote}

I think the main reason for this is that the code is actually ported from Avro
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}}
indicates the builtin Avro, right?

{quote}
How many users complained about the csv thing?
{quote}

For instance, I saw these comments/issues below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomment-301898567

For clarification, it's related to me in any way but I thought we better keep
it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a
coherent point too. In that case, I think we better follow what we have done
with CSV.

was (Author: hyukjin.kwon):
If it already throws an error for CSV case too, I would prefer to have the
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could
have introduced bugs, behavior changes, etc.
{quote}

{quote}
I'm worried about other users who didn't happen to see this jira.
{quote}

I think the main reason for this is that the code is actually ported from Avro
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}}
indicates the builtin Avro, right?

{quote}
How many users complained about the csv thing?
{quote}

So far, I see some issues as below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomm

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

If it already throws an error for CSV case too, I would prefer to have the 
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  
{quote}

In this case, users should provide their own short name of the package. I would 
say it's discouraged to use the same name with Spark's builtin datasources, or 
other packages name reserved - I wonder if users would actually try to have the 
same name in practice.

{quote}
 I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the 
assumption here based on this being a databricks package, which in my opinion 
we shouldn't.   What if this was companyX package which we didn't know about, 
what would/should be the expected behavior? 
{quote}

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right? 

{quote}
How many users complained about the csv thing? 
{quote}

So far, I see some issues as below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomment-301898567

For clarification, it's related to me in any way but I thought we better keep 
it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a 
coherent point too. In that case, I think we better follow what we have done 
with CSV.



> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

100 matches

Mail list logo