[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569104#comment-16569104
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

For fully qualifed path, we already could specify like 
{{com.databricks.spark.avro.AvroFormat}} and I guess that will use thrid party 
one if I am not mistaken. 

Probably we should not do this but this is what we do with CSV which kind of 
makes a point as well. Wouldn't we better just follow what we do?

If we should make an error for this case, I guess it should target 3.0.0 for 
CSV and revert this PR.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-03 Thread Rifaqat Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569088#comment-16569088
 ] 

Rifaqat Shah commented on SPARK-14220:
--

great! congratulations and thanks..

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569085#comment-16569085
 ] 

Wenchen Fan commented on SPARK-24924:
-

when the short name conflicts, I feel it's better to pick the built-in data 
source than failing the job and say it conflicts. When the full class name of 
the data source is specified like com.databricks.spark.avro, we should respect 
it.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24940) Coalesce Hint for SQL Queries

2018-08-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24940.
-
   Resolution: Fixed
 Assignee: John Zhuge
Fix Version/s: 2.4.0

> Coalesce Hint for SQL Queries
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
> Fix For: 2.4.0
>
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24722) Column-based API for pivoting

2018-08-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24722.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21699
[https://github.com/apache/spark/pull/21699]

> Column-based API for pivoting
> -
>
> Key: SPARK-24722
> URL: https://issues.apache.org/jira/browse/SPARK-24722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the pivot() function accepts the pivot column as a string. It is 
> not consistent to groupBy API and causes additional problem of using nested 
> columns as the pivot column.
> `Column` support is needed for (a) API consistency, (b) user productivity and 
> (c) performance. In general, we should follow to the POLA - 
> https://en.wikipedia.org/wiki/Principle_of_least_astonishment in designing of 
> the API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24722) Column-based API for pivoting

2018-08-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24722:


Assignee: Maxim Gekk

> Column-based API for pivoting
> -
>
> Key: SPARK-24722
> URL: https://issues.apache.org/jira/browse/SPARK-24722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the pivot() function accepts the pivot column as a string. It is 
> not consistent to groupBy API and causes additional problem of using nested 
> columns as the pivot column.
> `Column` support is needed for (a) API consistency, (b) user productivity and 
> (c) performance. In general, we should follow to the POLA - 
> https://en.wikipedia.org/wiki/Principle_of_least_astonishment in designing of 
> the API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071
 ] 

Hyukjin Kwon edited comment on SPARK-24924 at 8/4/18 5:44 AM:
--

If it already throws an error for CSV case too, I would prefer to have the 
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  
{quote}

In this case, users should provide their own short name of the package. I would 
say it's discouraged to use the same name with Spark's builtin datasources, or 
other packages name reserved - I wonder if users would actually try to have the 
same name in practice.

{quote}
 I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the 
assumption here based on this being a databricks package, which in my opinion 
we shouldn't.   What if this was companyX package which we didn't know about, 
what would/should be the expected behavior? 
{quote}

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right? 

{quote}
How many users complained about the csv thing? 
{quote}

For instance, I saw these comments/issues below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomment-301898567

For clarification, it's not personally related to me in any way at all but I 
thought we better keep it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a 
coherent point too. In that case, I think we better follow what we have done 
with CSV.




was (Author: hyukjin.kwon):
If it already throws an error for CSV case too, I would prefer to have the 
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  
{quote}

In this case, users should provide their own short name of the package. I would 
say it's discouraged to use the same name with Spark's builtin datasources, or 
other packages name reserved - I wonder if users would actually try to have the 
same name in practice.

{quote}
 I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the 
assumption here based on this being a databricks package, which in my opinion 
we shouldn't.   What if this was companyX package which we didn't know about, 
what would/should be the expected behavior? 
{quote}

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right? 

{quote}
How many users complained about the csv thing? 
{quote}

For instance, I saw these comments/issues below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.co

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569072#comment-16569072
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

Also, for clarification, we already issue warnings:

{code}
17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
{code}

So, I guess it's virtually error vs warning.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071
 ] 

Hyukjin Kwon edited comment on SPARK-24924 at 8/4/18 5:42 AM:
--

If it already throws an error for CSV case too, I would prefer to have the 
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  
{quote}

In this case, users should provide their own short name of the package. I would 
say it's discouraged to use the same name with Spark's builtin datasources, or 
other packages name reserved - I wonder if users would actually try to have the 
same name in practice.

{quote}
 I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the 
assumption here based on this being a databricks package, which in my opinion 
we shouldn't.   What if this was companyX package which we didn't know about, 
what would/should be the expected behavior? 
{quote}

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right? 

{quote}
How many users complained about the csv thing? 
{quote}

For instance, I saw these comments/issues below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomment-301898567

For clarification, it's related to me in any way but I thought we better keep 
it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a 
coherent point too. In that case, I think we better follow what we have done 
with CSV.




was (Author: hyukjin.kwon):
If it already throws an error for CSV case too, I would prefer to have the 
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  
{quote}

In this case, users should provide their own short name of the package. I would 
say it's discouraged to use the same name with Spark's builtin datasources, or 
other packages name reserved - I wonder if users would actually try to have the 
same name in practice.

{quote}
 I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the 
assumption here based on this being a databricks package, which in my opinion 
we shouldn't.   What if this was companyX package which we didn't know about, 
what would/should be the expected behavior? 
{quote}

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right? 

{quote}
How many users complained about the csv thing? 
{quote}

So far, I see some issues as below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomm

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

If it already throws an error for CSV case too, I would prefer to have the 
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  
{quote}

In this case, users should provide their own short name of the package. I would 
say it's discouraged to use the same name with Spark's builtin datasources, or 
other packages name reserved - I wonder if users would actually try to have the 
same name in practice.

{quote}
 I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the 
assumption here based on this being a databricks package, which in my opinion 
we shouldn't.   What if this was companyX package which we didn't know about, 
what would/should be the expected behavior? 
{quote}

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right? 

{quote}
How many users complained about the csv thing? 
{quote}

So far, I see some issues as below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomment-301898567

For clarification, it's related to me in any way but I thought we better keep 
it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a 
coherent point too. In that case, I think we better follow what we have done 
with CSV.



> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24888) spark-submit --master spark://host:port --status driver-id does not work

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24888:


Assignee: (was: Apache Spark)

> spark-submit --master spark://host:port --status driver-id does not work 
> -
>
> Key: SPARK-24888
> URL: https://issues.apache.org/jira/browse/SPARK-24888
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: srinivasan
>Priority: Major
>
> spark-submit --master spark://host:port --status driver-id
> does not return anything. The command terminates without any error or output.
> Behaviour is the same from linux or windows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24888) spark-submit --master spark://host:port --status driver-id does not work

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568969#comment-16568969
 ] 

Apache Spark commented on SPARK-24888:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/21996

> spark-submit --master spark://host:port --status driver-id does not work 
> -
>
> Key: SPARK-24888
> URL: https://issues.apache.org/jira/browse/SPARK-24888
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: srinivasan
>Priority: Major
>
> spark-submit --master spark://host:port --status driver-id
> does not return anything. The command terminates without any error or output.
> Behaviour is the same from linux or windows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24888) spark-submit --master spark://host:port --status driver-id does not work

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24888:


Assignee: Apache Spark

> spark-submit --master spark://host:port --status driver-id does not work 
> -
>
> Key: SPARK-24888
> URL: https://issues.apache.org/jira/browse/SPARK-24888
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: srinivasan
>Assignee: Apache Spark
>Priority: Major
>
> spark-submit --master spark://host:port --status driver-id
> does not return anything. The command terminates without any error or output.
> Behaviour is the same from linux or windows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-03 Thread Matthew Normyle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568893#comment-16568893
 ] 

Matthew Normyle commented on SPARK-24928:
-

In CartesianRDD.compute, changing:

{color:#cc7832}for {color}(x <- 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, 
{color}context){color:#cc7832};
{color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, 
{color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

{color:#33} to:{color}

{color:#cc7832}val {color}it1 = 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context)
{color:#cc7832}val {color}it2 = 
rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)

{color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) 
{color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

Seems to resolve this issue.

I am brand new to Scala and Spark. Does anyone have any insight as to why this 
seemingly superficial change could make such a large difference?

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568863#comment-16568863
 ] 

Apache Spark commented on SPARK-18057:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/21995

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported

2018-08-03 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568810#comment-16568810
 ] 

Imran Rashid commented on SPARK-25024:
--

[~rvesse] [~arand] maybe you could take a stab at this based on your work on 
SPARK-16501?

Also pinging some other folks knowledgeable about mesos: [~skonto] [~tnachen]

> Update mesos documentation to be clear about security supported
> ---
>
> Key: SPARK-25024
> URL: https://issues.apache.org/jira/browse/SPARK-25024
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our mesos deployment docs and security docs and its not 
> clear at all what type of security and how to set it up for mesos.  I think 
> we should clarify this and have something about exactly what is supported and 
> what is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24529) Add spotbugs into maven build process

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24529:


Assignee: Kazuaki Ishizaki  (was: Apache Spark)

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24529) Add spotbugs into maven build process

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568806#comment-16568806
 ] 

Apache Spark commented on SPARK-24529:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21994

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24529) Add spotbugs into maven build process

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24529:


Assignee: Apache Spark  (was: Kazuaki Ishizaki)

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25023) Clarify Spark security documentation

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568805#comment-16568805
 ] 

Thomas Graves commented on SPARK-25023:
---

note some of this was already updated with 
[https://github.com/apache/spark/pull/20742] but I think there are still a few 
clarifications to make

> Clarify Spark security documentation
> 
>
> Key: SPARK-25023
> URL: https://issues.apache.org/jira/browse/SPARK-25023
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our deployment docs and security docs and its not clear 
> at all what deployment modes support exactly what for security.  I think we 
> should clarify the deployments that security is off by default on all 
> deployments.  We may also want to clarify the types of communication used 
> that would need to be secured.  We may also want to clarify multi-tenant safe 
> vs other things, like standalone mode for instance in my opinion is just note 
> secure, we do talk about using spark.authenticate for a secret but all 
> applications would use the same secret.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25024) Update mesos documentation to be clear about security supported

2018-08-03 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-25024:
--
Comment: was deleted

(was: I'm going to work on this.)

> Update mesos documentation to be clear about security supported
> ---
>
> Key: SPARK-25024
> URL: https://issues.apache.org/jira/browse/SPARK-25024
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our mesos deployment docs and security docs and its not 
> clear at all what type of security and how to set it up for mesos.  I think 
> we should clarify this and have something about exactly what is supported and 
> what is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25023) Clarify Spark security documentation

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568767#comment-16568767
 ] 

Thomas Graves commented on SPARK-25023:
---

I'm going to work on this

> Clarify Spark security documentation
> 
>
> Key: SPARK-25023
> URL: https://issues.apache.org/jira/browse/SPARK-25023
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our deployment docs and security docs and its not clear 
> at all what deployment modes support exactly what for security.  I think we 
> should clarify the deployments that security is off by default on all 
> deployments.  We may also want to clarify the types of communication used 
> that would need to be secured.  We may also want to clarify multi-tenant safe 
> vs other things, like standalone mode for instance in my opinion is just note 
> secure, we do talk about using spark.authenticate for a secret but all 
> applications would use the same secret.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568766#comment-16568766
 ] 

Thomas Graves commented on SPARK-25024:
---

I'm going to work on this.

> Update mesos documentation to be clear about security supported
> ---
>
> Key: SPARK-25024
> URL: https://issues.apache.org/jira/browse/SPARK-25024
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our mesos deployment docs and security docs and its not 
> clear at all what type of security and how to set it up for mesos.  I think 
> we should clarify this and have something about exactly what is supported and 
> what is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25024) Update mesos documentation to be clear about security supported

2018-08-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-25024:
-

 Summary: Update mesos documentation to be clear about security 
supported
 Key: SPARK-25024
 URL: https://issues.apache.org/jira/browse/SPARK-25024
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.2.2
Reporter: Thomas Graves


I was reading through our mesos deployment docs and security docs and its not 
clear at all what type of security and how to set it up for mesos.  I think we 
should clarify this and have something about exactly what is supported and what 
is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25023) Clarify Spark security documentation

2018-08-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-25023:
-

 Summary: Clarify Spark security documentation
 Key: SPARK-25023
 URL: https://issues.apache.org/jira/browse/SPARK-25023
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.2.2
Reporter: Thomas Graves


I was reading through our deployment docs and security docs and its not clear 
at all what deployment modes support exactly what for security.  I think we 
should clarify the deployments that security is off by default on all 
deployments.  We may also want to clarify the types of communication used that 
would need to be secured.  We may also want to clarify multi-tenant safe vs 
other things, like standalone mode for instance in my opinion is just note 
secure, we do talk about using spark.authenticate for a secret but all 
applications would use the same secret.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568730#comment-16568730
 ] 

Apache Spark commented on SPARK-24983:
--

User 'dvogelbacher' has created a pull request for this issue:
https://github.com/apache/spark/pull/21993

> Collapsing multiple project statements with dependent When-Otherwise 
> statements on the same column can OOM the driver
> -
>
> Key: SPARK-24983
> URL: https://issues.apache.org/jira/browse/SPARK-24983
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed that writing a spark job that includes many sequential 
> {{when-otherwise}} statements on the same column can easily OOM the driver 
> while generating the optimized plan because the project node will grow 
> exponentially in size.
> Example:
> {noformat}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> val df = Seq("a", "b", "c", "1").toDF("text")
> df: org.apache.spark.sql.DataFrame = [text: string]
> scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
> dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
> string]
> scala> for( a <- 1 to 5) {
>  | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
> lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
>  | }
> scala> dfCaseWhen.queryExecution.analyzed
> res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
> +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
>+- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
>   +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
>  +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
> +- Filter NOT (text#3 = 0)
>+- Project [value#1 AS text#3]
>   +- LocalRelation [value#1]
> scala> dfCaseWhen.queryExecution.optimizedPlan
> res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE 
> WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va...
> {noformat}
> As one can see the optimized plan grows exponentially in the number of 
> {{when-otherwise}} statements here.
> I can see that this comes from the {{CollapseProject}} optimizer rule.
> Maybe we should put a limit on the resulting size of the project node after 
> collapsing and only collapse if we stay under the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24983:


Assignee: Apache Spark

> Collapsing multiple project statements with dependent When-Otherwise 
> statements on the same column can OOM the driver
> -
>
> Key: SPARK-24983
> URL: https://issues.apache.org/jira/browse/SPARK-24983
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Assignee: Apache Spark
>Priority: Major
>
> I noticed that writing a spark job that includes many sequential 
> {{when-otherwise}} statements on the same column can easily OOM the driver 
> while generating the optimized plan because the project node will grow 
> exponentially in size.
> Example:
> {noformat}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> val df = Seq("a", "b", "c", "1").toDF("text")
> df: org.apache.spark.sql.DataFrame = [text: string]
> scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
> dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
> string]
> scala> for( a <- 1 to 5) {
>  | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
> lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
>  | }
> scala> dfCaseWhen.queryExecution.analyzed
> res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
> +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
>+- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
>   +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
>  +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
> +- Filter NOT (text#3 = 0)
>+- Project [value#1 AS text#3]
>   +- LocalRelation [value#1]
> scala> dfCaseWhen.queryExecution.optimizedPlan
> res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE 
> WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va...
> {noformat}
> As one can see the optimized plan grows exponentially in the number of 
> {{when-otherwise}} statements here.
> I can see that this comes from the {{CollapseProject}} optimizer rule.
> Maybe we should put a limit on the resulting size of the project node after 
> collapsing and only collapse if we stay under the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24983:


Assignee: (was: Apache Spark)

> Collapsing multiple project statements with dependent When-Otherwise 
> statements on the same column can OOM the driver
> -
>
> Key: SPARK-24983
> URL: https://issues.apache.org/jira/browse/SPARK-24983
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed that writing a spark job that includes many sequential 
> {{when-otherwise}} statements on the same column can easily OOM the driver 
> while generating the optimized plan because the project node will grow 
> exponentially in size.
> Example:
> {noformat}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> val df = Seq("a", "b", "c", "1").toDF("text")
> df: org.apache.spark.sql.DataFrame = [text: string]
> scala> var dfCaseWhen = df.filter($"text" =!= lit("0"))
> dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: 
> string]
> scala> for( a <- 1 to 5) {
>  | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === 
> lit(a.toString), lit("r" + a.toString)).otherwise($"text"))
>  | }
> scala> dfCaseWhen.queryExecution.analyzed
> res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14]
> +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12]
>+- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10]
>   +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8]
>  +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6]
> +- Filter NOT (text#3 = 0)
>+- Project [value#1 AS text#3]
>   +- LocalRelation [value#1]
> scala> dfCaseWhen.queryExecution.optimizedPlan
> res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) 
> THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 
> ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN 
> (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 
> 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE 
> WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va...
> {noformat}
> As one can see the optimized plan grows exponentially in the number of 
> {{when-otherwise}} statements here.
> I can see that this comes from the {{CollapseProject}} optimizer rule.
> Maybe we should put a limit on the resulting size of the project node after 
> collapsing and only collapse if we stay under the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

2018-08-03 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568718#comment-16568718
 ] 

Barry Becker commented on SPARK-21986:
--

Here are a couple more test cases that show the problem:
{code:java}
test("Quantile discretizer on data with that is only -1, and 1 (and mostly 
-1)") {
verify(Seq(-1, -1, 1, -1, -1, -1, 1, -1, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, Double.PositiveInfinity))
}

test("Quantile discretizer on data with that is only -1, 0, and 1 (and mostly 
-1)") {
verify(Seq(-1, -1, 1, -1, -1, -1, 1, 0, -1, -1, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, Double.PositiveInfinity))
}

test("Quantile discretizer on data with that is only -1, 0, and 1 ") {   // 
this is ok
verify(Seq(-1, -1, 1, -1, -1, -1, 1, 0, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, 0, Double.PositiveInfinity))
}{code}
If the bin were defined as (low, high] instead of [low, high), then I believe 
all the cases would be correct. Maybe if all the cuts has a very small epsilon 
added, or simply selected the next distinct value, then they would also all be 
correct.

> QuantileDiscretizer picks wrong split point for data with lots of 0's
> -
>
> Key: SPARK-21986
> URL: https://issues.apache.org/jira/browse/SPARK-21986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Barry Becker
>Priority: Minor
>
> I have some simple test cases to help illustrate (see below).
> I discovered this with data that had 96,000 rows, but can reproduce with much 
> smaller data that has roughly the same distribution of values.
> If I have data like
>   Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> and ask for 3 buckets, then it does the right thing and yields splits of 
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
> However, if I add just one more zero, such that I have data like
>  Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> then it will do the wrong thing and give splits of 
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> I'm not bothered that it gave fewer buckets than asked for (that is to be 
> expected), but I am bothered that it picked 0.0 instead of 40 as the one 
> split point.
> The way it did it, now I have 1 bucket with all the data, and a second with 
> none of the data.
> Am I interpreting something wrong?
> Here are my 2 test cases in scala:
> {code}
> class QuantileDiscretizerSuite extends FunSuite {
>   test("Quantile discretizer on data with lots of 0") {
> verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
>   }
>   test("Quantile discretizer on data with one less 0") {
> verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>   Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
>   }
>   
>   def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
> val theData: Seq[(Int, Double)] = data.map {
>   case x: Int => (x, 0.0)
>   case _ => (0, 0.0)
> }
> val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", 
> "unused")
> val qb = new QuantileDiscretizer()
>   .setInputCol("rawCol")
>   .setOutputCol("binnedColumn")
>   .setRelativeError(0.0)
>   .setNumBuckets(3)
>   .fit(df)
> assertResult(expectedSplits) {qb.getSplits}
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-08-03 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568664#comment-16568664
 ] 

Mridul Muralidharan commented on SPARK-24375:
-

{quote}
It's not desired behavior to catch exception thrown by TaskContext.barrier() 
silently. However, in case this really happens, we can detect that because we 
have `epoch` both in driver side and executor side, more details will go to the 
design doc of BarrierTaskContext.barrier() SPARK-24581
{quote}

The current 'barrier' function does not identify 'which' barrier it is from a 
user point of view.

Here, due to exceptions raised (not necessarily from barrier(), but could be 
from user code as well), different tasks are waiting on different barriers.

{code}
try {
  ... snippet A ...
  // Barrier 1
  context.barrier()
  ... snippet B ...
} catch { ... }
... snippet C ...
// Barrier 2
context.barrier()
{code}

T1 waits on barrier 1, T2 could have raised exception in snippet A and ends up 
waiting on Barrier 2 (having never seen Barrier 1).
In this scenario, how is spark making progress ?
(And ofcourse, when T1 reaches barrier 2, when T2 has moved past it).
I did not see this clarified in the design or in the implementation.


> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8

2018-08-03 Thread Ricky Saltzer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Saltzer updated SPARK-25020:
--
Description: 
Opening this up to give you guys some insight in an issue that will occur when 
using Spark Streaming with Hadoop 2.8. 

Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout 
for its shutdown hook. From our tests, if the Spark job takes longer than 10 
seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample 
over the graceful shutdown and throw an exception like
{code:java}
18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177)
at 
org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681)
at 
org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}
The reason I hit this issue is because we recently upgraded to EMR 5.15, which 
has both Spark 2.3 & Hadoop 2.8. The following workaround has proven successful 
to us (in limited testing)

Instead of just running
{code:java}
...
ssc.start()
ssc.awaitTermination(){code}
We needed to do the following
{code:java}
...
ssc.start()
sys.ShutdownHookThread {
  ssc.stop(true, true)
}
ssc.awaitTermination(){code}
As far as I can tell, there is no way to override the default {{10 second}} 
timeout in HADOOP-12950, which is why we had to go with the workaround. 

Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark 
2.2.x & Hadoop 2.8. 

Ricky
 Epic Games

  was:
Opening this up to give you guys some insight in an issue that will occur when 
using Spark Streaming with Hadoop 2.8. 

Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout 
for its shutdown hook. From our tests, if the Spark job takes longer than 10 
seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample 
over the graceful shutdown and throw an exception like
{code:java}
18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177)
at 
org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682)
at org.apache.spark.util.Utils$.tryLogN

[jira] [Created] (SPARK-25022) Add spark.executor.pyspark.memory support to Mesos

2018-08-03 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-25022:
-

 Summary: Add spark.executor.pyspark.memory support to Mesos
 Key: SPARK-25022
 URL: https://issues.apache.org/jira/browse/SPARK-25022
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.3.0
Reporter: Ryan Blue


SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
allocation for PySpark and updates YARN to add this memory to its container 
requests. Mesos should do something similar to account for the python memory 
allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes

2018-08-03 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-25021:
-

 Summary: Add spark.executor.pyspark.memory support to Kubernetes
 Key: SPARK-25021
 URL: https://issues.apache.org/jira/browse/SPARK-25021
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Ryan Blue


SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
allocation for PySpark and updates YARN to add this memory to its container 
requests. Kubernetes should do something similar to account for the python 
memory allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8

2018-08-03 Thread Ricky Saltzer (JIRA)
Ricky Saltzer created SPARK-25020:
-

 Summary: Unable to Perform Graceful Shutdown in Spark Streaming 
with Hadoop 2.8
 Key: SPARK-25020
 URL: https://issues.apache.org/jira/browse/SPARK-25020
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
 Environment: Spark Streaming

-- Tested on 2.2 & 2.3 (more than likely affects all versions with graceful 
shutdown) 

Hadoop 2.8
Reporter: Ricky Saltzer


Opening this up to give you guys some insight in an issue that will occur when 
using Spark Streaming with Hadoop 2.8. 

Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout 
for its shutdown hook. From our tests, if the Spark job takes longer than 10 
seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample 
over the graceful shutdown and throw an exception like
{code:java}
18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177)
at 
org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681)
at 
org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}
The reason I hit this issue is because we recently upgraded to EMR 5.15, which 
has both Spark 2.3 & Hadoop 2.8. The following workaround has proven successful 
to us (in limited testing)

Instead of just running
{code:java}
...
ssc.start()
ssc.awaitTermination(){code}
We needed to do the following
{code:java}
...
ssc.start()
{
  ssc.stop(true, true)
}
ssc.awaitTermination(){code}
As far as I can tell, there is no way to override the default {{10 second}} 
timeout in HADOOP-12950, which is why we had to go with the workaround. 

Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark 
2.2.x & Hadoop 2.8. 

Ricky
Epic Games



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-03 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568554#comment-16568554
 ] 

Yin Huai commented on SPARK-25019:
--

[~dongjoon] can you help us fix this issue? Or there is a reason that the 
parent pom and sql/core/pom are not consistent?

> The published spark sql pom does not exclude the normal version of orc-core 
> 
>
> Key: SPARK-25019
> URL: https://issues.apache.org/jira/browse/SPARK-25019
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> I noticed that 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
>  does not exclude the normal version of orc-core. Comparing with 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
>  and 
> [https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
>  we only exclude the normal version of orc-core in the parent pom. So, the 
> problem is that if a developer depends on spark-sql-core directly, orc-core 
> and orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25019) The published spark sql pom does not exclude the normal version of orc-core

2018-08-03 Thread Yin Huai (JIRA)
Yin Huai created SPARK-25019:


 Summary: The published spark sql pom does not exclude the normal 
version of orc-core 
 Key: SPARK-25019
 URL: https://issues.apache.org/jira/browse/SPARK-25019
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 2.4.0
Reporter: Yin Huai


I noticed that 
[https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.4.0-SNAPSHOT/spark-sql_2.11-2.4.0-20180803.100335-189.pom]
 does not exclude the normal version of orc-core. Comparing with 
[https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/sql/core/pom.xml#L108]
 and 
[https://github.com/apache/spark/blob/92b48842b944a3e430472294cdc3c481bad6b804/pom.xml#L1767,]
 we only exclude the normal version of orc-core in the parent pom. So, the 
problem is that if a developer depends on spark-sql-core directly, orc-core and 
orc-core-nohive will be in the dependency list. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25018:


Assignee: Apache Spark

> Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
> ---
>
> Key: SPARK-25018
> URL: https://issues.apache.org/jira/browse/SPARK-25018
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>
> Many projects such as openstack are using `Co-Authored-By: name 
> ` in commit messages to indicate people who worked on a 
> particular patch. 
> It's a convention for recognizing multiple authors, and can encourage people 
> to collaborate.
> Co-authored commits are visible on GitHub and can be included in the profile 
> contributions graph and the repository's statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25018:


Assignee: (was: Apache Spark)

> Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
> ---
>
> Key: SPARK-25018
> URL: https://issues.apache.org/jira/browse/SPARK-25018
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>
> Many projects such as openstack are using `Co-Authored-By: name 
> ` in commit messages to indicate people who worked on a 
> particular patch. 
> It's a convention for recognizing multiple authors, and can encourage people 
> to collaborate.
> Co-authored commits are visible on GitHub and can be included in the profile 
> contributions graph and the repository's statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568545#comment-16568545
 ] 

Apache Spark commented on SPARK-25018:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21991

> Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
> ---
>
> Key: SPARK-25018
> URL: https://issues.apache.org/jira/browse/SPARK-25018
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>
> Many projects such as openstack are using `Co-Authored-By: name 
> ` in commit messages to indicate people who worked on a 
> particular patch. 
> It's a convention for recognizing multiple authors, and can encourage people 
> to collaborate.
> Co-authored commits are visible on GitHub and can be included in the profile 
> contributions graph and the repository's statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25018) Use `Co-Authored-By` git trailer in `merge_spark_pr.py`

2018-08-03 Thread DB Tsai (JIRA)
DB Tsai created SPARK-25018:
---

 Summary: Use `Co-Authored-By` git trailer in `merge_spark_pr.py`
 Key: SPARK-25018
 URL: https://issues.apache.org/jira/browse/SPARK-25018
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.4.0
Reporter: DB Tsai


Many projects such as openstack are using `Co-Authored-By: name 
` in commit messages to indicate people who worked on a 
particular patch. 

It's a convention for recognizing multiple authors, and can encourage people to 
collaborate.

Co-authored commits are visible on GitHub and can be included in the profile 
contributions graph and the repository's statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25017) Add test suite for ContextBarrierState

2018-08-03 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-25017:


 Summary: Add test suite for ContextBarrierState
 Key: SPARK-25017
 URL: https://issues.apache.org/jira/browse/SPARK-25017
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


We shall be able to add unit test to ContextBarrierState with a mocked 
RpcCallContext. Currently it's only covered by end-to-end test in 
`BarrierTaskContextSuite`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25016) remove Support for hadoop 2.6

2018-08-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-25016:
-

 Summary: remove Support for hadoop 2.6
 Key: SPARK-25016
 URL: https://issues.apache.org/jira/browse/SPARK-25016
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Thomas Graves


Hadoop 2.6  is now old, no releases have been done in 2 year and it doesn't 
look like they are patching it for security or bug fixes. We should stop 
supporting it in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25016) remove Support for hadoop 2.6

2018-08-03 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-25016:
--
Target Version/s: 3.0.0

> remove Support for hadoop 2.6
> -
>
> Key: SPARK-25016
> URL: https://issues.apache.org/jira/browse/SPARK-25016
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Thomas Graves
>Priority: Major
>
> Hadoop 2.6  is now old, no releases have been done in 2 year and it doesn't 
> look like they are patching it for security or bug fixes. We should stop 
> supporting it in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24918) Executor Plugin API

2018-08-03 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-24918:
-
Labels: SPIP memory-analysis  (was: memory-analysis)

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-03 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568471#comment-16568471
 ] 

Imran Rashid commented on SPARK-24918:
--

attached an [spip 
proposal|https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing]

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-08-03 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24954:
-

Assignee: Jiang Xingbo

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-08-03 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24954.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21915
[https://github.com/apache/spark/pull/21915]

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568454#comment-16568454
 ] 

Reynold Xin commented on SPARK-24924:
-

I like the improved error message (I didn't read the earlier comments in this 
thread).

 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

2018-08-03 Thread Aditya Kamath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568439#comment-16568439
 ] 

Aditya Kamath commented on SPARK-20696:
---

[~rajanimaski] Please let me know which implementation of Kmeans you created. 
I'd really like this information. Thank you.

> tf-idf document clustering with K-means in Apache Spark putting points into 
> one cluster
> ---
>
> Key: SPARK-20696
> URL: https://issues.apache.org/jira/browse/SPARK-20696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nassir
>Priority: Major
>
> I am trying to do the classic job of clustering text documents by 
> pre-processing, generating tf-idf matrix, and then applying K-means. However, 
> testing this workflow on the classic 20NewsGroup dataset results in most 
> documents being clustered into one cluster. (I have initially tried to 
> cluster all documents from 6 of the 20 groups - so expecting to cluster into 
> 6 clusters).
> I am implementing this in Apache Spark as my purpose is to utilise this 
> technique on millions of documents. Here is the code written in Pyspark on 
> Databricks:
> #declare path to folder containing 6 of 20 news group categories
> path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
> MOUNT_NAME
> #read all the text files from the 6 folders. Each entity is an entire 
> document. 
> text_files = sc.wholeTextFiles(path).cache()
> #convert rdd to dataframe
> df = text_files.toDF(["filePath", "document"]).cache()
> from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
> #tokenize the document text
> tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
> tokenized = tokenizer.transform(df).cache()
> from pyspark.ml.feature import StopWordsRemover
> remover = StopWordsRemover(inputCol="tokens", 
> outputCol="stopWordsRemovedTokens")
> stopWordsRemoved_df = remover.transform(tokenized).cache()
> hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
> outputCol="rawFeatures", numFeatures=20)
> tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()
> idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
> idfModel = idf.fit(tfVectors)
> tfIdfVectors = idfModel.transform(tfVectors).cache()
> #note that I have also tried to use normalized data, but get the same result
> from pyspark.ml.feature import Normalizer
> from pyspark.ml.linalg import Vectors
> normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
> l2NormData = normalizer.transform(tfIdfVectors)
> from pyspark.ml.clustering import KMeans
> # Trains a KMeans model.
> kmeans = KMeans().setK(6).setMaxIter(20)
> km_model = kmeans.fit(l2NormData)
> clustersTable = km_model.transform(l2NormData)
> [output showing most documents get clustered into cluster 0][1]
> ID number_of_documents_in_cluster 
> 0 3024 
> 3 5 
> 1 3 
> 5 2
> 2 2 
> 4 1
> As you can see most of my data points get clustered into cluster 0, and I 
> cannot figure out what I am doing wrong as all the tutorials and code I have 
> come across online point to using this method.
> In addition I have also tried normalizing the tf-idf matrix before K-means 
> but that also produces the same result. I know cosine distance is a better 
> measure to use, but I expected using standard K-means in Apache Spark would 
> provide meaningful results.
> Can anyone help with regards to whether I have a bug in my code, or if 
> something is missing in my data clustering pipeline?
> (Question also asked in Stackoverflow before: 
> http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25011) Add PrefixSpan to __all__ in fpm.py

2018-08-03 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-25011:
---
Summary: Add PrefixSpan to __all__ in fpm.py  (was: Add PrefixSpan to 
__all__)

> Add PrefixSpan to __all__ in fpm.py
> ---
>
> Key: SPARK-25011
> URL: https://issues.apache.org/jira/browse/SPARK-25011
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add PrefixSpan to __all__ in fpm.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-08-03 Thread Russell Spitzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568415#comment-16568415
 ] 

Russell Spitzer commented on SPARK-25003:
-

[~holden.karau] , Wrote up a PR for each branch target since I'm not sure what 
version you would think best for the update. Please let me know if you have any 
feedback or advice on how to get an automatic test in :)

> Pyspark Does not use Spark Sql Extensions
> -
>
> Key: SPARK-25003
> URL: https://issues.apache.org/jira/browse/SPARK-25003
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Russell Spitzer
>Priority: Major
>
> When creating a SparkSession here
> [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
> {code:python}
> if jsparkSession is None:
>   jsparkSession = self._jvm.SparkSession(self._jsc.sc())
> self._jsparkSession = jsparkSession
> {code}
> I believe it ends up calling the constructor here
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
> {code:scala}
>   private[sql] def this(sc: SparkContext) {
> this(sc, None, None, new SparkSessionExtensions)
>   }
> {code}
> Which creates a new SparkSessionsExtensions object and does not pick up new 
> extensions that could have been set in the config like the companion 
> getOrCreate does.
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
> {code:scala}
> //in getOrCreate
> // Initialize extensions if the user has defined a configurator class.
> val extensionConfOption = 
> sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
> if (extensionConfOption.isDefined) {
>   val extensionConfClassName = extensionConfOption.get
>   try {
> val extensionConfClass = 
> Utils.classForName(extensionConfClassName)
> val extensionConf = extensionConfClass.newInstance()
>   .asInstanceOf[SparkSessionExtensions => Unit]
> extensionConf(extensions)
>   } catch {
> // Ignore the error if we cannot find the class or when the class 
> has the wrong type.
> case e @ (_: ClassCastException |
>   _: ClassNotFoundException |
>   _: NoClassDefFoundError) =>
>   logWarning(s"Cannot use $extensionConfClassName to configure 
> session extensions.", e)
>   }
> }
> {code}
> I think a quick fix would be to use the getOrCreate method from the companion 
> object instead of calling the constructor from the SparkContext. Or we could 
> fix this by ensuring that all constructors attempt to pick up custom 
> extensions if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568414#comment-16568414
 ] 

Apache Spark commented on SPARK-25003:
--

User 'RussellSpitzer' has created a pull request for this issue:
https://github.com/apache/spark/pull/21990

> Pyspark Does not use Spark Sql Extensions
> -
>
> Key: SPARK-25003
> URL: https://issues.apache.org/jira/browse/SPARK-25003
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Russell Spitzer
>Priority: Major
>
> When creating a SparkSession here
> [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
> {code:python}
> if jsparkSession is None:
>   jsparkSession = self._jvm.SparkSession(self._jsc.sc())
> self._jsparkSession = jsparkSession
> {code}
> I believe it ends up calling the constructor here
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
> {code:scala}
>   private[sql] def this(sc: SparkContext) {
> this(sc, None, None, new SparkSessionExtensions)
>   }
> {code}
> Which creates a new SparkSessionsExtensions object and does not pick up new 
> extensions that could have been set in the config like the companion 
> getOrCreate does.
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
> {code:scala}
> //in getOrCreate
> // Initialize extensions if the user has defined a configurator class.
> val extensionConfOption = 
> sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
> if (extensionConfOption.isDefined) {
>   val extensionConfClassName = extensionConfOption.get
>   try {
> val extensionConfClass = 
> Utils.classForName(extensionConfClassName)
> val extensionConf = extensionConfClass.newInstance()
>   .asInstanceOf[SparkSessionExtensions => Unit]
> extensionConf(extensions)
>   } catch {
> // Ignore the error if we cannot find the class or when the class 
> has the wrong type.
> case e @ (_: ClassCastException |
>   _: ClassNotFoundException |
>   _: NoClassDefFoundError) =>
>   logWarning(s"Cannot use $extensionConfClassName to configure 
> session extensions.", e)
>   }
> }
> {code}
> I think a quick fix would be to use the getOrCreate method from the companion 
> object instead of calling the constructor from the SparkContext. Or we could 
> fix this by ensuring that all constructors attempt to pick up custom 
> extensions if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568405#comment-16568405
 ] 

Apache Spark commented on SPARK-25003:
--

User 'RussellSpitzer' has created a pull request for this issue:
https://github.com/apache/spark/pull/21989

> Pyspark Does not use Spark Sql Extensions
> -
>
> Key: SPARK-25003
> URL: https://issues.apache.org/jira/browse/SPARK-25003
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Russell Spitzer
>Priority: Major
>
> When creating a SparkSession here
> [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
> {code:python}
> if jsparkSession is None:
>   jsparkSession = self._jvm.SparkSession(self._jsc.sc())
> self._jsparkSession = jsparkSession
> {code}
> I believe it ends up calling the constructor here
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
> {code:scala}
>   private[sql] def this(sc: SparkContext) {
> this(sc, None, None, new SparkSessionExtensions)
>   }
> {code}
> Which creates a new SparkSessionsExtensions object and does not pick up new 
> extensions that could have been set in the config like the companion 
> getOrCreate does.
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
> {code:scala}
> //in getOrCreate
> // Initialize extensions if the user has defined a configurator class.
> val extensionConfOption = 
> sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
> if (extensionConfOption.isDefined) {
>   val extensionConfClassName = extensionConfOption.get
>   try {
> val extensionConfClass = 
> Utils.classForName(extensionConfClassName)
> val extensionConf = extensionConfClass.newInstance()
>   .asInstanceOf[SparkSessionExtensions => Unit]
> extensionConf(extensions)
>   } catch {
> // Ignore the error if we cannot find the class or when the class 
> has the wrong type.
> case e @ (_: ClassCastException |
>   _: ClassNotFoundException |
>   _: NoClassDefFoundError) =>
>   logWarning(s"Cannot use $extensionConfClassName to configure 
> session extensions.", e)
>   }
> }
> {code}
> I think a quick fix would be to use the getOrCreate method from the companion 
> object instead of calling the constructor from the SparkContext. Or we could 
> fix this by ensuring that all constructors attempt to pick up custom 
> extensions if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25003:


Assignee: (was: Apache Spark)

> Pyspark Does not use Spark Sql Extensions
> -
>
> Key: SPARK-25003
> URL: https://issues.apache.org/jira/browse/SPARK-25003
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Russell Spitzer
>Priority: Major
>
> When creating a SparkSession here
> [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
> {code:python}
> if jsparkSession is None:
>   jsparkSession = self._jvm.SparkSession(self._jsc.sc())
> self._jsparkSession = jsparkSession
> {code}
> I believe it ends up calling the constructor here
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
> {code:scala}
>   private[sql] def this(sc: SparkContext) {
> this(sc, None, None, new SparkSessionExtensions)
>   }
> {code}
> Which creates a new SparkSessionsExtensions object and does not pick up new 
> extensions that could have been set in the config like the companion 
> getOrCreate does.
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
> {code:scala}
> //in getOrCreate
> // Initialize extensions if the user has defined a configurator class.
> val extensionConfOption = 
> sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
> if (extensionConfOption.isDefined) {
>   val extensionConfClassName = extensionConfOption.get
>   try {
> val extensionConfClass = 
> Utils.classForName(extensionConfClassName)
> val extensionConf = extensionConfClass.newInstance()
>   .asInstanceOf[SparkSessionExtensions => Unit]
> extensionConf(extensions)
>   } catch {
> // Ignore the error if we cannot find the class or when the class 
> has the wrong type.
> case e @ (_: ClassCastException |
>   _: ClassNotFoundException |
>   _: NoClassDefFoundError) =>
>   logWarning(s"Cannot use $extensionConfClassName to configure 
> session extensions.", e)
>   }
> }
> {code}
> I think a quick fix would be to use the getOrCreate method from the companion 
> object instead of calling the constructor from the SparkContext. Or we could 
> fix this by ensuring that all constructors attempt to pick up custom 
> extensions if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25003:


Assignee: Apache Spark

> Pyspark Does not use Spark Sql Extensions
> -
>
> Key: SPARK-25003
> URL: https://issues.apache.org/jira/browse/SPARK-25003
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Russell Spitzer
>Assignee: Apache Spark
>Priority: Major
>
> When creating a SparkSession here
> [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
> {code:python}
> if jsparkSession is None:
>   jsparkSession = self._jvm.SparkSession(self._jsc.sc())
> self._jsparkSession = jsparkSession
> {code}
> I believe it ends up calling the constructor here
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
> {code:scala}
>   private[sql] def this(sc: SparkContext) {
> this(sc, None, None, new SparkSessionExtensions)
>   }
> {code}
> Which creates a new SparkSessionsExtensions object and does not pick up new 
> extensions that could have been set in the config like the companion 
> getOrCreate does.
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
> {code:scala}
> //in getOrCreate
> // Initialize extensions if the user has defined a configurator class.
> val extensionConfOption = 
> sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
> if (extensionConfOption.isDefined) {
>   val extensionConfClassName = extensionConfOption.get
>   try {
> val extensionConfClass = 
> Utils.classForName(extensionConfClassName)
> val extensionConf = extensionConfClass.newInstance()
>   .asInstanceOf[SparkSessionExtensions => Unit]
> extensionConf(extensions)
>   } catch {
> // Ignore the error if we cannot find the class or when the class 
> has the wrong type.
> case e @ (_: ClassCastException |
>   _: ClassNotFoundException |
>   _: NoClassDefFoundError) =>
>   logWarning(s"Cannot use $extensionConfClassName to configure 
> session extensions.", e)
>   }
> }
> {code}
> I think a quick fix would be to use the getOrCreate method from the companion 
> object instead of calling the constructor from the SparkContext. Or we could 
> fix this by ensuring that all constructors attempt to pick up custom 
> extensions if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568396#comment-16568396
 ] 

Apache Spark commented on SPARK-25003:
--

User 'RussellSpitzer' has created a pull request for this issue:
https://github.com/apache/spark/pull/21988

> Pyspark Does not use Spark Sql Extensions
> -
>
> Key: SPARK-25003
> URL: https://issues.apache.org/jira/browse/SPARK-25003
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Russell Spitzer
>Priority: Major
>
> When creating a SparkSession here
> [https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
> {code:python}
> if jsparkSession is None:
>   jsparkSession = self._jvm.SparkSession(self._jsc.sc())
> self._jsparkSession = jsparkSession
> {code}
> I believe it ends up calling the constructor here
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
> {code:scala}
>   private[sql] def this(sc: SparkContext) {
> this(sc, None, None, new SparkSessionExtensions)
>   }
> {code}
> Which creates a new SparkSessionsExtensions object and does not pick up new 
> extensions that could have been set in the config like the companion 
> getOrCreate does.
> https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
> {code:scala}
> //in getOrCreate
> // Initialize extensions if the user has defined a configurator class.
> val extensionConfOption = 
> sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
> if (extensionConfOption.isDefined) {
>   val extensionConfClassName = extensionConfOption.get
>   try {
> val extensionConfClass = 
> Utils.classForName(extensionConfClassName)
> val extensionConf = extensionConfClass.newInstance()
>   .asInstanceOf[SparkSessionExtensions => Unit]
> extensionConf(extensions)
>   } catch {
> // Ignore the error if we cannot find the class or when the class 
> has the wrong type.
> case e @ (_: ClassCastException |
>   _: ClassNotFoundException |
>   _: NoClassDefFoundError) =>
>   logWarning(s"Cannot use $extensionConfClassName to configure 
> session extensions.", e)
>   }
> }
> {code}
> I think a quick fix would be to use the getOrCreate method from the companion 
> object instead of calling the constructor from the SparkContext. Or we could 
> fix this by ensuring that all constructors attempt to pick up custom 
> extensions if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25015) Update Hadoop 2.7 to 2.7.7

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25015:


Assignee: Apache Spark  (was: Sean Owen)

> Update Hadoop 2.7 to 2.7.7
> --
>
> Key: SPARK-25015
> URL: https://issues.apache.org/jira/browse/SPARK-25015
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> We should update the Hadoop 2.7 dependency to 2.7.7, to pick up bug and 
> security fixes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25015) Update Hadoop 2.7 to 2.7.7

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568392#comment-16568392
 ] 

Apache Spark commented on SPARK-25015:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/21987

> Update Hadoop 2.7 to 2.7.7
> --
>
> Key: SPARK-25015
> URL: https://issues.apache.org/jira/browse/SPARK-25015
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> We should update the Hadoop 2.7 dependency to 2.7.7, to pick up bug and 
> security fixes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568393#comment-16568393
 ] 

Thomas Graves commented on SPARK-24924:
---

| It wouldn't be very different for 2.4.0. It could be different but I guess it 
should be incremental improvement without behaviour changes.

I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.  If the user has been using the 
databrick spark-avro version for other releases and it was working fine and now 
we magically map it to a different version and they break, they are going to 
complain and say, I didn't change anything why did this break. 

Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  Just a note 
the fact we use different version isn't my issue, I'm happy to make that work, 
I'm worried about other users who didn't happen to see this jira.   I also 
realize these are 3rd party packages but I think we are making the assumption 
here based on this being a databricks package, which in my opinion we 
shouldn't.   What if this was companyX package which we didn't know about, what 
would/should be the expected behavior? 

How many users complained about the csv thing?  Could we just improve the error 
message to more simply state, "Multiple sources found, perhaps you are 
including an external package that also supports avro. Spark started internally 
supporting as of release X.Y, please remove the external package or rewrite to 
use different function"

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25015) Update Hadoop 2.7 to 2.7.7

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25015:


Assignee: Sean Owen  (was: Apache Spark)

> Update Hadoop 2.7 to 2.7.7
> --
>
> Key: SPARK-25015
> URL: https://issues.apache.org/jira/browse/SPARK-25015
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> We should update the Hadoop 2.7 dependency to 2.7.7, to pick up bug and 
> security fixes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25015) Update Hadoop 2.7 to 2.7.7

2018-08-03 Thread Sean Owen (JIRA)
Sean Owen created SPARK-25015:
-

 Summary: Update Hadoop 2.7 to 2.7.7
 Key: SPARK-25015
 URL: https://issues.apache.org/jira/browse/SPARK-25015
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 2.3.1, 2.2.2, 2.1.3
Reporter: Sean Owen
Assignee: Sean Owen


We should update the Hadoop 2.7 dependency to 2.7.7, to pick up bug and 
security fixes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18057:
--
Priority: Major  (was: Blocker)

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18057.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25014) When we tried to read kafka topic through spark streaming spark submit is getting failed with Python worker exited unexpectedly (crashed) error

2018-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25014:
--
 Priority: Major  (was: Blocker)
Fix Version/s: (was: 2.3.2)

Please read [https://spark.apache.org/contributing.html]

For example, don't set Blocker.

Nothing about this rules out an env problem or code problem. Jira is for 
reporting issues narrowed down to Spark, rather than asking for support.

> When we tried to read kafka topic through spark streaming spark submit is 
> getting failed with Python worker exited unexpectedly (crashed) error 
> 
>
> Key: SPARK-25014
> URL: https://issues.apache.org/jira/browse/SPARK-25014
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: KARTHIKEYAN RASIPALAYAM DURAIRAJ
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Hi Team , 
>  
> TOPIC = 'NBC_APPS.TBL_MS_ADVERTISER'
> PARTITION = 0
> topicAndPartition = TopicAndPartition(TOPIC, PARTITION)
> fromOffsets1 = \{topicAndPartition:int(PARTITION)}
>  
> def handler(message):
>     records = message.collect()
>     for record in records:
>         value_all=record[1]
>         value_key=record[0]
> #        print(value_all)
>  
> schema_registry_client = 
> CachedSchemaRegistryClient(url='http://localhost:8081')
> serializer = MessageSerializer(schema_registry_client)
> sc = SparkContext(appName="PythonStreamingAvro")
> ssc = StreamingContext(sc, 10)
> kvs = KafkaUtils.createDirectStream(ssc, ['NBC_APPS.TBL_MS_ADVERTISER'], 
> \{"metadata.broker.list": 
> 'localhost:9092'},valueDecoder=serializer.decode_message)
> lines = kvs.map(lambda x: x[1])
> lines.pprint()
> kvs.foreachRDD(handler)
>  
> ssc.start()
> ssc.awaitTermination()
>  
> This is code we trying to pull the data from kafka topic . when we execute 
> through spark submit we are getting below error 
>  
>  
> 2018-08-03 11:10:40 INFO  VerifiableProperties:68 - Property 
> zookeeper.connect is overridden to 
> 2018-08-03 11:10:40 ERROR PythonRunner:91 - Python worker exited unexpectedly 
> (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/worker.py",
>  line 215, in main
>     eval_type = read_int(infile)
>   File 
> "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 685, in read_int
>     raise EOFError
> EOFError
>  
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25014) When we tried to read kafka topic through spark streaming spark submit is getting failed with Python worker exited unexpectedly (crashed) error

2018-08-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25014.
---
Resolution: Invalid

> When we tried to read kafka topic through spark streaming spark submit is 
> getting failed with Python worker exited unexpectedly (crashed) error 
> 
>
> Key: SPARK-25014
> URL: https://issues.apache.org/jira/browse/SPARK-25014
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: KARTHIKEYAN RASIPALAYAM DURAIRAJ
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Hi Team , 
>  
> TOPIC = 'NBC_APPS.TBL_MS_ADVERTISER'
> PARTITION = 0
> topicAndPartition = TopicAndPartition(TOPIC, PARTITION)
> fromOffsets1 = \{topicAndPartition:int(PARTITION)}
>  
> def handler(message):
>     records = message.collect()
>     for record in records:
>         value_all=record[1]
>         value_key=record[0]
> #        print(value_all)
>  
> schema_registry_client = 
> CachedSchemaRegistryClient(url='http://localhost:8081')
> serializer = MessageSerializer(schema_registry_client)
> sc = SparkContext(appName="PythonStreamingAvro")
> ssc = StreamingContext(sc, 10)
> kvs = KafkaUtils.createDirectStream(ssc, ['NBC_APPS.TBL_MS_ADVERTISER'], 
> \{"metadata.broker.list": 
> 'localhost:9092'},valueDecoder=serializer.decode_message)
> lines = kvs.map(lambda x: x[1])
> lines.pprint()
> kvs.foreachRDD(handler)
>  
> ssc.start()
> ssc.awaitTermination()
>  
> This is code we trying to pull the data from kafka topic . when we execute 
> through spark submit we are getting below error 
>  
>  
> 2018-08-03 11:10:40 INFO  VerifiableProperties:68 - Property 
> zookeeper.connect is overridden to 
> 2018-08-03 11:10:40 ERROR PythonRunner:91 - Python worker exited unexpectedly 
> (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/worker.py",
>  line 215, in main
>     eval_type = read_int(infile)
>   File 
> "/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 685, in read_int
>     raise EOFError
> EOFError
>  
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568352#comment-16568352
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

cc [~cloud_fan] since we talked about this for CSV, and [~rxin] who agreed upon 
not adding .avro for now, FYI.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568348#comment-16568348
 ] 

Hyukjin Kwon edited comment on SPARK-24924 at 8/3/18 3:29 PM:
--

{quote}
but at the same time we aren't adding the spark.read.avro syntax so it break in 
that case or they get a different implementation by default?   
{quote}

If users call this, that's still going to use the builtin implemtnation 
(https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26)
 as it's a short name for {{format("com.databricks.spark.avro")}}.

{quote}
our internal implementation which could very well be different.
{quote}

It wouldn't be very different for 2.4.0. It could be different but I guess it 
should be incremental improvement without behaviour changes.

{quote}
 I would rather just plain error out saying these conflict, either update or 
change your external package to use a different name. 
{quote}

IIRC, in the past, we did for CSV datasource and many users complained about 
this.

{code}
java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
{code}

In practice, I am actually a bit more sure on the current approach since users 
actually complained about this a lot and now I am not seeing (so far) the 
complaints about the current approach.

{quote}
There is also the case one might be able to argue its breaking api 
compatilibity since .avro option went away, buts it a third party library so 
you can probably get away with that. 
{quote}

It's went away so I guess if the jar is provided with implicit import to 
support this, this should work as usual and use the internal implementation in 
theory. If the jar is not given, .avro API is not supported and the internal 
implmentation will be used. 



was (Author: hyukjin.kwon):
{quote}
but at the same time we aren't adding the spark.read.avro syntax so it break in 
that case or they get a different implementation by default?   
{quote}

If users call this, that's still going to use the builtin implemtnation 
(https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26)
 as it's a short name for {{format("com.databricks.spark.avro")}}.

{quote}
our internal implementation which could very well be different.
{quote}

It wouldn't be very different for 2.4.0. It could be different but I guess it 
should be incremental improvement without behaviour changes.

{quote}
 I would rather just plain error out saying these conflict, either update or 
change your external package to use a different name. 
{quote}

IIRC, in the past, we did for CSV datasource and many users complained about 
this.

{code}
java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
{code}

In practice, I am actually a bit more sure on the current approach since users 
actually complained about his a lot and now I am not seeing (so far) the 
complains about the current approach.

{code}
There is also the case one might be able to argue its breaking api 
compatilibity since .avro option went away, buts it a third party library so 
you can probably get away with that. 
{code}

It's went away so I guess if the jar is provided with implicit import to 
support this, this should work as usual and use the internal implementation in 
theory. If the jar is not given, .avro API is not supported and the internal 
implmentation will be used. 


> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568348#comment-16568348
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

{quote}
but at the same time we aren't adding the spark.read.avro syntax so it break in 
that case or they get a different implementation by default?   
{quote}

If users call this, that's still going to use the builtin implemtnation 
(https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26)
 as it's a short name for {{format("com.databricks.spark.avro")}}.

{quote}
our internal implementation which could very well be different.
{quote}

It wouldn't be very different for 2.4.0. It could be different but I guess it 
should be incremental improvement without behaviour changes.

{quote}
 I would rather just plain error out saying these conflict, either update or 
change your external package to use a different name. 
{quote}

IIRC, in the past, we did for CSV datasource and many users complained about 
this.

{code}
java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
{code}

In practice, I am actually a bit more sure on the current approach since users 
actually complained about his a lot and now I am not seeing (so far) the 
complains about the current approach.

{code}
There is also the case one might be able to argue its breaking api 
compatilibity since .avro option went away, buts it a third party library so 
you can probably get away with that. 
{code}

It's went away so I guess if the jar is provided with implicit import to 
support this, this should work as usual and use the internal implementation in 
theory. If the jar is not given, .avro API is not supported and the internal 
implmentation will be used. 


> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23937) High-order function: map_filter(map, function) → MAP

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568337#comment-16568337
 ] 

Apache Spark commented on SPARK-23937:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21986

> High-order function: map_filter(map, function) → MAP
> --
>
> Key: SPARK-23937
> URL: https://issues.apache.org/jira/browse/SPARK-23937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Constructs a map from those entries of map for which function returns true:
> {noformat}
> SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {}
> SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v 
> IS NOT NULL); -- {10 -> a, 30 -> c}
> SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v 
> > 10); -- {k1 -> 20, k3 -> 15}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23937) High-order function: map_filter(map, function) → MAP

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23937:


Assignee: (was: Apache Spark)

> High-order function: map_filter(map, function) → MAP
> --
>
> Key: SPARK-23937
> URL: https://issues.apache.org/jira/browse/SPARK-23937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Constructs a map from those entries of map for which function returns true:
> {noformat}
> SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {}
> SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v 
> IS NOT NULL); -- {10 -> a, 30 -> c}
> SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v 
> > 10); -- {k1 -> 20, k3 -> 15}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23937) High-order function: map_filter(map, function) → MAP

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23937:


Assignee: Apache Spark

> High-order function: map_filter(map, function) → MAP
> --
>
> Key: SPARK-23937
> URL: https://issues.apache.org/jira/browse/SPARK-23937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Constructs a map from those entries of map for which function returns true:
> {noformat}
> SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {}
> SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v 
> IS NOT NULL); -- {10 -> a, 30 -> c}
> SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v 
> > 10); -- {k1 -> 20, k3 -> 15}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25014) When we tried to read kafka topic through spark streaming spark submit is getting failed with Python worker exited unexpectedly (crashed) error

2018-08-03 Thread KARTHIKEYAN RASIPALAYAM DURAIRAJ (JIRA)
KARTHIKEYAN RASIPALAYAM DURAIRAJ created SPARK-25014:


 Summary: When we tried to read kafka topic through spark streaming 
spark submit is getting failed with Python worker exited unexpectedly (crashed) 
error 
 Key: SPARK-25014
 URL: https://issues.apache.org/jira/browse/SPARK-25014
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
Reporter: KARTHIKEYAN RASIPALAYAM DURAIRAJ
 Fix For: 2.3.2


Hi Team , 

 

TOPIC = 'NBC_APPS.TBL_MS_ADVERTISER'

PARTITION = 0

topicAndPartition = TopicAndPartition(TOPIC, PARTITION)

fromOffsets1 = \{topicAndPartition:int(PARTITION)}

 

def handler(message):

    records = message.collect()

    for record in records:

        value_all=record[1]

        value_key=record[0]

#        print(value_all)

 

schema_registry_client = CachedSchemaRegistryClient(url='http://localhost:8081')

serializer = MessageSerializer(schema_registry_client)

sc = SparkContext(appName="PythonStreamingAvro")

ssc = StreamingContext(sc, 10)

kvs = KafkaUtils.createDirectStream(ssc, ['NBC_APPS.TBL_MS_ADVERTISER'], 
\{"metadata.broker.list": 
'localhost:9092'},valueDecoder=serializer.decode_message)

lines = kvs.map(lambda x: x[1])

lines.pprint()

kvs.foreachRDD(handler)

 

ssc.start()

ssc.awaitTermination()

 

This is code we trying to pull the data from kafka topic . when we execute 
through spark submit we are getting below error 

 

 

2018-08-03 11:10:40 INFO  VerifiableProperties:68 - Property zookeeper.connect 
is overridden to 

2018-08-03 11:10:40 ERROR PythonRunner:91 - Python worker exited unexpectedly 
(crashed)

org.apache.spark.api.python.PythonException: Traceback (most recent call last):

  File 
"/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/worker.py",
 line 215, in main

    eval_type = read_int(infile)

  File 
"/Users/KarthikeyanDurairaj/Desktop/Sparkshell/python/lib/pyspark.zip/pyspark/serializers.py",
 line 685, in read_int

    raise EOFError

EOFError

 

 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)

 at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)

 at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)

 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-03 Thread Nick Poorman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568238#comment-16568238
 ] 

Nick Poorman commented on SPARK-14220:
--

Awesome job! (y)

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24884) Implement regexp_extract_all

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24884:


Assignee: (was: Apache Spark)

> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24884) Implement regexp_extract_all

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568215#comment-16568215
 ] 

Apache Spark commented on SPARK-24884:
--

User 'xueyumusic' has created a pull request for this issue:
https://github.com/apache/spark/pull/21985

> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24884) Implement regexp_extract_all

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24884:


Assignee: Apache Spark

> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Assignee: Apache Spark
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568204#comment-16568204
 ] 

Thomas Graves commented on SPARK-24924:
---

[~felixcheung] did your discussion on the same thing with csv get resolved?  

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568199#comment-16568199
 ] 

Thomas Graves commented on SPARK-24924:
---

Hmm, so we are adding this for ease of upgrading I guess (so user doesn't have 
to change their code), but at the same time we aren't adding the 
spark.read.avro syntax so it break in that case or they get a different 
implementation by default?   

This doesn't make sense to me.  Personally I don't like having some other add 
on package names in our code at all and here we are mapping what the user 
thought they would get to our internal implementation which could very well be 
different.  I would rather just plain error out saying these conflict, either 
update or change your external package to use a different name.  There is also 
the case one might be able to argue its breaking api compatilibity since .avro 
option went away, buts it a third party library so you can probably get away 
with that. 

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23937) High-order function: map_filter(map, function) → MAP

2018-08-03 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568152#comment-16568152
 ] 

Marco Gaido commented on SPARK-23937:
-

I am working on this, thanks.

> High-order function: map_filter(map, function) → MAP
> --
>
> Key: SPARK-23937
> URL: https://issues.apache.org/jira/browse/SPARK-23937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Constructs a map from those entries of map for which function returns true:
> {noformat}
> SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {}
> SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v 
> IS NOT NULL); -- {10 -> a, 30 -> c}
> SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v 
> > 10); -- {k1 -> 20, k3 -> 15}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24772) support reading AVRO logical types - Decimal

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568139#comment-16568139
 ] 

Apache Spark commented on SPARK-24772:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21984

> support reading AVRO logical types - Decimal
> 
>
> Key: SPARK-24772
> URL: https://issues.apache.org/jira/browse/SPARK-24772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24772) support reading AVRO logical types - Decimal

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24772:


Assignee: (was: Apache Spark)

> support reading AVRO logical types - Decimal
> 
>
> Key: SPARK-24772
> URL: https://issues.apache.org/jira/browse/SPARK-24772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24772) support reading AVRO logical types - Decimal

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24772:


Assignee: Apache Spark

> support reading AVRO logical types - Decimal
> 
>
> Key: SPARK-24772
> URL: https://issues.apache.org/jira/browse/SPARK-24772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24998) spark-sql will scan the same table repeatedly when doing multi-insert

2018-08-03 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-24998:

Summary: spark-sql will scan the same table repeatedly when doing 
multi-insert  (was: spark-sql will scan the same table repeatedly when doing 
multi-insert")

> spark-sql will scan the same table repeatedly when doing multi-insert
> -
>
> Key: SPARK-24998
> URL: https://issues.apache.org/jira/browse/SPARK-24998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
> Attachments: scan_table_many_times.png
>
>
> Such as the query likes "From xx SELECT yy INSERT INTO a INSERT INTO b INSERT 
> INTO c ..." .
> Following screenshot shows the stages:
> !scan_table_many_times.png!
> But, Hive only scan once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24998) spark-sql will scan the same table repeatedly when doing multi-insert"

2018-08-03 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-24998:

Description: 
Such as the query likes "From xx SELECT yy INSERT INTO a INSERT INTO b INSERT 
INTO c ..." .

Following screenshot shows the stages:

!scan_table_many_times.png!

But, Hive only scan once.

 

  was:
Such as the following screenshot:

!scan_table_many_times.png!

But, Hive only scan once.

 


> spark-sql will scan the same table repeatedly when doing multi-insert"
> --
>
> Key: SPARK-24998
> URL: https://issues.apache.org/jira/browse/SPARK-24998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
> Attachments: scan_table_many_times.png
>
>
> Such as the query likes "From xx SELECT yy INSERT INTO a INSERT INTO b INSERT 
> INTO c ..." .
> Following screenshot shows the stages:
> !scan_table_many_times.png!
> But, Hive only scan once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24998) spark-sql will scan the same table repeatedly when doing multi-insert"

2018-08-03 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-24998:

Summary: spark-sql will scan the same table repeatedly when doing 
multi-insert"  (was: spark-sql will scan the same table repeatedly when the 
query likes "From xx SELECT yy INSERT a INSERT b INSERT c ...")

> spark-sql will scan the same table repeatedly when doing multi-insert"
> --
>
> Key: SPARK-24998
> URL: https://issues.apache.org/jira/browse/SPARK-24998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
> Attachments: scan_table_many_times.png
>
>
> Such as the following screenshot:
> !scan_table_many_times.png!
> But, Hive only scan once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25013) JDBC urls with jdbc:mariadb don't work as expected

2018-08-03 Thread Dieter Vekeman (JIRA)
Dieter Vekeman created SPARK-25013:
--

 Summary: JDBC urls with jdbc:mariadb don't work as expected
 Key: SPARK-25013
 URL: https://issues.apache.org/jira/browse/SPARK-25013
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Dieter Vekeman


When using the MariaDB JDBC driver, the JDBC connection url should be  
{code:java}
jdbc:mariadb://localhost:3306/DB?user=someuser&password=somepassword
{code}
https://mariadb.com/kb/en/library/about-mariadb-connector-j/
However this does not work well in Spark (see below)

*Workaround*
The MariaDB driver also supports using mysql which does work.

The problem seems to have been described and identified in:
https://jira.mariadb.org/browse/CONJ-421

All works well with spark using connection string with {{"jdbc:mysql:..."}}, 
but not using {{"jdbc:mariadb:..."}} because MySQL dialect is then not used.
when not used, defaut quote is {{"}}, not {{`}}
So, some internal query generated by spark like {{SELECT `i`,`ip` FROM tmp}} 
will then be executed as {{SELECT "i","ip" FROM tmp}} with dataType previously 
retrieved, causing the exception

The author of the comment says
{quote}I'll make a pull request to spark so "jdbc:mariadb:" connection string 
can be handle{quote}

Did the pull request get lost or should a new one be made?




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23932) High-order function: zip_with(array, array, function) → array

2018-08-03 Thread Takuya Ueshin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568059#comment-16568059
 ] 

Takuya Ueshin commented on SPARK-23932:
---

Hi [~crafty-coder],
 Are you still working on this?
 Recently I added a base framework for higher-order functions. You can use it 
to implement this.
 If you don't have enough time, I can take this over, so please let me know.
 Thanks!

> High-order function: zip_with(array, array, function) → 
> array
> ---
>
> Key: SPARK-23932
> URL: https://issues.apache.org/jira/browse/SPARK-23932
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the two given arrays, element-wise, into a single array using 
> function. Both arrays must be the same length.
> {noformat}
> SELECT zip_with(ARRAY[1, 3, 5], ARRAY['a', 'b', 'c'], (x, y) -> (y, x)); -- 
> [ROW('a', 1), ROW('b', 3), ROW('c', 5)]
> SELECT zip_with(ARRAY[1, 2], ARRAY[3, 4], (x, y) -> x + y); -- [4, 6]
> SELECT zip_with(ARRAY['a', 'b', 'c'], ARRAY['d', 'e', 'f'], (x, y) -> 
> concat(x, y)); -- ['ad', 'be', 'cf']
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24987:


Assignee: (was: Apache Spark)

> Kafka Cached Consumer Leaking File Descriptors
> --
>
> Key: SPARK-24987
> URL: https://issues.apache.org/jira/browse/SPARK-24987
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
>  
>Reporter: Yuval Itzchakov
>Priority: Critical
>
> Setup:
>  * Spark 2.3.1
>  * Java 1.8.0 (112)
>  * Standalone Cluster Manager
>  * 3 Nodes, 1 Executor per node.
> Spark 2.3.0 introduced a new mechanism for caching Kafka consumers 
> (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel)
>  via KafkaDataConsumer.acquire.
> It seems that there are situations (I've been trying to debug it, haven't 
> been able to find the root cause as of yet) where cached consumers remain "in 
> use" throughout the life time of the task and are never released. This can be 
> identified by the following line of the stack trace:
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460)
> Which points to:
> {code:java}
> } else if (existingInternalConsumer.inUse) {
>   // If consumer is already cached but is currently in use, then return a new 
> consumer
>   NonCachedKafkaDataConsumer(newInternalConsumer)
> {code}
>  Meaning the existing consumer created for that `TopicPartition` is still in 
> use for some reason. The weird thing is that you can see this for very old 
> tasks which have already finished successfully.
> I've traced down this leak using file leak detector, attaching it to the 
> running Executor JVM process. I've emitted the list of open file descriptors 
> which [you can find 
> here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d],
>  and you can see that the majority of them are epoll FD used by Kafka 
> Consumers, indicating that they aren't closing.
>  Spark graph:
> {code:java}
> kafkaStream
>   .load()
>   .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   .as[(String, String)]
>   .flatMap {...}
>   .groupByKey(...)
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...)
>   .foreach(...)
>   .outputMode(OutputMode.Update)
>   .option("checkpointLocation",
> sparkConfiguration.properties.checkpointDirectory)
>   .start()
>   .awaitTermination(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors

2018-08-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24987:


Assignee: Apache Spark

> Kafka Cached Consumer Leaking File Descriptors
> --
>
> Key: SPARK-24987
> URL: https://issues.apache.org/jira/browse/SPARK-24987
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
>  
>Reporter: Yuval Itzchakov
>Assignee: Apache Spark
>Priority: Critical
>
> Setup:
>  * Spark 2.3.1
>  * Java 1.8.0 (112)
>  * Standalone Cluster Manager
>  * 3 Nodes, 1 Executor per node.
> Spark 2.3.0 introduced a new mechanism for caching Kafka consumers 
> (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel)
>  via KafkaDataConsumer.acquire.
> It seems that there are situations (I've been trying to debug it, haven't 
> been able to find the root cause as of yet) where cached consumers remain "in 
> use" throughout the life time of the task and are never released. This can be 
> identified by the following line of the stack trace:
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460)
> Which points to:
> {code:java}
> } else if (existingInternalConsumer.inUse) {
>   // If consumer is already cached but is currently in use, then return a new 
> consumer
>   NonCachedKafkaDataConsumer(newInternalConsumer)
> {code}
>  Meaning the existing consumer created for that `TopicPartition` is still in 
> use for some reason. The weird thing is that you can see this for very old 
> tasks which have already finished successfully.
> I've traced down this leak using file leak detector, attaching it to the 
> running Executor JVM process. I've emitted the list of open file descriptors 
> which [you can find 
> here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d],
>  and you can see that the majority of them are epoll FD used by Kafka 
> Consumers, indicating that they aren't closing.
>  Spark graph:
> {code:java}
> kafkaStream
>   .load()
>   .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   .as[(String, String)]
>   .flatMap {...}
>   .groupByKey(...)
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...)
>   .foreach(...)
>   .outputMode(OutputMode.Update)
>   .option("checkpointLocation",
> sparkConfiguration.properties.checkpointDirectory)
>   .start()
>   .awaitTermination(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24987) Kafka Cached Consumer Leaking File Descriptors

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568022#comment-16568022
 ] 

Apache Spark commented on SPARK-24987:
--

User 'YuvalItzchakov' has created a pull request for this issue:
https://github.com/apache/spark/pull/21983

> Kafka Cached Consumer Leaking File Descriptors
> --
>
> Key: SPARK-24987
> URL: https://issues.apache.org/jira/browse/SPARK-24987
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
> Environment: Spark 2.3.1
> Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
>  
>Reporter: Yuval Itzchakov
>Priority: Critical
>
> Setup:
>  * Spark 2.3.1
>  * Java 1.8.0 (112)
>  * Standalone Cluster Manager
>  * 3 Nodes, 1 Executor per node.
> Spark 2.3.0 introduced a new mechanism for caching Kafka consumers 
> (https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel)
>  via KafkaDataConsumer.acquire.
> It seems that there are situations (I've been trying to debug it, haven't 
> been able to find the root cause as of yet) where cached consumers remain "in 
> use" throughout the life time of the task and are never released. This can be 
> identified by the following line of the stack trace:
> at 
> org.apache.spark.sql.kafka010.KafkaDataConsumer$.acquire(KafkaDataConsumer.scala:460)
> Which points to:
> {code:java}
> } else if (existingInternalConsumer.inUse) {
>   // If consumer is already cached but is currently in use, then return a new 
> consumer
>   NonCachedKafkaDataConsumer(newInternalConsumer)
> {code}
>  Meaning the existing consumer created for that `TopicPartition` is still in 
> use for some reason. The weird thing is that you can see this for very old 
> tasks which have already finished successfully.
> I've traced down this leak using file leak detector, attaching it to the 
> running Executor JVM process. I've emitted the list of open file descriptors 
> which [you can find 
> here|https://gist.github.com/YuvalItzchakov/cdbdd7f67604557fccfbcce673c49e5d],
>  and you can see that the majority of them are epoll FD used by Kafka 
> Consumers, indicating that they aren't closing.
>  Spark graph:
> {code:java}
> kafkaStream
>   .load()
>   .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   .as[(String, String)]
>   .flatMap {...}
>   .groupByKey(...)
>   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(...)
>   .foreach(...)
>   .outputMode(OutputMode.Update)
>   .option("checkpointLocation",
> sparkConfiguration.properties.checkpointDirectory)
>   .start()
>   .awaitTermination(){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24598) SPARK SQL:Datatype overflow conditions gives incorrect result

2018-08-03 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568005#comment-16568005
 ] 

Marco Gaido commented on SPARK-24598:
-

[~smilegator] as we just enhanced the doc, but we have not really addressed the 
overflow condition, which I think we are targeting for a fix for 3.0, shall we 
leave this open for now and resolve it once the actual fix is in place? What do 
you think? Thanks.

> SPARK SQL:Datatype overflow conditions gives incorrect result
> -
>
> Key: SPARK-24598
> URL: https://issues.apache.org/jira/browse/SPARK-24598
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: navya
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> Execute an sql query, so that it results in overflow conditions. 
> EX - SELECT 9223372036854775807 + 1 result = -9223372036854776000
>  
> Expected result - Error should be throw like mysql. 
> mysql> SELECT 9223372036854775807 + 1;
> ERROR 1690 (22003): BIGINT value is out of range in '(9223372036854775807 + 
> 1)'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25012) dataframe creation results in matcherror

2018-08-03 Thread uwe (JIRA)
uwe created SPARK-25012:
---

 Summary: dataframe creation results in matcherror
 Key: SPARK-25012
 URL: https://issues.apache.org/jira/browse/SPARK-25012
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.1
 Environment: spark 2.3.1

mac

scala 2.11.12

 
Reporter: uwe


hi,

 

running the attached code results in a 

 
{code:java}
scala.MatchError: 2017-02-09 00:09:27.0 (of class java.sql.Timestamp)
{code}
 # i do think this is wrong (at least i do not see the issue in my code)
 # the error is the ein 90% of the cases (it sometimes passes). that makes me 
think something weird is going on

 

 
{code:java}
package misc

import java.sql.Timestamp
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types.{StringType, StructField, StructType, 
TimestampType}
import org.apache.spark.sql.{Row, SQLContext, SparkSession}

case class LogRecord(application:String, dateTime: Timestamp, component: 
String, level: String, message: String)

class LogRelation(val sqlContext: SQLContext, val path: String) extends 
BaseRelation with PrunedFilteredScan {
 override def schema: StructType = StructType(Seq(
 StructField("application", StringType, false),
 StructField("dateTime", TimestampType, false),
 StructField("component", StringType, false),
 StructField("level", StringType, false),
 StructField("message", StringType, false)))

 override def buildScan(requiredColumns: Array[String], filters: 
Array[Filter]): RDD[Row] = {
 val str = "2017-02-09T00:09:27"
 val ts =Timestamp.valueOf(LocalDateTime.parse(str, 
DateTimeFormatter.ISO_LOCAL_DATE_TIME))

 val 
data=List(Row("app",ts,"comp","level","mess"),Row("app",ts,"comp","level","mess"))
 sqlContext.sparkContext.parallelize(data)
 }
}

class LogDataSource extends DataSourceRegister with RelationProvider {
 override def shortName(): String = "log"

 override def createRelation(sqlContext: SQLContext, parameters: Map[String, 
String]): BaseRelation =
 new LogRelation(sqlContext, parameters("path"))
}

object f0 extends App {
 lazy val spark: SparkSession = 
SparkSession.builder().master("local").appName("spark session").getOrCreate()

 val df = spark.read.format("log").load("hdfs:///logs")
 df.show()
}
 
{code}
 

results in the following stacktrace

 
{noformat}
11:20:06 [task-result-getter-0] ERROR o.a.spark.scheduler.TaskSetManager - Task 
0 in stage 0.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost, executor driver): scala.MatchError: 
2017-02-09 00:09:27.0 (of class java.sql.Timestamp)
 at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
 at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
 at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
 at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
 at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
 at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor

[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-03 Thread LIFULONG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567986#comment-16567986
 ] 

LIFULONG commented on SPARK-24928:
--

for (x <- rdd1.iterator(currSplit.s1, context);
 y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)

 

code from CartesianRDD.compute() method,looks like it will load right rdd from 
text for each record in left rdd.

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23911) High-order function: reduce(array, initialState S, inputFunction, outputFunction) → R

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567963#comment-16567963
 ] 

Apache Spark commented on SPARK-23911:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21982

> High-order function: reduce(array, initialState S, inputFunction, 
> outputFunction) → R
> ---
>
> Key: SPARK-23911
> URL: https://issues.apache.org/jira/browse/SPARK-23911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Herman van Hovell
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns a single value reduced from array. inputFunction will be invoked for 
> each element in array in order. In addition to taking the element, 
> inputFunction takes the current state, initially initialState, and returns 
> the new state. outputFunction will be invoked to turn the final state into 
> the result value. It may be the identity function (i -> i).
> {noformat}
> SELECT reduce(ARRAY [], 0, (s, x) -> s + x, s -> s); -- 0
> SELECT reduce(ARRAY [5, 20, 50], 0, (s, x) -> s + x, s -> s); -- 75
> SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> s + x, s -> s); -- NULL
> SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> s + COALESCE(x, 0), s -> 
> s); -- 75
> SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> IF(x IS NULL, s, s + x), 
> s -> s); -- 75
> SELECT reduce(ARRAY [2147483647, 1], CAST (0 AS BIGINT), (s, x) -> s + x, s 
> -> s); -- 2147483648
> SELECT reduce(ARRAY [5, 6, 10, 20], -- calculates arithmetic average: 10.25
>   CAST(ROW(0.0, 0) AS ROW(sum DOUBLE, count INTEGER)),
>   (s, x) -> CAST(ROW(x + s.sum, s.count + 1) AS ROW(sum DOUBLE, 
> count INTEGER)),
>   s -> IF(s.count = 0, NULL, s.sum / s.count));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23909) High-order function: filter(array, function) → array

2018-08-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567941#comment-16567941
 ] 

Apache Spark commented on SPARK-23909:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21982

> High-order function: filter(array, function) → array
> --
>
> Key: SPARK-23909
> URL: https://issues.apache.org/jira/browse/SPARK-23909
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Herman van Hovell
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Constructs an array from those elements of array for which function returns 
> true:
> {noformat}
> SELECT filter(ARRAY [], x -> true); -- []
> SELECT filter(ARRAY [5, -6, NULL, 7], x -> x > 0); -- [5, 7]
> SELECT filter(ARRAY [5, NULL, 7, NULL], x -> x IS NOT NULL); -- [5, 7]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24993) Make Avro fast again

2018-08-03 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24993.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Make Avro fast again
> 
>
> Key: SPARK-24993
> URL: https://issues.apache.org/jira/browse/SPARK-24993
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24993) Make Avro fast again

2018-08-03 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24993:

Affects Version/s: (was: 2.3.0)
   2.4.0

> Make Avro fast again
> 
>
> Key: SPARK-24993
> URL: https://issues.apache.org/jira/browse/SPARK-24993
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24989) BlockFetcher should retry while getting OutOfDirectMemoryError

2018-08-03 Thread Li Yuanjian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian resolved SPARK-24989.
-
Resolution: Not A Problem

The param `spark.reducer.maxBlocksInFlightPerAddress` added in SPARK-21243 can 
solve this problem, close this jira. Thanks [~Dhruve Ashar] !

> BlockFetcher should retry while getting OutOfDirectMemoryError
> --
>
> Key: SPARK-24989
> URL: https://issues.apache.org/jira/browse/SPARK-24989
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.2.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: FailedStage.png
>
>
> h3. Description
> This problem can be reproduced stably by a large parallelism job migrate from 
> map reduce to Spark in our practice, some metrics list below:
> ||Item||Value||
> |spark.executor.instances|1000|
> |spark.executor.cores|5|
> |task number of shuffle writer stage|18038|
> |task number of shuffle reader stage|8|
> While the shuffle writer stage successful ended, the shuffle reader stage 
> starting and keep failing by FetchFail. Each fetch request need the netty 
> sever allocate a buffer in 16MB(detailed stack attached below), the huge 
> amount of fetch request will use up default maxDirectMemory rapidly, even 
> though we bump up io.netty.maxDirectMemory to 50GB!
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 
> byte(s) of direct memory (used: 21474836480, max: 21474836480)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:514)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:445)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:119)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 
> 16777216 byte(s) of direct memory (use

[jira] [Resolved] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-03 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-25009.
-
   Resolution: Fixed
 Assignee: Devaraj K
Fix Version/s: 2.4.0

> Standalone Cluster mode application submit is not working
> -
>
> Key: SPARK-25009
> URL: https://issues.apache.org/jira/browse/SPARK-25009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Critical
> Fix For: 2.4.0
>
>
> It is not showing any error while submitting but the app is not running and 
> as well as not showing in the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25011) Add PrefixSpan to __all__

2018-08-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25011.
--
   Resolution: Fixed
 Assignee: yuhao yang
Fix Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/21981

> Add PrefixSpan to __all__
> -
>
> Key: SPARK-25011
> URL: https://issues.apache.org/jira/browse/SPARK-25011
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add PrefixSpan to __all__ in fpm.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >