[jira] [Resolved] (SPARK-22300) Update ORC to 1.4.1

2017-10-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22300.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19521
[https://github.com/apache/spark/pull/19521]

> Update ORC to 1.4.1
> ---
>
> Key: SPARK-22300
> URL: https://issues.apache.org/jira/browse/SPARK-22300
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> Apache ORC 1.4.1 is released yesterday.
> - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/
> Like ORC-233 (Allow `orc.include.columns` to be empty), there are several 
> important fixes.
> We had better use the latest one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22300) Update ORC to 1.4.1

2017-10-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22300:
---

Assignee: Dongjoon Hyun

> Update ORC to 1.4.1
> ---
>
> Key: SPARK-22300
> URL: https://issues.apache.org/jira/browse/SPARK-22300
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>
> Apache ORC 1.4.1 is released yesterday.
> - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/
> Like ORC-233 (Allow `orc.include.columns` to be empty), there are several 
> important fixes.
> We had better use the latest one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22312) Spark job stuck with no executor due to bug in Executor Allocation Manager

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22312:


Assignee: Apache Spark

> Spark job stuck with no executor due to bug in Executor Allocation Manager
> --
>
> Key: SPARK-22312
> URL: https://issues.apache.org/jira/browse/SPARK-22312
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> We often see the issue of Spark jobs stuck because the Executor Allocation 
> Manager does not ask for any executor even if there are pending tasks in case 
> dynamic allocation is turned on. Looking at the logic in EAM which calculates 
> the running tasks, it can happen that the calculation will be wrong and the 
> number of running tasks can become negative. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22312) Spark job stuck with no executor due to bug in Executor Allocation Manager

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22312:


Assignee: (was: Apache Spark)

> Spark job stuck with no executor due to bug in Executor Allocation Manager
> --
>
> Key: SPARK-22312
> URL: https://issues.apache.org/jira/browse/SPARK-22312
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> We often see the issue of Spark jobs stuck because the Executor Allocation 
> Manager does not ask for any executor even if there are pending tasks in case 
> dynamic allocation is turned on. Looking at the logic in EAM which calculates 
> the running tasks, it can happen that the calculation will be wrong and the 
> number of running tasks can become negative. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22312) Spark job stuck with no executor due to bug in Executor Allocation Manager

2017-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210602#comment-16210602
 ] 

Apache Spark commented on SPARK-22312:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/19534

> Spark job stuck with no executor due to bug in Executor Allocation Manager
> --
>
> Key: SPARK-22312
> URL: https://issues.apache.org/jira/browse/SPARK-22312
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> We often see the issue of Spark jobs stuck because the Executor Allocation 
> Manager does not ask for any executor even if there are pending tasks in case 
> dynamic allocation is turned on. Looking at the logic in EAM which calculates 
> the running tasks, it can happen that the calculation will be wrong and the 
> number of running tasks can become negative. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22312) Spark job stuck with no executor due to bug in Executor Allocation Manager

2017-10-18 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-22312:
---

 Summary: Spark job stuck with no executor due to bug in Executor 
Allocation Manager
 Key: SPARK-22312
 URL: https://issues.apache.org/jira/browse/SPARK-22312
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Sital Kedia


We often see the issue of Spark jobs stuck because the Executor Allocation 
Manager does not ask for any executor even if there are pending tasks in case 
dynamic allocation is turned on. Looking at the logic in EAM which calculates 
the running tasks, it can happen that the calculation will be wrong and the 
number of running tasks can become negative. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-10-18 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21551:
-
Fix Version/s: 2.2.1

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Assignee: peay
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22310) Refactor join estimation to incorporate estimation logic for different kinds of statistics

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22310:


Assignee: Apache Spark

> Refactor join estimation to incorporate estimation logic for different kinds 
> of statistics
> --
>
> Key: SPARK-22310
> URL: https://issues.apache.org/jira/browse/SPARK-22310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> The current join estimation logic is only based on basic column statistics 
> (such as ndv, etc). If we want to add estimation for other kinds of 
> statistics (such as histograms), it's not easy to incorporate into the 
> current algorithm:
> 1. When we have multiple pairs of join keys, the current algorithm computes 
> cardinality in a single formula. But if different join keys have different 
> kinds of stats, the computation logic for each pair of join keys become 
> different, so the previous formula does not apply.
> 2. Currently it computes cardinality and updates join keys' column stats 
> separately. It's better to do these two steps together, since both 
> computation and update logic are different for different kinds of stats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22310) Refactor join estimation to incorporate estimation logic for different kinds of statistics

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22310:


Assignee: (was: Apache Spark)

> Refactor join estimation to incorporate estimation logic for different kinds 
> of statistics
> --
>
> Key: SPARK-22310
> URL: https://issues.apache.org/jira/browse/SPARK-22310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> The current join estimation logic is only based on basic column statistics 
> (such as ndv, etc). If we want to add estimation for other kinds of 
> statistics (such as histograms), it's not easy to incorporate into the 
> current algorithm:
> 1. When we have multiple pairs of join keys, the current algorithm computes 
> cardinality in a single formula. But if different join keys have different 
> kinds of stats, the computation logic for each pair of join keys become 
> different, so the previous formula does not apply.
> 2. Currently it computes cardinality and updates join keys' column stats 
> separately. It's better to do these two steps together, since both 
> computation and update logic are different for different kinds of stats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22310) Refactor join estimation to incorporate estimation logic for different kinds of statistics

2017-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210517#comment-16210517
 ] 

Apache Spark commented on SPARK-22310:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19531

> Refactor join estimation to incorporate estimation logic for different kinds 
> of statistics
> --
>
> Key: SPARK-22310
> URL: https://issues.apache.org/jira/browse/SPARK-22310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> The current join estimation logic is only based on basic column statistics 
> (such as ndv, etc). If we want to add estimation for other kinds of 
> statistics (such as histograms), it's not easy to incorporate into the 
> current algorithm:
> 1. When we have multiple pairs of join keys, the current algorithm computes 
> cardinality in a single formula. But if different join keys have different 
> kinds of stats, the computation logic for each pair of join keys become 
> different, so the previous formula does not apply.
> 2. Currently it computes cardinality and updates join keys' column stats 
> separately. It's better to do these two steps together, since both 
> computation and update logic are different for different kinds of stats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22311) stage api modify the description format, add version api, modify the duration real-time calculation

2017-10-18 Thread guoxiaolongzte (JIRA)
guoxiaolongzte created SPARK-22311:
--

 Summary: stage api modify the description format, add version api, 
modify the duration real-time calculation
 Key: SPARK-22311
 URL: https://issues.apache.org/jira/browse/SPARK-22311
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: guoxiaolongzte
Priority: Trivial


stage api modify the description format

 A list of all stages for a given application.
 ?status=[active|complete|pending|failed] list only stages 
in the state.
content should be included in  

add version api  doc '/api/v1/version'

modify the duration real-time calculation




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22310) Refactor join estimation to incorporate estimation logic for different kinds of statistics

2017-10-18 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-22310:


 Summary: Refactor join estimation to incorporate estimation logic 
for different kinds of statistics
 Key: SPARK-22310
 URL: https://issues.apache.org/jira/browse/SPARK-22310
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang


The current join estimation logic is only based on basic column statistics 
(such as ndv, etc). If we want to add estimation for other kinds of statistics 
(such as histograms), it's not easy to incorporate into the current algorithm:
1. When we have multiple pairs of join keys, the current algorithm computes 
cardinality in a single formula. But if different join keys have different 
kinds of stats, the computation logic for each pair of join keys become 
different, so the previous formula does not apply.
2. Currently it computes cardinality and updates join keys' column stats 
separately. It's better to do these two steps together, since both computation 
and update logic are different for different kinds of stats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in RandomForest

2017-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210513#comment-16210513
 ] 

Apache Spark commented on SPARK-22309:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19530

> Remove unused param in `LDAModel.getTopicDistributionMethod` & destory 
> `nodeToFeaturesBc` in RandomForest
> -
>
> Key: SPARK-22309
> URL: https://issues.apache.org/jira/browse/SPARK-22309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
> is redundant.
> 2, forgot to destory broadcasted object {{nodeToFeaturesBc}} in 
> {{RandomForest}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in RandomForest

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22309:


Assignee: (was: Apache Spark)

> Remove unused param in `LDAModel.getTopicDistributionMethod` & destory 
> `nodeToFeaturesBc` in RandomForest
> -
>
> Key: SPARK-22309
> URL: https://issues.apache.org/jira/browse/SPARK-22309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
> is redundant.
> 2, forgot to destory broadcasted object {{nodeToFeaturesBc}} in 
> {{RandomForest}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in RandomForest

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22309:


Assignee: Apache Spark

> Remove unused param in `LDAModel.getTopicDistributionMethod` & destory 
> `nodeToFeaturesBc` in RandomForest
> -
>
> Key: SPARK-22309
> URL: https://issues.apache.org/jira/browse/SPARK-22309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> 1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
> is redundant.
> 2, forgot to destory broadcasted object {{nodeToFeaturesBc}} in 
> {{RandomForest}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22309) Remove unused param in `LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in RandomForest

2017-10-18 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-22309:


 Summary: Remove unused param in 
`LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in 
RandomForest
 Key: SPARK-22309
 URL: https://issues.apache.org/jira/browse/SPARK-22309
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng
Priority: Minor


1, Param {{sc: SparkContext}} in {{LocalLDAModel.getTopicDistributionMethod}} 
is redundant.
2, forgot to destory broadcasted object {{nodeToFeaturesBc}} in {{RandomForest}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22308:


Assignee: Apache Spark

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Apache Spark
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210361#comment-16210361
 ] 

Apache Spark commented on SPARK-22308:
--

User 'nkronenfeld' has created a pull request for this issue:
https://github.com/apache/spark/pull/19529

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22308:


Assignee: (was: Apache Spark)

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-18 Thread Nathan Kronenfeld (JIRA)
Nathan Kronenfeld created SPARK-22308:
-

 Summary: Support unit tests of spark code using ScalaTest using 
suites other than FunSuite
 Key: SPARK-22308
 URL: https://issues.apache.org/jira/browse/SPARK-22308
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core, SQL, Tests
Affects Versions: 2.2.0
Reporter: Nathan Kronenfeld
Priority: Minor


External codebases that have spark code can test it using SharedSparkContext, 
no matter how they write their scalatests - basing on FunSuite, FunSpec, 
FlatSpec, or WordSpec.

SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-18 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210127#comment-16210127
 ] 

Dongjoon Hyun commented on SPARK-22306:
---

[~WhoisDavid].
Your table is 'STORED AS PARQUET', right?
Could you try `spark.sql.hive.convertMetastoreParquet.mergeSchema=false`, too?


> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-18 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210105#comment-16210105
 ] 

Dongjoon Hyun commented on SPARK-22306:
---

Hi, [~cloud_fan] and [~smilegator].
Should we change the default value to prevent this regression?

> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB

2017-10-18 Thread Ben (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210089#comment-16210089
 ] 

Ben commented on SPARK-22284:
-

[~hyukjin.kwon], thank you very much for your assistance, but I don't think I 
can reproduce it easily.
Everything has been working fine until now, but it only got stuck in the last 
batch.

If 'm correct, you are also the developer of spark-xml, which is exactly what I 
use for reading the source files.
I read more than a million XML files per batch, and these files have complex 
nested structures, so a lot of data is parsed into a DataFrame.
There are cases that the XML files may have very deep levels of nested tags, so 
I'm not sure if that may be the cause, linking this issue to spark-xml and not 
the actual join step when the error happens.

Furthermore, I tried splitting the batch where it got stuck into smaller 
batches to process. The first half of around 700 thousand files worked, but the 
second half had the same error. Then I split the second half in 100 thousand 
files per batch, and then it worked. I cannot make the XML files available, 
which means I can't give you a way to reproduce it. I can send you the 
generated code that is included in the error, but I'm guessing that also 
doesn't help. However, if you think the deep nested levels may be a cause, 
maybe this can provide some insight.

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Ben
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read..cache()
> dataFrameSmall.read...cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22307) NOT condition working incorrectly

2017-10-18 Thread Andrey Yakovenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Yakovenko updated SPARK-22307:
-
Description: 
Suggest test case: table with x record filtered by expression expr returns y 
records (< x), not(expr) does not returns x-y records. Work around: when(expr, 
false).otherwise(true) is working fine.
{code}
val ctg = spark.sqlContext.read.json("/user/Catalog.json")
scala> ctg.printSchema
root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Parent: struct (nullable = true)
 ||-- Id: string (nullable = true)
 ||-- Name: string (nullable = true)
 ||-- Parent: struct (nullable = true)
 |||-- Id: string (nullable = true)
 |||-- Name: string (nullable = true)
 |||-- Parent: struct (nullable = true)
 ||||-- Id: string (nullable = true)
 ||||-- Name: string (nullable = true)
 ||||-- Parent: string (nullable = true)
 ||||-- SKU: string (nullable = true)
 |||-- SKU: string (nullable = true)
 ||-- SKU: string (nullable = true)
 |-- SKU: string (nullable = true)

val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
'13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
(Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))

scala> ctg.count
res5: Long = 623

scala> ctg.filter(col1).count
res2: Long = 2

scala> ctg.filter(not(col1)).count
res3: Long = 4

scala> ctg.filter(when(col1, false).otherwise(true)).count
res4: Long = 621
{code}
Table is hierarchy like structure and has a records with different number of 
levels filled up. I have a suspicion that due to partly filled hierarchy 
condition return null/undefined/failed/nan some times (neither true or false).

  was:
Suggest test case: table with x record filtered by expression expr returns y 
records (< x), not(expr) does not returns x-y records. Work around: when(expr, 
false).otherwise(true) is working fine.
{code}
val ctg = spark.sqlContext.read.json("/user/Catalog.json")
scala> ctg.printSchema
root[^attachment-name.zip]
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Parent: struct (nullable = true)
 ||-- Id: string (nullable = true)
 ||-- Name: string (nullable = true)
 ||-- Parent: struct (nullable = true)
 |||-- Id: string (nullable = true)
 |||-- Name: string (nullable = true)
 |||-- Parent: struct (nullable = true)
 ||||-- Id: string (nullable = true)
 ||||-- Name: string (nullable = true)
 ||||-- Parent: string (nullable = true)
 ||||-- SKU: string (nullable = true)
 |||-- SKU: string (nullable = true)
 ||-- SKU: string (nullable = true)
 |-- SKU: string (nullable = true)

val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
'13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
(Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))

scala> ctg.count
res5: Long = 623

scala> ctg.filter(col1).count
res2: Long = 2

scala> ctg.filter(not(col1)).count
res3: Long = 4

scala> ctg.filter(when(col1, false).otherwise(true)).count
res4: Long = 621
{code}
Table is hierarchy like structure and has a records with different number of 
levels filled up. I have a suspicion that due to partly filled hierarchy 
condition return null/undefined/failed/nan some times (neither true or false).


> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- 

[jira] [Updated] (SPARK-22307) NOT condition working incorrectly

2017-10-18 Thread Andrey Yakovenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Yakovenko updated SPARK-22307:
-
Attachment: Catalog.json.gz

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root[^attachment-name.zip]
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22307) NOT condition working incorrectly

2017-10-18 Thread Andrey Yakovenko (JIRA)
Andrey Yakovenko created SPARK-22307:


 Summary: NOT condition working incorrectly
 Key: SPARK-22307
 URL: https://issues.apache.org/jira/browse/SPARK-22307
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1, 2.1.0
Reporter: Andrey Yakovenko


Suggest test case: table with x record filtered by expression expr returns y 
records (< x), not(expr) does not returns x-y records. Work around: when(expr, 
false).otherwise(true) is working fine.
{code}
val ctg = spark.sqlContext.read.json("/user/Catalog.json")
scala> ctg.printSchema
root[^attachment-name.zip]
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Parent: struct (nullable = true)
 ||-- Id: string (nullable = true)
 ||-- Name: string (nullable = true)
 ||-- Parent: struct (nullable = true)
 |||-- Id: string (nullable = true)
 |||-- Name: string (nullable = true)
 |||-- Parent: struct (nullable = true)
 ||||-- Id: string (nullable = true)
 ||||-- Name: string (nullable = true)
 ||||-- Parent: string (nullable = true)
 ||||-- SKU: string (nullable = true)
 |||-- SKU: string (nullable = true)
 ||-- SKU: string (nullable = true)
 |-- SKU: string (nullable = true)

val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
'13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
(Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))

scala> ctg.count
res5: Long = 623

scala> ctg.filter(col1).count
res2: Long = 2

scala> ctg.filter(not(col1)).count
res3: Long = 4

scala> ctg.filter(when(col1, false).otherwise(true)).count
res4: Long = 621
{code}
Table is hierarchy like structure and has a records with different number of 
levels filled up. I have a suspicion that due to partly filled hierarchy 
condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-18 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209958#comment-16209958
 ] 

Hossein Falaki commented on SPARK-17902:


A simple unit test we could add would be:

{code}
> df <- createDataFrame(iris)
> sapply(iris, typeof) == sapply(collect(df, stringsAsFactors = T), typeof)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width  Species
TRUE TRUE TRUE TRUEFALSE
{code}

As for the solution, I suggest performing the conversion inside [this 
loop|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L1168].

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors

2017-10-18 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209927#comment-16209927
 ] 

Shivaram Venkataraman commented on SPARK-17902:
---

I think [~falaki] might have a test case that we could test against ?

> collect() ignores stringsAsFactors
> --
>
> Key: SPARK-17902
> URL: https://issues.apache.org/jira/browse/SPARK-17902
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> `collect()` function signature includes an optional flag named 
> `stringsAsFactors`. It seems it is completely ignored.
> {code}
> str(collect(createDataFrame(iris), stringsAsFactors = TRUE)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21122) Address starvation issues when dynamic allocation is enabled

2017-10-18 Thread Craig Ingram (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209918#comment-16209918
 ] 

Craig Ingram commented on SPARK-21122:
--

Finally getting back around to this. Thanks for the feedback [~tgraves]. I 
agree with pretty much everything you pointed out. Newer versions of YARN do 
have a [Cluster Metrics 
API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Metrics_API].
 I think this was introduced in 2.8 though. Regardless I believe working with 
YARN's PreemptionMessage is a better path forward than what I was initially 
proposing. I wish there was an elegant way to do this generically, but I 
believe I can at least make an abstraction of the PreemptionMessage that can be 
used by other RM clients. For now, I will focus on a YARN specific solution.

> Address starvation issues when dynamic allocation is enabled
> 
>
> Key: SPARK-21122
> URL: https://issues.apache.org/jira/browse/SPARK-21122
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Craig Ingram
> Attachments: Preventing Starvation with Dynamic Allocation Enabled.pdf
>
>
> When dynamic resource allocation is enabled on a cluster, it’s currently 
> possible for one application to consume all the cluster’s resources, 
> effectively starving any other application trying to start. This is 
> particularly painful in a notebook environment where notebooks may be idle 
> for tens of minutes while the user is figuring out what to do next (or eating 
> their lunch). Ideally the application should give resources back to the 
> cluster when monitoring indicates other applications are pending.
> Before delving into the specifics of the solution. There are some workarounds 
> to this problem that are worth mentioning:
> * Set spark.dynamicAllocation.maxExecutors to a small value, so that users 
> are unlikely to use the entire cluster even when many of them are doing work. 
> This approach will hurt cluster utilization.
> * If using YARN, enable preemption and have each application (or 
> organization) run in a separate queue. The downside of this is that when YARN 
> preempts, it doesn't know anything about which executor it's killing. It 
> would just as likely kill a long running executor with cached data as one 
> that just spun up. Moreover, given a feature like 
> https://issues.apache.org/jira/browse/SPARK-21097 (preserving cached data on 
> executor decommission), YARN may not wait long enough between trying to 
> gracefully and forcefully shut down the executor. This would mean the blocks 
> that belonged to that executor would be lost and have to be recomputed.
> * Configure YARN to use the capacity scheduler with multiple scheduler 
> queues. Put high-priority notebook users into a high-priority queue. Prevents 
> high-priority users from being starved out by low-priority notebook users. 
> Does not prevent users in the same priority class from starving each other.
> Obviously any solution to this problem that depends on YARN would leave other 
> resource managers out in the cold. The solution proposed in this ticket will 
> afford spark clusters the flexibly to hook in different resource allocation 
> policies to fulfill their user's needs regardless of resource manager choice. 
> Initially the focus will be on users in a notebook environment. When 
> operating in a notebook environment with many users, the goal is fair 
> resource allocation. Given that all users will be using the same memory 
> configuration, this solution will focus primarily on fair sharing of cores.
> The fair resource allocation policy should pick executors to remove based on 
> three factors initially: idleness, presence of cached data, and uptime. The 
> policy will favor removing executors that are idle, short-lived, and have no 
> cached data. The policy will only preemptively remove executors if there are 
> pending applications or cores (otherwise the default dynamic allocation 
> timeout/removal process is followed). The policy will also allow an 
> application's resource consumption to expand based on cluster utilization. 
> For example if there are 3 applications running but 2 of them are idle, the 
> policy will allow a busy application with pending tasks to consume more than 
> 1/3rd of the the cluster's resources.
> More complexity could be added to take advantage of task/stage metrics, 
> histograms, and heuristics (i.e. favor removing executors running tasks that 
> are quick). The important thing here is to benchmark effectively before 
> adding complexity so we can measure the impact of the changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities

2017-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209902#comment-16209902
 ] 

Apache Spark commented on SPARK-20393:
--

User 'ambauma' has created a pull request for this issue:
https://github.com/apache/spark/pull/19528

> Strengthen Spark to prevent XSS vulnerabilities
> ---
>
> Key: SPARK-20393
> URL: https://issues.apache.org/jira/browse/SPARK-20393
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 2.0.2, 2.1.0
>Reporter: Nicholas Marion
>Assignee: Nicholas Marion
>  Labels: security
> Fix For: 2.1.2, 2.2.0
>
>
> Using IBM Security AppScan Standard, we discovered several easy to recreate 
> MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI 
> application and these vulnerabilities were found to exist in Spark version 
> 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting 
> attack is not really an attack on the Spark server as much as an attack on 
> the end user, taking advantage of their trust in the Spark server to get them 
> to click on a URL like the ones in the examples below.  So whether the user 
> could or could not change lots of stuff on the Spark server is not the key 
> point.  It is an attack on the user themselves.  If they click the link the 
> script could run in their browser and comprise their device.  Once the 
> browser is compromised it could submit Spark requests but it also might not.
> https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/
> {quote}
> Request: GET 
> /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a--
> _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> HTTP/1.1
> Excerpt from response: No running application with ID 
> Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> 
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET 
> /history/app-20161012202114-0038/stages/stage?id=1=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> k.pageSize=100 HTTP/1.1
> Excerpt from response: Content-Type: multipart/related;
> boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> Request: GET /log?appId=app-20170113131903-=0=Content-
> Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent-
> Location:foo%0d%0aContent-Transfer-
> Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a
> eLength=0 HTTP/1.1
> Excerpt from response:  Bytes 0-0 of 0 of 
> /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content-
> Type: multipart/related; boundary=_AppScan
> --_AppScan
> Content-Location:foo
> Content-Transfer-Encoding:base64
> PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+
> Result: In the above payload the BASE64 data decodes as:
> alert("XSS")
> {quote}
> security@apache was notified and recommended a PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209451#comment-16209451
 ] 

Sean Owen commented on SPARK-22296:
---

Indeed does it happen in 2.2.0? Could be fixed.

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. scala.collection.Seq
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50/* 093 */   private scala.collection.Seq 
> argValue84;
> build   17-Oct-2017 05:30:50/* 094 */   private scala.collection.Seq 
> argValue85;
> and
> build   17-Oct-2017 05:30:54/* 8217 */ final 
> com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
> com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, 
> argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, 
> argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, 
> argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, 
> argValue24, argValue25, argValue26, argValue27, argValue28, argValue29, 
> argValue30, argValue31, argValue32, argValue33, argValue34, argValue35, 
> 

[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-18 Thread David Malinge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Malinge updated SPARK-22306:
--
Description: 
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Switching the property to: NEVER_INFER prevents the issue.

Also, note that the damage can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}


  was:
I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Note that this can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}



> INFER_AND_SAVE overwrites important metadata in Metastore
> -
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore)
> Spark 2.2.0
>Reporter: David Malinge
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Metastore

2017-10-18 Thread David Malinge (JIRA)
David Malinge created SPARK-22306:
-

 Summary: INFER_AND_SAVE overwrites important metadata in Metastore
 Key: SPARK-22306
 URL: https://issues.apache.org/jira/browse/SPARK-22306
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: Hive 2.3.0 (PostgresQL metastore)
Spark 2.2.0
Reporter: David Malinge


I noticed some critical changes on my hive tables and realized that they were 
caused by a simple select on SparkSQL. Looking at the logs, I found out that 
this select was actually performing an update on the database "Saving 
case-sensitive schema for table". 
I then found out that Spark 2.2.0 introduces a new default value for 
spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE

The issue is that this update changes critical metadata of the table, in 
particular:
- changes the owner to the current user
- removes bucketing metadata (BUCKETING_COLS, SDS)
- removes sorting metadata (SORT_COLS)

Note that this can be fix manually in Hive with e.g.:
{code:sql}
alter table [table_name] 
clustered by ([col1], [col2]) 
sorted by ([colA], [colB])
into [n] buckets
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

2017-10-18 Thread Rob Vesse (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209396#comment-16209396
 ] 

Rob Vesse commented on SPARK-9:
---

I'd be interested to know more about what performance testing you carried out?

In our experience of running Spark over various high-performance interconnects 
you can get fairly good performance boost by simply setting 
{{spark.shuffle.compress=false}} and relying on TCP/IP performance over the 
interconnect.  It is pretty difficult to write a Spark job that saturates the 
available bandwidth of such an interconnect so disabling compression means you 
don't waste CPU cycles compressing data and instead simply pump it across the 
network ASAP.

I also wonder if you could comment on the choice of the underlying RDMA 
libraries and their licensing?

It looks like {{libdisni}} is ASLv2 which is fine but some of its dependencies 
appear to be at least in parts GPL (e.g. {{librdmacm}}) which would mean they 
could not be depended on by an Apache project even as optional dependencies due 
to foundation level policies around dependency licensing. Due to the nature of 
licensing of most of the libraries in this space it may be legally impossible 
for RDMA support to make it into spark proper. It which case you would likely 
have to stick with the external plug-in approach as you do currently.

> SPIP: RDMA Accelerated Shuffle Engine
> -
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuval Degani
> Attachments: 
> SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22266:
---

Assignee: Maryann Xue

> The same aggregate function was evaluated multiple times
> 
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.3.0
>
>
> We should avoid the same aggregate function being evaluated more than once, 
> and this is what has been stated in the code comment below 
> (patterns.scala:206). However things didn't work as expected.
> {code}
>   // A single aggregate expression might appear multiple times in 
> resultExpressions.
>   // In order to avoid evaluating an individual aggregate function 
> multiple times, we'll
>   // build a set of the distinct aggregate expressions and build a 
> function which can
>   // be used to re-write expressions so that they reference the single 
> copy of the
>   // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
>+- SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
> true]).b AS b#24]
>   +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions 
> "max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22266) The same aggregate function was evaluated multiple times

2017-10-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22266.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19488
[https://github.com/apache/spark/pull/19488]

> The same aggregate function was evaluated multiple times
> 
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Maryann Xue
>Priority: Minor
> Fix For: 2.3.0
>
>
> We should avoid the same aggregate function being evaluated more than once, 
> and this is what has been stated in the code comment below 
> (patterns.scala:206). However things didn't work as expected.
> {code}
>   // A single aggregate expression might appear multiple times in 
> resultExpressions.
>   // In order to avoid evaluating an individual aggregate function 
> multiple times, we'll
>   // build a set of the distinct aggregate expressions and build a 
> function which can
>   // be used to re-write expressions so that they reference the single 
> copy of the
>   // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))], 
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)), 
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
>+- SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, 
> true]).b AS b#24]
>   +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions 
> "max(b#24 + 1)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq

2017-10-18 Thread Randy Tidd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209289#comment-16209289
 ] 

Randy Tidd commented on SPARK-22296:


Potentially similar to SPARK-16792

> CodeGenerator - failed to compile when constructor has 
> scala.collection.mutable.Seq vs. scala.collection.Seq
> 
>
> Key: SPARK-22296
> URL: https://issues.apache.org/jira/browse/SPARK-22296
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Randy Tidd
>
> This is with Scala 2.11.
> We have a case class that has a constructor with 85 args, the last two of 
> which are:
>  var chargesInst : 
> scala.collection.mutable.Seq[ChargeInstitutional] = 
> scala.collection.mutable.Seq.empty[ChargeInstitutional],
>  var chargesProf : 
> scala.collection.mutable.Seq[ChargeProfessional] = 
> scala.collection.mutable.Seq.empty[ChargeProfessional]
> A mutable Seq in a the constructor of a case class is probably poor form but 
> Scala allows it.  When we run this job we get this error:
> build   17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch 
> worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 8217, Column 70: No applicable constructor/method found for actual parameters 
> "java.lang.String, java.lang.String, long, java.lang.String, long, long, 
> long, java.lang.String, long, long, double, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, long, long, long, long, 
> scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, long, java.lang.String, int, double, 
> double, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, 
> com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, 
> java.lang.String, long, java.lang.String, int, int, boolean, boolean, 
> scala.collection.Seq, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, scala.collection.Seq"; candidates are: 
> "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, 
> java.lang.String, long, long, long, java.lang.String, long, long, double, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, long, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> long, long, long, long, scala.Option, scala.Option, scala.Option, 
> scala.Option, scala.Option, java.lang.String, java.lang.String, 
> java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, 
> java.lang.String, int, double, double, java.lang.String, java.lang.String, 
> java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, 
> java.lang.String, long, long, long, long, java.lang.String, 
> com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, 
> scala.collection.Seq, scala.collection.Seq, java.lang.String, long, 
> java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, 
> scala.collection.Seq, boolean, scala.collection.mutable.Seq, 
> scala.collection.mutable.Seq)"
> The relevant lines are:
> build   17-Oct-2017 05:30:50/* 093 */   private scala.collection.Seq 
> argValue84;
> build   17-Oct-2017 05:30:50/* 094 */   private scala.collection.Seq 
> argValue85;
> and
> build   17-Oct-2017 05:30:54/* 8217 */ final 
> com.xyz.xyz.xyz.domain.Account value1 = false ? null : new 
> com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, 
> argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, 
> argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, 
> argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, 
> argValue24, argValue25, argValue26, argValue27, argValue28, argValue29, 
> argValue30, argValue31, argValue32, argValue33, argValue34, argValue35, 
> 

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-18 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209203#comment-16209203
 ] 

Liang-Chi Hsieh commented on SPARK-13030:
-

[~josephkb] I think as we add a new class, it is better to add multi-column 
support at beginning.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-18 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209166#comment-16209166
 ] 

Peng Meng commented on SPARK-22295:
---

Why there is a space for the column names?  ' plas', ' pres', ' skin', ' test', 
' mass', ' pedi', ' age', 

> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", 

[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-18 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209164#comment-16209164
 ] 

Peng Meng commented on SPARK-22295:
---

Please use  labelCol=" class", not  labelCol="class".

> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol="class").fit(df)
> except:
> print(sys.exc_info())
> 

[jira] [Comment Edited] (SPARK-22304) HDFSBackedStateStoreProvider stackoverflow due to reloading of very old state

2017-10-18 Thread Yuval Itzchakov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209128#comment-16209128
 ] 

Yuval Itzchakov edited comment on SPARK-22304 at 10/18/17 10:41 AM:


Accidently opened twice (SPARK-22305)


was (Author: yuval.itzchakov):
Accidently opened twice (SPARK-22304)

> HDFSBackedStateStoreProvider stackoverflow due to reloading of very old state
> -
>
> Key: SPARK-22304
> URL: https://issues.apache.org/jira/browse/SPARK-22304
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuval Itzchakov
>
> Setup:
> Spark 2.2.0
> Java version: "1.8.0_112"
> spark.sql.streaming.minBatchesToRetain: 100
> During a failure of an Executor due to an OOM exception, we're seeing the 
> HDFS backed state store attempt to reload old state in memory. This fails due 
> to a StackOverflow exception as the `HDFSBackedStateStoreProvider.loadMap` 
> method is recursive. 
> The gist of the exception is:
> {code:java}
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.StackOverflowError
>   at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> 

[jira] [Resolved] (SPARK-22304) HDFSBackedStateStoreProvider stackoverflow due to reloading of very old state

2017-10-18 Thread Yuval Itzchakov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuval Itzchakov resolved SPARK-22304.
-
Resolution: Duplicate

Accidently opened twice (SPARK-22304)

> HDFSBackedStateStoreProvider stackoverflow due to reloading of very old state
> -
>
> Key: SPARK-22304
> URL: https://issues.apache.org/jira/browse/SPARK-22304
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuval Itzchakov
>
> Setup:
> Spark 2.2.0
> Java version: "1.8.0_112"
> spark.sql.streaming.minBatchesToRetain: 100
> During a failure of an Executor due to an OOM exception, we're seeing the 
> HDFS backed state store attempt to reload old state in memory. This fails due 
> to a StackOverflow exception as the `HDFSBackedStateStoreProvider.loadMap` 
> method is recursive. 
> The gist of the exception is:
> {code:java}
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.StackOverflowError
>   at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at 

[jira] [Updated] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-18 Thread Yuval Itzchakov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuval Itzchakov updated SPARK-22305:

Description: 
Environment:

Spark: 2.2.0
Java version: 1.8.0_112
spark.sql.streaming.minBatchesToRetain: 100

After an application failure due to OOM exceptions, restarting the application 
with the existing state produces the following OOM:

{code:java}
java.io.IOException: com.google.protobuf.ServiceException: 
java.lang.StackOverflowError
at 
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
at 

[jira] [Updated] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-18 Thread Yuval Itzchakov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuval Itzchakov updated SPARK-22305:

Description: 
Environment:

Spark: 2.2.0
Java version: 1.8.0_112
spark.sql.streaming.stateStore.maintenanceInterval: 100

After an application failure due to OOM exceptions, restarting the application 
with the existing state produces the following OOM:

{code:java}
java.io.IOException: com.google.protobuf.ServiceException: 
java.lang.StackOverflowError
at 
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
at 

[jira] [Created] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-18 Thread Yuval Itzchakov (JIRA)
Yuval Itzchakov created SPARK-22305:
---

 Summary: HDFSBackedStateStoreProvider fails with 
StackOverflowException when attempting to recover state
 Key: SPARK-22305
 URL: https://issues.apache.org/jira/browse/SPARK-22305
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Yuval Itzchakov


Environment:

Spark: 2.2.0
Java version: 1.8.0_112
spark.sql.streaming.stateStore.maintenanceInterval: 100

After an application failure due to OOM exceptions, restarting the application 
with the existing state produces the following OOM:

{code:java}
java.io.IOException: com.google.protobuf.ServiceException: 
java.lang.StackOverflowError
at 
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
at 

[jira] [Created] (SPARK-22304) HDFSBackedStateStoreProvider stackoverflow due to reloading of very old state

2017-10-18 Thread Yuval Itzchakov (JIRA)
Yuval Itzchakov created SPARK-22304:
---

 Summary: HDFSBackedStateStoreProvider stackoverflow due to 
reloading of very old state
 Key: SPARK-22304
 URL: https://issues.apache.org/jira/browse/SPARK-22304
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Yuval Itzchakov


Setup:

Spark 2.2.0
Java version: "1.8.0_112"
spark.sql.streaming.minBatchesToRetain: 100

During a failure of an Executor due to an OOM exception, we're seeing the HDFS 
backed state store attempt to reload old state in memory. This fails due to a 
StackOverflow exception as the `HDFSBackedStateStoreProvider.loadMap` method is 
recursive. 

The gist of the exception is:

{code:java}
java.io.IOException: com.google.protobuf.ServiceException: 
java.lang.StackOverflowError
at 
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
at 

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209018#comment-16209018
 ] 

Apache Spark commented on SPARK-13030:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19527

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13030:


Assignee: (was: Apache Spark)

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13030:


Assignee: Apache Spark

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>Assignee: Apache Spark
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22014) Sample windows in Spark SQL

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22014:


Assignee: Apache Spark

> Sample windows in Spark SQL
> ---
>
> Key: SPARK-22014
> URL: https://issues.apache.org/jira/browse/SPARK-22014
> Project: Spark
>  Issue Type: Wish
>  Components: DStreams, SQL
>Affects Versions: 2.2.0
>Reporter: Simon Schiff
>Assignee: Apache Spark
>Priority: Minor
>
> Hello,
> I am using spark to process measurement data. It is possible to create sample 
> windows in Spark Streaming, where the duration of the window is smaller than 
> the slide. But when I try to do the same with Spark SQL (The measurement data 
> has a time stamp column) then I got an analysis exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
> mismatch: The slide duration (18000) must be less than or equal to the 
> windowDuration (6000)
> {code}
> Here is a example:
> {code:java}
> import java.sql.Timestamp;
> import java.text.SimpleDateFormat;
> import java.util.ArrayList;
> import java.util.Date;
> import java.util.List;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class App {
>   public static Timestamp createTimestamp(String in) throws Exception {
>   SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
> hh:mm:ss");
>   Date parsedDate = dateFormat.parse(in);
>   return new Timestamp(parsedDate.getTime());
>   }
>   
>   public static void main(String[] args) {
>   SparkSession spark = SparkSession.builder().appName("Window 
> Sampling Example").getOrCreate();
>   
>   List sensorData = new ArrayList();
>   sensorData.add("2017-08-04 00:00:00, 22.75");
>   sensorData.add("2017-08-04 00:01:00, 23.82");
>   sensorData.add("2017-08-04 00:02:00, 24.15");
>   sensorData.add("2017-08-04 00:03:00, 23.16");
>   sensorData.add("2017-08-04 00:04:00, 22.62");
>   sensorData.add("2017-08-04 00:05:00, 22.89");
>   sensorData.add("2017-08-04 00:06:00, 23.21");
>   sensorData.add("2017-08-04 00:07:00, 24.59");
>   sensorData.add("2017-08-04 00:08:00, 24.44");
>   
>   Dataset in = spark.createDataset(sensorData, 
> Encoders.STRING());
>   
>   StructType sensorSchema = DataTypes.createStructType(new 
> StructField[] { 
>   DataTypes.createStructField("timestamp", 
> DataTypes.TimestampType, false),
>   DataTypes.createStructField("value", 
> DataTypes.DoubleType, false),
>   });
>   
>   Dataset data = 
> spark.createDataFrame(in.toJavaRDD().map(new Function() {
>   public Row call(String line) throws Exception {
>   return 
> RowFactory.create(createTimestamp(line.split(",")[0]), 
> Double.parseDouble(line.split(",")[1]));
>   }
>   }), sensorSchema);
>   
>   data.groupBy(functions.window(data.col("timestamp"), "1 
> minutes", "3 minutes")).avg("value").orderBy("window").show(false);
>   }
> }
> {code}
> I think there should be no difference (duration and slide) in a "Spark 
> Streaming window" and a "Spark SQL window" function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22014) Sample windows in Spark SQL

2017-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208981#comment-16208981
 ] 

Apache Spark commented on SPARK-22014:
--

User 'SimonUzL' has created a pull request for this issue:
https://github.com/apache/spark/pull/19526

> Sample windows in Spark SQL
> ---
>
> Key: SPARK-22014
> URL: https://issues.apache.org/jira/browse/SPARK-22014
> Project: Spark
>  Issue Type: Wish
>  Components: DStreams, SQL
>Affects Versions: 2.2.0
>Reporter: Simon Schiff
>Priority: Minor
>
> Hello,
> I am using spark to process measurement data. It is possible to create sample 
> windows in Spark Streaming, where the duration of the window is smaller than 
> the slide. But when I try to do the same with Spark SQL (The measurement data 
> has a time stamp column) then I got an analysis exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
> mismatch: The slide duration (18000) must be less than or equal to the 
> windowDuration (6000)
> {code}
> Here is a example:
> {code:java}
> import java.sql.Timestamp;
> import java.text.SimpleDateFormat;
> import java.util.ArrayList;
> import java.util.Date;
> import java.util.List;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class App {
>   public static Timestamp createTimestamp(String in) throws Exception {
>   SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
> hh:mm:ss");
>   Date parsedDate = dateFormat.parse(in);
>   return new Timestamp(parsedDate.getTime());
>   }
>   
>   public static void main(String[] args) {
>   SparkSession spark = SparkSession.builder().appName("Window 
> Sampling Example").getOrCreate();
>   
>   List sensorData = new ArrayList();
>   sensorData.add("2017-08-04 00:00:00, 22.75");
>   sensorData.add("2017-08-04 00:01:00, 23.82");
>   sensorData.add("2017-08-04 00:02:00, 24.15");
>   sensorData.add("2017-08-04 00:03:00, 23.16");
>   sensorData.add("2017-08-04 00:04:00, 22.62");
>   sensorData.add("2017-08-04 00:05:00, 22.89");
>   sensorData.add("2017-08-04 00:06:00, 23.21");
>   sensorData.add("2017-08-04 00:07:00, 24.59");
>   sensorData.add("2017-08-04 00:08:00, 24.44");
>   
>   Dataset in = spark.createDataset(sensorData, 
> Encoders.STRING());
>   
>   StructType sensorSchema = DataTypes.createStructType(new 
> StructField[] { 
>   DataTypes.createStructField("timestamp", 
> DataTypes.TimestampType, false),
>   DataTypes.createStructField("value", 
> DataTypes.DoubleType, false),
>   });
>   
>   Dataset data = 
> spark.createDataFrame(in.toJavaRDD().map(new Function() {
>   public Row call(String line) throws Exception {
>   return 
> RowFactory.create(createTimestamp(line.split(",")[0]), 
> Double.parseDouble(line.split(",")[1]));
>   }
>   }), sensorSchema);
>   
>   data.groupBy(functions.window(data.col("timestamp"), "1 
> minutes", "3 minutes")).avg("value").orderBy("window").show(false);
>   }
> }
> {code}
> I think there should be no difference (duration and slide) in a "Spark 
> Streaming window" and a "Spark SQL window" function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22014) Sample windows in Spark SQL

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22014:


Assignee: (was: Apache Spark)

> Sample windows in Spark SQL
> ---
>
> Key: SPARK-22014
> URL: https://issues.apache.org/jira/browse/SPARK-22014
> Project: Spark
>  Issue Type: Wish
>  Components: DStreams, SQL
>Affects Versions: 2.2.0
>Reporter: Simon Schiff
>Priority: Minor
>
> Hello,
> I am using spark to process measurement data. It is possible to create sample 
> windows in Spark Streaming, where the duration of the window is smaller than 
> the slide. But when I try to do the same with Spark SQL (The measurement data 
> has a time stamp column) then I got an analysis exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
> mismatch: The slide duration (18000) must be less than or equal to the 
> windowDuration (6000)
> {code}
> Here is a example:
> {code:java}
> import java.sql.Timestamp;
> import java.text.SimpleDateFormat;
> import java.util.ArrayList;
> import java.util.Date;
> import java.util.List;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class App {
>   public static Timestamp createTimestamp(String in) throws Exception {
>   SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
> hh:mm:ss");
>   Date parsedDate = dateFormat.parse(in);
>   return new Timestamp(parsedDate.getTime());
>   }
>   
>   public static void main(String[] args) {
>   SparkSession spark = SparkSession.builder().appName("Window 
> Sampling Example").getOrCreate();
>   
>   List sensorData = new ArrayList();
>   sensorData.add("2017-08-04 00:00:00, 22.75");
>   sensorData.add("2017-08-04 00:01:00, 23.82");
>   sensorData.add("2017-08-04 00:02:00, 24.15");
>   sensorData.add("2017-08-04 00:03:00, 23.16");
>   sensorData.add("2017-08-04 00:04:00, 22.62");
>   sensorData.add("2017-08-04 00:05:00, 22.89");
>   sensorData.add("2017-08-04 00:06:00, 23.21");
>   sensorData.add("2017-08-04 00:07:00, 24.59");
>   sensorData.add("2017-08-04 00:08:00, 24.44");
>   
>   Dataset in = spark.createDataset(sensorData, 
> Encoders.STRING());
>   
>   StructType sensorSchema = DataTypes.createStructType(new 
> StructField[] { 
>   DataTypes.createStructField("timestamp", 
> DataTypes.TimestampType, false),
>   DataTypes.createStructField("value", 
> DataTypes.DoubleType, false),
>   });
>   
>   Dataset data = 
> spark.createDataFrame(in.toJavaRDD().map(new Function() {
>   public Row call(String line) throws Exception {
>   return 
> RowFactory.create(createTimestamp(line.split(",")[0]), 
> Double.parseDouble(line.split(",")[1]));
>   }
>   }), sensorSchema);
>   
>   data.groupBy(functions.window(data.col("timestamp"), "1 
> minutes", "3 minutes")).avg("value").orderBy("window").show(false);
>   }
> }
> {code}
> I think there should be no difference (duration and slide) in a "Spark 
> Streaming window" and a "Spark SQL window" function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2017-10-18 Thread Hristo Angelov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208936#comment-16208936
 ] 

Hristo Angelov commented on SPARK-5594:
---

[~hryhoriev.nick] I have exactly the same problem. The application is working 
fine(in my case for 4-5 days) and then this exception occurs. The strange thing 
is that my broadcast variable is with index 0:

{code:java}
INFO TorrentBroadcast: Started reading broadcast variable 0
{code}

So this broadcast variable is only read when the exception occurred and never 
before that. I'm using spark version 2.2.0. So did you find out what lead to 
that error?


> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>

[jira] [Commented] (SPARK-9135) Filter fails when filtering with a method reference to overloaded method

2017-10-18 Thread yue long (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208929#comment-16208929
 ] 

yue long commented on SPARK-9135:
-

It still happens in spark 2.1

> Filter fails when filtering with a method reference to overloaded method
> 
>
> Key: SPARK-9135
> URL: https://issues.apache.org/jira/browse/SPARK-9135
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.4.0
>Reporter: Mateusz Michalowski
>
> Filter fails when filtering with a method reference to overloaded method.
> In the example below we filter by Fruit::isRed, which is overloaded by 
> Apple::isRed and Banana::isRed. 
> {code}
> apples.filter(Fruit::isRed)
> bananas.filter(Fruit::isRed) //throws!
> {code}
> Spark will try to cast Apple::isRed to Banana::isRed - and then throw as a 
> result.
> However if we filter more generic rdd first - all works fine
> {code}
> fruit.filter(Fruit::isRed)
> bananas.filter(Fruit::isRed) //works fine!
> {code}
> It also works well if we use lambda instead of the method reference
> {code}
> apples.filter(f -> f.isRed())
> bananas.filter(f -> f.isRed()) //works fine!
> {code} 
> I attach a test setup below:
> {code:java}
> package com.doggybites;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.junit.After;
> import org.junit.Before;
> import org.junit.Test;
> import java.io.Serializable;
> import java.util.Arrays;
> import static org.hamcrest.CoreMatchers.equalTo;
> import static org.junit.Assert.assertThat;
> public class SparkTest {
> static abstract class Fruit implements Serializable {
> abstract boolean isRed();
> }
> static class Banana extends Fruit {
> @Override
> boolean isRed() {
> return false;
> }
> }
> static class Apple extends Fruit {
> @Override
> boolean isRed() {
> return true;
> }
> }
> private JavaSparkContext sparkContext;
> @Before
> public void setUp() throws Exception {
> SparkConf sparkConf = new 
> SparkConf().setAppName("test").setMaster("local[2]");
> sparkContext = new JavaSparkContext(sparkConf);
> }
> @After
> public void tearDown() throws Exception {
> sparkContext.stop();
> }
> private  JavaRDD toRdd(T ... array) {
> return sparkContext.parallelize(Arrays.asList(array));
> }
> @Test
> public void filters_apples_and_bananas_with_method_reference() {
> JavaRDD appleRdd = toRdd(new Apple());
> JavaRDD bananaRdd = toRdd(new Banana());
> 
> long redAppleCount = appleRdd.filter(Fruit::isRed).count();
> long redBananaCount = bananaRdd.filter(Fruit::isRed).count();
> assertThat(redAppleCount, equalTo(1L));
> assertThat(redBananaCount, equalTo(0L));
> }
> }
> {code}
> The test above throws:
> {code}
> 15/07/17 14:10:04 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> java.lang.ClassCastException: com.doggybites.SparkTest$Banana cannot be cast 
> to com.doggybites.SparkTest$Apple
>   at com.doggybites.SparkTest$$Lambda$2/976119300.call(Unknown Source)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/07/17 14:10:04 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 3, 
> localhost): java.lang.ClassCastException: com.doggybites.SparkTest$Banana 
> cannot be cast to com.doggybites.SparkTest$Apple
>   at com.doggybites.SparkTest$$Lambda$2/976119300.call(Unknown Source)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at 
> 

[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-10-18 Thread Pranav Singhania (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208925#comment-16208925
 ] 

Pranav Singhania commented on SPARK-16599:
--

More on the error here from the particular node where the exception occurred:

It fails to get a particular broadcast piece:

{code:java}
17/10/18 00:09:48 INFO Executor: Running task 216.0 in stage 715.0 (TID 1645)
17/10/18 00:09:48 INFO TorrentBroadcast: Started reading broadcast variable 554
17/10/18 00:09:48 INFO log: Logging initialized @532075ms
17/10/18 00:09:48 ERROR Utils: Exception encountered
org.apache.spark.SparkException: Failed to get broadcast_554_piece0 of 
broadcast_554
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:146)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:125)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1273)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:67)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/10/18 00:09:48 ERROR Executor: Exception in task 216.0 in stage 715.0 (TID 
1645)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/10/18 00:09:48 INFO CoarseGrainedExecutorBackend: Got assigned task 1656
17/10/18 00:09:48 INFO Executor: Running task 109.0 in stage 715.0 (TID 1656)
17/10/18 00:09:48 INFO TorrentBroadcast: Started reading broadcast variable 554
17/10/18 00:09:48 ERROR Utils: Exception encountered
org.apache.spark.SparkException: Failed to get broadcast_554_piece0 of 
broadcast_554
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:146)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:125)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1273)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 

[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208898#comment-16208898
 ] 

Apache Spark commented on SPARK-22289:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/19525

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22289:


Assignee: (was: Apache Spark)

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22289:


Assignee: Apache Spark

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>Assignee: Apache Spark
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2017-10-18 Thread Soumitra Sulav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208883#comment-16208883
 ] 

Soumitra Sulav commented on SPARK-2984:
---

[~ste...@apache.org] Multiple batches/jobs run at same time as we have enabled 
concurrentJobs to 4 and in cases where load increases suddenly, jobs run 
parallel.
The directory tree comprises of the partitions, current dest folder being the 
current hour and hence same dir. 
We can add batch time for each job to separate the parent folder but we may end 
up with lot of directories and we don't want that for current scenario.
Is there a simpler way to not delete _temporary folder if any write is 
currently going on.

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> 

[jira] [Commented] (SPARK-22303) [SQL] Getting java.sql.SQLException: Unsupported type 101 for BINARY_DOUBLE

2017-10-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208873#comment-16208873
 ] 

Sean Owen commented on SPARK-22303:
---

If it's Oracle specific I don't think it's reasonable to expect support. 

> [SQL] Getting java.sql.SQLException: Unsupported type 101 for BINARY_DOUBLE
> ---
>
> Key: SPARK-22303
> URL: https://issues.apache.org/jira/browse/SPARK-22303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kohki Nishio
>Priority: Minor
>
> When a table contains columns such as BINARY_DOUBLE or BINARY_FLOAT, this 
> JDBC connector throws SQL exception
> {code}
> java.sql.SQLException: Unsupported type 101
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:235)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:291)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:64)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
> {code}
> these types are Oracle specific ones, described here
> https://docs.oracle.com/cd/E11882_01/timesten.112/e21642/types.htm#TTSQL148



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org