[jira] [Resolved] (SPARK-20850) Improve division and multiplication mixing process the data

2017-05-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20850.
---
Resolution: Not A Problem

> Improve division and multiplication mixing process the data
> ---
>
> Key: SPARK-20850
> URL: https://issues.apache.org/jira/browse/SPARK-20850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> spark-sql> select  (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120;
> NULL
> spark-sql> select  (12345678901234567890 / 123) * 123;
> NULL
> when the length of the getText is greater than 19, The result is not what we 
> expected.
> but mysql handle the value is ok.
> mysql> select (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120;
> +--+
> | (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120 |
> +--+
> |  
> 1234567890123456789012. |
> +--+
> 1 row in set (0.00 sec)
> mysql> select (12345678901234567890 / 123) * 123;
> ++
> | (12345678901234567890 / 123) * 123 |
> ++
> |  12345678901234567890. |
> ++
> 1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20850) Improve division and multiplication mixing process the data

2017-05-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-20850:
---

> Improve division and multiplication mixing process the data
> ---
>
> Key: SPARK-20850
> URL: https://issues.apache.org/jira/browse/SPARK-20850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> spark-sql> select  (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120;
> NULL
> spark-sql> select  (12345678901234567890 / 123) * 123;
> NULL
> when the length of the getText is greater than 19, The result is not what we 
> expected.
> but mysql handle the value is ok.
> mysql> select (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120;
> +--+
> | (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120 |
> +--+
> |  
> 1234567890123456789012. |
> +--+
> 1 row in set (0.00 sec)
> mysql> select (12345678901234567890 / 123) * 123;
> ++
> | (12345678901234567890 / 123) * 123 |
> ++
> |  12345678901234567890. |
> ++
> 1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20864) I tried to run spark mllib PIC algorithm, but got error

2017-05-23 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022345#comment-16022345
 ] 

yuhao yang commented on SPARK-20864:


[~yuanjie] Could you please provide more code to help the investigation? From 
the exception it looks like the issue is not caused by the algorithm, but 
something in the data processing.

> I tried to run spark mllib PIC algorithm, but got error
> ---
>
> Key: SPARK-20864
> URL: https://issues.apache.org/jira/browse/SPARK-20864
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: yuanjie
>Priority: Blocker
>
> I use a very simple data:
> 1 2 3
> 2 1 3
> 3 1 3
> 4 5 2
> 4 6 2
> 5 6 2
> but when running I got:
> Exception in thread "main" : java.io.IOException: 
> com.google.protobuf.ServiceException: java.lang.UnsupportedOperationException 
> :This is supposed to be overridden by subclasses
> why?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20864) I tried to run spark mllib PIC algorithm, but got error

2017-05-23 Thread yuanjie (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuanjie updated SPARK-20864:

Description: 
I use a very simple data:
1 2 3
2 1 3
3 1 3
4 5 2
4 6 2
5 6 2
but when running I got:
Exception in thread "main" : java.io.IOException: 
com.google.protobuf.ServiceException: java.lang.UnsupportedOperationException 
:This is supposed to be overridden by subclasses

why?

> I tried to run spark mllib PIC algorithm, but got error
> ---
>
> Key: SPARK-20864
> URL: https://issues.apache.org/jira/browse/SPARK-20864
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: yuanjie
>Priority: Blocker
>
> I use a very simple data:
> 1 2 3
> 2 1 3
> 3 1 3
> 4 5 2
> 4 6 2
> 5 6 2
> but when running I got:
> Exception in thread "main" : java.io.IOException: 
> com.google.protobuf.ServiceException: java.lang.UnsupportedOperationException 
> :This is supposed to be overridden by subclasses
> why?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20864) I tried to run spark mllib PIC algorithm, but got error

2017-05-23 Thread yuanjie (JIRA)
yuanjie created SPARK-20864:
---

 Summary: I tried to run spark mllib PIC algorithm, but got error
 Key: SPARK-20864
 URL: https://issues.apache.org/jira/browse/SPARK-20864
 Project: Spark
  Issue Type: Question
  Components: MLlib
Affects Versions: 2.1.1
Reporter: yuanjie
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20861.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 18077
[https://github.com/apache/spark/pull/18077]

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022267#comment-16022267
 ] 

Joseph K. Bradley commented on SPARK-20861:
---

I targeted this at 2.2.1 and 2.3.0; backporting is probably worthwhile since 
this can manifest as a bug (slowness) when comparing Python vs Scala APIs.

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20861:
--
Target Version/s: 2.2.1, 2.3.0  (was: 2.2.1)

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20861:
-

Assignee: Bago Amirbekian

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20850) Improve division and multiplication mixing process the data

2017-05-23 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen resolved SPARK-20850.
---
Resolution: Fixed

> Improve division and multiplication mixing process the data
> ---
>
> Key: SPARK-20850
> URL: https://issues.apache.org/jira/browse/SPARK-20850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> spark-sql> select  (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120;
> NULL
> spark-sql> select  (12345678901234567890 / 123) * 123;
> NULL
> when the length of the getText is greater than 19, The result is not what we 
> expected.
> but mysql handle the value is ok.
> mysql> select (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120;
> +--+
> | (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120 |
> +--+
> |  
> 1234567890123456789012. |
> +--+
> 1 row in set (0.00 sec)
> mysql> select (12345678901234567890 / 123) * 123;
> ++
> | (12345678901234567890 / 123) * 123 |
> ++
> |  12345678901234567890. |
> ++
> 1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19256) Hive bucketing support

2017-05-23 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022246#comment-16022246
 ] 

Wenchen Fan commented on SPARK-19256:
-

let's wait for https://github.com/apache/spark/pull/18064 , which also 
refactors the insertion plan node.

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-19900:
-

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat 

[jira] [Assigned] (SPARK-20863) Add metrics/instrumentation to LiveListenerBus

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20863:


Assignee: Apache Spark  (was: Josh Rosen)

> Add metrics/instrumentation to LiveListenerBus
> --
>
> Key: SPARK-20863
> URL: https://issues.apache.org/jira/browse/SPARK-20863
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> I think that we should add Coda Hale metrics to the LiveListenerBus in order 
> to count the number of queued, processed, and dropped events, as well as a 
> timer tracking per-event processing times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20863) Add metrics/instrumentation to LiveListenerBus

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20863:


Assignee: Josh Rosen  (was: Apache Spark)

> Add metrics/instrumentation to LiveListenerBus
> --
>
> Key: SPARK-20863
> URL: https://issues.apache.org/jira/browse/SPARK-20863
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> I think that we should add Coda Hale metrics to the LiveListenerBus in order 
> to count the number of queued, processed, and dropped events, as well as a 
> timer tracking per-event processing times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20863) Add metrics/instrumentation to LiveListenerBus

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022192#comment-16022192
 ] 

Apache Spark commented on SPARK-20863:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18083

> Add metrics/instrumentation to LiveListenerBus
> --
>
> Key: SPARK-20863
> URL: https://issues.apache.org/jira/browse/SPARK-20863
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> I think that we should add Coda Hale metrics to the LiveListenerBus in order 
> to count the number of queued, processed, and dropped events, as well as a 
> timer tracking per-event processing times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20665) Spark-sql, "Bround" and "Round" function return NULL

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022182#comment-16022182
 ] 

Apache Spark commented on SPARK-20665:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/18082

> Spark-sql, "Bround" and "Round" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: liuxian
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null
> "Round" has the same problem:
> >select round(12.3, 2);
> >NULL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20862) LogisticRegressionModel throws TypeError

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20862:


Assignee: Apache Spark

> LogisticRegressionModel throws TypeError
> 
>
> Key: SPARK-20862
> URL: https://issues.apache.org/jira/browse/SPARK-20862
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Minor
>
> LogisticRegressionModel throws a TypeError using python3 and numpy 1.12.1:
> **
> File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", line 
> 155, in __main__.LogisticRegressionModel
> Failed example:
> mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3)
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/doctest.py",
>  line 1330, in __run
> compileflags, 1), test.globs)
>   File "", line 1, in 
> 
> mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, 
> numClasses=3)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
> line 398, in train
> return _regression_train_wrapper(train, LogisticRegressionModel, 
> data, initialWeights)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/regression.py", line 
> 216, in _regression_train_wrapper
> return modelClass(weights, intercept, numFeatures, numClasses)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
> line 176, in __init__
> self._dataWithBiasSize)
> TypeError: 'float' object cannot be interpreted as an integer



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20862) LogisticRegressionModel throws TypeError

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20862:


Assignee: (was: Apache Spark)

> LogisticRegressionModel throws TypeError
> 
>
> Key: SPARK-20862
> URL: https://issues.apache.org/jira/browse/SPARK-20862
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> LogisticRegressionModel throws a TypeError using python3 and numpy 1.12.1:
> **
> File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", line 
> 155, in __main__.LogisticRegressionModel
> Failed example:
> mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3)
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/doctest.py",
>  line 1330, in __run
> compileflags, 1), test.globs)
>   File "", line 1, in 
> 
> mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, 
> numClasses=3)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
> line 398, in train
> return _regression_train_wrapper(train, LogisticRegressionModel, 
> data, initialWeights)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/regression.py", line 
> 216, in _regression_train_wrapper
> return modelClass(weights, intercept, numFeatures, numClasses)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
> line 176, in __init__
> self._dataWithBiasSize)
> TypeError: 'float' object cannot be interpreted as an integer



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20862) LogisticRegressionModel throws TypeError

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022179#comment-16022179
 ] 

Apache Spark commented on SPARK-20862:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/18081

> LogisticRegressionModel throws TypeError
> 
>
> Key: SPARK-20862
> URL: https://issues.apache.org/jira/browse/SPARK-20862
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> LogisticRegressionModel throws a TypeError using python3 and numpy 1.12.1:
> **
> File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", line 
> 155, in __main__.LogisticRegressionModel
> Failed example:
> mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3)
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/doctest.py",
>  line 1330, in __run
> compileflags, 1), test.globs)
>   File "", line 1, in 
> 
> mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, 
> numClasses=3)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
> line 398, in train
> return _regression_train_wrapper(train, LogisticRegressionModel, 
> data, initialWeights)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/regression.py", line 
> 216, in _regression_train_wrapper
> return modelClass(weights, intercept, numFeatures, numClasses)
>   File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
> line 176, in __init__
> self._dataWithBiasSize)
> TypeError: 'float' object cannot be interpreted as an integer



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20863) Add metrics/instrumentation to LiveListenerBus

2017-05-23 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-20863:
--

 Summary: Add metrics/instrumentation to LiveListenerBus
 Key: SPARK-20863
 URL: https://issues.apache.org/jira/browse/SPARK-20863
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen


I think that we should add Coda Hale metrics to the LiveListenerBus in order to 
count the number of queued, processed, and dropped events, as well as a timer 
tracking per-event processing times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20862) LogisticRegressionModel throws TypeError

2017-05-23 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-20862:
---

 Summary: LogisticRegressionModel throws TypeError
 Key: SPARK-20862
 URL: https://issues.apache.org/jira/browse/SPARK-20862
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 2.1.1
Reporter: Bago Amirbekian
Priority: Minor


LogisticRegressionModel throws a TypeError using python3 and numpy 1.12.1:

**
File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", line 
155, in __main__.LogisticRegressionModel
Failed example:
mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3)
Exception raised:
Traceback (most recent call last):
  File 
"/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/doctest.py",
 line 1330, in __run
compileflags, 1), test.globs)
  File "", line 1, in 
mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, 
numClasses=3)
  File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
line 398, in train
return _regression_train_wrapper(train, LogisticRegressionModel, data, 
initialWeights)
  File "/Users/bago/repos/spark/python/pyspark/mllib/regression.py", line 
216, in _regression_train_wrapper
return modelClass(weights, intercept, numFeatures, numClasses)
  File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
line 176, in __init__
self._dataWithBiasSize)
TypeError: 'float' object cannot be interpreted as an integer




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20771) Usability issues with weekofyear()

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20771:


Assignee: (was: Apache Spark)

> Usability issues with weekofyear()
> --
>
> Key: SPARK-20771
> URL: https://issues.apache.org/jira/browse/SPARK-20771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> The weekofyear() implementation follows HIVE / ISO 8601 week number. However 
> it is not useful because it doesn't return the year of the week start. For 
> example,
> weekofyear("2017-01-01") returns 52
> Anyone using this with groupBy('week) might do the aggregation or ordering 
> wrong. A better implementation should return the year number of the week as 
> well.
> MySQL's yearweek() is much better in this sense: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.
> Maybe we should implement that in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20771) Usability issues with weekofyear()

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20771:


Assignee: Apache Spark

> Usability issues with weekofyear()
> --
>
> Key: SPARK-20771
> URL: https://issues.apache.org/jira/browse/SPARK-20771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> The weekofyear() implementation follows HIVE / ISO 8601 week number. However 
> it is not useful because it doesn't return the year of the week start. For 
> example,
> weekofyear("2017-01-01") returns 52
> Anyone using this with groupBy('week) might do the aggregation or ordering 
> wrong. A better implementation should return the year number of the week as 
> well.
> MySQL's yearweek() is much better in this sense: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.
> Maybe we should implement that in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20771) Usability issues with weekofyear()

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022145#comment-16022145
 ] 

Apache Spark commented on SPARK-20771:
--

User 'setjet' has created a pull request for this issue:
https://github.com/apache/spark/pull/18080

> Usability issues with weekofyear()
> --
>
> Key: SPARK-20771
> URL: https://issues.apache.org/jira/browse/SPARK-20771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> The weekofyear() implementation follows HIVE / ISO 8601 week number. However 
> it is not useful because it doesn't return the year of the week start. For 
> example,
> weekofyear("2017-01-01") returns 52
> Anyone using this with groupBy('week) might do the aggregation or ordering 
> wrong. A better implementation should return the year number of the week as 
> well.
> MySQL's yearweek() is much better in this sense: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.
> Maybe we should implement that in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10643) Support remote application download in client mode spark submit

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10643:


Assignee: (was: Apache Spark)

> Support remote application download in client mode spark submit
> ---
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10643) Support remote application download in client mode spark submit

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10643:


Assignee: Apache Spark

> Support remote application download in client mode spark submit
> ---
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Assignee: Apache Spark
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10643) Support remote application download in client mode spark submit

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022143#comment-16022143
 ] 

Apache Spark commented on SPARK-10643:
--

User 'loneknightpy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18078

> Support remote application download in client mode spark submit
> ---
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022137#comment-16022137
 ] 

Bago Amirbekian commented on SPARK-20861:
-

[~josephkb]

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20861:


Assignee: Apache Spark

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-20861:

Comment: was deleted

(was: I've made a PR to address this issue: 
https://github.com/apache/spark/pull/18077.)

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20861:


Assignee: (was: Apache Spark)

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022134#comment-16022134
 ] 

Bago Amirbekian commented on SPARK-20861:
-

I've made a PR to address this issue: 
https://github.com/apache/spark/pull/18077.

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022131#comment-16022131
 ] 

Apache Spark commented on SPARK-20861:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/18077

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-20861:
---

 Summary: Pyspark CrossValidator & TrainValidationSplit should 
delegate parameter looping to estimators
 Key: SPARK-20861
 URL: https://issues.apache.org/jira/browse/SPARK-20861
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.1.1
Reporter: Bago Amirbekian
Priority: Minor


The CrossValidator & TrainValidationSplit should call estimator.fit with all 
their parameter maps instead of passing params one by one to fit. This 
behaviour would make Python spark more consistent with Scala spark and allow 
individual to parallelize or optimize for fitting over multiple parameter maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20860) Make spark-submit download remote files to local in client mode

2017-05-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20860.

Resolution: Duplicate

> Make spark-submit download remote files to local in client mode
> ---
>
> Key: SPARK-20860
> URL: https://issues.apache.org/jira/browse/SPARK-20860
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: Yu Peng
>
> Currently, spark-submit script doesn't allow remote files in client mode. It 
> would be great to make it able to download remote files (e.g. files on S3) to 
> local before executing the spark application. 
> cc: [~mengxr] [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20841) Support table column aliases in FROM clause

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20841:


Assignee: Apache Spark

> Support table column aliases in FROM clause
> ---
>
> Key: SPARK-20841
> URL: https://issues.apache.org/jira/browse/SPARK-20841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>
> Some SQL dialects support a relatively obscure "table column aliases" feature 
> where you can rename columns when aliasing a relation in a {{FROM}} clause. 
> For example:
> {code}
> SELECT * FROM onecolumn AS a(x) JOIN onecolumn AS b(y) ON a.x = b.y
> {code}
> Spark does not currently support this. I would like to add support for this 
> in order to allow me to run a corpus of existing queries which depend on this 
> syntax.
> There's a good writeup on this at 
> http://modern-sql.com/feature/table-column-aliases, which has additional 
> examples and describes other databases' degrees of support for this feature.
> One tricky thing to figure out will be whether FROM clause column aliases 
> take precedence over aliases in the SELECT clause. When adding support for 
> this, we should make sure to add sufficient testing of several corner-cases, 
> including:
> * Aliasing in both the SELECT and FROM clause
> * Aliasing columns in the FROM clause both with and without an explicit AS.
> * Aliasing the wrong number of columns in the FROM clause, both greater and 
> fewer columns than were selected in the SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20841) Support table column aliases in FROM clause

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20841:


Assignee: (was: Apache Spark)

> Support table column aliases in FROM clause
> ---
>
> Key: SPARK-20841
> URL: https://issues.apache.org/jira/browse/SPARK-20841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> Some SQL dialects support a relatively obscure "table column aliases" feature 
> where you can rename columns when aliasing a relation in a {{FROM}} clause. 
> For example:
> {code}
> SELECT * FROM onecolumn AS a(x) JOIN onecolumn AS b(y) ON a.x = b.y
> {code}
> Spark does not currently support this. I would like to add support for this 
> in order to allow me to run a corpus of existing queries which depend on this 
> syntax.
> There's a good writeup on this at 
> http://modern-sql.com/feature/table-column-aliases, which has additional 
> examples and describes other databases' degrees of support for this feature.
> One tricky thing to figure out will be whether FROM clause column aliases 
> take precedence over aliases in the SELECT clause. When adding support for 
> this, we should make sure to add sufficient testing of several corner-cases, 
> including:
> * Aliasing in both the SELECT and FROM clause
> * Aliasing columns in the FROM clause both with and without an explicit AS.
> * Aliasing the wrong number of columns in the FROM clause, both greater and 
> fewer columns than were selected in the SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20841) Support table column aliases in FROM clause

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022122#comment-16022122
 ] 

Apache Spark commented on SPARK-20841:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18079

> Support table column aliases in FROM clause
> ---
>
> Key: SPARK-20841
> URL: https://issues.apache.org/jira/browse/SPARK-20841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> Some SQL dialects support a relatively obscure "table column aliases" feature 
> where you can rename columns when aliasing a relation in a {{FROM}} clause. 
> For example:
> {code}
> SELECT * FROM onecolumn AS a(x) JOIN onecolumn AS b(y) ON a.x = b.y
> {code}
> Spark does not currently support this. I would like to add support for this 
> in order to allow me to run a corpus of existing queries which depend on this 
> syntax.
> There's a good writeup on this at 
> http://modern-sql.com/feature/table-column-aliases, which has additional 
> examples and describes other databases' degrees of support for this feature.
> One tricky thing to figure out will be whether FROM clause column aliases 
> take precedence over aliases in the SELECT clause. When adding support for 
> this, we should make sure to add sufficient testing of several corner-cases, 
> including:
> * Aliasing in both the SELECT and FROM clause
> * Aliasing columns in the FROM clause both with and without an explicit AS.
> * Aliasing the wrong number of columns in the FROM clause, both greater and 
> fewer columns than were selected in the SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20860) Make spark-submit download remote files to local in client mode

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20860:


Assignee: (was: Apache Spark)

> Make spark-submit download remote files to local in client mode
> ---
>
> Key: SPARK-20860
> URL: https://issues.apache.org/jira/browse/SPARK-20860
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: Yu Peng
>
> Currently, spark-submit script doesn't allow remote files in client mode. It 
> would be great to make it able to download remote files (e.g. files on S3) to 
> local before executing the spark application. 
> cc: [~mengxr] [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20860) Make spark-submit download remote files to local in client mode

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20860:


Assignee: Apache Spark

> Make spark-submit download remote files to local in client mode
> ---
>
> Key: SPARK-20860
> URL: https://issues.apache.org/jira/browse/SPARK-20860
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: Yu Peng
>Assignee: Apache Spark
>
> Currently, spark-submit script doesn't allow remote files in client mode. It 
> would be great to make it able to download remote files (e.g. files on S3) to 
> local before executing the spark application. 
> cc: [~mengxr] [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20860) Make spark-submit download remote files to local in client mode

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022120#comment-16022120
 ] 

Apache Spark commented on SPARK-20860:
--

User 'loneknightpy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18078

> Make spark-submit download remote files to local in client mode
> ---
>
> Key: SPARK-20860
> URL: https://issues.apache.org/jira/browse/SPARK-20860
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: Yu Peng
>
> Currently, spark-submit script doesn't allow remote files in client mode. It 
> would be great to make it able to download remote files (e.g. files on S3) to 
> local before executing the spark application. 
> cc: [~mengxr] [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20860) Make spark-submit download remote files to local in client mode

2017-05-23 Thread Yu Peng (JIRA)
Yu Peng created SPARK-20860:
---

 Summary: Make spark-submit download remote files to local in 
client mode
 Key: SPARK-20860
 URL: https://issues.apache.org/jira/browse/SPARK-20860
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 2.2.0
Reporter: Yu Peng


Currently, spark-submit script doesn't allow remote files in client mode. It 
would be great to make it able to download remote files (e.g. files on S3) to 
local before executing the spark application. 

cc: [~mengxr] [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-05-23 Thread Yongqin Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022077#comment-16022077
 ] 

Yongqin Xiao commented on SPARK-18406:
--

Thanks for the fix. What spark release will have it?
Can we get a patch on top of spark 2.1.0?

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for 
> /databricks/python/bin/python
> java.lang.AssertionError: assertion failed: Block 

[jira] [Assigned] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18406:


Assignee: (was: Apache Spark)

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for 
> /databricks/python/bin/python
> java.lang.AssertionError: assertion failed: Block rdd_2741_1 is not locked 
> for reading
>   at scala.Predef$.assert(Predef.scala:179)
>

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022067#comment-16022067
 ] 

Apache Spark commented on SPARK-18406:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/18076

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for 
> /databricks/python/bin/python
> java.lang.AssertionError: assertion failed: 

[jira] [Assigned] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18406:


Assignee: Apache Spark

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for 
> /databricks/python/bin/python
> java.lang.AssertionError: assertion failed: Block rdd_2741_1 is not locked 
> for reading
>   at 

[jira] [Comment Edited] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-05-23 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022049#comment-16022049
 ] 

Miao Wang edited comment on SPARK-20307 at 5/23/17 11:16 PM:
-

[~felixcheung] I will wait for Wayne Zhang to reply whether he is working on 
this one. 

Recently, I have moved to product development role. So I mainly work on open 
source in weekends. Please keep me updated. 

I still want to work on Spark. ;)

Thanks!


was (Author: wm624):
[~felixcheung] I will wait for [~wayen.zh...@263.net] to reply whether he is 
working on this one. 

Recently, I have moved to product development role. So I mainly work on open 
source in weekends. Please keep me updated. 

I still want to work on Spark. ;)

Thanks!

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> 

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-05-23 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022049#comment-16022049
 ] 

Miao Wang commented on SPARK-20307:
---

[~felixcheung] I will wait for [~wayen.zh...@263.net] to reply whether he is 
working on this one. 

Recently, I have moved to product development role. So I mainly work on open 
source in weekends. Please keep me updated. 

I still want to work on Spark. ;)

Thanks!

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> 

[jira] [Comment Edited] (SPARK-15703) Make ListenerBus event queue size configurable

2017-05-23 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013208#comment-16013208
 ] 

Ruslan Dautkhanov edited comment on SPARK-15703 at 5/23/17 10:53 PM:
-

We keep running into this issue too - would be great to document 
spark.scheduler.listenerbus.eventqueue.size

SPARK-20858


was (Author: tagar):
We keep running into this issue too - would be great to document 
spark.scheduler.listenerbus.eventqueue.size

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, spark-dynamic-executor-allocation.png, 
> SparkListenerBus .png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20859) SQL Loader does not recognize multidimensional columns in postgresql (like integer[]][])

2017-05-23 Thread Pablo Alcaraz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022011#comment-16022011
 ] 

Pablo Alcaraz commented on SPARK-20859:
---

The patch on SPARK-14536 only works for unidimensional arrays.

> SQL Loader does not recognize multidimensional columns in postgresql (like 
> integer[]][])
> 
>
> Key: SPARK-20859
> URL: https://issues.apache.org/jira/browse/SPARK-20859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Pablo Alcaraz
>Priority: Critical
>
> The fix in SPARK-14536 is not accepting columns like integer[][]  
> (multidimensional arrays)
> To reproduce this error:
> 1) Create a SQL table in postgresql
> {code:sql}
> CREATE TABLE arrays_test
> (
>   eid integer NOT NULL,
>   simple integer[],
>   multi integer[][]
> );
> {code}
> 2) Insert a row like this one:
> {code:xml}
> insert into arrays_test (eid, simple, multi)
> values
> (1, '{1, 1}', NULL);
> {code}
> 3) Execute a SPQL query like this one and observe how it works:
> {code:python}
> from pyspark import SparkConf
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> master = "spark://spark211:7077"  # local is OK too
> conf = (
> SparkConf()
> .setMaster(master)
> .setAppName("Connection Test 5")
> .set("spark.jars.packages", "org.postgresql:postgresql:9.4.1212")   
> ## This one works ok
> .set("spark.driver.memory", "2G")
> .set("spark.executor.memory", "2G")
> .set("spark.driver.cores", "10")
> )
> sc = SparkContext(conf=conf)
> # sc.setLogLevel("ALL")
> print ">", 1
> print(sc)
> sqlContext = SQLContext(sc)
> print ">", 2
> print sqlContext
> url = "postgresql://localhost:5432/test"   # change properly
> url = 'jdbc:'+url
> properties = {'user': 'user', 'password': 'password'}   # change user and 
> password if needed
> df = sqlContext.read.format("jdbc"). \
> option("url", url). \
> option("driver", "org.postgresql.Driver"). \
> option("useUnicode", "true"). \
> option("continueBatchOnError","true"). \
> option("useSSL", "false"). \
> option("user", "user"). \
> option("password", "password"). \
> option("dbtable", "arrays_test"). \
> option("partitionColumn", "eid"). \
> option("lowerBound", "115"). \
> option("upperBound", "6026289"). \
> option("numPartitions", "100"). \
> load()
> print ">", 3
> df.registerTempTable("arrays_test")
> df = sqlContext.sql("SELECT * FROM arrays_test limit 5")
> print ">", 4
> print df.collect()
> {code}
> 4) Observe how it works.
> 5) Now, to reproduce the error, insert a multi dimensional array into the SQL 
> table:
> {code:sql}
> insert into arrays_test (eid, simple, multi)
> values
> (2, '{1, 1}', '{{1, 1},{2, 2}}');
> {code}
> 6) Execute step 3) again.
> 7) Observe the exception
> {code}
> 17/05/23 15:23:38 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; 
> aborting job
> Traceback (most recent call last):
>   File 
> "/home/pablo/develop/physiosigns/livebetter/modelling2/modelling2/scripts/runSparkTest2.py",
>  line 65, in 
> print df.collect()
>   File 
> "/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/pyspark/sql/dataframe.py",
>  line 391, in collect
> port = self._jdf.collectToPython()
>   File 
> "/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/py4j/java_gateway.py",
>  line 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/py4j/protocol.py",
>  line 319, in get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o49.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, 172.17.0.58, executor 0): java.lang.ClassCastException: 
> [Ljava.lang.Integer; cannot be cast to java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> 

[jira] [Created] (SPARK-20859) SQL Loader does not recognize multidimensional columns in postgresql (like integer[]][])

2017-05-23 Thread Pablo Alcaraz (JIRA)
Pablo Alcaraz created SPARK-20859:
-

 Summary: SQL Loader does not recognize multidimensional columns in 
postgresql (like integer[]][])
 Key: SPARK-20859
 URL: https://issues.apache.org/jira/browse/SPARK-20859
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Pablo Alcaraz
Priority: Critical


The fix in SPARK-14536 is not accepting columns like integer[][]  
(multidimensional arrays)

To reproduce this error:

1) Create a SQL table in postgresql
{code:sql}
CREATE TABLE arrays_test
(
  eid integer NOT NULL,
  simple integer[],
  multi integer[][]
);
{code}

2) Insert a row like this one:
{code:xml}
insert into arrays_test (eid, simple, multi)
values
(1, '{1, 1}', NULL);
{code}

3) Execute a SPQL query like this one and observe how it works:
{code:python}
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext

master = "spark://spark211:7077"  # local is OK too
conf = (
SparkConf()
.setMaster(master)
.setAppName("Connection Test 5")
.set("spark.jars.packages", "org.postgresql:postgresql:9.4.1212")   ## 
This one works ok
.set("spark.driver.memory", "2G")
.set("spark.executor.memory", "2G")
.set("spark.driver.cores", "10")
)

sc = SparkContext(conf=conf)
# sc.setLogLevel("ALL")

print ">", 1
print(sc)

sqlContext = SQLContext(sc)

print ">", 2
print sqlContext

url = "postgresql://localhost:5432/test"   # change properly
url = 'jdbc:'+url
properties = {'user': 'user', 'password': 'password'}   # change user and 
password if needed

df = sqlContext.read.format("jdbc"). \
option("url", url). \
option("driver", "org.postgresql.Driver"). \
option("useUnicode", "true"). \
option("continueBatchOnError","true"). \
option("useSSL", "false"). \
option("user", "user"). \
option("password", "password"). \
option("dbtable", "arrays_test"). \
option("partitionColumn", "eid"). \
option("lowerBound", "115"). \
option("upperBound", "6026289"). \
option("numPartitions", "100"). \
load()

print ">", 3

df.registerTempTable("arrays_test")
df = sqlContext.sql("SELECT * FROM arrays_test limit 5")


print ">", 4
print df.collect()

{code}

4) Observe how it works.

5) Now, to reproduce the error, insert a multi dimensional array into the SQL 
table:
{code:sql}
insert into arrays_test (eid, simple, multi)
values
(2, '{1, 1}', '{{1, 1},{2, 2}}');
{code}

6) Execute step 3) again.

7) Observe the exception
{code}

17/05/23 15:23:38 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; 
aborting job
Traceback (most recent call last):
  File 
"/home/pablo/develop/physiosigns/livebetter/modelling2/modelling2/scripts/runSparkTest2.py",
 line 65, in 
print df.collect()
  File 
"/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/pyspark/sql/dataframe.py",
 line 391, in collect
port = self._jdf.collectToPython()
  File 
"/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/py4j/java_gateway.py",
 line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/pyspark/sql/utils.py",
 line 63, in deco
return f(*a, **kw)
  File 
"/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/py4j/protocol.py",
 line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling 
o49.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3, 172.17.0.58, executor 0): java.lang.ClassCastException: [Ljava.lang.Integer; 
cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at 

[jira] [Commented] (SPARK-20847) Error reading NULL int[] element from postgres -- null pointer exception.

2017-05-23 Thread Pablo Alcaraz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022004#comment-16022004
 ] 

Pablo Alcaraz commented on SPARK-20847:
---

This is fixed by SPARK-14536 in Spark 2.1.1

However the patch does not fix multidimensional columns.

> Error reading NULL int[] element from postgres -- null pointer exception.
> -
>
> Key: SPARK-20847
> URL: https://issues.apache.org/jira/browse/SPARK-20847
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Stuart Reynolds
>
> -- maybe fixed already? 
> https://github.com/apache/spark/commit/f174cdc7478d0b81f9cfa896284a5ec4c6bb952d
> {code:python}
> def query_int_array():
> import pandas as pd
> from pyspark.sql import SQLContext
> user,password = ... , 
> hostname = 
> dbName = ...
> url = "jdbc:postgresql://{hostname}:5432/{dbName}".format(**locals())
> properties = {'user': user, 'password': password}
> sql_create = """DROP TABLE IF EXISTS public._df10;
> CREATE TABLE IF NOT EXISTS public._df10 (
> id  integer,
> f_21 integer[]
> );
> INSERT INTO public._df10(id, f_21) VALUES
> (1, ARRAY[1,2])   --OK
>,(2, ARRAY[3,NULL])  --OK
>,(3, NULL)  --FAIL   *< PROBLEM
> ;"""
> engine = 
> sqlalchemy.create_engine('postgresql+psycopg2://{user}:{password}@{hostname}:5432/{dbName}'.format(**locals()))
> with engine.connect().execution_options(autocommit=True) as con:
> con.execute(sql_create)
> # Export postgres _df10 to spark as table df10
> sc = get_spark_context(master="local")
> sqlContext = SQLContext(sc)
> df10 = sqlContext.read.format("jdbc"). \
> option("url", url). \
> option("driver", "org.postgresql.Driver"). \
> option("useUnicode", "true"). \
> option("continueBatchOnError","true"). \
> option("useSSL", "false"). \
> option("user", user). \
> option("password", password). \
> option("dbtable", "_df10"). \
> load()
> df10.registerTempTable("df10")
> print "DF inferred from postgres:"
> df10.printSchema()
> df10.show()
> print "DF queried from postgres:"
> df10 = sqlContext.sql("select * from df10")
> df10.printSchema()
> df10.show()
> print df10.collect()
> {code}
> Explodes with:
> {noformat}
> DF inferred from postgres:
> root
>  |-- id: integer (nullable = true)
>  |-- f_21: array (nullable = true)
>  ||-- element: integer (containsNull = true)
> 17/05/22 15:46:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:427)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:425)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at 

[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2017-05-23 Thread Pablo Alcaraz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022002#comment-16022002
 ] 

Pablo Alcaraz commented on SPARK-14536:
---

The fix is not accepting columns like integer[][]  (multidimensional arrays)

To reproduce this error:

1) Create a SQL table in postgresql
{code:sql}
CREATE TABLE arrays_test
(
  eid integer NOT NULL,
  simple integer[],
  multi integer[][]
);
{code}

2) Insert a row like this one:
{code:xml}
insert into arrays_test (eid, simple, multi)
values
(1, '{1, 1}', NULL);
{code}

3) Execute a SPQL query like this one and observe how it works:
{code:python}
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext

master = "spark://spark211:7077"  # local is OK too
conf = (
SparkConf()
.setMaster(master)
.setAppName("Connection Test 5")
.set("spark.jars.packages", "org.postgresql:postgresql:9.4.1212")   ## 
This one works ok
.set("spark.driver.memory", "2G")
.set("spark.executor.memory", "2G")
.set("spark.driver.cores", "10")
)

sc = SparkContext(conf=conf)
# sc.setLogLevel("ALL")

print ">", 1
print(sc)

sqlContext = SQLContext(sc)

print ">", 2
print sqlContext

url = "postgresql://localhost:5432/test"   # change properly
url = 'jdbc:'+url
properties = {'user': 'user', 'password': 'password'}   # change user and 
password if needed

df = sqlContext.read.format("jdbc"). \
option("url", url). \
option("driver", "org.postgresql.Driver"). \
option("useUnicode", "true"). \
option("continueBatchOnError","true"). \
option("useSSL", "false"). \
option("user", "user"). \
option("password", "password"). \
option("dbtable", "arrays_test"). \
option("partitionColumn", "eid"). \
option("lowerBound", "115"). \
option("upperBound", "6026289"). \
option("numPartitions", "100"). \
load()

print ">", 3

df.registerTempTable("arrays_test")
df = sqlContext.sql("SELECT * FROM arrays_test limit 5")


print ">", 4
print df.collect()

{code}

4) Observe how it works.

5) Now, to reproduce the error, insert a multi dimensional array into the SQL 
table:
{code:sql}
insert into arrays_test (eid, simple, multi)
values
(2, '{1, 1}', '{{1, 1},{2, 2}}');
{code}

6) Execute step 3) again.

7) Observe the exception
{code}

17/05/23 15:23:38 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; 
aborting job
Traceback (most recent call last):
  File 
"/home/pablo/develop/physiosigns/livebetter/modelling2/modelling2/scripts/runSparkTest2.py",
 line 65, in 
print df.collect()
  File 
"/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/pyspark/sql/dataframe.py",
 line 391, in collect
port = self._jdf.collectToPython()
  File 
"/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/py4j/java_gateway.py",
 line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/pyspark/sql/utils.py",
 line 63, in deco
return f(*a, **kw)
  File 
"/home/pablo/myProgs/virt-pablo/local/lib/python2.7/site-packages/py4j/protocol.py",
 line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling 
o49.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3, 172.17.0.58, executor 0): java.lang.ClassCastException: [Ljava.lang.Integer; 
cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getInt(GenericArrayData.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at 

[jira] [Created] (SPARK-20858) Document ListenerBus event queue size property

2017-05-23 Thread Bjorn Jonsson (JIRA)
Bjorn Jonsson created SPARK-20858:
-

 Summary: Document ListenerBus event queue size property
 Key: SPARK-20858
 URL: https://issues.apache.org/jira/browse/SPARK-20858
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.1.1
Reporter: Bjorn Jonsson
Priority: Minor


SPARK-15703 made the ListenerBus event queue size configurable via 
spark.scheduler.listenerbus.eventqueue.size. This should be documented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021814#comment-16021814
 ] 

Apache Spark commented on SPARK-18016:
--

User 'bdrillard' has created a pull request for this issue:
https://github.com/apache/spark/pull/18075

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> 

[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-05-23 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021766#comment-16021766
 ] 

Michael Gummelt commented on SPARK-4899:


[~drcrallen] Can you link me to the conversation you had with Tim?  I can't 
find it on the mailing list.

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-05-23 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021750#comment-16021750
 ] 

Michael Gummelt commented on SPARK-4899:


These are two separate features, which need two separate JIRAs.  Roles is 
already supported, though, so this should either be renamed or closed in favor 
of a JIRA just for checkpointing.


> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15648) add TeradataDialect

2017-05-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-15648.
-
   Resolution: Fixed
 Assignee: Kirby Linvill
Fix Version/s: 2.3.0

> add TeradataDialect
> ---
>
> Key: SPARK-15648
> URL: https://issues.apache.org/jira/browse/SPARK-15648
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Teardata database
>Reporter: lihongli
>Assignee: Kirby Linvill
>Priority: Minor
> Fix For: 2.3.0
>
>
> I found that Teradata does not have the Dialect in spark.So I want to add the 
> TeradataDialect.scala in package org.apache.spark.sql.jdbc to support 
> Teradata database better.
> I override three functions in TeradataDialect.scala:
> 1.The url: jdbc:teradata
> 2.The JDBCType:Teradata database does not support "TEXT",so we replace it 
> with "VARCHAR(255)".Also we replace "BLOB" with "VARBYTE(4)".
> 3.Teradata database does not support sql like "LIMIT 1",so we replace it with 
> "TOP 1".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20853) spark.ui.reverseProxy=true leads to hanging communication to master

2017-05-23 Thread Alex Bozarth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bozarth closed SPARK-20853.

Resolution: Not A Problem

> spark.ui.reverseProxy=true leads to hanging communication to master
> ---
>
> Key: SPARK-20853
> URL: https://issues.apache.org/jira/browse/SPARK-20853
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Benno Staebler
>  Labels: network, web-ui
>
> When *reverse proxy is enabled*
> {quote}
> spark.ui.reverseProxy=true
> spark.ui.reverseProxyUrl=/
> {quote}
>  first of all any invocation of the spark master Web UI hangs forever locally 
> (e.g. http://192.168.10.16:25001) and via external URL without any data 
> received. 
> One, sometimes two spark applications succeed without error and than workers 
> start throwing exceptions:
> {quote}
> Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050
> {quote}
> The application dies during creation of SparkContext:
> {quote}
> 2017-05-22 16:11:23 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:11:23 INFO  TransportClientFactory:254 - Successfully created 
> connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in 
> bootstraps)
> 2017-05-22 16:11:43 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:12:03 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has 
> been killed. Reason: All masters are unresponsive! Giving up.
> 2017-05-22 16:12:23 WARN  StandaloneSchedulerBackend:66 - Application ID is 
> not initialized yet.
> 2017-05-22 16:12:23 INFO  Utils:54 - Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056.
> .
> Caused by: java.lang.IllegalArgumentException: requirement failed: Can only 
> call getServletHandlers on a running MetricsSystem
> {quote}
> *This definitively does not happen without reverse proxy enabled!*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20853) spark.ui.reverseProxy=true leads to hanging communication to master

2017-05-23 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021636#comment-16021636
 ] 

Alex Bozarth commented on SPARK-20853:
--

Closing this as it is a question for the user email list, for reference though, 
spark.ui.reverseProxyUrl should be set to a full url including http(s):// not a 
a relative one like /

> spark.ui.reverseProxy=true leads to hanging communication to master
> ---
>
> Key: SPARK-20853
> URL: https://issues.apache.org/jira/browse/SPARK-20853
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Benno Staebler
>  Labels: network, web-ui
>
> When *reverse proxy is enabled*
> {quote}
> spark.ui.reverseProxy=true
> spark.ui.reverseProxyUrl=/
> {quote}
>  first of all any invocation of the spark master Web UI hangs forever locally 
> (e.g. http://192.168.10.16:25001) and via external URL without any data 
> received. 
> One, sometimes two spark applications succeed without error and than workers 
> start throwing exceptions:
> {quote}
> Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050
> {quote}
> The application dies during creation of SparkContext:
> {quote}
> 2017-05-22 16:11:23 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:11:23 INFO  TransportClientFactory:254 - Successfully created 
> connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in 
> bootstraps)
> 2017-05-22 16:11:43 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:12:03 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has 
> been killed. Reason: All masters are unresponsive! Giving up.
> 2017-05-22 16:12:23 WARN  StandaloneSchedulerBackend:66 - Application ID is 
> not initialized yet.
> 2017-05-22 16:12:23 INFO  Utils:54 - Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056.
> .
> Caused by: java.lang.IllegalArgumentException: requirement failed: Can only 
> call getServletHandlers on a running MetricsSystem
> {quote}
> *This definitively does not happen without reverse proxy enabled!*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-23 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021450#comment-16021450
 ] 

Jork Zijlstra commented on SPARK-20799:
---

[~dongjoon]
I don't know since we don't use Parquet files. But I can off course generate 
one from the orc. Will try this tomorrow and let you know.

> Unable to infer schema for ORC on reading ORC from S3
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20857) Generic resolved hint node

2017-05-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20857.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Generic resolved hint node
> --
>
> Key: SPARK-20857
> URL: https://issues.apache.org/jira/browse/SPARK-20857
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> This patch renames BroadcastHint to ResolvedHint so it is more generic and 
> would allow us to introduce other hint types in the future without 
> introducing new hint nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4921) TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks

2017-05-23 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021417#comment-16021417
 ] 

Nan Zhu commented on SPARK-4921:


I forgot most of details...but the final conclusion was that "it's a typo 
resulting no performance issue, but for some reason, we cannot fully prove 
changing this will not bring hurts...so won't fix"

> TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks
> -
>
> Key: SPARK-4921
> URL: https://issues.apache.org/jira/browse/SPARK-4921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Xuefu Zhang
> Attachments: NO_PREF.patch
>
>
> During research for HIVE-9153, we found that TaskSetManager returns 
> PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
> Changing the return value to NO_PREF, as demonstrated in the attached patch, 
> seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20841) Support table column aliases in FROM clause

2017-05-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021413#comment-16021413
 ] 

Takeshi Yamamuro commented on SPARK-20841:
--

I'll work on this. I feel we could implement this in a similar way of 
https://github.com/apache/spark/pull/17928

> Support table column aliases in FROM clause
> ---
>
> Key: SPARK-20841
> URL: https://issues.apache.org/jira/browse/SPARK-20841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> Some SQL dialects support a relatively obscure "table column aliases" feature 
> where you can rename columns when aliasing a relation in a {{FROM}} clause. 
> For example:
> {code}
> SELECT * FROM onecolumn AS a(x) JOIN onecolumn AS b(y) ON a.x = b.y
> {code}
> Spark does not currently support this. I would like to add support for this 
> in order to allow me to run a corpus of existing queries which depend on this 
> syntax.
> There's a good writeup on this at 
> http://modern-sql.com/feature/table-column-aliases, which has additional 
> examples and describes other databases' degrees of support for this feature.
> One tricky thing to figure out will be whether FROM clause column aliases 
> take precedence over aliases in the SELECT clause. When adding support for 
> this, we should make sure to add sufficient testing of several corner-cases, 
> including:
> * Aliasing in both the SELECT and FROM clause
> * Aliasing columns in the FROM clause both with and without an explicit AS.
> * Aliasing the wrong number of columns in the FROM clause, both greater and 
> fewer columns than were selected in the SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4921) TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks

2017-05-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021410#comment-16021410
 ] 

Imran Rashid commented on SPARK-4921:
-

I want to point out that this also results in really confusing behavior in the 
event logs and the UI -- no pref tasks show up as "Process local" in the UI.  
This doesn't adversely effect scheduling, but is very confusing for the user.

I know this is really old, but just curious [~CodingCat] if you remember why 
this was done in SPARK-2294.

> TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks
> -
>
> Key: SPARK-4921
> URL: https://issues.apache.org/jira/browse/SPARK-4921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Xuefu Zhang
> Attachments: NO_PREF.patch
>
>
> During research for HIVE-9153, we found that TaskSetManager returns 
> PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
> Changing the return value to NO_PREF, as demonstrated in the attached patch, 
> seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16209) Convert Hive Tables to Data Source Tables for CREATE TABLE AS SELECT

2017-05-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-16209.
-
Resolution: Won't Fix

Hi, [~smilegator].
According to the discussion on PR, I'm closing this issue.
Please reopen this if I'm wrong.
Thanks.

> Convert Hive Tables to Data Source Tables for CREATE TABLE AS SELECT
> 
>
> Key: SPARK-16209
> URL: https://issues.apache.org/jira/browse/SPARK-16209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, the following created table is Hive Table.
> {noformat}
> CREATE TABLE t STORED AS parquet SELECT 1 as a, 1 as b
> {noformat}
> When users create table as query with {{STORED AS}} or {{ROW FORMAT}}, we 
> will not convert them to data source tables when 
> {{spark.sql.hive.convertCTAS}} is set to true. Actually, for parquet and orc 
> formats, we still can convert them to data source table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20851) Drop spark table failed if a column name is a numeric string

2017-05-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021348#comment-16021348
 ] 

Takeshi Yamamuro commented on SPARK-20851:
--

I think we could use numeric strings as column names and we could drop them;
{code}
scala> sql("CREATE TABLE t(`1` INT)")
scala> sql("INSERT INTO t VALUES(1)")
scala> sql("SELECT * FROM t").show
+---+
|  1|
+---+
|  1|
+---+

scala> sql("DROP TABLE t")
{code}
Could you describe more and show us a simpler query to reproduce this? Thanks.

> Drop spark table failed if a column name is a numeric string
> 
>
> Key: SPARK-20851
> URL: https://issues.apache.org/jira/browse/SPARK-20851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: linux redhat
>Reporter: Chen Gong
>
> I tried to read a json file to a spark dataframe
> {noformat}
> df = spark.read.json('path.json')
> df.write.parquet('dataframe', compression='snappy')
> {noformat}
> However, there are some columns' names are numeric strings, such as 
> "989238883". Then I created spark sql table by using this
> {noformat}
> create table if not exists `a` using org.apache.spark.sql.parquet options 
> (path 'dataframe');  // It works well
> {noformat}
> But after created table, any operations, like select, drop table on this 
> table will raise the same exceptions below
> {noformat}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array>,url:string,width:bigint>>,audit_id:bigint,author_id:bigint,body:string,brand_id:string,created_at:string,custom_ticket_fields:struct<49244727:string,51588527:string,51591767:string,51950848:string,51950868:string,51950888:string,51950928:string,52359587:string,55276747:string,56958227:string,57080067:string,57080667:string,57107727:string,57112447:string,57113207:string,57411128:string,57424648:string,57442588:string,62382188:string,74862088:string,74871788:string>,event_type:string,group_id:bigint,html_body:string,id:bigint,is_public:string,locale_id:string,organization_id:string,plain_body:string,previous_value:string,priority:string,public:boolean,rel:string,removed_tags:array,requester_id:bigint,satisfaction_probability:string,satisfaction_score:string,sla_policy:string,status:string,tags:array,ticket_form_id:string,type:string,via:string,via_reference_id:bigint>>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:785)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10$$anonfun$7.apply(HiveClientImpl.scala:365)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10$$anonfun$7.apply(HiveClientImpl.scala:365)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:365)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:361)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
>   at 
> 

[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-23 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021340#comment-16021340
 ] 

Dongjoon Hyun commented on SPARK-20799:
---

Hi, [~jzijlstra]. What about Parquet?

> Unable to infer schema for ORC on reading ORC from S3
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2017-05-23 Thread Bing Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021310#comment-16021310
 ] 

Bing Li commented on SPARK-15343:
-

The property in Yarn should be yarn.timeline-service.enabled=false, instead of 
hadoop.yarn.timeline-service.enabled.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at 

[jira] [Commented] (SPARK-20848) Dangling threads when reading parquet files in local mode

2017-05-23 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021237#comment-16021237
 ] 

Liang-Chi Hsieh commented on SPARK-20848:
-

Ok. It seems better not to change the concurrency, I add a shutdown and a test 
case for it.

> Dangling threads when reading parquet files in local mode
> -
>
> Key: SPARK-20848
> URL: https://issues.apache.org/jira/browse/SPARK-20848
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Nick Pritchard
> Attachments: Screen Shot 2017-05-22 at 4.13.52 PM.png
>
>
> On each call to {{spark.read.parquet}}, a new ForkJoinPool is created. One of 
> the threads in the pool is kept in the {{WAITING}} state, and never stopped, 
> which leads to unbounded growth in number of threads.
> This behavior is a regression from v2.1.0.
> Reproducible example:
> {code}
> val spark = SparkSession
>   .builder()
>   .appName("test")
>   .master("local")
>   .getOrCreate()
> while(true) {
>   spark.read.parquet("/path/to/file")
>   Thread.sleep(5000)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20848) Dangling threads when reading parquet files in local mode

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20848:


Assignee: Apache Spark

> Dangling threads when reading parquet files in local mode
> -
>
> Key: SPARK-20848
> URL: https://issues.apache.org/jira/browse/SPARK-20848
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Nick Pritchard
>Assignee: Apache Spark
> Attachments: Screen Shot 2017-05-22 at 4.13.52 PM.png
>
>
> On each call to {{spark.read.parquet}}, a new ForkJoinPool is created. One of 
> the threads in the pool is kept in the {{WAITING}} state, and never stopped, 
> which leads to unbounded growth in number of threads.
> This behavior is a regression from v2.1.0.
> Reproducible example:
> {code}
> val spark = SparkSession
>   .builder()
>   .appName("test")
>   .master("local")
>   .getOrCreate()
> while(true) {
>   spark.read.parquet("/path/to/file")
>   Thread.sleep(5000)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20848) Dangling threads when reading parquet files in local mode

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20848:


Assignee: (was: Apache Spark)

> Dangling threads when reading parquet files in local mode
> -
>
> Key: SPARK-20848
> URL: https://issues.apache.org/jira/browse/SPARK-20848
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Nick Pritchard
> Attachments: Screen Shot 2017-05-22 at 4.13.52 PM.png
>
>
> On each call to {{spark.read.parquet}}, a new ForkJoinPool is created. One of 
> the threads in the pool is kept in the {{WAITING}} state, and never stopped, 
> which leads to unbounded growth in number of threads.
> This behavior is a regression from v2.1.0.
> Reproducible example:
> {code}
> val spark = SparkSession
>   .builder()
>   .appName("test")
>   .master("local")
>   .getOrCreate()
> while(true) {
>   spark.read.parquet("/path/to/file")
>   Thread.sleep(5000)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20848) Dangling threads when reading parquet files in local mode

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021236#comment-16021236
 ] 

Apache Spark commented on SPARK-20848:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/18073

> Dangling threads when reading parquet files in local mode
> -
>
> Key: SPARK-20848
> URL: https://issues.apache.org/jira/browse/SPARK-20848
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Nick Pritchard
> Attachments: Screen Shot 2017-05-22 at 4.13.52 PM.png
>
>
> On each call to {{spark.read.parquet}}, a new ForkJoinPool is created. One of 
> the threads in the pool is kept in the {{WAITING}} state, and never stopped, 
> which leads to unbounded growth in number of threads.
> This behavior is a regression from v2.1.0.
> Reproducible example:
> {code}
> val spark = SparkSession
>   .builder()
>   .appName("test")
>   .master("local")
>   .getOrCreate()
> while(true) {
>   spark.read.parquet("/path/to/file")
>   Thread.sleep(5000)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20857) Generic resolved hint node

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20857:


Assignee: Reynold Xin  (was: Apache Spark)

> Generic resolved hint node
> --
>
> Key: SPARK-20857
> URL: https://issues.apache.org/jira/browse/SPARK-20857
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This patch renames BroadcastHint to ResolvedHint so it is more generic and 
> would allow us to introduce other hint types in the future without 
> introducing new hint nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20857) Generic resolved hint node

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20857:


Assignee: Apache Spark  (was: Reynold Xin)

> Generic resolved hint node
> --
>
> Key: SPARK-20857
> URL: https://issues.apache.org/jira/browse/SPARK-20857
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> This patch renames BroadcastHint to ResolvedHint so it is more generic and 
> would allow us to introduce other hint types in the future without 
> introducing new hint nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20857) Generic resolved hint node

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021200#comment-16021200
 ] 

Apache Spark commented on SPARK-20857:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18072

> Generic resolved hint node
> --
>
> Key: SPARK-20857
> URL: https://issues.apache.org/jira/browse/SPARK-20857
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This patch renames BroadcastHint to ResolvedHint so it is more generic and 
> would allow us to introduce other hint types in the future without 
> introducing new hint nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-05-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021199#comment-16021199
 ] 

Thomas Graves commented on SPARK-20178:
---

| My understanding of today's code is that a single FetchFailed task will 
trigger a stage failure and parent stage retry and that the task which 
experienced the fetch failure will not be retried within the same task set that 
scheduled it. I'm basing this off the comment at 
https://github.com/apache/spark/blob/9b09101938399a3490c3c9bde9e5f07031140fdf/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L77
 and the code at 
https://github.com/apache/spark/blob/9b09101938399a3490c3c9bde9e5f07031140fdf/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L770
 where the TSM prevents re-attempts of FetchFailed tasks.

That is correct but that doesn't mean we can't track the fetch failures on a 
host across stages.  You may or may not get multiple fetch failures in the 
first stage before it is aborted (very timing dependent) so you are correct 
that you can't rely on that. But if you track those across stage attempts and 
if the max is set to 2 or 3 then it will clear the entire host before the 4 
default stage failures.   This might give us a little more confidence its a 
hard failure vs a transient failure.   But that does take extra tracking and 
right now I don't have a good measure of metrics to tell me how many of 
different kinds of failures.  So to get the robustness for now I'm fine with 
just invalidating it immediately and see how that works.


> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20857) Turn BroadcastHint into a more generic hint node

2017-05-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-20857:

Summary: Turn BroadcastHint into a more generic hint node  (was: Make 
BroadcastHint a more ResolvedHint node)

> Turn BroadcastHint into a more generic hint node
> 
>
> Key: SPARK-20857
> URL: https://issues.apache.org/jira/browse/SPARK-20857
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This patch renames BroadcastHint to ResolvedHint so it is more generic and 
> would allow us to introduce other hint types in the future without 
> introducing new hint nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20857) Make BroadcastHint a more ResolvedHint node

2017-05-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20857:
---

 Summary: Make BroadcastHint a more ResolvedHint node
 Key: SPARK-20857
 URL: https://issues.apache.org/jira/browse/SPARK-20857
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Reynold Xin
Assignee: Reynold Xin


This patch renames BroadcastHint to ResolvedHint so it is more generic and 
would allow us to introduce other hint types in the future without introducing 
new hint nodes.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20857) Generic resolved hint node

2017-05-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-20857:

Summary: Generic resolved hint node  (was: Turn BroadcastHint into a more 
generic hint node)

> Generic resolved hint node
> --
>
> Key: SPARK-20857
> URL: https://issues.apache.org/jira/browse/SPARK-20857
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This patch renames BroadcastHint to ResolvedHint so it is more generic and 
> would allow us to introduce other hint types in the future without 
> introducing new hint nodes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20848) Dangling threads when reading parquet files in local mode

2017-05-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021186#comment-16021186
 ] 

Sean Owen commented on SPARK-20848:
---

Yes, possibly. The main tradeoff is that concurrent read jobs share a pool of 
threads, and don't get their own. You don't need to spin up new threads, but 
also, will have a thread pool lying around for the whole app lifetime. No big 
deal. The main question is whether concurrent jobs were intended to limit their 
total concurrency on purpose by sharing a pool or not.

> Dangling threads when reading parquet files in local mode
> -
>
> Key: SPARK-20848
> URL: https://issues.apache.org/jira/browse/SPARK-20848
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Nick Pritchard
> Attachments: Screen Shot 2017-05-22 at 4.13.52 PM.png
>
>
> On each call to {{spark.read.parquet}}, a new ForkJoinPool is created. One of 
> the threads in the pool is kept in the {{WAITING}} state, and never stopped, 
> which leads to unbounded growth in number of threads.
> This behavior is a regression from v2.1.0.
> Reproducible example:
> {code}
> val spark = SparkSession
>   .builder()
>   .appName("test")
>   .master("local")
>   .getOrCreate()
> while(true) {
>   spark.read.parquet("/path/to/file")
>   Thread.sleep(5000)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20848) Dangling threads when reading parquet files in local mode

2017-05-23 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021166#comment-16021166
 ] 

Liang-Chi Hsieh commented on SPARK-20848:
-

It seems to me that to share the task support between parquet file reading is 
better than shutdowning after each reading?

> Dangling threads when reading parquet files in local mode
> -
>
> Key: SPARK-20848
> URL: https://issues.apache.org/jira/browse/SPARK-20848
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Nick Pritchard
> Attachments: Screen Shot 2017-05-22 at 4.13.52 PM.png
>
>
> On each call to {{spark.read.parquet}}, a new ForkJoinPool is created. One of 
> the threads in the pool is kept in the {{WAITING}} state, and never stopped, 
> which leads to unbounded growth in number of threads.
> This behavior is a regression from v2.1.0.
> Reproducible example:
> {code}
> val spark = SparkSession
>   .builder()
>   .appName("test")
>   .master("local")
>   .getOrCreate()
> while(true) {
>   spark.read.parquet("/path/to/file")
>   Thread.sleep(5000)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-23 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-20443:
--
Attachment: blockSize.jpg

blockSize and the performance of ALS recommendForAll

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
> Attachments: blockSize.jpg
>
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20856) support statement using nested joins

2017-05-23 Thread N Campbell (JIRA)
N Campbell created SPARK-20856:
--

 Summary: support statement using nested joins
 Key: SPARK-20856
 URL: https://issues.apache.org/jira/browse/SPARK-20856
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: N Campbell


While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does not. 

Not supported
select * from 
  cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
 on tbint.rnum = tint.rnum
 on tint.rnum = tsint.rnum

versus written as shown
select * from 
  cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner 
join cert.tbint tbint on tint.rnum = tbint.rnum
   


ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', 
'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 
'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 
'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', 
LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 
'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5)

== SQL ==
select * from 
  cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
 on tbint.rnum = tint.rnum
 on tint.rnum = tsint.rnum
-^^^
, Query: select * from 
  cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
 on tbint.rnum = tint.rnum
 on tint.rnum = tsint.rnum.
SQLState:  HY000
ErrorCode: 500051





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20855) Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20855:


Assignee: Apache Spark

> Update the Spark kinesis docs to use the KinesisInputDStream builder instead 
> of deprecated KinesisUtils
> ---
>
> Key: SPARK-20855
> URL: https://issues.apache.org/jira/browse/SPARK-20855
> Project: Spark
>  Issue Type: Documentation
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Yash Sharma
>Assignee: Apache Spark
>Priority: Minor
>  Labels: docs, examples, kinesis, streaming
>
> The examples and docs for Spark-Kinesis integrations use the deprecated 
> KinesisUtils. We should update the docs to use the KinesisInputDStream 
> builder to create DStreams.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20855) Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20855:


Assignee: (was: Apache Spark)

> Update the Spark kinesis docs to use the KinesisInputDStream builder instead 
> of deprecated KinesisUtils
> ---
>
> Key: SPARK-20855
> URL: https://issues.apache.org/jira/browse/SPARK-20855
> Project: Spark
>  Issue Type: Documentation
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Yash Sharma
>Priority: Minor
>  Labels: docs, examples, kinesis, streaming
>
> The examples and docs for Spark-Kinesis integrations use the deprecated 
> KinesisUtils. We should update the docs to use the KinesisInputDStream 
> builder to create DStreams.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20855) Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021095#comment-16021095
 ] 

Apache Spark commented on SPARK-20855:
--

User 'yssharma' has created a pull request for this issue:
https://github.com/apache/spark/pull/18071

> Update the Spark kinesis docs to use the KinesisInputDStream builder instead 
> of deprecated KinesisUtils
> ---
>
> Key: SPARK-20855
> URL: https://issues.apache.org/jira/browse/SPARK-20855
> Project: Spark
>  Issue Type: Documentation
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Yash Sharma
>Priority: Minor
>  Labels: docs, examples, kinesis, streaming
>
> The examples and docs for Spark-Kinesis integrations use the deprecated 
> KinesisUtils. We should update the docs to use the KinesisInputDStream 
> builder to create DStreams.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20681) DataFram.Drop doesn't take effective, neither does error

2017-05-23 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021093#comment-16021093
 ] 

lyc commented on SPARK-20681:
-

As said in spark source code, `drop` can only be used to drop top level 
columns. Why did you think this will work?

> DataFram.Drop doesn't take effective, neither does error
> 
>
> Key: SPARK-20681
> URL: https://issues.apache.org/jira/browse/SPARK-20681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xi Wang
>Priority: Critical
>
> I am running the following code trying to drop nested columns,  but it 
> doesn't work, it doesn't return error either.
> *I read the DF from this json:*
> {'parent':{'child':{'grandchild':{'val':'1',' val_to_be_deleted':'0'
> scala> spark.read.format("json").load("c:/tmp/spark_issue.json")
> res0: org.apache.spark.sql.DataFrame = [father: struct struct>>]
> *read the df:*
> scala> res0.printSchema
> root
> |-- parent: struct (nullable = true)
> ||-- child: struct (nullable = true)
> |||-- grandchild: struct (nullable = true)
> ||||-- val: long (nullable = true)
> ||||-- val_to_be_deleted: long (nullable = true)
> *drop the column (I tried different ways, "quote", `back-tick`, col(object) 
> ...) column remains anyway:*
> scala> res0.drop(col("father.child.grandchild.val_to_be_deleted")).printSchema
> root
> |-- father: struct (nullable = true)
> ||-- child: struct (nullable = true)
> |||-- grandchild: struct (nullable = true)
> ||||-- val: long (nullable = true)
> ||||-- val_to_be_deleted: long (nullable = true)
> scala> res0.drop("father.child.grandchild.val_to_be_deleted").printSchema
> root
> |-- father: struct (nullable = true)
> ||-- child: struct (nullable = true)
> |||-- grandchild: struct (nullable = true)
> ||||-- val: long (nullable = true)
> ||||-- val_to_be_deleted: long (nullable = true)
> Any help is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20855) Update the Spark kinesis docs to use the KinesisInputDStream builder instead of deprecated KinesisUtils

2017-05-23 Thread Yash Sharma (JIRA)
Yash Sharma created SPARK-20855:
---

 Summary: Update the Spark kinesis docs to use the 
KinesisInputDStream builder instead of deprecated KinesisUtils
 Key: SPARK-20855
 URL: https://issues.apache.org/jira/browse/SPARK-20855
 Project: Spark
  Issue Type: Documentation
  Components: DStreams
Affects Versions: 2.1.1
Reporter: Yash Sharma
Priority: Minor


The examples and docs for Spark-Kinesis integrations use the deprecated 
KinesisUtils. We should update the docs to use the KinesisInputDStream builder 
to create DStreams.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20838) Spark ML ngram feature extractor should support ngram range like scikit

2017-05-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20838.
---
Resolution: Duplicate

> Spark ML ngram feature extractor should support ngram range like scikit
> ---
>
> Key: SPARK-20838
> URL: https://issues.apache.org/jira/browse/SPARK-20838
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Nick Lothian
>
> Currently Spark ML ngram extractor requires an ngram size (which default to 
> 2).
> This means that to tokenize to words, bigrams and trigrams (which is pretty 
> common) you need a pipeline like this:
> tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_text")
> remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), 
> outputCol="words")
> bigram = NGram(n=2, inputCol=remover.getOutputCol(), outputCol="bigrams")
> trigram = NGram(n=3, inputCol=remover.getOutputCol(), 
> outputCol="trigrams")
> 
> pipeline = Pipeline(stages=[tokenizer, remover, bigram, trigram])
> That's not terrible, but the big problem is that the words, bigrams and 
> trigrams end up in separate fields, and the only way (in pyspark) to combine 
> them is to explode each of the words, bigrams and trigrams field and then 
> union them together.
> In my experience this means it is slower to use this for feature extraction 
> than to use a python UDF. This seems preposterous!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13819) using a regexp_replace in a group by clause raises a nullpointerexception

2017-05-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13819.
---
Resolution: Duplicate

> using a regexp_replace in a group by clause raises a nullpointerexception
> -
>
> Key: SPARK-13819
> URL: https://issues.apache.org/jira/browse/SPARK-13819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Javier Pérez
>
> 1. Start start-thriftserver.sh
> 2. connect with beeline
> 3. Perform the following query over a table:
>   SELECT t0.textsample 
>   FROM test t0 
>   ORDER BY regexp_replace(
> t0.code, 
> concat('\\Q', 'a', '\\E'), 
> regexp_replace(
>regexp_replace('zz', '', ''),
> '\\$', 
> '\\$')) DESC;
> Problem: NullPointerException
> Trace:
>  java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpReplace.nullSafeEval(regexpExpressions.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:36)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:27)
>   at scala.math.Ordering$class.gt(Ordering.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.gt(ordering.scala:27)
>   at org.apache.spark.RangePartitioner.getPartition(Partitioner.scala:168)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13819) using a regexp_replace in a group by clause raises a nullpointerexception

2017-05-23 Thread DazhuangSu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021024#comment-16021024
 ] 

DazhuangSu commented on SPARK-13819:


This is a duplicate of https://issues.apache.org/jira/browse/SPARK-18368
Please close this issue, thx.

> using a regexp_replace in a group by clause raises a nullpointerexception
> -
>
> Key: SPARK-13819
> URL: https://issues.apache.org/jira/browse/SPARK-13819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Javier Pérez
>
> 1. Start start-thriftserver.sh
> 2. connect with beeline
> 3. Perform the following query over a table:
>   SELECT t0.textsample 
>   FROM test t0 
>   ORDER BY regexp_replace(
> t0.code, 
> concat('\\Q', 'a', '\\E'), 
> regexp_replace(
>regexp_replace('zz', '', ''),
> '\\$', 
> '\\$')) DESC;
> Problem: NullPointerException
> Trace:
>  java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpReplace.nullSafeEval(regexpExpressions.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:36)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:27)
>   at scala.math.Ordering$class.gt(Ordering.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.gt(ordering.scala:27)
>   at org.apache.spark.RangePartitioner.getPartition(Partitioner.scala:168)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings

2017-05-23 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-20854:
---

 Summary: extend hint syntax to support any expression, not just 
identifiers or strings
 Key: SPARK-20854
 URL: https://issues.apache.org/jira/browse/SPARK-20854
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Bogdan Raducanu


Currently the SQL hint syntax supports as parameters only identifiers while the 
Dataset hint syntax supports only strings.

They should support any expression as parameters, for example numbers. This is 
useful for implementing other hints in the future.

Examples:
{code}
df.hint("hint1", Seq(1, 2, 3))
df.hint("hint2", "A", 1)

sql("select /*+ hint1((1,2,3)) */")
sql("select /*+ hint2('A', 1) */")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20852) NullPointerException on distinct (Dataset)

2017-05-23 Thread Paride Casulli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paride Casulli closed SPARK-20852.
--
Resolution: Duplicate

duplicate issue of https://issues.apache.org/jira/browse/SPARK-18528

> NullPointerException on distinct (Dataset)
> ---
>
> Key: SPARK-20852
> URL: https://issues.apache.org/jira/browse/SPARK-20852
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: Hive 1.2.1
>Reporter: Paride Casulli
>
> Hi, I have NullPointerException on these instructions, I've tried also the 
> same commands in Spark 1.5.1 and returnes correctly the right result, can you 
> help me? The table is stored as hive table in parquet file, partitioned by 
> more than a column (ost_date is a partitioning field)
> scala> spark.sql("select ost_date from bda_ia.cdr_centrale_normalizzato where 
> ost_Date='2017-05-23' limit 200")
> res5: org.apache.spark.sql.DataFrame = [ost_date: date]
> scala> res5.distinct()
> res6: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ost_date: 
> date]
> scala> res6.show()
> [Stage 13:>   (53 + 2) / 
> 57]17/05/23 11:42:57 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 14.0 (TID 484, esedatanode3.telecomitalia.local, executor 37): 
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/05/23 11:42:58 ERROR scheduler.TaskSetManager: Task 0 in stage 14.0 failed 
> 4 times; aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 
> (TID 487, esedatanode3.telecomitalia.local, executor 37): 
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 

[jira] [Commented] (SPARK-20852) NullPointerException on distinct (Dataset)

2017-05-23 Thread Paride Casulli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020994#comment-16020994
 ] 

Paride Casulli commented on SPARK-20852:


Thank you Sean, you're right, without limit works fine :)

> NullPointerException on distinct (Dataset)
> ---
>
> Key: SPARK-20852
> URL: https://issues.apache.org/jira/browse/SPARK-20852
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: Hive 1.2.1
>Reporter: Paride Casulli
>
> Hi, I have NullPointerException on these instructions, I've tried also the 
> same commands in Spark 1.5.1 and returnes correctly the right result, can you 
> help me? The table is stored as hive table in parquet file, partitioned by 
> more than a column (ost_date is a partitioning field)
> scala> spark.sql("select ost_date from bda_ia.cdr_centrale_normalizzato where 
> ost_Date='2017-05-23' limit 200")
> res5: org.apache.spark.sql.DataFrame = [ost_date: date]
> scala> res5.distinct()
> res6: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ost_date: 
> date]
> scala> res6.show()
> [Stage 13:>   (53 + 2) / 
> 57]17/05/23 11:42:57 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 14.0 (TID 484, esedatanode3.telecomitalia.local, executor 37): 
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/05/23 11:42:58 ERROR scheduler.TaskSetManager: Task 0 in stage 14.0 failed 
> 4 times; aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 
> (TID 487, esedatanode3.telecomitalia.local, executor 37): 
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

[jira] [Commented] (SPARK-20713) Speculative task that got CommitDenied exception shows up as failed

2017-05-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020984#comment-16020984
 ] 

Apache Spark commented on SPARK-20713:
--

User 'liyichao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18070

> Speculative task that got CommitDenied exception shows up as failed
> ---
>
> Key: SPARK-20713
> URL: https://issues.apache.org/jira/browse/SPARK-20713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>
> When running speculative tasks you can end up getting a task failure on a 
> speculative task (the other task succeeded) because that task got a 
> CommitDenied exception when really it was "killed" by the driver. It is a 
> race between when the driver kills and when the executor tries to commit.
> I think ideally we should fix up the task state on this to be killed because 
> the fact that this task failed doesn't matter since the other speculative 
> task succeeded.  tasks showing up as failure confuse the user and could make 
> other scheduler cases harder.   
> This is somewhat related to SPARK-13343 where I think we should be correctly 
> account for speculative tasks.  only one of the 2 tasks really succeeded and 
> commited, and the other should be marked differently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20713) Speculative task that got CommitDenied exception shows up as failed

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20713:


Assignee: (was: Apache Spark)

> Speculative task that got CommitDenied exception shows up as failed
> ---
>
> Key: SPARK-20713
> URL: https://issues.apache.org/jira/browse/SPARK-20713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>
> When running speculative tasks you can end up getting a task failure on a 
> speculative task (the other task succeeded) because that task got a 
> CommitDenied exception when really it was "killed" by the driver. It is a 
> race between when the driver kills and when the executor tries to commit.
> I think ideally we should fix up the task state on this to be killed because 
> the fact that this task failed doesn't matter since the other speculative 
> task succeeded.  tasks showing up as failure confuse the user and could make 
> other scheduler cases harder.   
> This is somewhat related to SPARK-13343 where I think we should be correctly 
> account for speculative tasks.  only one of the 2 tasks really succeeded and 
> commited, and the other should be marked differently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20713) Speculative task that got CommitDenied exception shows up as failed

2017-05-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20713:


Assignee: Apache Spark

> Speculative task that got CommitDenied exception shows up as failed
> ---
>
> Key: SPARK-20713
> URL: https://issues.apache.org/jira/browse/SPARK-20713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> When running speculative tasks you can end up getting a task failure on a 
> speculative task (the other task succeeded) because that task got a 
> CommitDenied exception when really it was "killed" by the driver. It is a 
> race between when the driver kills and when the executor tries to commit.
> I think ideally we should fix up the task state on this to be killed because 
> the fact that this task failed doesn't matter since the other speculative 
> task succeeded.  tasks showing up as failure confuse the user and could make 
> other scheduler cases harder.   
> This is somewhat related to SPARK-13343 where I think we should be correctly 
> account for speculative tasks.  only one of the 2 tasks really succeeded and 
> commited, and the other should be marked differently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20852) NullPointerException on distinct (Dataset)

2017-05-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020974#comment-16020974
 ] 

Sean Owen commented on SPARK-20852:
---

Looks like another duplicate of 
https://issues.apache.org/jira/browse/SPARK-18528

> NullPointerException on distinct (Dataset)
> ---
>
> Key: SPARK-20852
> URL: https://issues.apache.org/jira/browse/SPARK-20852
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, SQL
>Affects Versions: 2.1.0
> Environment: Hive 1.2.1
>Reporter: Paride Casulli
>
> Hi, I have NullPointerException on these instructions, I've tried also the 
> same commands in Spark 1.5.1 and returnes correctly the right result, can you 
> help me? The table is stored as hive table in parquet file, partitioned by 
> more than a column (ost_date is a partitioning field)
> scala> spark.sql("select ost_date from bda_ia.cdr_centrale_normalizzato where 
> ost_Date='2017-05-23' limit 200")
> res5: org.apache.spark.sql.DataFrame = [ost_date: date]
> scala> res5.distinct()
> res6: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ost_date: 
> date]
> scala> res6.show()
> [Stage 13:>   (53 + 2) / 
> 57]17/05/23 11:42:57 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 14.0 (TID 484, esedatanode3.telecomitalia.local, executor 37): 
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/05/23 11:42:58 ERROR scheduler.TaskSetManager: Task 0 in stage 14.0 failed 
> 4 times; aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 
> (TID 487, esedatanode3.telecomitalia.local, executor 37): 
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> 

[jira] [Resolved] (SPARK-18651) KeyValueGroupedDataset[K, V].reduceGroups cannot handle primitive for V

2017-05-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18651.
---
   Resolution: Fixed
Fix Version/s: 2.1.1
   2.2.0

> KeyValueGroupedDataset[K, V].reduceGroups cannot handle primitive for V
> ---
>
> Key: SPARK-18651
> URL: https://issues.apache.org/jira/browse/SPARK-18651
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: koert kuipers
> Fix For: 2.2.0, 2.1.1
>
>
> run:
> {noformat}
> val df = Seq(1, 2, 3)
>   .toDS
>   .groupByKey(x => x)
>   .reduceGroups(_ + _)
> df.show
> {noformat}
> result:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 143.0 failed 1 times, most recent failure: Lost task 2.0 in stage 143.0 
> (TID 514, localhost): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.createHashMap(HashAggregateExec.scala:296)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> the issue is the null in ReduceAggregator.zero
> for a primitive type this null leads to the NPE. instead for primitive types 
> we should do a dummy/default value (0 for int, false for boolean, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >