date:20170623

[jira] [Created] (SPARK-21189) Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs

2017-06-23 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-21189:


 Summary: Handle unknown error codes in Jenkins rather then leaving 
incomplete comment in PRs
 Key: SPARK-21189
 URL: https://issues.apache.org/jira/browse/SPARK-21189
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon


Recently, Jenkins tests were unstable due to unknown reasons as below:

{code}
 /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; process was 
terminated by signal 9
test_result_code, test_result_note = run_tests(tests_timeout)
  File "./dev/run-tests-jenkins.py", line 140, in run_tests
test_result_note = ' * This patch **fails %s**.' % 
failure_note_by_errcode[test_result_code]
KeyError: -9
{code}

{code}
Traceback (most recent call last):
  File "./dev/run-tests-jenkins.py", line 226, in 
main()
  File "./dev/run-tests-jenkins.py", line 213, in main
test_result_code, test_result_note = run_tests(tests_timeout)
  File "./dev/run-tests-jenkins.py", line 140, in run_tests
test_result_note = ' * This patch **fails %s**.' % 
failure_note_by_errcode[test_result_code]
KeyError: -10
{code}

This exception looks causing failing to update the comments in the PR. For 
example:

{code}
Test build #78508 has started for PR 18320 at commit 53e00d7.
{code}

comment just remains. 

This always requires, for both reviewers and the author, a overhead to click 
and check the logs, which I believe are not really useful.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21189) Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21189:


Assignee: (was: Apache Spark)

> Handle unknown error codes in Jenkins rather then leaving incomplete comment 
> in PRs
> ---
>
> Key: SPARK-21189
> URL: https://issues.apache.org/jira/browse/SPARK-21189
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Recently, Jenkins tests were unstable due to unknown reasons as below:
> {code}
>  /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; process was 
> terminated by signal 9
> test_result_code, test_result_note = run_tests(tests_timeout)
>   File "./dev/run-tests-jenkins.py", line 140, in run_tests
> test_result_note = ' * This patch **fails %s**.' % 
> failure_note_by_errcode[test_result_code]
> KeyError: -9
> {code}
> {code}
> Traceback (most recent call last):
>   File "./dev/run-tests-jenkins.py", line 226, in 
> main()
>   File "./dev/run-tests-jenkins.py", line 213, in main
> test_result_code, test_result_note = run_tests(tests_timeout)
>   File "./dev/run-tests-jenkins.py", line 140, in run_tests
> test_result_note = ' * This patch **fails %s**.' % 
> failure_note_by_errcode[test_result_code]
> KeyError: -10
> {code}
> This exception looks causing failing to update the comments in the PR. For 
> example:
> {code}
> Test build #78508 has started for PR 18320 at commit 53e00d7.
> {code}
> comment just remains. 
> This always requires, for both reviewers and the author, a overhead to click 
> and check the logs, which I believe are not really useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21189) Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21189:


Assignee: Apache Spark

> Handle unknown error codes in Jenkins rather then leaving incomplete comment 
> in PRs
> ---
>
> Key: SPARK-21189
> URL: https://issues.apache.org/jira/browse/SPARK-21189
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Recently, Jenkins tests were unstable due to unknown reasons as below:
> {code}
>  /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; process was 
> terminated by signal 9
> test_result_code, test_result_note = run_tests(tests_timeout)
>   File "./dev/run-tests-jenkins.py", line 140, in run_tests
> test_result_note = ' * This patch **fails %s**.' % 
> failure_note_by_errcode[test_result_code]
> KeyError: -9
> {code}
> {code}
> Traceback (most recent call last):
>   File "./dev/run-tests-jenkins.py", line 226, in 
> main()
>   File "./dev/run-tests-jenkins.py", line 213, in main
> test_result_code, test_result_note = run_tests(tests_timeout)
>   File "./dev/run-tests-jenkins.py", line 140, in run_tests
> test_result_note = ' * This patch **fails %s**.' % 
> failure_note_by_errcode[test_result_code]
> KeyError: -10
> {code}
> This exception looks causing failing to update the comments in the PR. For 
> example:
> {code}
> Test build #78508 has started for PR 18320 at commit 53e00d7.
> {code}
> comment just remains. 
> This always requires, for both reviewers and the author, a overhead to click 
> and check the logs, which I believe are not really useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21189) Handle unknown error codes in Jenkins rather then leaving incomplete comment in PRs

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060501#comment-16060501
 ] 

Apache Spark commented on SPARK-21189:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18399

> Handle unknown error codes in Jenkins rather then leaving incomplete comment 
> in PRs
> ---
>
> Key: SPARK-21189
> URL: https://issues.apache.org/jira/browse/SPARK-21189
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Recently, Jenkins tests were unstable due to unknown reasons as below:
> {code}
>  /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; process was 
> terminated by signal 9
> test_result_code, test_result_note = run_tests(tests_timeout)
>   File "./dev/run-tests-jenkins.py", line 140, in run_tests
> test_result_note = ' * This patch **fails %s**.' % 
> failure_note_by_errcode[test_result_code]
> KeyError: -9
> {code}
> {code}
> Traceback (most recent call last):
>   File "./dev/run-tests-jenkins.py", line 226, in 
> main()
>   File "./dev/run-tests-jenkins.py", line 213, in main
> test_result_code, test_result_note = run_tests(tests_timeout)
>   File "./dev/run-tests-jenkins.py", line 140, in run_tests
> test_result_note = ' * This patch **fails %s**.' % 
> failure_note_by_errcode[test_result_code]
> KeyError: -10
> {code}
> This exception looks causing failing to update the comments in the PR. For 
> example:
> {code}
> Test build #78508 has started for PR 18320 at commit 53e00d7.
> {code}
> comment just remains. 
> This always requires, for both reviewers and the author, a overhead to click 
> and check the logs, which I believe are not really useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-23 Thread Ingo Schuster (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingo Schuster updated SPARK-21176:
--
Description: 
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum half 
the number of available CPUs:
{{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a proxy servlet is set up for each executor.
I have a system with 7 executors and 88 CPUs on the master node. Jetty tries to 
instantiate 7*44 = 309 selector threads just for the reverse proxy servlets, 
but since the QueuedThreadPool is initialized with 200 threads by default, the 
UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool(400)}}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.

  was:
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum half 
the number of available CPUs:
{{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a proxy servlet is set up for each executor.
I have a system with 7 executors and 88 CPUs on the master node. Jetty tries to 
instantiate 7*44 = 309 selector threads just for the reverse proxy servlets, 
but since the QueuedThreadPool is initialized with 200 threads by default, the 
UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)*}}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.


> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull

[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-23 Thread Ingo Schuster (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingo Schuster updated SPARK-21176:
--
Description: 
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum half 
the number of available CPUs:
{{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a proxy servlet is set up for each executor.
I have a system with 7 executors and 88 CPUs on the master node. Jetty tries to 
instantiate 7*44 = 309 selector threads just for the reverse proxy servlets, 
but since the QueuedThreadPool is initialized with 200 threads by default, the 
UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)*}}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.

  was:
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum half 
the number of available CPUs:
{{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a proxy servlet is set up for each executor.
I have a system with 7 executors and 88 CPUs on the master node. Jetty tries to 
instantiate 7*44 = 309 selector threads just for the reverse proxy servlets, 
but since the QueuedThreadPool is initialized with 200 threads by default, the 
UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.


> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool*(400)*}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a

[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-23 Thread Ingo Schuster (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingo Schuster updated SPARK-21176:
--
Description: 
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum half 
the number of available CPUs:
{{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a proxy servlet is set up for each executor.
I have a system with 7 executors and 88 CPUs on the master node. Jetty tries to 
instantiate 7*44 = 309 selector threads just for the reverse proxy servlets, 
but since the QueuedThreadPool is initialized with 200 threads by default, the 
UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.

  was:
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each connector, Jetty creates Selector threads: minimum 4, maximum half the 
number of available CPUs:
{{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a connector is set up for each executor and one for the 
master UI.
I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to 
instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
initialized with 200 threads by default, the UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.


> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool*(400)* }}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.

[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-06-23 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060510#comment-16060510
 ] 

Bryan Cutler commented on SPARK-21187:
--

Pandas only supports flat columns, I'm not sure if there is an equivalent to 
array or map.  I was thinking more of what arrow supports for this, but since 
toPandas() is the only consumer of arrow data, then I will focus on what we 
need to be the same as usage without arrow.

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * *Date*
> * *Timestamp*
> * *Complex*: Struct, Array, Map
> * *Decimal*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore

2017-06-23 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-21145.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18355
[https://github.com/apache/spark/pull/18355]

> Restarted queries reuse same StateStoreProvider, causing multiple concurrent 
> tasks to update same StateStore
> 
>
> Key: SPARK-21145
> URL: https://issues.apache.org/jira/browse/SPARK-21145
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.3.0
>
>
> StateStoreProvider instances are loaded on-demand in a executor when a query 
> is started. When a query is restarted, the loaded provider instance will get 
> reused. Now, there is a non-trivial chance, that the task of the previous 
> query run is still running, while the tasks of the restarted run has started. 
> So for a stateful partition, there may be two concurrent tasks related to the 
> same stateful partition, and there for using the same provider instance. This 
> can lead to inconsistent results and possibly random failures, as state store 
> implementations are not designed to be thread-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21190) SPIP: Vectorized UDFs for Python

2017-06-23 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-21190:
---

 Summary: SPIP: Vectorized UDFs for Python
 Key: SPARK-21190
 URL: https://issues.apache.org/jira/browse/SPARK-21190
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21190) SPIP: Vectorized UDFs for Python

2017-06-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21190:

Description: 
*Background and Motivation*
 
Python is one of the most popular programming languages among Spark users. 
Spark currently exposes a row-at-a-time interface for defining and executing 
user-defined functions (UDFs). This introduces high overhead in serialization 
and deserialization, and also makes it difficult to leverage Python libraries 
that are written in native code. This proposal advocates introducing new APIs 
to support vectorized UDFs in Python, in which a block of data is transferred 
over to Python in some column format for execution.
 
 
*Target Personas*

Data scientists, data engineers, library developers.
 

*Goals*

... todo ...
 

*Non-Goals*

- Define block oriented UDFs in other languages (that are not Python).
- Define aggregate UDFs
 
 
*Proposed API Changes*
 
... todo ...
 
 
 
*Optional Design Sketch*
The implementation should be pretty straightforward and is not a huge concern 
at this point. I’m more concerned about getting proper feedback for API design.
 
 
*Optional Rejected Designs*
See above.
 
 
 
 


> SPIP: Vectorized UDFs for Python
> 
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>  Labels: SPIP
>
> *Background and Motivation*
>  
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> that are written in native code. This proposal advocates introducing new APIs 
> to support vectorized UDFs in Python, in which a block of data is transferred 
> over to Python in some column format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> ... todo ...
>  
> *Non-Goals*
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
>  
>  
> *Proposed API Changes*
>  
> ... todo ...
>  
>  
>  
> *Optional Design Sketch*
> The implementation should be pretty straightforward and is not a huge concern 
> at this point. I’m more concerned about getting proper feedback for API 
> design.
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21190) SPIP: Vectorized UDFs for Python

2017-06-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-21190:
---

Assignee: Reynold Xin

> SPIP: Vectorized UDFs for Python
> 
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
>
> *Background and Motivation*
>  
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> that are written in native code. This proposal advocates introducing new APIs 
> to support vectorized UDFs in Python, in which a block of data is transferred 
> over to Python in some column format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> ... todo ...
>  
> *Non-Goals*
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
>  
>  
> *Proposed API Changes*
>  
> ... todo ...
>  
>  
>  
> *Optional Design Sketch*
> The implementation should be pretty straightforward and is not a huge concern 
> at this point. I’m more concerned about getting proper feedback for API 
> design.
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21190:

Summary: SPIP: Vectorized UDFs in Python  (was: SPIP: Vectorized UDFs for 
Python)

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
>
> *Background and Motivation*
>  
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> that are written in native code. This proposal advocates introducing new APIs 
> to support vectorized UDFs in Python, in which a block of data is transferred 
> over to Python in some column format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> ... todo ...
>  
> *Non-Goals*
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
>  
>  
> *Proposed API Changes*
>  
> ... todo ...
>  
>  
>  
> *Optional Design Sketch*
> The implementation should be pretty straightforward and is not a huge concern 
> at this point. I’m more concerned about getting proper feedback for API 
> design.
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21190:

Description: 
*Background and Motivation*
Python is one of the most popular programming languages among Spark users. 
Spark currently exposes a row-at-a-time interface for defining and executing 
user-defined functions (UDFs). This introduces high overhead in serialization 
and deserialization, and also makes it difficult to leverage Python libraries 
that are written in native code. This proposal advocates introducing new APIs 
to support vectorized UDFs in Python, in which a block of data is transferred 
over to Python in some column format for execution.
 
 
*Target Personas*
Data scientists, data engineers, library developers.
 

*Goals*
... todo ...
 

*Non-Goals*
- Define block oriented UDFs in other languages (that are not Python).
- Define aggregate UDFs
 
 
*Proposed API Changes*
... todo ...
 
 
 
*Optional Design Sketch*
The implementation should be pretty straightforward and is not a huge concern 
at this point. I’m more concerned about getting proper feedback for API design.
 
 
*Optional Rejected Designs*
See above.
 
 
 
 


  was:
*Background and Motivation*
 
Python is one of the most popular programming languages among Spark users. 
Spark currently exposes a row-at-a-time interface for defining and executing 
user-defined functions (UDFs). This introduces high overhead in serialization 
and deserialization, and also makes it difficult to leverage Python libraries 
that are written in native code. This proposal advocates introducing new APIs 
to support vectorized UDFs in Python, in which a block of data is transferred 
over to Python in some column format for execution.
 
 
*Target Personas*

Data scientists, data engineers, library developers.
 

*Goals*

... todo ...
 

*Non-Goals*

- Define block oriented UDFs in other languages (that are not Python).
- Define aggregate UDFs
 
 
*Proposed API Changes*
 
... todo ...
 
 
 
*Optional Design Sketch*
The implementation should be pretty straightforward and is not a huge concern 
at this point. I’m more concerned about getting proper feedback for API design.
 
 
*Optional Rejected Designs*
See above.
 
 
 
 



> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> that are written in native code. This proposal advocates introducing new APIs 
> to support vectorized UDFs in Python, in which a block of data is transferred 
> over to Python in some column format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> ... todo ...
>  
> *Non-Goals*
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
>  
>  
> *Proposed API Changes*
> ... todo ...
>  
>  
>  
> *Optional Design Sketch*
> The implementation should be pretty straightforward and is not a huge concern 
> at this point. I’m more concerned about getting proper feedback for API 
> design.
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2017-06-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060545#comment-16060545
 ] 

Sean Owen commented on SPARK-14220:
---

I have a branch locally for 2.12. That much isn't a lot of work. The 
dependencies maybe. I agree that it's probably not going to happen for 2.3.x 
but maybe by the end of the year.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2017-06-23 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060549#comment-16060549
 ] 

Reynold Xin commented on SPARK-14220:
-

Making it build isn't that much work, but getting the API to work (especially 
the Dataset one) will be a huge.


> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-23 Thread Ingo Schuster (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060553#comment-16060553
 ] 

Ingo Schuster commented on SPARK-21176:
---

Best way I can see to control the number of selectors, is to overwrite the 
newHttpClient() method of ProxyServlet in our JettyUtils.scala.
It's a pitty that the constructor of ProxyServlet does not allow to overwrite 
the selector default.
Maybe something to clean up the code in future, I have opened an respective 
request  for Jetty: https://github.com/eclipse/jetty.project/issues/1643

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21171) Speculate task scheduling block dirve handle normal task when a job task number more than one hundred thousand

2017-06-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21171.
---
Resolution: Invalid

> Speculate task scheduling block dirve handle normal task when a job task 
> number more than one hundred thousand
> --
>
> Key: SPARK-21171
> URL: https://issues.apache.org/jira/browse/SPARK-21171
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.0.0
> Environment: We have more than two hundred high-performance machine 
> to handle more than 2T data by one query
>Reporter: wangminfeng
>
> If a job have more then one hundred thousand tasks and spark.speculation is 
> true, when speculable tasks start, choosing a speculable will waste lots of 
> time and block other tasks. We do a ad-hoc query for data analyse,  we can't 
> tolerate one job wasting time even it is a large job



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t

2017-06-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060557#comment-16060557
 ] 

Sean Owen commented on SPARK-21183:
---

I'm not clear this is a Spark problem. What's the underlying error? what value 
failed to convert?

> Unable to return Google BigQuery INTEGER data type into Spark via google 
> BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error 
> converting value to long.
> --
>
> Key: SPARK-21183
> URL: https://issues.apache.org/jira/browse/SPARK-21183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> Spark  version 2.1.1
> JDBC:  Download the latest google BigQuery JDBC Driver from Google
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark using a JDBC connection to Google 
> BigQuery.  Unfortunately, when I try to query data that resides in an INTEGER 
> column I get the following error:  
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> long.  
> Steps to reproduce:
> 1) On Google BigQuery console create a simple table with an INT column and 
> insert some data 
> 2) Copy the Google BigQuery JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files
> ./spark-shell --jars 
> /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar
> 4) In Spark shell load the data from Google BigQuery using the JDBC driver
> val gbq = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable";
>  -> 
> "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load()
> 5) In Spark shell try to display the data
> gbq.show()
> At this point you should see the error:
> scala> gbq.show()
> 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, 
> ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: 
> [Simba][JDBC](10140) Error converting value to long.
> at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source)
> at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a

[jira] [Commented] (SPARK-19726) Faild to insert null timestamp value to mysql using spark jdbc

2017-06-23 Thread wangshuangshuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060558#comment-16060558
 ] 

wangshuangshuang commented on SPARK-19726:
--

How to reappear this issue via testcase but not mysql jdbc? Anyone can help me？

> Faild to insert null timestamp value to mysql using spark jdbc
> --
>
> Key: SPARK-19726
> URL: https://issues.apache.org/jira/browse/SPARK-19726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: AnfengYuan
>
> 1. create a table in mysql
> {code:borderStyle=solid}
> CREATE TABLE `timestamp_test` (
>   `id` bigint(23) DEFAULT NULL,
>   `time_stamp` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
> CURRENT_TIMESTAMP
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8
> {code}
> 2. insert one row using spark
> {code:borderStyle=solid}
> CREATE OR REPLACE TEMPORARY VIEW jdbcTable
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url 
> 'jdbc:mysql://xxx.xxx.xxx.xxx:3306/default?characterEncoding=utf8&useServerPrepStmts=false&rewriteBatchedStatements=true',
>   dbtable 'timestamp_test',
>   driver 'com.mysql.jdbc.Driver',
>   user 'root',
>   password 'root'
> );
> insert into jdbcTable values (1, null);
> {code}
> the insert statement failed with exceptions:
> {code:borderStyle=solid}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 599 in stage 1.0 failed 4 times, most recent failure: Lost task 599.3 in 
> stage 1.0 (TID 1202, A03-R07-I12-135.JD.LOCAL): 
> java.sql.BatchUpdateException: Data truncation: Incorrect datetime value: 
> '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
>   at com.mysql.jdbc.Util.getInstance(Util.java:387)
>   at 
> com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1154)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1582)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1248)
>   at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:959)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:227)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect 
> datetime value: '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3876)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2073)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2009)
>   at 
> com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5094)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1543)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.1

[jira] [Commented] (SPARK-19726) Faild to insert null timestamp value to mysql using spark jdbc

2017-06-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060562#comment-16060562
 ] 

Sean Owen commented on SPARK-19726:
---

I'm not sure this is a Spark problem. You're trying to insert a null into a 
non-null column. Something should go wrong.

> Faild to insert null timestamp value to mysql using spark jdbc
> --
>
> Key: SPARK-19726
> URL: https://issues.apache.org/jira/browse/SPARK-19726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: AnfengYuan
>
> 1. create a table in mysql
> {code:borderStyle=solid}
> CREATE TABLE `timestamp_test` (
>   `id` bigint(23) DEFAULT NULL,
>   `time_stamp` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
> CURRENT_TIMESTAMP
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8
> {code}
> 2. insert one row using spark
> {code:borderStyle=solid}
> CREATE OR REPLACE TEMPORARY VIEW jdbcTable
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url 
> 'jdbc:mysql://xxx.xxx.xxx.xxx:3306/default?characterEncoding=utf8&useServerPrepStmts=false&rewriteBatchedStatements=true',
>   dbtable 'timestamp_test',
>   driver 'com.mysql.jdbc.Driver',
>   user 'root',
>   password 'root'
> );
> insert into jdbcTable values (1, null);
> {code}
> the insert statement failed with exceptions:
> {code:borderStyle=solid}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 599 in stage 1.0 failed 4 times, most recent failure: Lost task 599.3 in 
> stage 1.0 (TID 1202, A03-R07-I12-135.JD.LOCAL): 
> java.sql.BatchUpdateException: Data truncation: Incorrect datetime value: 
> '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
>   at com.mysql.jdbc.Util.getInstance(Util.java:387)
>   at 
> com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1154)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1582)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1248)
>   at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:959)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:227)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect 
> datetime value: '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3876)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2073)
>   at 
> com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2009)
>   at 
> com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5094)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1543)
>   ... 15 more
> {code}



--
This message was sen

[jira] [Resolved] (SPARK-21185) Spurious errors in unidoc causing PRs to fail

2017-06-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21185.
---
Resolution: Duplicate

Yes, the underlying issue is that javadoc 8 is stricter. We have a few issues 
about that and [~hyukjin.kwon] has been cleaning that up as fast as possible. 

As to warn vs error, that's weird, but I suspect it's because some build 
machines are actually running Java 8 vs 7?

> Spurious errors in unidoc causing PRs to fail
> -
>
> Key: SPARK-21185
> URL: https://issues.apache.org/jira/browse/SPARK-21185
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Hyukjin Kwon
>
> Some PRs are failing because of unidoc throwing random errors. When 
> GenJavaDoc generates Java files from Scala files, the generated java files 
> can have errors in them. When JavaDoc attempts to generate docs on these 
> generated java files, it throws errors. Usually, the errors are marked as 
> warnings, so the unidoc does not fail the build. 
> Example - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull
> {code}
> [info] Constructing Javadoc information...
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
>  error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
> accessed from outside package
> [warn]   public   BlacklistTracker 
> (org.apache.spark.scheduler.LiveListenerBus listenerBus, 
> org.apache.spark.SparkConf conf, 
> scala.Option allocationClient, 
> org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
> [warn]
> ^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
>  error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
> accessed from outside package
> [warn]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
> scala.Option allocationClient)  { 
> throw new RuntimeException(); }
> {code}
> However in some PR builds these are marked as errors, thus causing the build 
> to fail due to unidoc. Example - 
> https://github.com/apache/spark/pull/18355
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull
> {code}
> [info] Constructing Javadoc information...
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
>  error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
> accessed from outside package
> [error]   public   BlacklistTracker 
> (org.apache.spark.scheduler.LiveListenerBus listenerBus, 
> org.apache.spark.SparkConf conf, 
> scala.Option allocationClient, 
> org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
> [error]   
>  ^
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
>  error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
> accessed from outside package
> [error]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
> scala.Option allocationClient)  { 
> throw new RuntimeException(); }
> [error] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator

2017-06-23 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060603#comment-16060603
 ] 

Yanbo Liang commented on SPARK-21152:
-

[~sethah] This is an interesting topic, thanks for working on it. Could you 
show the performance comparison with respect to the size and type of data? 
AFAIK, the most common use case for MLlib LR is training on {{low dimensional 
dense/sparse or high dimensional sparse}} data. If blocked gradient update can 
get significant performance improvement for these cases, I think it's worth the 
investment. Thanks.

> Use level 3 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-21152
> URL: https://issues.apache.org/jira/browse/SPARK-21152
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Seth Hendrickson
>
> In logistic regression gradient update, we currently compute by each 
> individual row. If we blocked the rows together, we can do a blocked gradient 
> update which leverages the BLAS GEMM operation.
> On high dimensional dense datasets, I've observed ~10x speedups. The problem 
> here, though, is that it likely won't improve the sparse case so we need to 
> keep both implementations around, and this blocked algorithm will require 
> caching a new dataset of type:
> {code}
> BlockInstance(label: Vector, weight: Vector, features: Matrix)
> {code}
> We have avoided caching anything beside the original dataset passed to train 
> in the past because it adds memory overhead if the user has cached this 
> original dataset for other reasons. Here, I'd like to discuss whether we 
> think this patch would be worth the investment, given that it only improves a 
> subset of the use cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21191) DataFrame Row StructType check duplicate name

2017-06-23 Thread darion yaphet (JIRA)

darion yaphet created SPARK-21191:
-

 Summary: DataFrame Row StructType check duplicate name
 Key: SPARK-21191
 URL: https://issues.apache.org/jira/browse/SPARK-21191
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.0.2, 2.0.0
Reporter: darion yaphet


Currently , when we create a dataframe with *toDF(columns:String)* or describe 
from *StructType* can't avoid have duplicated name .

{code:scala}
val dataset = Seq(
  (0, 3, 4),
  (0, 4, 3),
  (0, 5, 2),
  (1, 3, 3),
  (1, 5, 6),
  (1, 4, 2),
  (2, 3, 5),
  (2, 5, 4),
  (2, 4, 3)
).toDF("1", "1", "2").show
{code}

{code}
+---+---+---+
|  1|  1|  2|
+---+---+---+
|  0|  3|  4|
|  0|  4|  3|
|  0|  5|  2|
|  1|  3|  3|
|  1|  5|  6|
|  1|  4|  2|
|  2|  3|  5|
|  2|  5|  4|
|  2|  4|  3|
+---+---+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21191) DataFrame Row StructType check duplicate name

2017-06-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060629#comment-16060629
 ] 

Sean Owen commented on SPARK-21191:
---

What is the issue? You have the cold duplicate names. 

> DataFrame Row StructType check duplicate name
> -
>
> Key: SPARK-21191
> URL: https://issues.apache.org/jira/browse/SPARK-21191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently , when we create a dataframe with *toDF(columns:String)* or 
> describe from *StructType* can't avoid have duplicated name .
> {code:scala}
> val dataset = Seq(
>   (0, 3, 4),
>   (0, 4, 3),
>   (0, 5, 2),
>   (1, 3, 3),
>   (1, 5, 6),
>   (1, 4, 2),
>   (2, 3, 5),
>   (2, 5, 4),
>   (2, 4, 3)
> ).toDF("1", "1", "2").show
> {code}
> {code}
> +---+---+---+
> |  1|  1|  2|
> +---+---+---+
> |  0|  3|  4|
> |  0|  4|  3|
> |  0|  5|  2|
> |  1|  3|  3|
> |  1|  5|  6|
> |  1|  4|  2|
> |  2|  3|  5|
> |  2|  5|  4|
> |  2|  4|  3|
> +---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21191) DataFrame Row StructType check duplicate name

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21191:


Assignee: Apache Spark

> DataFrame Row StructType check duplicate name
> -
>
> Key: SPARK-21191
> URL: https://issues.apache.org/jira/browse/SPARK-21191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.1
>Reporter: darion yaphet
>Assignee: Apache Spark
>
> Currently , when we create a dataframe with *toDF(columns:String)* or 
> describe from *StructType* can't avoid have duplicated name .
> {code:scala}
> val dataset = Seq(
>   (0, 3, 4),
>   (0, 4, 3),
>   (0, 5, 2),
>   (1, 3, 3),
>   (1, 5, 6),
>   (1, 4, 2),
>   (2, 3, 5),
>   (2, 5, 4),
>   (2, 4, 3)
> ).toDF("1", "1", "2").show
> {code}
> {code}
> +---+---+---+
> |  1|  1|  2|
> +---+---+---+
> |  0|  3|  4|
> |  0|  4|  3|
> |  0|  5|  2|
> |  1|  3|  3|
> |  1|  5|  6|
> |  1|  4|  2|
> |  2|  3|  5|
> |  2|  5|  4|
> |  2|  4|  3|
> +---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21191) DataFrame Row StructType check duplicate name

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21191:


Assignee: (was: Apache Spark)

> DataFrame Row StructType check duplicate name
> -
>
> Key: SPARK-21191
> URL: https://issues.apache.org/jira/browse/SPARK-21191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently , when we create a dataframe with *toDF(columns:String)* or 
> describe from *StructType* can't avoid have duplicated name .
> {code:scala}
> val dataset = Seq(
>   (0, 3, 4),
>   (0, 4, 3),
>   (0, 5, 2),
>   (1, 3, 3),
>   (1, 5, 6),
>   (1, 4, 2),
>   (2, 3, 5),
>   (2, 5, 4),
>   (2, 4, 3)
> ).toDF("1", "1", "2").show
> {code}
> {code}
> +---+---+---+
> |  1|  1|  2|
> +---+---+---+
> |  0|  3|  4|
> |  0|  4|  3|
> |  0|  5|  2|
> |  1|  3|  3|
> |  1|  5|  6|
> |  1|  4|  2|
> |  2|  3|  5|
> |  2|  5|  4|
> |  2|  4|  3|
> +---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21191) DataFrame Row StructType check duplicate name

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060689#comment-16060689
 ] 

Apache Spark commented on SPARK-21191:
--

User 'darionyaphet' has created a pull request for this issue:
https://github.com/apache/spark/pull/18401

> DataFrame Row StructType check duplicate name
> -
>
> Key: SPARK-21191
> URL: https://issues.apache.org/jira/browse/SPARK-21191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently , when we create a dataframe with *toDF(columns:String)* or 
> describe from *StructType* can't avoid have duplicated name .
> {code:scala}
> val dataset = Seq(
>   (0, 3, 4),
>   (0, 4, 3),
>   (0, 5, 2),
>   (1, 3, 3),
>   (1, 5, 6),
>   (1, 4, 2),
>   (2, 3, 5),
>   (2, 5, 4),
>   (2, 4, 3)
> ).toDF("1", "1", "2").show
> {code}
> {code}
> +---+---+---+
> |  1|  1|  2|
> +---+---+---+
> |  0|  3|  4|
> |  0|  4|  3|
> |  0|  5|  2|
> |  1|  3|  3|
> |  1|  5|  6|
> |  1|  4|  2|
> |  2|  3|  5|
> |  2|  5|  4|
> |  2|  4|  3|
> +---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21192) Preserve State Store provider class configuration across restarts

2017-06-23 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-21192:
-

 Summary: Preserve State Store provider class configuration across 
restarts
 Key: SPARK-21192
 URL: https://issues.apache.org/jira/browse/SPARK-21192
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das


If the SQL conf for StateStore provider class is changed between restarts (i.e. 
query started with providerClass1 and attempted to restart using 
providerClass2), then the query will fail in a unpredictable way as files saved 
by one provider class cannot be used by the newer one. 

Ideally, the provider class used to start the query should be used to restart 
the query, and the configuration in the session where it is being restarted 
should be ignored. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21192) Preserve State Store provider class configuration across StreamingQuery restarts

2017-06-23 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-21192:
--
Summary: Preserve State Store provider class configuration across 
StreamingQuery restarts  (was: Preserve State Store provider class 
configuration across restarts)

> Preserve State Store provider class configuration across StreamingQuery 
> restarts
> 
>
> Key: SPARK-21192
> URL: https://issues.apache.org/jira/browse/SPARK-21192
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> If the SQL conf for StateStore provider class is changed between restarts 
> (i.e. query started with providerClass1 and attempted to restart using 
> providerClass2), then the query will fail in a unpredictable way as files 
> saved by one provider class cannot be used by the newer one. 
> Ideally, the provider class used to start the query should be used to restart 
> the query, and the configuration in the session where it is being restarted 
> should be ignored. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21192) Preserve State Store provider class configuration across StreamingQuery restarts

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21192:


Assignee: Apache Spark  (was: Tathagata Das)

> Preserve State Store provider class configuration across StreamingQuery 
> restarts
> 
>
> Key: SPARK-21192
> URL: https://issues.apache.org/jira/browse/SPARK-21192
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> If the SQL conf for StateStore provider class is changed between restarts 
> (i.e. query started with providerClass1 and attempted to restart using 
> providerClass2), then the query will fail in a unpredictable way as files 
> saved by one provider class cannot be used by the newer one. 
> Ideally, the provider class used to start the query should be used to restart 
> the query, and the configuration in the session where it is being restarted 
> should be ignored. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21192) Preserve State Store provider class configuration across StreamingQuery restarts

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21192:


Assignee: Tathagata Das  (was: Apache Spark)

> Preserve State Store provider class configuration across StreamingQuery 
> restarts
> 
>
> Key: SPARK-21192
> URL: https://issues.apache.org/jira/browse/SPARK-21192
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> If the SQL conf for StateStore provider class is changed between restarts 
> (i.e. query started with providerClass1 and attempted to restart using 
> providerClass2), then the query will fail in a unpredictable way as files 
> saved by one provider class cannot be used by the newer one. 
> Ideally, the provider class used to start the query should be used to restart 
> the query, and the configuration in the session where it is being restarted 
> should be ignored. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21192) Preserve State Store provider class configuration across StreamingQuery restarts

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060699#comment-16060699
 ] 

Apache Spark commented on SPARK-21192:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/18402

> Preserve State Store provider class configuration across StreamingQuery 
> restarts
> 
>
> Key: SPARK-21192
> URL: https://issues.apache.org/jira/browse/SPARK-21192
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> If the SQL conf for StateStore provider class is changed between restarts 
> (i.e. query started with providerClass1 and attempted to restart using 
> providerClass2), then the query will fail in a unpredictable way as files 
> saved by one provider class cannot be used by the newer one. 
> Ideally, the provider class used to start the query should be used to restart 
> the query, and the configuration in the session where it is being restarted 
> should be ignored. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21192) Preserve State Store provider class configuration across StreamingQuery restarts

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21192:


Assignee: Apache Spark  (was: Tathagata Das)

> Preserve State Store provider class configuration across StreamingQuery 
> restarts
> 
>
> Key: SPARK-21192
> URL: https://issues.apache.org/jira/browse/SPARK-21192
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> If the SQL conf for StateStore provider class is changed between restarts 
> (i.e. query started with providerClass1 and attempted to restart using 
> providerClass2), then the query will fail in a unpredictable way as files 
> saved by one provider class cannot be used by the newer one. 
> Ideally, the provider class used to start the query should be used to restart 
> the query, and the configuration in the session where it is being restarted 
> should be ignored. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21193) Specify Pandas version in setup.py

2017-06-23 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-21193:


 Summary: Specify Pandas version in setup.py
 Key: SPARK-21193
 URL: https://issues.apache.org/jira/browse/SPARK-21193
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Minor


It looks we don't specify Pandas version in 
https://github.com/apache/spark/blob/master/python/setup.py#L202. It looks few 
versions do not work with Spark anymore. It might be better to explicitly set 
this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21193) Specify Pandas version in setup.py

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21193:


Assignee: Apache Spark

> Specify Pandas version in setup.py
> --
>
> Key: SPARK-21193
> URL: https://issues.apache.org/jira/browse/SPARK-21193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> It looks we don't specify Pandas version in 
> https://github.com/apache/spark/blob/master/python/setup.py#L202. It looks 
> few versions do not work with Spark anymore. It might be better to explicitly 
> set this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21193) Specify Pandas version in setup.py

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060719#comment-16060719
 ] 

Apache Spark commented on SPARK-21193:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18403

> Specify Pandas version in setup.py
> --
>
> Key: SPARK-21193
> URL: https://issues.apache.org/jira/browse/SPARK-21193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks we don't specify Pandas version in 
> https://github.com/apache/spark/blob/master/python/setup.py#L202. It looks 
> few versions do not work with Spark anymore. It might be better to explicitly 
> set this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21193) Specify Pandas version in setup.py

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21193:


Assignee: (was: Apache Spark)

> Specify Pandas version in setup.py
> --
>
> Key: SPARK-21193
> URL: https://issues.apache.org/jira/browse/SPARK-21193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks we don't specify Pandas version in 
> https://github.com/apache/spark/blob/master/python/setup.py#L202. It looks 
> few versions do not work with Spark anymore. It might be better to explicitly 
> set this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21191) DataFrame Row StructType check duplicate name

2017-06-23 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060724#comment-16060724
 ] 

Takeshi Yamamuro commented on SPARK-21191:
--

I think this is a design: https://issues.apache.org/jira/browse/SPARK-16347

> DataFrame Row StructType check duplicate name
> -
>
> Key: SPARK-21191
> URL: https://issues.apache.org/jira/browse/SPARK-21191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently , when we create a dataframe with *toDF(columns:String)* or 
> describe from *StructType* can't avoid have duplicated name .
> {code:scala}
> val dataset = Seq(
>   (0, 3, 4),
>   (0, 4, 3),
>   (0, 5, 2),
>   (1, 3, 3),
>   (1, 5, 6),
>   (1, 4, 2),
>   (2, 3, 5),
>   (2, 5, 4),
>   (2, 4, 3)
> ).toDF("1", "1", "2").show
> {code}
> {code}
> +---+---+---+
> |  1|  1|  2|
> +---+---+---+
> |  0|  3|  4|
> |  0|  4|  3|
> |  0|  5|  2|
> |  1|  3|  3|
> |  1|  5|  6|
> |  1|  4|  2|
> |  2|  3|  5|
> |  2|  5|  4|
> |  2|  4|  3|
> +---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21191) DataFrame Row StructType check duplicate name

2017-06-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21191.
---
Resolution: Duplicate

> DataFrame Row StructType check duplicate name
> -
>
> Key: SPARK-21191
> URL: https://issues.apache.org/jira/browse/SPARK-21191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently , when we create a dataframe with *toDF(columns:String)* or 
> describe from *StructType* can't avoid have duplicated name .
> {code:scala}
> val dataset = Seq(
>   (0, 3, 4),
>   (0, 4, 3),
>   (0, 5, 2),
>   (1, 3, 3),
>   (1, 5, 6),
>   (1, 4, 2),
>   (2, 3, 5),
>   (2, 5, 4),
>   (2, 4, 3)
> ).toDF("1", "1", "2").show
> {code}
> {code}
> +---+---+---+
> |  1|  1|  2|
> +---+---+---+
> |  0|  3|  4|
> |  0|  4|  3|
> |  0|  5|  2|
> |  1|  3|  3|
> |  1|  5|  6|
> |  1|  4|  2|
> |  2|  3|  5|
> |  2|  5|  4|
> |  2|  4|  3|
> +---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19726) Faild to insert null timestamp value to mysql using spark jdbc

2017-06-23 Thread Chang chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060739#comment-16060739
 ] 

Chang chen commented on SPARK-19726:


@srowen, As we investigated, Spark translate null to 0 in case of timestamp 
which is wrong, since 0 is interpreted as "1970-01-01 00:00:00"

{code:title=UnsafeRow.scala|borderStyle=solid}
  public void setNullAt(int i) {
assertIndexIsValid(i);
BitSetMethods.set(baseObject, baseOffset, i);
// To preserve row equality, zero out the value when setting the column to 
null.
// Since this row does does not currently support updates to 
variable-length values, we don't
// have to worry about zeroing out that data.
Platform.putLong(baseObject, getFieldOffset(i), 0);
  }
{code}

Yes, user insert null into a non-null column, but  
# Spark should pass *null* to underlying DB engine instead of 0.  then let DB 
report error Or
# Spark report error by itself

> Faild to insert null timestamp value to mysql using spark jdbc
> --
>
> Key: SPARK-19726
> URL: https://issues.apache.org/jira/browse/SPARK-19726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: AnfengYuan
>
> 1. create a table in mysql
> {code:borderStyle=solid}
> CREATE TABLE `timestamp_test` (
>   `id` bigint(23) DEFAULT NULL,
>   `time_stamp` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE 
> CURRENT_TIMESTAMP
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8
> {code}
> 2. insert one row using spark
> {code:borderStyle=solid}
> CREATE OR REPLACE TEMPORARY VIEW jdbcTable
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url 
> 'jdbc:mysql://xxx.xxx.xxx.xxx:3306/default?characterEncoding=utf8&useServerPrepStmts=false&rewriteBatchedStatements=true',
>   dbtable 'timestamp_test',
>   driver 'com.mysql.jdbc.Driver',
>   user 'root',
>   password 'root'
> );
> insert into jdbcTable values (1, null);
> {code}
> the insert statement failed with exceptions:
> {code:borderStyle=solid}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 599 in stage 1.0 failed 4 times, most recent failure: Lost task 599.3 in 
> stage 1.0 (TID 1202, A03-R07-I12-135.JD.LOCAL): 
> java.sql.BatchUpdateException: Data truncation: Incorrect datetime value: 
> '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
>   at com.mysql.jdbc.Util.getInstance(Util.java:387)
>   at 
> com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1154)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchedInserts(PreparedStatement.java:1582)
>   at 
> com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1248)
>   at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:959)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:227)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect 
> datetime value: '1970-01-01 08:00:00' for column 'time_stamp' at row 1
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3876)
>   at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814)
>   at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478)
>   at com.mys

[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n

2017-06-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060741#comment-16060741
 ] 

Sean Owen commented on SPARK-21184:
---

[~timhunter] [~clockfly] do you have opinions on this?
Probably best to first make sure the implementation works. Then if there's a 
variation that's better, move to that if possible. I don't think we'd 
reimplement completely.

> QuantileSummaries implementation is wrong and QuantileSummariesSuite fails 
> with larger n
> 
>
> Key: SPARK-21184
> URL: https://issues.apache.org/jira/browse/SPARK-21184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>
> 1. QuantileSummaries implementation does not match the paper it is supposed 
> to be based on.
> 1a. The compress method 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240)
>  merges neighboring buckets, but thats not what the paper says to do. The 
> paper 
> (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) 
> describes an implicit tree structure and the compress method deletes selected 
> subtrees.
> 1b. The paper does not discuss merging these summary data structures at all. 
> The following comment is in the merge method of QuantileSummaries:
> {quote}  // The GK algorithm is a bit unclear about it, but it seems 
> there is no need to adjust the
>   // statistics during the merging: the invariants are still respected 
> after the merge.{quote}
> Unless I'm missing something that needs substantiation, it's not clear that 
> that the invariants hold.
> 2. QuantileSummariesSuite fails with n = 1 (and other non trivial values)
> https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27
> One possible solution if these issues can't be resolved would be to move to 
> an algorithm that explicitly supports merging and is well tested like 
> https://github.com/tdunning/t-digest



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18004:


Assignee: (was: Apache Spark)

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18004:


Assignee: Apache Spark

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Assignee: Apache Spark
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060795#comment-16060795
 ] 

Apache Spark commented on SPARK-18004:
--

User 'SharpRay' has created a pull request for this issue:
https://github.com/apache/spark/pull/18404

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21047) Add test suites for complicated cases in ColumnarBatchSuite

2017-06-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21047:
---

Assignee: jin xing

> Add test suites for complicated cases in ColumnarBatchSuite
> ---
>
> Key: SPARK-21047
> URL: https://issues.apache.org/jira/browse/SPARK-21047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: jin xing
> Fix For: 2.3.0
>
>
> Current {{ColumnarBatchSuite}} has very simple test cases for array. This 
> JIRA will add test suites for complicated cases such as nested array in 
> {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21047) Add test suites for complicated cases in ColumnarBatchSuite

2017-06-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21047.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18327
[https://github.com/apache/spark/pull/18327]

> Add test suites for complicated cases in ColumnarBatchSuite
> ---
>
> Key: SPARK-21047
> URL: https://issues.apache.org/jira/browse/SPARK-21047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: jin xing
> Fix For: 2.3.0
>
>
> Current {{ColumnarBatchSuite}} has very simple test cases for array. This 
> JIRA will add test suites for complicated cases such as nested array in 
> {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21165.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18386
[https://github.com/apache/spark/pull/18386]

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.2.0
>
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> or

[jira] [Assigned] (SPARK-21115) If the cores left is less than the coresPerExecutor,the cores left will not be allocated, so it should not to check in every schedule

2017-06-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21115:
---

Assignee: eaton

> If the cores left is less than the coresPerExecutor,the cores left will not 
> be allocated, so it should not to check in every schedule
> -
>
> Key: SPARK-21115
> URL: https://issues.apache.org/jira/browse/SPARK-21115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: eaton
>Assignee: eaton
>Priority: Minor
> Fix For: 2.3.0
>
>
> If we start an app with the param --total-executor-cores=4 and 
> spark.executor.cores=3, the cores left is always 1, so it will try to 
> allocate executors in the function 
> org.apache.spark.deploy.master.startExecutorsOnWorkers in every schedule.
> Another question is, is it will be better to allocate another executor with 1 
> core for the cores left.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21115) If the cores left is less than the coresPerExecutor,the cores left will not be allocated, so it should not to check in every schedule

2017-06-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21115.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18322
[https://github.com/apache/spark/pull/18322]

> If the cores left is less than the coresPerExecutor,the cores left will not 
> be allocated, so it should not to check in every schedule
> -
>
> Key: SPARK-21115
> URL: https://issues.apache.org/jira/browse/SPARK-21115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: eaton
>Priority: Minor
> Fix For: 2.3.0
>
>
> If we start an app with the param --total-executor-cores=4 and 
> spark.executor.cores=3, the cores left is always 1, so it will try to 
> allocate executors in the function 
> org.apache.spark.deploy.master.startExecutorsOnWorkers in every schedule.
> Another question is, is it will be better to allocate another executor with 1 
> core for the cores left.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21194) Fail the putNullmethod when containsNull=false.

2017-06-23 Thread jin xing (JIRA)

jin xing created SPARK-21194:


 Summary: Fail the putNullmethod when containsNull=false.
 Key: SPARK-21194
 URL: https://issues.apache.org/jira/browse/SPARK-21194
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: jin xing


Currently there's no check for putting null into a {{ArrayType(IntegerType, 
false)}};
It's better to fail the {{putNull}} method in {{OffHeapColumnVector}} and 
{{OnHeapColumnVector}} when {{containsNull}}=false.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21194) Fail the putNullmethod when containsNull=false.

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21194:


Assignee: Apache Spark

> Fail the putNullmethod when containsNull=false.
> ---
>
> Key: SPARK-21194
> URL: https://issues.apache.org/jira/browse/SPARK-21194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: Apache Spark
>
> Currently there's no check for putting null into a {{ArrayType(IntegerType, 
> false)}};
> It's better to fail the {{putNull}} method in {{OffHeapColumnVector}} and 
> {{OnHeapColumnVector}} when {{containsNull}}=false.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21194) Fail the putNullmethod when containsNull=false.

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21194:


Assignee: (was: Apache Spark)

> Fail the putNullmethod when containsNull=false.
> ---
>
> Key: SPARK-21194
> URL: https://issues.apache.org/jira/browse/SPARK-21194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> Currently there's no check for putting null into a {{ArrayType(IntegerType, 
> false)}};
> It's better to fail the {{putNull}} method in {{OffHeapColumnVector}} and 
> {{OnHeapColumnVector}} when {{containsNull}}=false.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21194) Fail the putNullmethod when containsNull=false.

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060919#comment-16060919
 ] 

Apache Spark commented on SPARK-21194:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/18405

> Fail the putNullmethod when containsNull=false.
> ---
>
> Key: SPARK-21194
> URL: https://issues.apache.org/jira/browse/SPARK-21194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> Currently there's no check for putting null into a {{ArrayType(IntegerType, 
> false)}};
> It's better to fail the {{putNull}} method in {{OffHeapColumnVector}} and 
> {{OnHeapColumnVector}} when {{containsNull}}=false.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21193) Specify Pandas version in setup.py

2017-06-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21193:
---

Assignee: Hyukjin Kwon

> Specify Pandas version in setup.py
> --
>
> Key: SPARK-21193
> URL: https://issues.apache.org/jira/browse/SPARK-21193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks we don't specify Pandas version in 
> https://github.com/apache/spark/blob/master/python/setup.py#L202. It looks 
> few versions do not work with Spark anymore. It might be better to explicitly 
> set this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21193) Specify Pandas version in setup.py

2017-06-23 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21193.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18403
[https://github.com/apache/spark/pull/18403]

> Specify Pandas version in setup.py
> --
>
> Key: SPARK-21193
> URL: https://issues.apache.org/jira/browse/SPARK-21193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks we don't specify Pandas version in 
> https://github.com/apache/spark/blob/master/python/setup.py#L202. It looks 
> few versions do not work with Spark anymore. It might be better to explicitly 
> set this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-06-23 Thread Robert Kruszewski (JIRA)

Robert Kruszewski created SPARK-21195:
-

 Summary: MetricSystem should pick up dynamically registered 
metrics in sources
 Key: SPARK-21195
 URL: https://issues.apache.org/jira/browse/SPARK-21195
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Robert Kruszewski
Priority: Minor


Currently when MetricsSystem registers a source it only picks up currently 
registered metrics. It's quite cumbersome and leads to a lot of boilerplate to 
preregister all metrics especially with systems that use instrumentation. This 
change proposes to teach MetricsSystem to watch metrics added to sources and 
dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060939#comment-16060939
 ] 

Apache Spark commented on SPARK-21195:
--

User 'robert3005' has created a pull request for this issue:
https://github.com/apache/spark/pull/18406

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21195:


Assignee: Apache Spark

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Assignee: Apache Spark
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21195:


Assignee: (was: Apache Spark)

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2017-06-23 Thread Rui Zha (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060963#comment-16060963
 ] 

Rui Zha commented on SPARK-18004:
-

The PR 18404 is closed. I will resend the PR later.

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20952) ParquetFileFormat should forward TaskContext to its forkjoinpool

2017-06-23 Thread Robert Kruszewski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-20952:
--
Summary: ParquetFileFormat should forward TaskContext to its forkjoinpool  
(was: TaskContext should be an InheritableThreadLocal)

> ParquetFileFormat should forward TaskContext to its forkjoinpool
> 
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21181) Suppress memory leak errors reported by netty

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060970#comment-16060970
 ] 

Apache Spark commented on SPARK-21181:
--

User 'dhruve' has created a pull request for this issue:
https://github.com/apache/spark/pull/18407

> Suppress memory leak errors reported by netty
> -
>
> Key: SPARK-21181
> URL: https://issues.apache.org/jira/browse/SPARK-21181
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Dhruve Ashar
>Priority: Minor
>
> We are seeing netty report memory leak erros like the one below after 
> switching to 2.1. 
> {code}
> ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before 
> it's garbage-collected. Enable advanced leak reporting to find out where the 
> leak occurred. To enable advanced leak reporting, specify the JVM option 
> '-Dio.netty.leakDetection.level=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> {code}
> Looking a bit deeper, Spark is not leaking any memory here, but it is 
> confusing for the user to see the error message in the driver logs. 
> After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the 
> SparkSaslServer to be the source of these leaks.
> Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20952) ParquetFileFormat should forward TaskContext to its forkjoinpool

2017-06-23 Thread Robert Kruszewski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060972#comment-16060972
 ] 

Robert Kruszewski commented on SPARK-20952:
---

[~zsxwing] updated the issue title and pr with that fix. Not sure if there's a 
comprehensive change we can make that ensures all usages are safe but hope that 
this change is non controversial now.

> ParquetFileFormat should forward TaskContext to its forkjoinpool
> 
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-06-23 Thread Joseph Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060982#comment-16060982
 ] 

Joseph Wang commented on SPARK-20307:
-

Hello, I am new to the discussion. I come across the same issue now in 
production environment where the SparkR is used to train a model. With 
"allow.new.labels" flag, it will have a huge impact to make it frictionless. 
What is the current status for this issue if anyone knows? Thanks.:D

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGSc

[jira] [Commented] (SPARK-21152) Use level 3 BLAS operations in LogisticAggregator

2017-06-23 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061158#comment-16061158
 ] 

Seth Hendrickson commented on SPARK-21152:
--

[~yanboliang] I can do performance testing and post the results for sure. 
Still, do you have any thoughts about the caching issues? I wanted to see if it 
was a deal-breaker before getting so far as conducting exhaustive performance 
tests.

> Use level 3 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-21152
> URL: https://issues.apache.org/jira/browse/SPARK-21152
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Seth Hendrickson
>
> In logistic regression gradient update, we currently compute by each 
> individual row. If we blocked the rows together, we can do a blocked gradient 
> update which leverages the BLAS GEMM operation.
> On high dimensional dense datasets, I've observed ~10x speedups. The problem 
> here, though, is that it likely won't improve the sparse case so we need to 
> keep both implementations around, and this blocked algorithm will require 
> caching a new dataset of type:
> {code}
> BlockInstance(label: Vector, weight: Vector, features: Matrix)
> {code}
> We have avoided caching anything beside the original dataset passed to train 
> in the past because it adds memory overhead if the user has cached this 
> original dataset for other reasons. Here, I'd like to discuss whether we 
> think this patch would be worth the investment, given that it only improves a 
> subset of the use cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-06-23 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061159#comment-16061159
 ] 

Felix Cheung commented on SPARK-20307:
--

Not much update as far as I know.
Would you like to contribute?




> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at scala.Option.foreach(Option.sca

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-06-23 Thread Joseph Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061161#comment-16061161
 ] 

Joseph Wang commented on SPARK-20307:
-

Hi Felix, the problem for me is how to get into the backend where the SparkR 
API is located from the installed Spark folder in my environment. If I can get 
some hints from there, I may be able to do something there.  Thanks.



> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scal

[jira] [Resolved] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21144.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.2.0

> Unexpected results when the data schema and partition schema have the 
> duplicate columns
> ---
>
> Key: SPARK-21144
> URL: https://issues.apache.org/jira/browse/SPARK-21144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> {noformat}
> withTempPath { dir =>
>   val basePath = dir.getCanonicalPath
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=1").toString)
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=a").toString)
>   spark.read.parquet(basePath).show()
> }
> {noformat}
> The result of the above case is
> {noformat}
> +---+
> |foo|
> +---+
> |  1|
> |  1|
> |  a|
> |  a|
> |  1|
> |  a|
> +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21164) Remove isTableSample from Sample and isGenerated from Alias and AttributeReference

2017-06-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21164:

Summary: Remove isTableSample from Sample and isGenerated from Alias and 
AttributeReference  (was: Remove isTableSample from Sample)

> Remove isTableSample from Sample and isGenerated from Alias and 
> AttributeReference
> --
>
> Key: SPARK-21164
> URL: https://issues.apache.org/jira/browse/SPARK-21164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {{isTableSample}} was introduced for SQL Generation. Since SQL Generation is 
> removed, we do not need to keep {{isTableSample}}. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21164) Remove isTableSample from Sample and isGenerated from Alias and AttributeReference

2017-06-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21164:

Description: 
isTableSample and isGenerated were introduced for SQL Generation respectively 
by #11148 and #11050

Since SQL Generation is removed, we do not need to keep isTableSample.

  was:{{isTableSample}} was introduced for SQL Generation. Since SQL Generation 
is removed, we do not need to keep {{isTableSample}}. 


> Remove isTableSample from Sample and isGenerated from Alias and 
> AttributeReference
> --
>
> Key: SPARK-21164
> URL: https://issues.apache.org/jira/browse/SPARK-21164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> isTableSample and isGenerated were introduced for SQL Generation respectively 
> by #11148 and #11050
> Since SQL Generation is removed, we do not need to keep isTableSample.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21164) Remove isTableSample from Sample and isGenerated from Alias and AttributeReference

2017-06-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21164:

Description: 
isTableSample and isGenerated were introduced for SQL Generation respectively 
by PR 11148 and PR 11050

Since SQL Generation is removed, we do not need to keep isTableSample.

  was:
isTableSample and isGenerated were introduced for SQL Generation respectively 
by #11148 and #11050

Since SQL Generation is removed, we do not need to keep isTableSample.


> Remove isTableSample from Sample and isGenerated from Alias and 
> AttributeReference
> --
>
> Key: SPARK-21164
> URL: https://issues.apache.org/jira/browse/SPARK-21164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> isTableSample and isGenerated were introduced for SQL Generation respectively 
> by PR 11148 and PR 11050
> Since SQL Generation is removed, we do not need to keep isTableSample.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21149) Add job description API for R

2017-06-23 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21149.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Add job description API for R
> -
>
> Key: SPARK-21149
> URL: https://issues.apache.org/jira/browse/SPARK-21149
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> see SPARK-21125



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-23 Thread Eric Vandenberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061225#comment-16061225
 ] 

Eric Vandenberg commented on SPARK-21155:
-

Added screen shot with skipped tasks for reference.

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan

2017-06-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21180.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.3.0

> Remove conf from stats functions since now we have conf in LogicalPlan
> --
>
> Key: SPARK-21180
> URL: https://issues.apache.org/jira/browse/SPARK-21180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21181) Suppress memory leak errors reported by netty

2017-06-23 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21181.

   Resolution: Fixed
 Assignee: Dhruve Ashar
Fix Version/s: 2.3.0
   2.2.1
   2.1.2

> Suppress memory leak errors reported by netty
> -
>
> Key: SPARK-21181
> URL: https://issues.apache.org/jira/browse/SPARK-21181
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Dhruve Ashar
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> We are seeing netty report memory leak erros like the one below after 
> switching to 2.1. 
> {code}
> ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before 
> it's garbage-collected. Enable advanced leak reporting to find out where the 
> leak occurred. To enable advanced leak reporting, specify the JVM option 
> '-Dio.netty.leakDetection.level=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> {code}
> Looking a bit deeper, Spark is not leaking any memory here, but it is 
> confusing for the user to see the error message in the driver logs. 
> After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the 
> SparkSaslServer to be the source of these leaks.
> Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061304#comment-16061304
 ] 

Apache Spark commented on SPARK-20555:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18408

> Incorrect handling of Oracle's decimal types via JDBC
> -
>
> Key: SPARK-20555
> URL: https://issues.apache.org/jira/browse/SPARK-20555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gabor Feher
>
> When querying an Oracle database, Spark maps some Oracle numeric data types 
> to incorrect Catalyst data types:
> 1. DECIMAL(1) becomes BooleanType
> In Orcale, a DECIMAL(1) can have values from -9 to 9.
> In Spark now, values larger than 1 become the boolean value true.
> 2. DECIMAL(3,2) becomes IntegerType
> In Oracle, a DECIMAL(2) can have values like 1.23
> In Spark now, digits after the decimal point are dropped.
> 3. DECIMAL(10) becomes IntegerType
> In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
> more than 2^31
> Spark throws an exception: "java.sql.SQLException: Numeric Overflow"
> I think the best solution is to always keep Oracle's decimal types. (In 
> theory we could introduce a FloatType in some case of #2, and fix #3 by only 
> introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
> complicated and error-prone.)
> Note: I think the above problems were introduced as part of  
> https://github.com/apache/spark/pull/14377
> The main purpose of that PR seems to be converting Spark types to correct 
> Oracle types, and that part seems good to me. But it also adds the inverse 
> conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21192) Preserve State Store provider class configuration across StreamingQuery restarts

2017-06-23 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21192.
--
   Resolution: Fixed
 Assignee: Tathagata Das  (was: Apache Spark)
Fix Version/s: 2.3.0

> Preserve State Store provider class configuration across StreamingQuery 
> restarts
> 
>
> Key: SPARK-21192
> URL: https://issues.apache.org/jira/browse/SPARK-21192
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.3.0
>
>
> If the SQL conf for StateStore provider class is changed between restarts 
> (i.e. query started with providerClass1 and attempted to restart using 
> providerClass2), then the query will fail in a unpredictable way as files 
> saved by one provider class cannot be used by the newer one. 
> Ideally, the provider class used to start the query should be used to restart 
> the query, and the configuration in the session where it is being restarted 
> should be ignored. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20417) Move error reporting for subquery from Analyzer to CheckAnalysis

2017-06-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20417.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.3.0

> Move error reporting for subquery from Analyzer to CheckAnalysis
> 
>
> Key: SPARK-20417
> URL: https://issues.apache.org/jira/browse/SPARK-20417
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
> Fix For: 2.3.0
>
>
> Currently we do a lot of validations for subquery in the Analyzer. We should 
> move them to CheckAnalysis which is the framework to catch and report 
> Analysis errors. This was mentioned as a review comment in SPARK-18874.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21196) Split codegen info of query plan into sequence

2017-06-23 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-21196:
--

 Summary: Split codegen info of query plan into sequence
 Key: SPARK-21196
 URL: https://issues.apache.org/jira/browse/SPARK-21196
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Gengliang Wang
Priority: Minor


codegen info of query plan can be very long. 
In debugging console / web page, it would be more readable if the subtrees and 
corresponding codegen are split into sequence. 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21196) Split codegen info of query plan into sequence

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061364#comment-16061364
 ] 

Apache Spark commented on SPARK-21196:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18409

> Split codegen info of query plan into sequence
> --
>
> Key: SPARK-21196
> URL: https://issues.apache.org/jira/browse/SPARK-21196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> codegen info of query plan can be very long. 
> In debugging console / web page, it would be more readable if the subtrees 
> and corresponding codegen are split into sequence. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21196) Split codegen info of query plan into sequence

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21196:


Assignee: (was: Apache Spark)

> Split codegen info of query plan into sequence
> --
>
> Key: SPARK-21196
> URL: https://issues.apache.org/jira/browse/SPARK-21196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> codegen info of query plan can be very long. 
> In debugging console / web page, it would be more readable if the subtrees 
> and corresponding codegen are split into sequence. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21196) Split codegen info of query plan into sequence

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21196:


Assignee: Apache Spark

> Split codegen info of query plan into sequence
> --
>
> Key: SPARK-21196
> URL: https://issues.apache.org/jira/browse/SPARK-21196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> codegen info of query plan can be very long. 
> In debugging console / web page, it would be more readable if the subtrees 
> and corresponding codegen are split into sequence. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21110) Structs should be usable in inequality filters

2017-06-23 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21110:
-
Target Version/s: 2.3.0

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21110) Structs should be usable in inequality filters

2017-06-23 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061398#comment-16061398
 ] 

Michael Armbrust commented on SPARK-21110:
--

It seems if you can call {{min}} and {{max}} on structs you should be able to 
use comparison operations as well.

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21137) Spark reads many small files slowly

2017-06-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-21137:
---

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21137) Spark reads many small files slowly

2017-06-23 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21137:
--
Affects Version/s: (was: 2.2.1)
   2.1.1
 Priority: Minor  (was: Major)
   Issue Type: Improvement  (was: Bug)
  Summary: Spark reads many small files slowly  (was: Spark cannot 
read many small files (wholeTextFiles))

So just to move this along, I did the thread dump. Yeah, it's spending a huge 
amount of time examining the input files in the Hadoop {{InputFormat}}:

{code}
"main" #1 prio=5 os_prio=31 tid=0x7fe85f004000 nid=0x1c03 runnable 
[0x79a5e000]
   java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:522)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:587)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:562)
at 
org.apache.hadoop.fs.LocatedFileStatus.(LocatedFileStatus.java:47)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1701)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
at 
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
...
{code}

This is slow because it's single-threaded, and in the case of a local file 
system, actually uses things like {{ls}} to traverse the directories. I suspect 
it can only be worse on S3; not sure about HDFS.

Does this work need to be done at all? well... Spark here is trying to figure 
out the max split size to configure, in order to enforce a minimum number of 
partitions. To me, this {{minPartitions}} argument probably should never have 
been in there: just repartition as desired. It's even capped at 2. But this arg 
exists, and the default behavior is still something used by a lot of methods.

I found there's a Hadoop option to list the dirs in parallel, and that sped 
things up a lot -- still took a minute or so to crunch on my laptop, but much 
better than about 10. I think that's valid to set this listing parallelism to 
something like {{Runtime.getRuntime.availableProcessors}} just for these 
methods. They're expecting to encounter a bunch of files, after all, even a 
million is probably not a great idea.

That much I think is an unobtrusive change that makes a big difference. I'd be 
OK with that.

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listi

[jira] [Updated] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21190:

Description: 
*Background and Motivation*
Python is one of the most popular programming languages among Spark users. 
Spark currently exposes a row-at-a-time interface for defining and executing 
user-defined functions (UDFs). This introduces high overhead in serialization 
and deserialization, and also makes it difficult to leverage Python libraries 
(e.g. numpy, Pandas) that are written in native code.
 
This proposal advocates introducing new APIs to support vectorized UDFs in 
Python, in which a block of data is transferred over to Python in some columnar 
format for execution.

 
 
*Target Personas*
Data scientists, data engineers, library developers.
 

*Goals*
- Support vectorized UDFs that apply on chunks of the data frame
- Low system overhead: Substantially reduce serialization and deserialization 
overhead when compared with row-at-a-time interface
- UDF performance: Enable users to leverage native libraries in Python (e.g. 
numpy, Pandas) for data manipulation in these UDFs
 

*Non-Goals*
The following are explicitly out of scope for the current SPIP, and should be 
done in future SPIPs. Nonetheless, it would be good to consider these future 
use cases during API design, so we can achieve some consistency when rolling 
out new APIs.
 
- Define block oriented UDFs in other languages (that are not Python).
- Define aggregate UDFs
- Tight integration with machine learning frameworks

 
*Proposed API Changes*
The following sketches some possibilities. I haven’t spent a lot of time 
thinking about the API (wrote it down in 5 mins) and I am not attached to this 
design at all. The main purpose of the SPIP is to get feedback on use cases and 
see how they can impact API design.
 
Two things to consider are:
 
1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
analysis time typing. This means users would need to specify the return type of 
their UDFs.
 
2. Ratio of input rows to output rows. We propose initially we require number 
of output rows to be the same as the number of input rows. In the future, we 
can consider relaxing this constraint with support for vectorized aggregate 
UDFs.
 
Proposed API sketch (using examples):
 
Use case 1. A function that defines all the columns of a DataFrame (similar to 
a “map” function):
 
{code}
@spark_udf(some way to describe the return schema)
def my_func_on_entire_df(input):
  """ Some user-defined function.
 
  :param input: A Pandas DataFrame with two columns, a and b.
  :return: :class: A Pandas data frame.
  """
  input[c] = input[a] + input[b]
  return input
 
spark.range(1000).selectExpr("id a", "id / 2 b")
  .mapBatches(my_func_on_entire_df)
{code}
 
 
Use case 2. A function that defines only one column (similar to existing UDFs):
 
{code}
@spark_udf(some way to describe the return schema)
def my_func_that_returns_one_column(input):
  """ Some user-defined function.
 
  :param input: A Pandas DataFrame with two columns, a and b.
  :return: :class: A numpy array
  """
  return input[a] + input[b]
 
my_func = udf(my_func_that_returns_one_column)
 
df = spark.range(1000).selectExpr("id a", "id / 2 b")
df.withColumn("c", my_func(df.a, df.b))
{code}

 
 
 
*Optional Design Sketch*
I’m more concerned about getting proper feedback for API design. The 
implementation should be pretty straightforward and is not a huge concern at 
this point. We can leverage the same implementation for faster toPandas (using 
Arrow).

 
 
*Optional Rejected Designs*
See above.
 
 
 
 


  was:
*Background and Motivation*
Python is one of the most popular programming languages among Spark users. 
Spark currently exposes a row-at-a-time interface for defining and executing 
user-defined functions (UDFs). This introduces high overhead in serialization 
and deserialization, and also makes it difficult to leverage Python libraries 
that are written in native code. This proposal advocates introducing new APIs 
to support vectorized UDFs in Python, in which a block of data is transferred 
over to Python in some column format for execution.
 
 
*Target Personas*
Data scientists, data engineers, library developers.
 

*Goals*
... todo ...
 

*Non-Goals*
- Define block oriented UDFs in other languages (that are not Python).
- Define aggregate UDFs
 
 
*Proposed API Changes*
... todo ...
 
 
 
*Optional Design Sketch*
The implementation should be pretty straightforward and is not a huge concern 
at this point. I’m more concerned about getting proper feedback for API design.
 
 
*Optional Rejected Designs*
See above.
 
 
 
 



> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>

[jira] [Closed] (SPARK-20817) Benchmark.getProcessorName() returns "Unknown processor" on ppc and 390 platforms

2017-06-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-20817.
---
Resolution: Won't Fix

See github discussions.


> Benchmark.getProcessorName() returns "Unknown processor" on ppc and 390 
> platforms
> -
>
> Key: SPARK-20817
> URL: https://issues.apache.org/jira/browse/SPARK-20817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> {{Benchmark.getProcessorName()}} returns "Unknown processor" string on ppc 
> and 390 Linux platforms



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21190:

Description: 
*Background and Motivation*
Python is one of the most popular programming languages among Spark users. 
Spark currently exposes a row-at-a-time interface for defining and executing 
user-defined functions (UDFs). This introduces high overhead in serialization 
and deserialization, and also makes it difficult to leverage Python libraries 
(e.g. numpy, Pandas) that are written in native code.
 
This proposal advocates introducing new APIs to support vectorized UDFs in 
Python, in which a block of data is transferred over to Python in some columnar 
format for execution.

 
 
*Target Personas*
Data scientists, data engineers, library developers.
 

*Goals*
- Support vectorized UDFs that apply on chunks of the data frame
- Low system overhead: Substantially reduce serialization and deserialization 
overhead when compared with row-at-a-time interface
- UDF performance: Enable users to leverage native libraries in Python (e.g. 
numpy, Pandas) for data manipulation in these UDFs
 

*Non-Goals*
The following are explicitly out of scope for the current SPIP, and should be 
done in future SPIPs. Nonetheless, it would be good to consider these future 
use cases during API design, so we can achieve some consistency when rolling 
out new APIs.
 
- Define block oriented UDFs in other languages (that are not Python).
- Define aggregate UDFs
- Tight integration with machine learning frameworks

 
*Proposed API Changes*
The following sketches some possibilities. I haven’t spent a lot of time 
thinking about the API (wrote it down in 5 mins) and I am not attached to this 
design at all. The main purpose of the SPIP is to get feedback on use cases and 
see how they can impact API design.
 
Two things to consider are:
 
1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
analysis time typing. This means users would need to specify the return type of 
their UDFs.
 
2. Ratio of input rows to output rows. We propose initially we require number 
of output rows to be the same as the number of input rows. In the future, we 
can consider relaxing this constraint with support for vectorized aggregate 
UDFs.
 
Proposed API sketch (using examples):
 
Use case 1. A function that defines all the columns of a DataFrame (similar to 
a “map” function):
 
{code}
@spark_udf(some way to describe the return schema)
def my_func_on_entire_df(input):
  """ Some user-defined function.
 
  :param input: A Pandas DataFrame with two columns, a and b.
  :return: :class: A Pandas data frame.
  """
  input[c] = input[a] + input[b]
  Input[d] = input[a] - input[b]
  return input
 
spark.range(1000).selectExpr("id a", "id / 2 b")
  .mapBatches(my_func_on_entire_df)
{code}
 
Use case 2. A function that defines only one column (similar to existing UDFs):
 
{code}
@spark_udf(some way to describe the return schema)
def my_func_that_returns_one_column(input):
  """ Some user-defined function.
 
  :param input: A Pandas DataFrame with two columns, a and b.
  :return: :class: A numpy array
  """
  return input[a] + input[b]
 
my_func = udf(my_func_that_returns_one_column)
 
df = spark.range(1000).selectExpr("id a", "id / 2 b")
df.withColumn("c", my_func(df.a, df.b))
{code}

 
 
 
*Optional Design Sketch*
I’m more concerned about getting proper feedback for API design. The 
implementation should be pretty straightforward and is not a huge concern at 
this point. We can leverage the same implementation for faster toPandas (using 
Arrow).

 
 
*Optional Rejected Designs*
See above.
 
 
 
 


  was:
*Background and Motivation*
Python is one of the most popular programming languages among Spark users. 
Spark currently exposes a row-at-a-time interface for defining and executing 
user-defined functions (UDFs). This introduces high overhead in serialization 
and deserialization, and also makes it difficult to leverage Python libraries 
(e.g. numpy, Pandas) that are written in native code.
 
This proposal advocates introducing new APIs to support vectorized UDFs in 
Python, in which a block of data is transferred over to Python in some columnar 
format for execution.

 
 
*Target Personas*
Data scientists, data engineers, library developers.
 

*Goals*
- Support vectorized UDFs that apply on chunks of the data frame
- Low system overhead: Substantially reduce serialization and deserialization 
overhead when compared with row-at-a-time interface
- UDF performance: Enable users to leverage native libraries in Python (e.g. 
numpy, Pandas) for data manipulation in these UDFs
 

*Non-Goals*
The following are explicitly out of scope for the current SPIP, and should be 
done in future SPIPs. Nonetheless, it would be good to consider these future 
use cases during API design, so we can achieve some consistency when rolling 
out new APIs.
 
- Define

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-06-23 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061525#comment-16061525
 ] 

Miao Wang commented on SPARK-20307:
---

[~Monday0927!] I am working on it now. Thanks! Miao cc [~felixcheung]


> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at scala.Option.foreach(Option.scala

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-06-23 Thread Joseph Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061532#comment-16061532
 ] 

Joseph Wang commented on SPARK-20307:
-

Thanks a lot. Let me know if you have done that. This will really boost the 
status of SPARKR a lot in industry usage to embrace the advantage of R with 
Spark toward production environment.


> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.s

[jira] [Resolved] (SPARK-21164) Remove isTableSample from Sample and isGenerated from Alias and AttributeReference

2017-06-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21164.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Remove isTableSample from Sample and isGenerated from Alias and 
> AttributeReference
> --
>
> Key: SPARK-21164
> URL: https://issues.apache.org/jira/browse/SPARK-21164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> isTableSample and isGenerated were introduced for SQL Generation respectively 
> by PR 11148 and PR 11050
> Since SQL Generation is removed, we do not need to keep isTableSample.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21157) Report Total Memory Used by Spark Executors

2017-06-23 Thread Jose Soltren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061582#comment-16061582
 ] 

Jose Soltren commented on SPARK-21157:
--

Hello Marcelo - It's true, the design doc doesn't discuss flights very well. 
Let me give my thoughts on them here, and I'll propagate this back to the 
design doc at a later time.

So, first off, I thought for a while to come up with a catchy name for "a 
period of time during which the number of stages in execution is constant". I 
came up with "flight". If you have a better term I would love to hear your 
thoughts. Let's stick with "flight" for now.

After giving it some thought, I don't think the end user needs to care about 
flights at all. Here's what I think the user does care about: seeing some 
metrics (min/max/mean/stdev) for how different types of memory are consumed for 
a particular stage.

I haven't worked out the details of the data store yet, but, what I envision is 
a data store of key-value pairs, where the key is the start time of a 
particular flight, and the values are the metrics associated with that flight 
and the stages that were running for the duration of that flight. Then, for a 
particular stage, we would be able to query all of the flights during which 
this stage was active, get min/max/mean/stdev metrics for each of those 
flights, and aggregate them to get total metrics for that particular stage.

These total metrics for the stage would be shown in the Stages UI.

Of course, with this data store, you could directly query statistics for a 
particular flight.

Note that there is not a precise way to determine memory used for a particular 
stage at a given time unless it was the only stage active in that flight. If 
memory usage for stages were constant then we could possibly impute the memory 
usage for a single stage given all of its flight statistics. This is not 
feasible, so, the UI would be clear that these were total memory metrics for 
executors while the stage was running, and not specific to that stage. Even 
this should be enough for an end user to do some detective work and determine 
which stage is hogging memory.

I glossed over some of these details since I thought they were well covered in 
SPARK-9103. I hope this clarifies things somewhat. If not, please let me know 
how I can clarify this further. Cheers.

> Report Total Memory Used by Spark Executors
> ---
>
> Key: SPARK-21157
> URL: https://issues.apache.org/jira/browse/SPARK-21157
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Jose Soltren
> Attachments: TotalMemoryReportingDesignDoc.pdf
>
>
> Building on some of the core ideas of SPARK-9103, this JIRA proposes tracking 
> total memory used by Spark executors, and a means of broadcasting, 
> aggregating, and reporting memory usage data in the Spark UI.
> Here, "total memory used" refers to memory usage that is visible outside of 
> Spark, to an external observer such as YARN, Mesos, or the operating system. 
> The goal of this enhancement is to give Spark users more information about 
> how Spark clusters are using memory. Total memory will include non-Spark JVM 
> memory and all off-heap memory.
> Please consult the attached design document for further details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20971) Purge the metadata log for FileStreamSource

2017-06-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061596#comment-16061596
 ] 

Apache Spark commented on SPARK-20971:
--

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/18410

> Purge the metadata log for FileStreamSource
> ---
>
> Key: SPARK-20971
> URL: https://issues.apache.org/jira/browse/SPARK-20971
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>
> Currently 
> [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258]
>  is empty. We can delete unused metadata logs in this method to reduce the 
> size of log files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20971) Purge the metadata log for FileStreamSource

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20971:


Assignee: Apache Spark

> Purge the metadata log for FileStreamSource
> ---
>
> Key: SPARK-20971
> URL: https://issues.apache.org/jira/browse/SPARK-20971
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Currently 
> [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258]
>  is empty. We can delete unused metadata logs in this method to reduce the 
> size of log files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20971) Purge the metadata log for FileStreamSource

2017-06-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20971:


Assignee: (was: Apache Spark)

> Purge the metadata log for FileStreamSource
> ---
>
> Key: SPARK-20971
> URL: https://issues.apache.org/jira/browse/SPARK-20971
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>
> Currently 
> [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258]
>  is empty. We can delete unused metadata logs in this method to reduce the 
> size of log files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21197) Tricky use cases makes dead application struggle for a long duration

2017-06-23 Thread Nan Zhu (JIRA)

Nan Zhu created SPARK-21197:
---

 Summary: Tricky use cases makes dead application struggle for a 
long duration
 Key: SPARK-21197
 URL: https://issues.apache.org/jira/browse/SPARK-21197
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Spark Core
Affects Versions: 2.1.1, 2.0.2
Reporter: Nan Zhu


The use case is in Spark Streaming while the root cause is in DAGScheduler, so 
I said the component as both of DStreams and Core

Use case: 

the user has a thread periodically triggering Spark jobs, and in the same 
application, they retrieve data through Spark Streaming from somewherein 
the Streaming logic, an exception is thrown so that the whole application is 
supposed to be shutdown and let YARN restart it...

The user observed that after the exception is propagated to Spark core and 
SparkContext.stop() is called, after 18 hours, the application is still 
running...

The root cause is that when we call DAGScheduler.stop(), we will wait for 
eventLoop's thread to finish 
(https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1704
 and 
https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L40)

Since there is a thread periodically push events to DAGScheduler's event queue, 
it will never finish

a potential solution is that in EventLoop, we should allow interrupt the thread 
directly for some cases, e.g. this one, and simultaneously allow graceful 
shutdown for other cases, e.g. ListenerBus one, 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21197) Tricky use case makes dead application struggle for a long duration

2017-06-23 Thread Nan Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-21197:

Summary: Tricky use case makes dead application struggle for a long 
duration  (was: Tricky use cases makes dead application struggle for a long 
duration)

> Tricky use case makes dead application struggle for a long duration
> ---
>
> Key: SPARK-21197
> URL: https://issues.apache.org/jira/browse/SPARK-21197
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Nan Zhu
>
> The use case is in Spark Streaming while the root cause is in DAGScheduler, 
> so I said the component as both of DStreams and Core
> Use case: 
> the user has a thread periodically triggering Spark jobs, and in the same 
> application, they retrieve data through Spark Streaming from somewherein 
> the Streaming logic, an exception is thrown so that the whole application is 
> supposed to be shutdown and let YARN restart it...
> The user observed that after the exception is propagated to Spark core and 
> SparkContext.stop() is called, after 18 hours, the application is still 
> running...
> The root cause is that when we call DAGScheduler.stop(), we will wait for 
> eventLoop's thread to finish 
> (https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1704
>  and 
> https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L40)
> Since there is a thread periodically push events to DAGScheduler's event 
> queue, it will never finish
> a potential solution is that in EventLoop, we should allow interrupt the 
> thread directly for some cases, e.g. this one, and simultaneously allow 
> graceful shutdown for other cases, e.g. ListenerBus one, 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-23 Thread Saif Addin (JIRA)

Saif Addin created SPARK-21198:
--

 Summary: SparkSession catalog is terribly slow
 Key: SPARK-21198
 URL: https://issues.apache.org/jira/browse/SPARK-21198
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Saif Addin


We have a considerably large Hive metastore and a Spark program that goes 
through Hive data availability.

In spark 1.x, we were using sqlConext.tableNames or sqlContext.sql() to go 
throgh Hive.
Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
it turns out that both listDatabases() and listTables() take between 5 to 20 
minutes depending on the database to return results, using operations such as 
the following one:

spark.catalog.listTables(db).filter(_.isTemporary).map(_.name).collect

and made the program unbearably to return a list of tables.

I know we still have spark.sqlContext.tableNames as workaround but I am 
assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 129 matches

Mail list logo