[jira] [Updated] (SPARK-18686) Several cleanup and improvements for spark.logit

2016-12-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18686:

Description: 
Several cleanup and improvements for {{spark.logit}}:
* {{summary}} should return coefficients matrix, and should output labels for 
each class if the model is multinomial logistic regression model.

  was:Several cleanup and improvements for spark.logit:


> Several cleanup and improvements for spark.logit
> 
>
> Key: SPARK-18686
> URL: https://issues.apache.org/jira/browse/SPARK-18686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> Several cleanup and improvements for {{spark.logit}}:
> * {{summary}} should return coefficients matrix, and should output labels for 
> each class if the model is multinomial logistic regression model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18686) Several cleanup and improvements for spark.logit

2016-12-01 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-18686:
---

 Summary: Several cleanup and improvements for spark.logit
 Key: SPARK-18686
 URL: https://issues.apache.org/jira/browse/SPARK-18686
 Project: Spark
  Issue Type: Improvement
  Components: ML, SparkR
Reporter: Yanbo Liang


Several cleanup and improvements for spark.logit:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18685:


Assignee: (was: Apache Spark)

> Fix all tests in ExecutorClassLoaderSuite to pass on Windows
> 
>
> Key: SPARK-18685
> URL: https://issues.apache.org/jira/browse/SPARK-18685
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> There are two problems as below:
> We should make the URI correct and {{BufferedSource}} from 
> {{Source.fromInputStream}} closed after opening them in the tests in 
> {{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures 
> on Windows.
> {code}
> ExecutorClassLoaderSuite:
> [info] - child first *** FAILED *** (78 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - parent first *** FAILED *** (15 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fall back *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fail *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resource from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resources from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> {code}
> {code}
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 
> 333 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
> [info]   at 
> org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714356#comment-15714356
 ] 

Apache Spark commented on SPARK-18685:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16116

> Fix all tests in ExecutorClassLoaderSuite to pass on Windows
> 
>
> Key: SPARK-18685
> URL: https://issues.apache.org/jira/browse/SPARK-18685
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> There are two problems as below:
> We should make the URI correct and {{BufferedSource}} from 
> {{Source.fromInputStream}} closed after opening them in the tests in 
> {{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures 
> on Windows.
> {code}
> ExecutorClassLoaderSuite:
> [info] - child first *** FAILED *** (78 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - parent first *** FAILED *** (15 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fall back *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fail *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resource from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resources from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> {code}
> {code}
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 
> 333 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
> [info]   at 
> org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18685:


Assignee: Apache Spark

> Fix all tests in ExecutorClassLoaderSuite to pass on Windows
> 
>
> Key: SPARK-18685
> URL: https://issues.apache.org/jira/browse/SPARK-18685
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell, Tests
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> There are two problems as below:
> We should make the URI correct and {{BufferedSource}} from 
> {{Source.fromInputStream}} closed after opening them in the tests in 
> {{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures 
> on Windows.
> {code}
> ExecutorClassLoaderSuite:
> [info] - child first *** FAILED *** (78 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - parent first *** FAILED *** (15 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fall back *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fail *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resource from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resources from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> {code}
> {code}
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 
> 333 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
> [info]   at 
> org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows

2016-12-01 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18685:


 Summary: Fix all tests in ExecutorClassLoaderSuite to pass on 
Windows
 Key: SPARK-18685
 URL: https://issues.apache.org/jira/browse/SPARK-18685
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Shell, Tests
Reporter: Hyukjin Kwon
Priority: Minor


There are two problems as below:

We should make the URI correct and {{BufferedSource}} from 
{{Source.fromInputStream}} closed after opening them in the tests in 
{{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures on 
Windows.


{code}
ExecutorClassLoaderSuite:
[info] - child first *** FAILED *** (78 milliseconds)
[info]   java.net.URISyntaxException: Illegal character in authority at index 
7: 
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info]   at java.net.URI$Parser.fail(URI.java:2848)
[info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - parent first *** FAILED *** (15 milliseconds)
[info]   java.net.URISyntaxException: Illegal character in authority at index 
7: 
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info]   at java.net.URI$Parser.fail(URI.java:2848)
[info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - child first can fall back *** FAILED *** (0 milliseconds)
[info]   java.net.URISyntaxException: Illegal character in authority at index 
7: 
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info]   at java.net.URI$Parser.fail(URI.java:2848)
[info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - child first can fail *** FAILED *** (0 milliseconds)
[info]   java.net.URISyntaxException: Illegal character in authority at index 
7: 
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info]   at java.net.URI$Parser.fail(URI.java:2848)
[info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - resource from parent *** FAILED *** (0 milliseconds)
[info]   java.net.URISyntaxException: Illegal character in authority at index 
7: 
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info]   at java.net.URI$Parser.fail(URI.java:2848)
[info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
...
[info] - resources from parent *** FAILED *** (0 milliseconds)
[info]   java.net.URISyntaxException: Illegal character in authority at index 
7: 
file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
[info]   at java.net.URI$Parser.fail(URI.java:2848)
[info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
{code}


{code}
[info] Exception encountered when attempting to run a suite with class name: 
org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 333 
milliseconds)
[info]   java.io.IOException: Failed to delete: 
C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
[info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
[info]   at 
org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
...
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2016-12-01 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714327#comment-15714327
 ] 

Takeshi Yamamuro commented on SPARK-18165:
--

Thanks for the reference! I'd like to discuss the kinesis integration for 
structured streaming in future after the component becomes stable in 2.1 (or 
2.2?). So, this is my prototype to check feasibility  for implementing the 
kinesis integration on the current structured streaming APIs.

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Lauren Moos
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17909) we should create table before writing out the data in CTAS

2016-12-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan closed SPARK-17909.
---
Resolution: Invalid

> we should create table before writing out the data in CTAS
> --
>
> Key: SPARK-17909
> URL: https://issues.apache.org/jira/browse/SPARK-17909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-01 Thread Aral Can Kaymaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714209#comment-15714209
 ] 

Aral Can Kaymaz commented on SPARK-16845:
-

I am currently out of office, and will be back on Monday, 5th of December, 2016 
(05.12.2016). I will have rare access to e-mails during this time, and will 
reply to requests, but expect delays for replies.

Kind regards,
Aral Can Kaymaz



> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16845:

Component/s: (was: Java API)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18661:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Creating a partitioned datasource table should not scan all files for table
> ---
>
> Key: SPARK-18661
> URL: https://issues.apache.org/jira/browse/SPARK-18661
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> Even though in 2.1 creating a partitioned datasource table will not populate 
> the partition data by default (until the user issues MSCK REPAIR TABLE), it 
> seems we still scan the filesystem for no good reason.
> We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18679:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18659:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The first three test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (A, B) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("qux") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a=1, b) select id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 10)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18667) input_file_name function does not work with UDF

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18667:


Assignee: Apache Spark

> input_file_name function does not work with UDF
> ---
>
> Key: SPARK-18667
> URL: https://issues.apache.org/jira/browse/SPARK-18667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> {{input_file_name()}} does not return the file name but empty string instead 
> when it is used as input for UDF in PySpark as below: 
> with the data as below:
> {code}
> {"a": 1}
> {code}
> with the codes below:
> {code}
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def filename(path):
> return path
> sourceFile = udf(filename, StringType())
> spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
> {code}
> prints as below:
> {code}
> +---+
> |filename(input_file_name())|
> +---+
> |   |
> +---+
> {code}
> but the codes below:
> {code}
> spark.read.json("tmp.json").select(input_file_name()).show()
> {code}
> prints correctly as below:
> {code}
> ++
> |   input_file_name()|
> ++
> |file:///Users/hyu...|
> ++
> {code}
> This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18667) input_file_name function does not work with UDF

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18667:


Assignee: (was: Apache Spark)

> input_file_name function does not work with UDF
> ---
>
> Key: SPARK-18667
> URL: https://issues.apache.org/jira/browse/SPARK-18667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>
> {{input_file_name()}} does not return the file name but empty string instead 
> when it is used as input for UDF in PySpark as below: 
> with the data as below:
> {code}
> {"a": 1}
> {code}
> with the codes below:
> {code}
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def filename(path):
> return path
> sourceFile = udf(filename, StringType())
> spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
> {code}
> prints as below:
> {code}
> +---+
> |filename(input_file_name())|
> +---+
> |   |
> +---+
> {code}
> but the codes below:
> {code}
> spark.read.json("tmp.json").select(input_file_name()).show()
> {code}
> prints correctly as below:
> {code}
> ++
> |   input_file_name()|
> ++
> |file:///Users/hyu...|
> ++
> {code}
> This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18667) input_file_name function does not work with UDF

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714200#comment-15714200
 ] 

Apache Spark commented on SPARK-18667:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16115

> input_file_name function does not work with UDF
> ---
>
> Key: SPARK-18667
> URL: https://issues.apache.org/jira/browse/SPARK-18667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>
> {{input_file_name()}} does not return the file name but empty string instead 
> when it is used as input for UDF in PySpark as below: 
> with the data as below:
> {code}
> {"a": 1}
> {code}
> with the codes below:
> {code}
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def filename(path):
> return path
> sourceFile = udf(filename, StringType())
> spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
> {code}
> prints as below:
> {code}
> +---+
> |filename(input_file_name())|
> +---+
> |   |
> +---+
> {code}
> but the codes below:
> {code}
> spark.read.json("tmp.json").select(input_file_name()).show()
> {code}
> prints correctly as below:
> {code}
> ++
> |   input_file_name()|
> ++
> |file:///Users/hyu...|
> ++
> {code}
> This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18640.
-
  Resolution: Fixed
   Fix Version/s: 2.1.0
  2.0.3
Target Version/s:   (was: 1.6.4, 2.0.3, 2.1.0, 2.2.0)

> Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
> 
>
> Key: SPARK-18640
> URL: https://issues.apache.org/jira/browse/SPARK-18640
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.0.3, 2.1.0
>
>
> The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable 
> executorIdToRunningTaskIds map without proper synchronization. We should fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors

2016-12-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714189#comment-15714189
 ] 

Reynold Xin commented on SPARK-18640:
-

[~andrewor14] how come you didn't close the ticket?


> Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
> 
>
> Key: SPARK-18640
> URL: https://issues.apache.org/jira/browse/SPARK-18640
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.0.3, 2.1.0
>
>
> The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable 
> executorIdToRunningTaskIds map without proper synchronization. We should fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17213.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18658.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nathan Howell
>Assignee: Nathan Howell
>Priority: Minor
> Fix For: 2.2.0
>
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18663) Simplify CountMinSketch aggregate implementation

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18663.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Simplify CountMinSketch aggregate implementation
> 
>
> Key: SPARK-18663
> URL: https://issues.apache.org/jira/browse/SPARK-18663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> SPARK-18429 introduced count-min sketch aggregate function for SQL, but the 
> implementation and testing is more complicated than needed. This simplifies 
> the test cases and removes support for data types that don't have clear 
> equality semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18620:


Assignee: Apache Spark

> Spark Streaming + Kinesis : Receiver MaxRate is violated
> 
>
> Key: SPARK-18620
> URL: https://issues.apache.org/jira/browse/SPARK-18620
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: david przybill
>Assignee: Apache Spark
>Priority: Minor
>  Labels: kinesis
> Attachments: Apply_limit in_spark_with_my_patch.png, Apply_limit 
> in_vanilla_spark.png, Apply_no_limit.png
>
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18620:


Assignee: (was: Apache Spark)

> Spark Streaming + Kinesis : Receiver MaxRate is violated
> 
>
> Key: SPARK-18620
> URL: https://issues.apache.org/jira/browse/SPARK-18620
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: david przybill
>Priority: Minor
>  Labels: kinesis
> Attachments: Apply_limit in_spark_with_my_patch.png, Apply_limit 
> in_vanilla_spark.png, Apply_no_limit.png
>
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714116#comment-15714116
 ] 

Apache Spark commented on SPARK-18620:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/16114

> Spark Streaming + Kinesis : Receiver MaxRate is violated
> 
>
> Key: SPARK-18620
> URL: https://issues.apache.org/jira/browse/SPARK-18620
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: david przybill
>Priority: Minor
>  Labels: kinesis
> Attachments: Apply_limit in_spark_with_my_patch.png, Apply_limit 
> in_vanilla_spark.png, Apply_no_limit.png
>
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18647) do not put provider in table properties for Hive serde table

2016-12-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18647.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16080
[https://github.com/apache/spark/pull/16080]

> do not put provider in table properties for Hive serde table
> 
>
> Key: SPARK-18647
> URL: https://issues.apache.org/jira/browse/SPARK-18647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18284) Scheme of DataFrame generated from RDD is diffrent between master and 2.0

2016-12-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18284:

Assignee: Kazuaki Ishizaki

> Scheme of DataFrame generated from RDD is diffrent between master and 2.0
> -
>
> Key: SPARK-18284
> URL: https://issues.apache.org/jira/browse/SPARK-18284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.1.0
>
>
> When the following program is executed, a schema of dataframe is different 
> among master, branch 2.0, and branch 2.1. The result should be false.
> {code:java}
> val df = sparkContext.parallelize(1 to 8, 1).toDF()
> df.printSchema
> df.filter("value > 4").count
> === master ===
> root
>  |-- value: integer (nullable = true)
> === branch 2.1 ===
> root
>  |-- value: integer (nullable = true)
> === branch 2.0 ===
> root
>  |-- value: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18284) Scheme of DataFrame generated from RDD is diffrent between master and 2.0

2016-12-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18284.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15780
[https://github.com/apache/spark/pull/15780]

> Scheme of DataFrame generated from RDD is diffrent between master and 2.0
> -
>
> Key: SPARK-18284
> URL: https://issues.apache.org/jira/browse/SPARK-18284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Kazuaki Ishizaki
> Fix For: 2.1.0
>
>
> When the following program is executed, a schema of dataframe is different 
> among master, branch 2.0, and branch 2.1. The result should be false.
> {code:java}
> val df = sparkContext.parallelize(1 to 8, 1).toDF()
> df.printSchema
> df.filter("value > 4").count
> === master ===
> root
>  |-- value: integer (nullable = true)
> === branch 2.1 ===
> root
>  |-- value: integer (nullable = true)
> === branch 2.0 ===
> root
>  |-- value: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2016-12-01 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713978#comment-15713978
 ] 

Brian edited comment on SPARK-12216 at 12/2/16 4:24 AM:


Theory or no for what caused it, it's a bug in spark.  Other programs and 
libraries I run on windows do not have this problem... Just because you don't 
know how to fix a bug doesn't mean it doesn't exist, I really don't understand 
that logic.


was (Author: brian44):
Theory or no for what caused it, it's a bug in spark.  Other programs and 
libraries I run on windows do not have this problem... Just because you don't 
knwo how to fix a bug doesn't mean it doesn't exist, I really don't understand 
that logic.

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2016-12-01 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713978#comment-15713978
 ] 

Brian commented on SPARK-12216:
---

Theory or no for what caused it, it's a bug in spark.  Other programs and 
libraries I run on windows do not have this problem... Just because you don't 
knwo how to fix a bug doesn't mean it doesn't exist, I really don't understand 
that logic.

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2016-12-01 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713974#comment-15713974
 ] 

Brian commented on SPARK-12216:
---

Why is this closed / marked as resolved?  It is not resolved at all - this is a 
valid issue.  This happens to me every time when I run a job locally (local 
mode) or use spark-shell on my machine.  I use this a lot since its part of my 
development process before scaling up to cluster runs is to test and develop 
thoroughly locally.  Saying "it won't happen if you use linux" is not a 
solution - as spark is intended to work on windows as well.

As others have said it's not a permission issue, as it happens when running in 
administrator mode.

This is not ideal as temporary files generated do not get deleted and sit there 
taking up disk space.  Additionally it clogs up my logs showing exceptions all 
the time - which is annoying it I am checking for exceptions and have to ignore 
these.

How can we re-open it?  If not I will submit new issues.



> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18668) Do not auto-generate query name

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18668:


Assignee: Tathagata Das  (was: Apache Spark)

> Do not auto-generate query name
> ---
>
> Key: SPARK-18668
> URL: https://issues.apache.org/jira/browse/SPARK-18668
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> With SPARK-18657 we will make the StreamingQuery.id the persistently and 
> truly unique, it does not make sense to use an auto-generated name. Rather 
> name should be meant only as a purely optional pretty identifier set by the 
> user, or remain as null. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18657) Persist UUID across query restart

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18657:


Assignee: (was: Apache Spark)

> Persist UUID across query restart
> -
>
> Key: SPARK-18657
> URL: https://issues.apache.org/jira/browse/SPARK-18657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Priority: Critical
>
> We probably also want to add an instance Id or something that changes when 
> the query restarts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18668) Do not auto-generate query name

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18668:


Assignee: Apache Spark  (was: Tathagata Das)

> Do not auto-generate query name
> ---
>
> Key: SPARK-18668
> URL: https://issues.apache.org/jira/browse/SPARK-18668
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> With SPARK-18657 we will make the StreamingQuery.id the persistently and 
> truly unique, it does not make sense to use an auto-generated name. Rather 
> name should be meant only as a purely optional pretty identifier set by the 
> user, or remain as null. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18657) Persist UUID across query restart

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18657:


Assignee: Apache Spark

> Persist UUID across query restart
> -
>
> Key: SPARK-18657
> URL: https://issues.apache.org/jira/browse/SPARK-18657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> We probably also want to add an instance Id or something that changes when 
> the query restarts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18668) Do not auto-generate query name

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713943#comment-15713943
 ] 

Apache Spark commented on SPARK-18668:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16113

> Do not auto-generate query name
> ---
>
> Key: SPARK-18668
> URL: https://issues.apache.org/jira/browse/SPARK-18668
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> With SPARK-18657 we will make the StreamingQuery.id the persistently and 
> truly unique, it does not make sense to use an auto-generated name. Rather 
> name should be meant only as a purely optional pretty identifier set by the 
> user, or remain as null. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18657) Persist UUID across query restart

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713942#comment-15713942
 ] 

Apache Spark commented on SPARK-18657:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16113

> Persist UUID across query restart
> -
>
> Key: SPARK-18657
> URL: https://issues.apache.org/jira/browse/SPARK-18657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Priority: Critical
>
> We probably also want to add an instance Id or something that changes when 
> the query restarts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-12-01 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713894#comment-15713894
 ] 

Takeshi Yamamuro commented on SPARK-18620:
--

yea, I'll make a pr in a day

> Spark Streaming + Kinesis : Receiver MaxRate is violated
> 
>
> Key: SPARK-18620
> URL: https://issues.apache.org/jira/browse/SPARK-18620
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: david przybill
>Priority: Minor
>  Labels: kinesis
> Attachments: Apply_limit in_spark_with_my_patch.png, Apply_limit 
> in_vanilla_spark.png, Apply_no_limit.png
>
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18620) Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-12-01 Thread david przybill (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713889#comment-15713889
 ] 

david przybill commented on SPARK-18620:


Looks good to me.
Thanks for the prompt answer

> Spark Streaming + Kinesis : Receiver MaxRate is violated
> 
>
> Key: SPARK-18620
> URL: https://issues.apache.org/jira/browse/SPARK-18620
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: david przybill
>Priority: Minor
>  Labels: kinesis
> Attachments: Apply_limit in_spark_with_my_patch.png, Apply_limit 
> in_vanilla_spark.png, Apply_no_limit.png
>
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18680) Throw Filtering is supported only on partition keys of type string exception

2016-12-01 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang closed SPARK-18680.
---
Resolution: Duplicate

> Throw Filtering is supported only on partition keys of type string exception
> 
>
> Key: SPARK-18680
> URL: https://issues.apache.org/jira/browse/SPARK-18680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> I'm using MySQL for the Hive Metastore. 
> {{hive.metastore.try.direct.sql=true}} and 
> {{hive.metastore.integral.jdo.pushdown=false}} . It will throw following 
> exception when filtering on a *int* partition column after 
> [SPARK-17992|https://issues.apache.org/jira/browse/SPARK-17992].
> {noformat}
> spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
> Time taken: 0.221 seconds
> spark-sql> select * from test where part=1 limit 10;
> 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
> test where part=1 limit 10]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>   at 
> 

[jira] [Resolved] (SPARK-18538) Concurrent Fetching DataFrameReader JDBC APIs Do Not Work

2016-12-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18538.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Concurrent Fetching DataFrameReader JDBC APIs Do Not Work
> -
>
> Key: SPARK-18538
> URL: https://issues.apache.org/jira/browse/SPARK-18538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.1.0
>
>
> {code}
>   def jdbc(
>   url: String,
>   table: String,
>   columnName: String,
>   lowerBound: Long,
>   upperBound: Long,
>   numPartitions: Int,
>   connectionProperties: Properties): DataFrame
> {code}
> {code}
>   def jdbc(
>   url: String,
>   table: String,
>   predicates: Array[String],
>   connectionProperties: Properties): DataFrame
> {code}
> The above two DataFrameReader JDBC APIs ignore the user-specified parameters 
> of parallelism degree



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18141) jdbc datasource read fails when quoted columns (eg:mixed case, reserved words) in source table are used in the filter.

2016-12-01 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18141:

Assignee: Suresh Thalamati

> jdbc datasource read fails when  quoted  columns (eg:mixed case, reserved 
> words) in source table are used  in the filter.
> -
>
> Key: SPARK-18141
> URL: https://issues.apache.org/jira/browse/SPARK-18141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Suresh Thalamati
>Assignee: Suresh Thalamati
> Fix For: 2.1.0
>
>
> create table t1("Name" text, "Id" integer)
> insert into t1 values('Mike', 1)
> val df = sqlContext.read.jdbc(jdbcUrl, "t1", new Properties)

> df.filter("Id = 1").show()

> df.filter("`Id` = 1").show()
> Error :
> Cause: org.postgresql.util.PSQLException: ERROR: column "id" does not exist
>   Position: 35
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
>   at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
>   at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
>   at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:295)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> I am working on fix for this issue, will submit PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception

2016-12-01 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713871#comment-15713871
 ] 

Liang-Chi Hsieh commented on SPARK-18681:
-

Looks like you create two Jiras (SPARK-18680, SPARK-18681) for the same issue. 
Mind close one of them?

> Throw Filtering is supported only on partition keys of type string exception
> 
>
> Key: SPARK-18681
> URL: https://issues.apache.org/jira/browse/SPARK-18681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> I'm using MySQL for the Hive Metastore. 
> {{hive.metastore.try.direct.sql=true}} and 
> {{hive.metastore.integral.jdo.pushdown=false}} . It will throw following 
> exception when filtering on a *int* partition column after 
> [SPARK-17992|https://issues.apache.org/jira/browse/SPARK-17992].
> {noformat}
> spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
> Time taken: 0.221 seconds
> spark-sql> select * from test where part=1 limit 10;
> 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
> test where part=1 limit 10]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:133)
>   at 
> 

[jira] [Resolved] (SPARK-18141) jdbc datasource read fails when quoted columns (eg:mixed case, reserved words) in source table are used in the filter.

2016-12-01 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18141.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15662
[https://github.com/apache/spark/pull/15662]

> jdbc datasource read fails when  quoted  columns (eg:mixed case, reserved 
> words) in source table are used  in the filter.
> -
>
> Key: SPARK-18141
> URL: https://issues.apache.org/jira/browse/SPARK-18141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Suresh Thalamati
> Fix For: 2.1.0
>
>
> create table t1("Name" text, "Id" integer)
> insert into t1 values('Mike', 1)
> val df = sqlContext.read.jdbc(jdbcUrl, "t1", new Properties)

> df.filter("Id = 1").show()

> df.filter("`Id` = 1").show()
> Error :
> Cause: org.postgresql.util.PSQLException: ERROR: column "id" does not exist
>   Position: 35
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
>   at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
>   at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
>   at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:295)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> I am working on fix for this issue, will submit PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18684) Spark Executors off-heap memory usage keeps increasing while running spark streaming

2016-12-01 Thread Krishna Gandra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713855#comment-15713855
 ] 

Krishna Gandra edited comment on SPARK-18684 at 12/2/16 3:02 AM:
-

Executor off-heap size is keep increasing and eventually yarn killing the 
executor.
Using spark 2.0.2 , reading it from kinesis stream and writing as parquet on 
hdfs.


was (Author: krishnagkr):
Executor off-heap size is keep increasing and eventually yarn killing the 
executor.
Using spark 2.0.2 , reading it from kinesis stream and wring as parquet on hdfs.

> Spark Executors off-heap memory usage keeps increasing while running spark 
> streaming
> 
>
> Key: SPARK-18684
> URL: https://issues.apache.org/jira/browse/SPARK-18684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Krishna Gandra
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18684) Spark Executors off-heap memory usage keeps increasing while running spark streaming

2016-12-01 Thread Krishna Gandra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713855#comment-15713855
 ] 

Krishna Gandra commented on SPARK-18684:


Executor off-heap size is keep increasing and eventually yarn killing the 
executor.
Using spark 2.0.2 , reading it from kinesis stream and wring as parquet on hdfs.

> Spark Executors off-heap memory usage keeps increasing while running spark 
> streaming
> 
>
> Key: SPARK-18684
> URL: https://issues.apache.org/jira/browse/SPARK-18684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Krishna Gandra
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18684) Spark Executors off-heap memory usage keeps increasing while running spark streaming

2016-12-01 Thread Krishna Gandra (JIRA)
Krishna Gandra created SPARK-18684:
--

 Summary: Spark Executors off-heap memory usage keeps increasing 
while running spark streaming
 Key: SPARK-18684
 URL: https://issues.apache.org/jira/browse/SPARK-18684
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.2, 1.6.2
Reporter: Krishna Gandra






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-12-01 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-18665:
--
Description: 
I find that, some jobs are canceled, but the state are still "STARTED", I think 
this bug are imported by SPARK-6964


I find some logs:
{code}

16/12/01 11:43:34 ERROR SparkExecuteStatementOperation: Error running hive 
query: 
org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
transition from CLOSED to ERROR
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
at 
org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1708)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:176)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

{code}
org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
transition from CANCELED to ERROR
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
at 
org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1708)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:176)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{code}

  was:
I find that, some jobs are canceled, but the state are still "STARTED", I think 
this bug are imported by SPARK-6964


I find some logs:
{code}

16/12/01 11:43:34 ERROR SparkExecuteStatementOperation: Error running hive 
query: 
org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
transition from CLOSED to ERROR
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
at 
org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 

[jira] [Updated] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-12-01 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-18665:
--
Description: 
I find that, some jobs are canceled, but the state are still "STARTED", I think 
this bug are imported by SPARK-6964


I find some logs:

16/12/01 11:43:34 ERROR SparkExecuteStatementOperation: Error running hive 
query: 
org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
transition from CLOSED to ERROR
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
at 
org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1708)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:176)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
transition from CANCELED to ERROR
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
at 
org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1708)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:176)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

  was:I find that, some jobs are canceled, but the state are still "STARTED", I 
think this bug are imported by SPARK-6964


> Spark ThriftServer jobs where are canceled are still “STARTED”
> --
>
> Key: SPARK-18665
> URL: https://issues.apache.org/jira/browse/SPARK-18665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3
>Reporter: cen yuhai
> Attachments: 1179ACF7-3E62-44C5-B01D-CA71C876ECCE.png, 
> 83C5E8AD-59DE-4A85-A483-2BE3FB83F378.png
>
>
> I find that, some jobs are canceled, but the state are still "STARTED", I 
> think this bug are imported by SPARK-6964
> I find some logs:
> 16/12/01 11:43:34 ERROR SparkExecuteStatementOperation: Error running hive 
> query: 
> org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
> transition from CLOSED to ERROR
>   at 
> org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
>   at 
> org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
>   at 
> org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
>   at 
> 

[jira] [Updated] (SPARK-18665) Spark ThriftServer jobs where are canceled are still “STARTED”

2016-12-01 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-18665:
--
Description: 
I find that, some jobs are canceled, but the state are still "STARTED", I think 
this bug are imported by SPARK-6964


I find some logs:
{code}

16/12/01 11:43:34 ERROR SparkExecuteStatementOperation: Error running hive 
query: 
org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
transition from CLOSED to ERROR
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
at 
org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1708)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:176)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
transition from CANCELED to ERROR
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
at 
org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1708)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:176)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{code}

  was:
I find that, some jobs are canceled, but the state are still "STARTED", I think 
this bug are imported by SPARK-6964


I find some logs:

16/12/01 11:43:34 ERROR SparkExecuteStatementOperation: Error running hive 
query: 
org.apache.hive.service.cli.HiveSQLException: Illegal Operation state 
transition from CLOSED to ERROR
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:91)
at 
org.apache.hive.service.cli.OperationState.validateTransition(OperationState.java:97)
at 
org.apache.hive.service.cli.operation.Operation.setState(Operation.java:126)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1708)

[jira] [Assigned] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18679:


Assignee: Apache Spark

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18679:


Assignee: (was: Apache Spark)

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713826#comment-15713826
 ] 

Apache Spark commented on SPARK-18679:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/16112

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13287) Standalone REST API throttling?

2016-12-01 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713742#comment-15713742
 ] 

Shixiong Zhu commented on SPARK-13287:
--

Created SPARK-18683. But I don't have time to work on it now. I can help review 
PRs if someone picks it up.

> Standalone REST API throttling?
> ---
>
> Key: SPARK-13287
> URL: https://issues.apache.org/jira/browse/SPARK-13287
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Rares Vernica
>Priority: Minor
>
> I am using the REST API provided by Spark Standalone mode to check on jobs. 
> It turns out that if I don't pause between requests the server will redirect 
> me to the server homepage instead of offering the requested information.
> Here is a simple test to prove this:
> {code:JavaScript}
> $ curl --silent 
> http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head 
> -2 ; curl --silent 
> http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head 
> -2
> [ {
>   "jobId" : 0,
> 
>   
> {code}
> I am requesting the same information about one application twice using 
> {{curl}}. I print the first two lines from each response. The requests are 
> made immediately one after another. The first two lines are from the first 
> request, the last two lines are from the second request. Again, the request 
> URLs are identical. The response from the second request is identical with 
> the homepage you get from http://localhost:8080/
> If I insert a {{sleep 1}} between the two {{curl}} commands, both work fine. 
> For smaller time outs, like {{sleep .8}} it does not work correctly.
> I am not sure if this is intentional or a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18683) REST APIs for standalone Master and Workers

2016-12-01 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18683:


 Summary: REST APIs for standalone Master and Workers
 Key: SPARK-18683
 URL: https://issues.apache.org/jira/browse/SPARK-18683
 Project: Spark
  Issue Type: Improvement
Reporter: Shixiong Zhu


It would be great that we have some REST APIs to access Master and Workers 
information. Right now the only way to get them is using the Web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18639) Build only a single pip package

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18639.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Build only a single pip package
> ---
>
> Key: SPARK-18639
> URL: https://issues.apache.org/jira/browse/SPARK-18639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> We current build 5 separate pip binary tar balls, doubling the release script 
> runtime. It'd be better to build one, especially for use cases that are just 
> using Spark locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13287) Standalone REST API throttling?

2016-12-01 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713735#comment-15713735
 ] 

Shixiong Zhu commented on SPARK-13287:
--

Right now there is no REST API for master. You use the REST APIs for driver to 
access Master http server...

> Standalone REST API throttling?
> ---
>
> Key: SPARK-13287
> URL: https://issues.apache.org/jira/browse/SPARK-13287
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Rares Vernica
>Priority: Minor
>
> I am using the REST API provided by Spark Standalone mode to check on jobs. 
> It turns out that if I don't pause between requests the server will redirect 
> me to the server homepage instead of offering the requested information.
> Here is a simple test to prove this:
> {code:JavaScript}
> $ curl --silent 
> http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head 
> -2 ; curl --silent 
> http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head 
> -2
> [ {
>   "jobId" : 0,
> 
>   
> {code}
> I am requesting the same information about one application twice using 
> {{curl}}. I print the first two lines from each response. The requests are 
> made immediately one after another. The first two lines are from the first 
> request, the last two lines are from the second request. Again, the request 
> URLs are identical. The response from the second request is identical with 
> the homepage you get from http://localhost:8080/
> If I insert a {{sleep 1}} between the two {{curl}} commands, both work fine. 
> For smaller time outs, like {{sleep .8}} it does not work correctly.
> I am not sure if this is intentional or a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13287) Standalone REST API throttling?

2016-12-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13287.
--
Resolution: Not A Bug

> Standalone REST API throttling?
> ---
>
> Key: SPARK-13287
> URL: https://issues.apache.org/jira/browse/SPARK-13287
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Rares Vernica
>Priority: Minor
>
> I am using the REST API provided by Spark Standalone mode to check on jobs. 
> It turns out that if I don't pause between requests the server will redirect 
> me to the server homepage instead of offering the requested information.
> Here is a simple test to prove this:
> {code:JavaScript}
> $ curl --silent 
> http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head 
> -2 ; curl --silent 
> http://localhost:8080/api/v1/applications/app-20160211003526-0037/jobs | head 
> -2
> [ {
>   "jobId" : 0,
> 
>   
> {code}
> I am requesting the same information about one application twice using 
> {{curl}}. I print the first two lines from each response. The requests are 
> made immediately one after another. The first two lines are from the first 
> request, the last two lines are from the second request. Again, the request 
> URLs are identical. The response from the second request is identical with 
> the homepage you get from http://localhost:8080/
> If I insert a {{sleep 1}} between the two {{curl}} commands, both work fine. 
> For smaller time outs, like {{sleep .8}} it does not work correctly.
> I am not sure if this is intentional or a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18323) Update MLlib, GraphX websites for 2.1

2016-12-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713738#comment-15713738
 ] 

Joseph K. Bradley commented on SPARK-18323:
---

Recommendations:
* Update "Calling MLlib in Python" example to use DataFrame-based API
* Better organization for "Algorithms" list.  Divide into 3 sections: ML 
algorithms (basic ML algs), ML workflows (featurization, Pipelines, tuning), ML 
utilities (persistence, linalg, stats)

> Update MLlib, GraphX websites for 2.1
> -
>
> Key: SPARK-18323
> URL: https://issues.apache.org/jira/browse/SPARK-18323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18234) Update mode in structured streaming

2016-12-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18234:
-
Target Version/s: 2.2.0

> Update mode in structured streaming
> ---
>
> Key: SPARK-18234
> URL: https://issues.apache.org/jira/browse/SPARK-18234
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Priority: Critical
>
> We have this internal, but we should nail down the semantics and expose it to 
> users.  The idea of update mode is that any tuple that changes will be 
> emitted.  Open questions:
>  - do we need to reason about the {{keys}} for a given stream?  For things 
> like the {{foreach}} sink its up to the user.  However, for more end to end 
> use cases such as a JDBC sink, we need to know which row downstream is being 
> updated.
>  - okay to not support files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18682) Batch Source for Kafka

2016-12-01 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18682:


 Summary: Batch Source for Kafka
 Key: SPARK-18682
 URL: https://issues.apache.org/jira/browse/SPARK-18682
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Structured Streaming
Reporter: Michael Armbrust


Today, you can start a stream that reads from kafka.  However, given kafka's 
configurable retention period, it seems like sometimes you might just want to 
read all of the data that is available now.  As such we should add a version 
that works with {{spark.read}} as well.

The options should be the same as the streaming kafka source, with the 
following differences:
 - {{startingOffsets}} should default to earliest, and should not allow 
{{latest}} (which would always be empty).
 - {{endingOffsets}} should also be allowed and should default to {{latest}}. 
the same assign json format as {{startingOffsets}} should also be accepted.

It would be really good, if things like {{.limit\(n\)}} were enough to prevent 
all the data from being read (this might just work).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17822:
--
Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 2.0.3, 2.1.0)

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713646#comment-15713646
 ] 

Joseph K. Bradley commented on SPARK-17822:
---

Since 2.1 is underway and this is not a regression, I'll shift the target.

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17823) Make JVMObjectTracker.objMap thread-safe

2016-12-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17823:
--
Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 2.0.3, 2.1.0)

> Make JVMObjectTracker.objMap thread-safe
> 
>
> Key: SPARK-17823
> URL: https://issues.apache.org/jira/browse/SPARK-17823
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> Since JVMObjectTracker.objMap is a global map, it makes sense to make it 
> thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17823) Make JVMObjectTracker.objMap thread-safe

2016-12-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713644#comment-15713644
 ] 

Joseph K. Bradley commented on SPARK-17823:
---

Since 2.1 is underway and this is not a regression, I'll shift the target.

> Make JVMObjectTracker.objMap thread-safe
> 
>
> Key: SPARK-17823
> URL: https://issues.apache.org/jira/browse/SPARK-17823
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> Since JVMObjectTracker.objMap is a global map, it makes sense to make it 
> thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713636#comment-15713636
 ] 

Nicholas Chammas commented on SPARK-13587:
--

[~tsp]:

{quote}
Previously, I have had reasonable success with zipping the contents of my conda 
environment in the gateway/driver node and submitting the zip file as an 
argument to --archives in the spark-submit command line. This approach works 
perfectly because it uses the existing spark infrastructure to distribute 
dependencies through to the workers. You actually don't even need anaconda 
installed on the workers since the zip can package the entire python 
installation within it. The downside of it being that conda zip files can bloat 
up quickly in a production spark application.
{quote}

Can you elaborate on how you did this? I'm willing to jump through some hoops 
to create a hackish way of distributing dependencies while this JIRA task gets 
worked out.

What I'm trying is:
# Create a virtual environment and activate it.
# Pip install my requirements into that environment, as one would in a regular 
Python project.
# Zip up the venv/ folder and ship it with my application using {{--py-files}}.

I'm struggling to get the workers to pick up Python dependencies from the 
packaged venv over what's in the system site-packages. All I want is to be able 
to ship out the dependencies with the application from a virtual environment 
all at once (i.e. without having to enumerate each dependency).

Has anyone been able to do this today? It would be good to document it as a 
workaround for people until this issue is resolved.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception

2016-12-01 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713624#comment-15713624
 ] 

Yuming Wang commented on SPARK-18681:
-

I will pull request for this issue later.

> Throw Filtering is supported only on partition keys of type string exception
> 
>
> Key: SPARK-18681
> URL: https://issues.apache.org/jira/browse/SPARK-18681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> I'm using MySQL for the Hive Metastore. 
> {{hive.metastore.try.direct.sql=true}} and 
> {{hive.metastore.integral.jdo.pushdown=false}} . It will throw following 
> exception when filtering on a *int* partition column after 
> [SPARK-17992|https://issues.apache.org/jira/browse/SPARK-17992].
> {noformat}
> spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
> Time taken: 0.221 seconds
> spark-sql> select * from test where part=1 limit 10;
> 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
> test where part=1 limit 10]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)

[jira] [Created] (SPARK-18680) Throw Filtering is supported only on partition keys of type string exception

2016-12-01 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-18680:
---

 Summary: Throw Filtering is supported only on partition keys of 
type string exception
 Key: SPARK-18680
 URL: https://issues.apache.org/jira/browse/SPARK-18680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Yuming Wang


I'm using MySQL for the Hive Metastore. {{hive.metastore.try.direct.sql=true}} 
and {{hive.metastore.integral.jdo.pushdown=false}} . It will throw following 
exception when filtering on a *int* partition column after 
[SPARK-17992|https://issues.apache.org/jira/browse/SPARK-17992].
{noformat}
spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
Time taken: 0.221 seconds
spark-sql> select * from test where part=1 limit 10;
16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
test where part=1 limit 10]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:133)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 

[jira] [Created] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception

2016-12-01 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-18681:
---

 Summary: Throw Filtering is supported only on partition keys of 
type string exception
 Key: SPARK-18681
 URL: https://issues.apache.org/jira/browse/SPARK-18681
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Yuming Wang


I'm using MySQL for the Hive Metastore. {{hive.metastore.try.direct.sql=true}} 
and {{hive.metastore.integral.jdo.pushdown=false}} . It will throw following 
exception when filtering on a *int* partition column after 
[SPARK-17992|https://issues.apache.org/jira/browse/SPARK-17992].
{noformat}
spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
Time taken: 0.221 seconds
spark-sql> select * from test where part=1 limit 10;
16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
test where part=1 limit 10]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:133)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 

[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713604#comment-15713604
 ] 

Joseph K. Bradley commented on SPARK-17822:
---

I've been able to observe something like this bug by creating a DataFrame in 
SparkR and calling sql queries on it repeatedly.  Java objects from these 
duplicate queries start to collect in JVMObjectTracker.  But those Java objects 
do get GCed periodically.  And calling gc() in R completely cleans them up.

The periodic GC I saw only occurred when I ran R commands, so perhaps it is not 
triggered as frequently as we’d like.  I'm not that familiar with SparkR 
internals, but is there a good way to make this happen?

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic when kafka-clients 0.10.0.1 is used

2016-12-01 Thread Heji Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heji Kim resolved SPARK-18506.
--
Resolution: Not A Problem

Just another library incompatibilty issue. We just downgraded the kafka-clients 
to 10.0.1.0 from 10.1.0.0

> kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a 
> single partition on a multi partition topic when kafka-clients 0.10.0.1 is 
> used
> ---
>
> Key: SPARK-18506
> URL: https://issues.apache.org/jira/browse/SPARK-18506
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Problem occurs both in Hadoop/YARN 2.7.3 and Spark 
> standalone mode 2.0.2 
> with Kafka 0.10.1.0.   
>Reporter: Heji Kim
>
> Our team is trying to upgrade to Spark 2.0.2/Kafka 
> 0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our 
> drivers to read all partitions of a single stream when kafka 
> auto.offset.reset=earliest running on a real cluster(separate VM nodes). 
> When we run our drivers with auto.offset.reset=latest ingesting from a single 
> kafka topic with multiple partitions (usually 10 but problem shows up  with 
> only 3 partitions), the driver reads correctly from all partitions.  
> Unfortunately, we need "earliest" for exactly once semantics.
> In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using 
> spark-streaming-kafka-0-8_2.11 with the prior setting 
> auto.offset.reset=smallest runs correctly.
> We have tried the following configurations in trying to isolate our problem 
> but it is only auto.offset.reset=earliest on a "real multi-machine cluster" 
> which causes this problem.
> 1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each)  
> instead of YARN 2.7.3. Single partition read problem persists both cases. 
> Please note this problem occurs on an actual cluster of separate VM nodes 
> (but not when our engineer runs in as a cluster on his own Mac.)
> 2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists.
> 3. Turned off checkpointing. Problem persists with or without checkpointing.
> 4. Turned off backpressure. Problem persists with or without backpressure.
> 5. Tried both partition.assignment.strategy RangeAssignor and 
> RoundRobinAssignor. Broken with both.
> 6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with 
> both.
> 7. Tried the simplest scala driver that only logs.  (Our team uses java.) 
> Broken with both.
> 8. Tried increasing GCE capacity for cluster but already we were highly 
> overprovisioned for cores and memory. Also tried ramping up executors and 
> cores.  Since driver works with auto.offset.reset=latest, we have ruled out 
> GCP cloud infrastructure issues.
> When we turn on the debug logs, we sometimes see partitions being set to 
> different offset configuration even though the consumer config correctly 
> indicates auto.offset.reset=earliest. 
> {noformat}
> 8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Resetting offset for partition simple_test-9 to latest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]}
>  to broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]}
>  to broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]}
>  from broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=9,error_code=0,timestamp=-1,offset=66724}]}]}
>  from broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 DEBUG Fetched {timestamp=-1, offset=0} for partition simple_test-8 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Fetched {timestamp=-1, offset=66724} for partition simple_test-9 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> {noformat}
> I've enclosed below the completely stripped down trivial test driver that 
> shows this behavior.  After spending 2 weeks trying all combinations with a 
> really stripped down driver, we think either there might be a bug in 

[jira] [Updated] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic when kafka-clients 0.10.1.0 is used

2016-12-01 Thread Heji Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heji Kim updated SPARK-18506:
-
Summary: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only 
read from a single partition on a multi partition topic when kafka-clients 
0.10.1.0 is used  (was: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest 
will only read from a single partition on a multi partition topic when 
kafka-clients 0.10.0.1 is used)

> kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a 
> single partition on a multi partition topic when kafka-clients 0.10.1.0 is 
> used
> ---
>
> Key: SPARK-18506
> URL: https://issues.apache.org/jira/browse/SPARK-18506
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Problem occurs both in Hadoop/YARN 2.7.3 and Spark 
> standalone mode 2.0.2 
> with Kafka 0.10.1.0.   
>Reporter: Heji Kim
>
> Our team is trying to upgrade to Spark 2.0.2/Kafka 
> 0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our 
> drivers to read all partitions of a single stream when kafka 
> auto.offset.reset=earliest running on a real cluster(separate VM nodes). 
> When we run our drivers with auto.offset.reset=latest ingesting from a single 
> kafka topic with multiple partitions (usually 10 but problem shows up  with 
> only 3 partitions), the driver reads correctly from all partitions.  
> Unfortunately, we need "earliest" for exactly once semantics.
> In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using 
> spark-streaming-kafka-0-8_2.11 with the prior setting 
> auto.offset.reset=smallest runs correctly.
> We have tried the following configurations in trying to isolate our problem 
> but it is only auto.offset.reset=earliest on a "real multi-machine cluster" 
> which causes this problem.
> 1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each)  
> instead of YARN 2.7.3. Single partition read problem persists both cases. 
> Please note this problem occurs on an actual cluster of separate VM nodes 
> (but not when our engineer runs in as a cluster on his own Mac.)
> 2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists.
> 3. Turned off checkpointing. Problem persists with or without checkpointing.
> 4. Turned off backpressure. Problem persists with or without backpressure.
> 5. Tried both partition.assignment.strategy RangeAssignor and 
> RoundRobinAssignor. Broken with both.
> 6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with 
> both.
> 7. Tried the simplest scala driver that only logs.  (Our team uses java.) 
> Broken with both.
> 8. Tried increasing GCE capacity for cluster but already we were highly 
> overprovisioned for cores and memory. Also tried ramping up executors and 
> cores.  Since driver works with auto.offset.reset=latest, we have ruled out 
> GCP cloud infrastructure issues.
> When we turn on the debug logs, we sometimes see partitions being set to 
> different offset configuration even though the consumer config correctly 
> indicates auto.offset.reset=earliest. 
> {noformat}
> 8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Resetting offset for partition simple_test-9 to latest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]}
>  to broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]}
>  to broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]}
>  from broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=9,error_code=0,timestamp=-1,offset=66724}]}]}
>  from broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 DEBUG Fetched {timestamp=-1, offset=0} for partition simple_test-8 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Fetched {timestamp=-1, offset=66724} for partition simple_test-9 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> {noformat}
> I've enclosed below the 

[jira] [Commented] (SPARK-18476) SparkR Logistic Regression should should support output original label.

2016-12-01 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713515#comment-15713515
 ] 

Miao Wang commented on SPARK-18476:
---

spark.logit predict should output original label instead of a numerical value 
as the prediction column. Example:

> training <- suppressWarnings(createDataFrame(iris))
> binomial_training <- training[training$Species %in% c("versicolor", 
> "virginica"), ]
> binomial_model <- spark.logit(binomial_training, Species ~ Sepal_Length + 
> Sepal_Width)
> prediction <- predict(binomial_model, binomial_training)
> showDF(prediction)

Output:

++---++---+--+++--+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|   
rawPrediction| probability|prediction|
++---++---+--+++--+
| 7.0|3.2| 4.7|
1.4|versicolor|[-1.5655042626435...|[0.17285823940230...| virginica|
| 6.4|3.2| 4.5|
1.5|versicolor|[-0.4240802660720...|[0.39554079174312...| virginica|
| 6.9|3.1| 4.9|
1.5|versicolor|[-1.3348014339322...|[0.20836626079909...| virginica|
| 5.5|2.3| 4.0|
1.3|versicolor|[1.65224519232947...|[0.83919426374389...|versicolor|
| 6.5|2.8| 4.6|
1.5|versicolor|[-0.4524556150364...|[0.38877708044707...| virginica|
| 5.7|2.8| 4.5|
1.3|versicolor|[1.06944304705877...|[0.74449098435029...|versicolor|
| 6.3|3.3| 4.7|
1.6|versicolor|[-0.2743084292595...|[0.43184968922729...| virginica|
| 4.9|2.4| 3.3|
1.0|versicolor|[2.75320369295153...|[0.94009402758065...|versicolor|
| 6.6|2.9| 4.6|
1.3|versicolor|[-0.6831584437477...|[0.33555673563505...| virginica|
| 5.2|2.7| 3.9|
1.4|versicolor|[2.06109520681768...|[0.88706393592062...|versicolor|
| 5.0|2.0| 3.5|
1.0|versicolor|[2.72482834398713...|[0.93847590782569...|versicolor|
| 5.9|3.0| 4.2|
1.5|versicolor|[0.60803738963620...|[0.64749297424084...|versicolor|
| 6.0|2.2| 4.0|
1.0|versicolor|[0.74152402446931...|[0.67732902849243...|versicolor|
| 6.1|2.9| 4.7|
1.4|versicolor|[0.26802822006176...|[0.56660877197498...|versicolor|
| 5.6|2.9| 3.6|
1.3|versicolor|[1.21921488387130...|[0.77192535405997...|versicolor|
| 6.7|3.1| 4.4|
1.4|versicolor|[-0.9543267684084...|[0.27801550694056...| virginica|
| 5.6|3.0| 4.5|
1.5|versicolor|[1.17874938792192...|[0.76472286588073...|versicolor|
| 5.8|2.7| 4.1|
1.0|versicolor|[0.91967121024624...|[0.71497510778599...|versicolor|
| 6.2|2.2| 4.5|
1.5|versicolor|[0.36104935894550...|[0.58929443051304...|versicolor|
| 5.6|2.5| 3.9|
1.1|versicolor|[1.38107686766881...|[0.79916389423163...|versicolor|
++---++---+--+++--+

The `prediction` column should be the original label as shown above.

> SparkR Logistic Regression should should support output original label.
> ---
>
> Key: SPARK-18476
> URL: https://issues.apache.org/jira/browse/SPARK-18476
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.0
>
>
> Similar to [SPARK-18401], as a classification algorithm, logistic regression 
> should support output original label instead of supporting index label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-12-01 Thread Heji Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713476#comment-15713476
 ] 

Heji Kim commented on SPARK-18506:
--

Breaking news I finally found the source of the problem.  Our driver jars 
have a lot of dependencies and we also include
the kafka-clients jar along with spark-streaming_2.11 (2.02). Our data 
architect says our code uses it.

org.apache.kafka
kafka-clients
0.10.1.0


If I downgrade kafka-clients 0.10.0.1, "earliest" works exactly as expected. 

(I'll update the issue name with this jar name...)





> kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a 
> single partition on a multi partition topic
> ---
>
> Key: SPARK-18506
> URL: https://issues.apache.org/jira/browse/SPARK-18506
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Problem occurs both in Hadoop/YARN 2.7.3 and Spark 
> standalone mode 2.0.2 
> with Kafka 0.10.1.0.   
>Reporter: Heji Kim
>
> Our team is trying to upgrade to Spark 2.0.2/Kafka 
> 0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our 
> drivers to read all partitions of a single stream when kafka 
> auto.offset.reset=earliest running on a real cluster(separate VM nodes). 
> When we run our drivers with auto.offset.reset=latest ingesting from a single 
> kafka topic with multiple partitions (usually 10 but problem shows up  with 
> only 3 partitions), the driver reads correctly from all partitions.  
> Unfortunately, we need "earliest" for exactly once semantics.
> In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using 
> spark-streaming-kafka-0-8_2.11 with the prior setting 
> auto.offset.reset=smallest runs correctly.
> We have tried the following configurations in trying to isolate our problem 
> but it is only auto.offset.reset=earliest on a "real multi-machine cluster" 
> which causes this problem.
> 1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each)  
> instead of YARN 2.7.3. Single partition read problem persists both cases. 
> Please note this problem occurs on an actual cluster of separate VM nodes 
> (but not when our engineer runs in as a cluster on his own Mac.)
> 2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists.
> 3. Turned off checkpointing. Problem persists with or without checkpointing.
> 4. Turned off backpressure. Problem persists with or without backpressure.
> 5. Tried both partition.assignment.strategy RangeAssignor and 
> RoundRobinAssignor. Broken with both.
> 6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with 
> both.
> 7. Tried the simplest scala driver that only logs.  (Our team uses java.) 
> Broken with both.
> 8. Tried increasing GCE capacity for cluster but already we were highly 
> overprovisioned for cores and memory. Also tried ramping up executors and 
> cores.  Since driver works with auto.offset.reset=latest, we have ruled out 
> GCP cloud infrastructure issues.
> When we turn on the debug logs, we sometimes see partitions being set to 
> different offset configuration even though the consumer config correctly 
> indicates auto.offset.reset=earliest. 
> {noformat}
> 8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Resetting offset for partition simple_test-9 to latest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]}
>  to broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]}
>  to broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]}
>  from broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=9,error_code=0,timestamp=-1,offset=66724}]}]}
>  from broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 DEBUG Fetched {timestamp=-1, offset=0} for partition simple_test-8 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Fetched {timestamp=-1, offset=66724} for partition simple_test-9 
> 

[jira] [Updated] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic when kafka-clients 0.10.0.1 is used

2016-12-01 Thread Heji Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heji Kim updated SPARK-18506:
-
Summary: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only 
read from a single partition on a multi partition topic when kafka-clients 
0.10.0.1 is used  (was: kafka 0.10 with Spark 2.02 auto.offset.reset=earliest 
will only read from a single partition on a multi partition topic)

> kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a 
> single partition on a multi partition topic when kafka-clients 0.10.0.1 is 
> used
> ---
>
> Key: SPARK-18506
> URL: https://issues.apache.org/jira/browse/SPARK-18506
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Problem occurs both in Hadoop/YARN 2.7.3 and Spark 
> standalone mode 2.0.2 
> with Kafka 0.10.1.0.   
>Reporter: Heji Kim
>
> Our team is trying to upgrade to Spark 2.0.2/Kafka 
> 0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our 
> drivers to read all partitions of a single stream when kafka 
> auto.offset.reset=earliest running on a real cluster(separate VM nodes). 
> When we run our drivers with auto.offset.reset=latest ingesting from a single 
> kafka topic with multiple partitions (usually 10 but problem shows up  with 
> only 3 partitions), the driver reads correctly from all partitions.  
> Unfortunately, we need "earliest" for exactly once semantics.
> In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using 
> spark-streaming-kafka-0-8_2.11 with the prior setting 
> auto.offset.reset=smallest runs correctly.
> We have tried the following configurations in trying to isolate our problem 
> but it is only auto.offset.reset=earliest on a "real multi-machine cluster" 
> which causes this problem.
> 1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each)  
> instead of YARN 2.7.3. Single partition read problem persists both cases. 
> Please note this problem occurs on an actual cluster of separate VM nodes 
> (but not when our engineer runs in as a cluster on his own Mac.)
> 2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists.
> 3. Turned off checkpointing. Problem persists with or without checkpointing.
> 4. Turned off backpressure. Problem persists with or without backpressure.
> 5. Tried both partition.assignment.strategy RangeAssignor and 
> RoundRobinAssignor. Broken with both.
> 6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with 
> both.
> 7. Tried the simplest scala driver that only logs.  (Our team uses java.) 
> Broken with both.
> 8. Tried increasing GCE capacity for cluster but already we were highly 
> overprovisioned for cores and memory. Also tried ramping up executors and 
> cores.  Since driver works with auto.offset.reset=latest, we have ruled out 
> GCP cloud infrastructure issues.
> When we turn on the debug logs, we sometimes see partitions being set to 
> different offset configuration even though the consumer config correctly 
> indicates auto.offset.reset=earliest. 
> {noformat}
> 8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Resetting offset for partition simple_test-9 to latest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]}
>  to broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]}
>  to broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]}
>  from broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=9,error_code=0,timestamp=-1,offset=66724}]}]}
>  from broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 DEBUG Fetched {timestamp=-1, offset=0} for partition simple_test-8 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Fetched {timestamp=-1, offset=66724} for partition simple_test-9 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> {noformat}
> I've enclosed below the completely stripped down trivial test 

[jira] [Commented] (SPARK-18538) Concurrent Fetching DataFrameReader JDBC APIs Do Not Work

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713459#comment-15713459
 ] 

Apache Spark commented on SPARK-18538:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16111

> Concurrent Fetching DataFrameReader JDBC APIs Do Not Work
> -
>
> Key: SPARK-18538
> URL: https://issues.apache.org/jira/browse/SPARK-18538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code}
>   def jdbc(
>   url: String,
>   table: String,
>   columnName: String,
>   lowerBound: Long,
>   upperBound: Long,
>   numPartitions: Int,
>   connectionProperties: Properties): DataFrame
> {code}
> {code}
>   def jdbc(
>   url: String,
>   table: String,
>   predicates: Array[String],
>   connectionProperties: Properties): DataFrame
> {code}
> The above two DataFrameReader JDBC APIs ignore the user-specified parameters 
> of parallelism degree



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend

2016-12-01 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713430#comment-15713430
 ] 

Miao Wang commented on SPARK-18131:
---

I can try to follow this discussion for an initial PR.

> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18679:
---
Component/s: SQL

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18679:
--

 Summary: Regression in file listing performance
 Key: SPARK-18679
 URL: https://issues.apache.org/jira/browse/SPARK-18679
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang
Priority: Blocker


In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
InMemoryFileIndex).

It seems there is a performance regression here where we no longer performance 
listing in parallel for the non-root directory. This forces file listing to be 
completely serial when resolving datasource tables that are not backed by an 
external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18679:
---
Affects Version/s: 2.1.0

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18618) SparkR model predict should support type as a argument

2016-12-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713358#comment-15713358
 ] 

Joseph K. Bradley commented on SPARK-18618:
---

[~yanboliang] Shall we get this into 2.1 as a fix for the API change in 
[SPARK-18291]?  Otherwise, we may have to revert [SPARK-18291] and get both 
into 2.2.

> SparkR model predict should support type as a argument
> --
>
> Key: SPARK-18618
> URL: https://issues.apache.org/jira/browse/SPARK-18618
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> SparkR model {{predict}} should support {{type}} as a argument. This will it 
> consistent with native R predict such as 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18291) SparkR glm predict should output original label when family = "binomial"

2016-12-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713341#comment-15713341
 ] 

Joseph K. Bradley commented on SPARK-18291:
---

I just saw the comment at the end of the PR and [SPARK-18618].  It sounds like 
we reached similar conclusions.

> SparkR glm predict should output original label when family = "binomial"
> 
>
> Key: SPARK-18291
> URL: https://issues.apache.org/jira/browse/SPARK-18291
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
> Attachments: SparkR2.1decisionoutputschemaforGLMs.pdf
>
>
> SparkR spark.glm predict should output original label when family = 
> "binomial".
> For example, we can run the following code in sparkr shell:
> {code}
> training <- suppressWarnings(createDataFrame(iris))
> training <- training[training$Species %in% c("versicolor", "virginica"), ]
> model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = 
> binomial(link = "logit"))
> showDF(predict(model, training))
> {code}
> The prediction column is double value which makes no sense to users.
> {code}
> ++---++---+--+-+---+
> |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label| 
> prediction|
> ++---++---+--+-+---+
> | 7.0|3.2| 4.7|1.4|versicolor|  0.0| 
> 0.8271421517601544|
> | 6.4|3.2| 4.5|1.5|versicolor|  0.0| 
> 0.6044595910413112|
> | 6.9|3.1| 4.9|1.5|versicolor|  0.0| 
> 0.7916340858281998|
> | 5.5|2.3| 4.0|1.3|versicolor|  
> 0.0|0.16080518180591158|
> | 6.5|2.8| 4.6|1.5|versicolor|  0.0| 
> 0.6112229217050189|
> | 5.7|2.8| 4.5|1.3|versicolor|  0.0| 
> 0.2555087295500885|
> | 6.3|3.3| 4.7|1.6|versicolor|  0.0| 
> 0.5681507664364834|
> | 4.9|2.4| 3.3|1.0|versicolor|  
> 0.0|0.05990570219972002|
> | 6.6|2.9| 4.6|1.3|versicolor|  0.0| 
> 0.6644434078306246|
> | 5.2|2.7| 3.9|1.4|versicolor|  
> 0.0|0.11293577405862379|
> | 5.0|2.0| 3.5|1.0|versicolor|  
> 0.0|0.06152372321585971|
> | 5.9|3.0| 4.2|1.5|versicolor|  
> 0.0|0.35250697207602555|
> | 6.0|2.2| 4.0|1.0|versicolor|  
> 0.0|0.32267018290814303|
> | 6.1|2.9| 4.7|1.4|versicolor|  0.0|  
> 0.433391153814592|
> | 5.6|2.9| 3.6|1.3|versicolor|  0.0| 
> 0.2280744262436993|
> | 6.7|3.1| 4.4|1.4|versicolor|  0.0| 
> 0.7219848389339459|
> | 5.6|3.0| 4.5|1.5|versicolor|  
> 0.0|0.23527698971404695|
> | 5.8|2.7| 4.1|1.0|versicolor|  0.0|  
> 0.285024533520016|
> | 6.2|2.2| 4.5|1.5|versicolor|  0.0| 
> 0.4107047877447493|
> | 5.6|2.5| 3.9|1.1|versicolor|  
> 0.0|0.20083561961645083|
> ++---++---+--+-+---+
> {code}
> The prediction value should be the original label like:
> {code}
> ++---++---+--+-+--+
> |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   
> Species|label|prediction|
> ++---++---+--+-+--+
> | 7.0|3.2| 4.7|1.4|versicolor|  0.0| 
> virginica|
> | 6.4|3.2| 4.5|1.5|versicolor|  0.0| 
> virginica|
> | 6.9|3.1| 4.9|1.5|versicolor|  0.0| 
> virginica|
> | 5.5|2.3| 4.0|1.3|versicolor|  
> 0.0|versicolor|
> | 6.5|2.8| 4.6|1.5|versicolor|  0.0| 
> virginica|
> | 5.7|2.8| 4.5|1.3|versicolor|  
> 0.0|versicolor|
> | 6.3|3.3| 4.7|1.6|versicolor|  0.0| 
> virginica|
> | 4.9|2.4| 3.3|1.0|versicolor|  
> 0.0|versicolor|
> | 6.6|2.9| 4.6|1.3|versicolor|  0.0| 
> virginica|
> | 5.2|2.7| 3.9|1.4|versicolor|  
> 0.0|versicolor|
> | 5.0|2.0| 3.5|1.0|versicolor|  
> 0.0|versicolor|
> | 5.9|3.0| 4.2|1.5|versicolor|  
> 0.0|versicolor|
> |   

[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2016-12-01 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713340#comment-15713340
 ] 

Bryan Cutler commented on SPARK-13534:
--

Hi [~icexelloss], that sounds great!  We could definitely use some help for 
validation testing.

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18476) SparkR Logistic Regression should should support output original label.

2016-12-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18476:
--
Comment: was deleted

(was: [~wangmiao1981]
This changes the output schema and is an API-breaking change.  To help with 
discussion, can you please add examples of the change in output schema to this 
JIRA's description, similar to the example in [SPARK-18291]?  Thanks!)

> SparkR Logistic Regression should should support output original label.
> ---
>
> Key: SPARK-18476
> URL: https://issues.apache.org/jira/browse/SPARK-18476
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.0
>
>
> Similar to [SPARK-18401], as a classification algorithm, logistic regression 
> should support output original label instead of supporting index label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18674) improve the error message of using join

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713321#comment-15713321
 ] 

Apache Spark commented on SPARK-18674:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16110

> improve the error message of using join
> ---
>
> Key: SPARK-18674
> URL: https://issues.apache.org/jira/browse/SPARK-18674
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.3, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18476) SparkR Logistic Regression should should support output original label.

2016-12-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713296#comment-15713296
 ] 

Joseph K. Bradley commented on SPARK-18476:
---

[~wangmiao1981]
This changes the output schema and is an API-breaking change.  To help with 
discussion, can you please add examples of the change in output schema to this 
JIRA's description, similar to the example in [SPARK-18291]?  Thanks!

> SparkR Logistic Regression should should support output original label.
> ---
>
> Key: SPARK-18476
> URL: https://issues.apache.org/jira/browse/SPARK-18476
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.0
>
>
> Similar to [SPARK-18401], as a classification algorithm, logistic regression 
> should support output original label instead of supporting index label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18291) SparkR glm predict should output original label when family = "binomial"

2016-12-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18291:
--
Attachment: SparkR2.1decisionoutputschemaforGLMs.pdf

I'm adding a little summary of the API issue.  [~yanboliang] [~felixcheung] 
[~shivaram] [~mengxr] what do you think?

I'd vote for Option 3 to avoid breaking APIs.  Given a re-do of SparkR, I'd 
choose Option 2.

> SparkR glm predict should output original label when family = "binomial"
> 
>
> Key: SPARK-18291
> URL: https://issues.apache.org/jira/browse/SPARK-18291
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
> Attachments: SparkR2.1decisionoutputschemaforGLMs.pdf
>
>
> SparkR spark.glm predict should output original label when family = 
> "binomial".
> For example, we can run the following code in sparkr shell:
> {code}
> training <- suppressWarnings(createDataFrame(iris))
> training <- training[training$Species %in% c("versicolor", "virginica"), ]
> model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = 
> binomial(link = "logit"))
> showDF(predict(model, training))
> {code}
> The prediction column is double value which makes no sense to users.
> {code}
> ++---++---+--+-+---+
> |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label| 
> prediction|
> ++---++---+--+-+---+
> | 7.0|3.2| 4.7|1.4|versicolor|  0.0| 
> 0.8271421517601544|
> | 6.4|3.2| 4.5|1.5|versicolor|  0.0| 
> 0.6044595910413112|
> | 6.9|3.1| 4.9|1.5|versicolor|  0.0| 
> 0.7916340858281998|
> | 5.5|2.3| 4.0|1.3|versicolor|  
> 0.0|0.16080518180591158|
> | 6.5|2.8| 4.6|1.5|versicolor|  0.0| 
> 0.6112229217050189|
> | 5.7|2.8| 4.5|1.3|versicolor|  0.0| 
> 0.2555087295500885|
> | 6.3|3.3| 4.7|1.6|versicolor|  0.0| 
> 0.5681507664364834|
> | 4.9|2.4| 3.3|1.0|versicolor|  
> 0.0|0.05990570219972002|
> | 6.6|2.9| 4.6|1.3|versicolor|  0.0| 
> 0.6644434078306246|
> | 5.2|2.7| 3.9|1.4|versicolor|  
> 0.0|0.11293577405862379|
> | 5.0|2.0| 3.5|1.0|versicolor|  
> 0.0|0.06152372321585971|
> | 5.9|3.0| 4.2|1.5|versicolor|  
> 0.0|0.35250697207602555|
> | 6.0|2.2| 4.0|1.0|versicolor|  
> 0.0|0.32267018290814303|
> | 6.1|2.9| 4.7|1.4|versicolor|  0.0|  
> 0.433391153814592|
> | 5.6|2.9| 3.6|1.3|versicolor|  0.0| 
> 0.2280744262436993|
> | 6.7|3.1| 4.4|1.4|versicolor|  0.0| 
> 0.7219848389339459|
> | 5.6|3.0| 4.5|1.5|versicolor|  
> 0.0|0.23527698971404695|
> | 5.8|2.7| 4.1|1.0|versicolor|  0.0|  
> 0.285024533520016|
> | 6.2|2.2| 4.5|1.5|versicolor|  0.0| 
> 0.4107047877447493|
> | 5.6|2.5| 3.9|1.1|versicolor|  
> 0.0|0.20083561961645083|
> ++---++---+--+-+---+
> {code}
> The prediction value should be the original label like:
> {code}
> ++---++---+--+-+--+
> |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   
> Species|label|prediction|
> ++---++---+--+-+--+
> | 7.0|3.2| 4.7|1.4|versicolor|  0.0| 
> virginica|
> | 6.4|3.2| 4.5|1.5|versicolor|  0.0| 
> virginica|
> | 6.9|3.1| 4.9|1.5|versicolor|  0.0| 
> virginica|
> | 5.5|2.3| 4.0|1.3|versicolor|  
> 0.0|versicolor|
> | 6.5|2.8| 4.6|1.5|versicolor|  0.0| 
> virginica|
> | 5.7|2.8| 4.5|1.3|versicolor|  
> 0.0|versicolor|
> | 6.3|3.3| 4.7|1.6|versicolor|  0.0| 
> virginica|
> | 4.9|2.4| 3.3|1.0|versicolor|  
> 0.0|versicolor|
> | 6.6|2.9| 4.6|1.3|versicolor|  0.0| 
> virginica|
> | 5.2|2.7| 3.9|1.4|versicolor|  
> 0.0|versicolor|
> | 5.0|2.0| 3.5|

[jira] [Commented] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713179#comment-15713179
 ] 

Apache Spark commented on SPARK-18588:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16109

> KafkaSourceStressForDontFailOnDataLossSuite is flaky
> 
>
> Key: SPARK-18588
> URL: https://issues.apache.org/jira/browse/SPARK-18588
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Herman van Hovell
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite_name=stress+test+for+failOnDataLoss%3Dfalse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18670) Limit the number of StreamingQueryListener.StreamProgressEvent when there is no data

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18670:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Limit the number of StreamingQueryListener.StreamProgressEvent when there is 
> no data
> 
>
> Key: SPARK-18670
> URL: https://issues.apache.org/jira/browse/SPARK-18670
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
>
> When a StreamingQuery is not receiving any data, and no processing trigger is 
> set, it attempts to run a trigger every 10 ms. This would generate 
> StreamingQueryListener events at a very high rate. This should be limited 
> such that as long as data is not being received, it should generate events 
> once every X seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18670) Limit the number of StreamingQueryListener.StreamProgressEvent when there is no data

2016-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713142#comment-15713142
 ] 

Apache Spark commented on SPARK-18670:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16108

> Limit the number of StreamingQueryListener.StreamProgressEvent when there is 
> no data
> 
>
> Key: SPARK-18670
> URL: https://issues.apache.org/jira/browse/SPARK-18670
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Shixiong Zhu
>Priority: Critical
>
> When a StreamingQuery is not receiving any data, and no processing trigger is 
> set, it attempts to run a trigger every 10 ms. This would generate 
> StreamingQueryListener events at a very high rate. This should be limited 
> such that as long as data is not being received, it should generate events 
> once every X seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18670) Limit the number of StreamingQueryListener.StreamProgressEvent when there is no data

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18670:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Limit the number of StreamingQueryListener.StreamProgressEvent when there is 
> no data
> 
>
> Key: SPARK-18670
> URL: https://issues.apache.org/jira/browse/SPARK-18670
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> When a StreamingQuery is not receiving any data, and no processing trigger is 
> set, it attempts to run a trigger every 10 ms. This would generate 
> StreamingQueryListener events at a very high rate. This should be limited 
> such that as long as data is not being received, it should generate events 
> once every X seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18274) Memory leak in PySpark StringIndexer

2016-12-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18274:
--
Target Version/s: 2.0.3, 2.1.0  (was: 2.0.3, 2.1.1, 2.2.0)

> Memory leak in PySpark StringIndexer
> 
>
> Key: SPARK-18274
> URL: https://issues.apache.org/jira/browse/SPARK-18274
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
>Reporter: Jonas Amrich
>Assignee: Sandeep Singh
>Priority: Critical
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> StringIndexerModel won't get collected by GC in Java even when deleted in 
> Python. It can be reproduced by this code, which fails after couple of 
> iterations (around 7 if you set driver memory to 600MB): 
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) 
> for _ in range(int(7e5))]  # 70 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm 
> detach in model's destructor. This is implemented in 
> pyspark.mlib.common.JavaModelWrapper but missing in 
> pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be 
> affected by this memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18274) Memory leak in PySpark StringIndexer

2016-12-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18274.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.3
   2.1.1

Issue resolved by pull request 15843
[https://github.com/apache/spark/pull/15843]

> Memory leak in PySpark StringIndexer
> 
>
> Key: SPARK-18274
> URL: https://issues.apache.org/jira/browse/SPARK-18274
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
>Reporter: Jonas Amrich
>Assignee: Sandeep Singh
>Priority: Critical
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> StringIndexerModel won't get collected by GC in Java even when deleted in 
> Python. It can be reproduced by this code, which fails after couple of 
> iterations (around 7 if you set driver memory to 600MB): 
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) 
> for _ in range(int(7e5))]  # 70 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm 
> detach in model's destructor. This is implemented in 
> pyspark.mlib.common.JavaModelWrapper but missing in 
> pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be 
> affected by this memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-12-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18642:
--
Affects Version/s: 1.6.3

> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>  Labels: performance
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.
> External reference: 
> http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18641) Show databases NullPointerException while Sentry turned on

2016-12-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-18641.
-
Resolution: Invalid

I close this issue because the reported error message comes from Sentry code. 
If this is a Spark code issue, please reopen with the Spark code exception.

> Show databases NullPointerException while Sentry turned on
> --
>
> Key: SPARK-18641
> URL: https://issues.apache.org/jira/browse/SPARK-18641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1
>Reporter: zhangqw
>
> I've traced into source code, and it seems that  of 
> Sentry not set when spark sql started a session. This operation should be 
> done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is 
> not called in spark sql.
> Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
> spark's classpath.
> Here is the stack:
> ===
> 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
> java.lang.NullPointerException
> at 
> java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
> at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
> at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
> at 
> org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 

[jira] [Commented] (SPARK-18641) Show databases NullPointerException while Sentry turned on

2016-12-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713103#comment-15713103
 ] 

Dongjoon Hyun commented on SPARK-18641:
---

Thank you for reporting, [~zhangqw]
But, I'm wondering whether this is a Spark issue?
It looks to me Apache Sentry issue.
{code}
16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
{code}

> Show databases NullPointerException while Sentry turned on
> --
>
> Key: SPARK-18641
> URL: https://issues.apache.org/jira/browse/SPARK-18641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1
>Reporter: zhangqw
>
> I've traced into source code, and it seems that  of 
> Sentry not set when spark sql started a session. This operation should be 
> done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is 
> not called in spark sql.
> Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
> spark's classpath.
> Here is the stack:
> ===
> 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
> java.lang.NullPointerException
> at 
> java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
> at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
> at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
> at 
> org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
>   

[jira] [Updated] (SPARK-18274) Memory leak in PySpark StringIndexer

2016-12-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18274:
--
Shepherd: Joseph K. Bradley

> Memory leak in PySpark StringIndexer
> 
>
> Key: SPARK-18274
> URL: https://issues.apache.org/jira/browse/SPARK-18274
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
>Reporter: Jonas Amrich
>Assignee: Sandeep Singh
>Priority: Critical
>
> StringIndexerModel won't get collected by GC in Java even when deleted in 
> Python. It can be reproduced by this code, which fails after couple of 
> iterations (around 7 if you set driver memory to 600MB): 
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) 
> for _ in range(int(7e5))]  # 70 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm 
> detach in model's destructor. This is implemented in 
> pyspark.mlib.common.JavaModelWrapper but missing in 
> pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be 
> affected by this memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18274) Memory leak in PySpark StringIndexer

2016-12-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18274:
--
Assignee: Sandeep Singh

> Memory leak in PySpark StringIndexer
> 
>
> Key: SPARK-18274
> URL: https://issues.apache.org/jira/browse/SPARK-18274
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
>Reporter: Jonas Amrich
>Assignee: Sandeep Singh
>Priority: Critical
>
> StringIndexerModel won't get collected by GC in Java even when deleted in 
> Python. It can be reproduced by this code, which fails after couple of 
> iterations (around 7 if you set driver memory to 600MB): 
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) 
> for _ in range(int(7e5))]  # 70 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm 
> detach in model's destructor. This is implemented in 
> pyspark.mlib.common.JavaModelWrapper but missing in 
> pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be 
> affected by this memory leak. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-12-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713081#comment-15713081
 ] 

Dongjoon Hyun commented on SPARK-18642:
---

Thank you for reporting, [~mohitgargk].
It seems to be the same with Apache Spark 1.6.3 and to be resolved since Apache 
Spark 2.0.0.
{code}
scala> val dfA = spark.read.parquet("/tmp/a")
scala> val dfB = spark.read.parquet("/tmp/b")
scala> dfA.createOrReplaceTempView("A")
scala> dfB.createOrReplaceTempView("B")
scala> sql("select A.*, B.* from A left join B on A.id = B.id where 
B.id<2").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
:- *Project [id#0L]
:  +- *Filter (isnotnull(id#0L) && (id#0L < 2))
: +- *BatchedScan parquet [id#0L] Format: ParquetFormat, InputPaths: 
file:/tmp/a, PushedFilters: [IsNotNull(id), LessThan(id,2)], ReadSchema: 
struct
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
   +- *Project [id#3L]
  +- *Filter (isnotnull(id#3L) && (id#3L < 2))
 +- *BatchedScan parquet [id#3L] Format: ParquetFormat, InputPaths: 
file:/tmp/b, PushedFilters: [IsNotNull(id), LessThan(id,2)], ReadSchema: 
struct
{code}

IMO, it will not be inside Spark 1.6.4 (if exist) since this is a performance 
issue.

> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>  Labels: performance
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.
> External reference: 
> http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18678) Skewed feature subsampling in Random forest

2016-12-01 Thread Bjoern Toldbod (JIRA)
Bjoern Toldbod created SPARK-18678:
--

 Summary: Skewed feature subsampling in Random forest
 Key: SPARK-18678
 URL: https://issues.apache.org/jira/browse/SPARK-18678
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.2
Reporter: Bjoern Toldbod


The feature subsampling performed in the RandomForest-implementation from 
org.apache.spark.ml.tree.impl.RandomForest
is performed using SamplingUtils.reservoirSampleAndCount

The implementation of the sampling skews feature selection in favor of features 
with a higher index. 
The skewness is smaller for a large number of features, but completely 
dominates the feature selection for a small number of features. The extreme 
case is when the number of features is 2 and number of features to select is 1.

In this case the feature sampling will always pick feature 1 and ignore feature 
0.
Of course this produces low quality models for few features when using 
subsampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18677) Json path implementation fails to parse ['key']

2016-12-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18677:


Assignee: (was: Apache Spark)

> Json path implementation fails to parse ['key']
> ---
>
> Key: SPARK-18677
> URL: https://issues.apache.org/jira/browse/SPARK-18677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ryan Blue
>
> The current json path parser fails to parse expressions like ['key'], which 
> are used for named expressions with spaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >