date:20170307

[jira] [Assigned] (SPARK-19751) Create Data frame API fails with a self referencing bean

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19751:


Assignee: (was: Apache Spark)

> Create Data frame API fails with a self referencing bean
> 
>
> Key: SPARK-19751
> URL: https://issues.apache.org/jira/browse/SPARK-19751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Avinash Venkateshaiah
>Priority: Minor
>
> createDataset API throws a stack overflow exception when we try creating a 
> Dataset using a bean encoder. The bean is self referencing
> BEAN:
> public class HierObj implements Serializable {
> String name;
> List children;
> public String getName() {
> return name;
> }
> public void setName(String name) {
> this.name = name;
> }
> public List getChildren() {
> return children;
> }
> public void setChildren(List children) {
> this.children = children;
> }
> }
> // create an object
> HierObj hierObj = new HierObj();
> hierObj.setName("parent");
> List children = new ArrayList();
> HierObj child1 = new HierObj();
> child1.setName("child1");
> HierObj child2 = new HierObj();
> child2.setName("child2");
> children.add(child1);
> children.add(child2);
> hierObj.setChildren(children);
> // create a dataset
> Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), 
> Encoders.bean(HierObj.class));



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19751) Create Data frame API fails with a self referencing bean

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19751:


Assignee: Apache Spark

> Create Data frame API fails with a self referencing bean
> 
>
> Key: SPARK-19751
> URL: https://issues.apache.org/jira/browse/SPARK-19751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Avinash Venkateshaiah
>Assignee: Apache Spark
>Priority: Minor
>
> createDataset API throws a stack overflow exception when we try creating a 
> Dataset using a bean encoder. The bean is self referencing
> BEAN:
> public class HierObj implements Serializable {
> String name;
> List children;
> public String getName() {
> return name;
> }
> public void setName(String name) {
> this.name = name;
> }
> public List getChildren() {
> return children;
> }
> public void setChildren(List children) {
> this.children = children;
> }
> }
> // create an object
> HierObj hierObj = new HierObj();
> hierObj.setName("parent");
> List children = new ArrayList();
> HierObj child1 = new HierObj();
> child1.setName("child1");
> HierObj child2 = new HierObj();
> child2.setName("child2");
> children.add(child1);
> children.add(child2);
> hierObj.setChildren(children);
> // create a dataset
> Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), 
> Encoders.bean(HierObj.class));



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19751) Create Data frame API fails with a self referencing bean

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898967#comment-15898967
 ] 

Apache Spark commented on SPARK-19751:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17188

> Create Data frame API fails with a self referencing bean
> 
>
> Key: SPARK-19751
> URL: https://issues.apache.org/jira/browse/SPARK-19751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Avinash Venkateshaiah
>Priority: Minor
>
> createDataset API throws a stack overflow exception when we try creating a 
> Dataset using a bean encoder. The bean is self referencing
> BEAN:
> public class HierObj implements Serializable {
> String name;
> List children;
> public String getName() {
> return name;
> }
> public void setName(String name) {
> this.name = name;
> }
> public List getChildren() {
> return children;
> }
> public void setChildren(List children) {
> this.children = children;
> }
> }
> // create an object
> HierObj hierObj = new HierObj();
> hierObj.setName("parent");
> List children = new ArrayList();
> HierObj child1 = new HierObj();
> child1.setName("child1");
> HierObj child2 = new HierObj();
> child2.setName("child2");
> children.add(child1);
> children.add(child2);
> hierObj.setChildren(children);
> // create a dataset
> Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), 
> Encoders.bean(HierObj.class));



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Mohd Suaib Danish (JIRA)

Mohd Suaib Danish created SPARK-19848:
-

 Summary: Regex Support in StopWordsRemover
 Key: SPARK-19848
 URL: https://issues.apache.org/jira/browse/SPARK-19848
 Project: Spark
  Issue Type: Wish
  Components: ML
Affects Versions: 2.1.0
Reporter: Mohd Suaib Danish


Can we have regex feature in StopWordsRemover in addition to the provided list 
of stop words?
Use cases can be following:
1. Remove all single or double letter words- [a-zA-Z]{1,2}
2. Remove anything starting with abc - ^abc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Mohd Suaib Danish (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohd Suaib Danish updated SPARK-19848:
--
Description: 
Can we have regex feature in StopWordsRemover in addition to the provided list 
of stop words?

Use cases can be following:
1. Remove all single or double letter words- [a-zA-Z]{1,2}
2. Remove anything starting with abc - ^abc

  was:
Can we have regex feature in StopWordsRemover in addition to the provided list 
of stop words?
Use cases can be following:
1. Remove all single or double letter words- [a-zA-Z]{1,2}
2. Remove anything starting with abc - ^abc


> Regex Support in StopWordsRemover
> -
>
> Key: SPARK-19848
> URL: https://issues.apache.org/jira/browse/SPARK-19848
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mohd Suaib Danish
>
> Can we have regex feature in StopWordsRemover in addition to the provided 
> list of stop words?
> Use cases can be following:
> 1. Remove all single or double letter words- [a-zA-Z]{1,2}
> 2. Remove anything starting with abc - ^abc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19829) The log about driver should support rolling like executor

2017-03-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899228#comment-15899228
 ] 

Sean Owen commented on SPARK-19829:
---

Why wouldn't log4j be a good solution here? its purpose is managing logs and 
can define rolling appenders.

> The log about driver should support rolling like executor
> -
>
> Key: SPARK-19829
> URL: https://issues.apache.org/jira/browse/SPARK-19829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hustfxj
>Priority: Minor
>
> We should rollback the log of the driver , or the log maybe large！！！ 
> {code:title=DriverRunner.java|borderStyle=solid}
> // modify the runDriver
>   private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: 
> Boolean): Int = {
> builder.directory(baseDir)
> def initialize(process: Process): Unit = {
>   // Redirect stdout and stderr to files-- the old code
> //  val stdout = new File(baseDir, "stdout")
> //  CommandUtils.redirectStream(process.getInputStream, stdout)
> //
> //  val stderr = new File(baseDir, "stderr")
> //  val formattedCommand = builder.command.asScala.mkString("\"", "\" 
> \"", "\"")
> //  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, 
> "=" * 40)
> //  Files.append(header, stderr, StandardCharsets.UTF_8)
> //  CommandUtils.redirectStream(process.getErrorStream, stderr)
>   // Redirect its stdout and stderr to files-support rolling
>   val stdout = new File(baseDir, "stdout")
>   stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
>   val stderr = new File(baseDir, "stderr")
>   val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
> "\"")
>   val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" 
> * 40)
>   Files.append(header, stderr, StandardCharsets.UTF_8)
>   stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
> }
> runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19810) Remove support for Scala 2.10

2017-03-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19810:
--
Target Version/s: 2.3.0

> Remove support for Scala 2.10
> -
>
> Key: SPARK-19810
> URL: https://issues.apache.org/jira/browse/SPARK-19810
> Project: Spark
>  Issue Type: Task
>  Components: ML, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Critical
>
> This tracks the removal of Scala 2.10 support, as discussed in 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html
>  and other lists.
> The primary motivations are to simplify the code and build, and to enable 
> Scala 2.12 support later. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14220) Build and test Spark against Scala 2.12

2017-03-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14220:
--
Affects Version/s: 2.1.0
 Target Version/s: 2.3.0

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19848:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Wish)

If it's more of a question, ask on the mailing list.
It's not unreasonable, but what's the use case? stopwords are, well, words and 
not classes of words. You can always filter text in your own code as you like, 
and this is not a general-purpose tool for removing strings.

> Regex Support in StopWordsRemover
> -
>
> Key: SPARK-19848
> URL: https://issues.apache.org/jira/browse/SPARK-19848
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mohd Suaib Danish
>Priority: Minor
>
> Can we have regex feature in StopWordsRemover in addition to the provided 
> list of stop words?
> Use cases can be following:
> 1. Remove all single or double letter words- [a-zA-Z]{1,2}
> 2. Remove anything starting with abc - ^abc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899250#comment-15899250
 ] 

Nick Pentreath edited comment on SPARK-19848 at 3/7/17 11:06 AM:
-

This seems more like a "token filter" in a sense. Perhaps the ML pipeline 
components mentioned 
[here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be 
of use?


was (Author: mlnick):
Perhaps the ML pipeline components mentioned 
[here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be 
of use?

> Regex Support in StopWordsRemover
> -
>
> Key: SPARK-19848
> URL: https://issues.apache.org/jira/browse/SPARK-19848
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mohd Suaib Danish
>Priority: Minor
>
> Can we have regex feature in StopWordsRemover in addition to the provided 
> list of stop words?
> Use cases can be following:
> 1. Remove all single or double letter words- [a-zA-Z]{1,2}
> 2. Remove anything starting with abc - ^abc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899250#comment-15899250
 ] 

Nick Pentreath commented on SPARK-19848:


Perhaps the ML pipeline components mentioned 
[here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be 
of use?

> Regex Support in StopWordsRemover
> -
>
> Key: SPARK-19848
> URL: https://issues.apache.org/jira/browse/SPARK-19848
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mohd Suaib Danish
>Priority: Minor
>
> Can we have regex feature in StopWordsRemover in addition to the provided 
> list of stop words?
> Use cases can be following:
> 1. Remove all single or double letter words- [a-zA-Z]{1,2}
> 2. Remove anything starting with abc - ^abc



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19836) Customizable remote repository url for hive versions unit test

2017-03-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899292#comment-15899292
 ] 

Song Jun commented on SPARK-19836:
--

I have do this similar https://github.com/apache/spark/pull/16803

> Customizable remote repository url for hive versions unit test
> --
>
> Key: SPARK-19836
> URL: https://issues.apache.org/jira/browse/SPARK-19836
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Elek, Marton
>  Labels: ivy, unittest
>
> When the VersionSuite test runs from sql/hive it downloads different versions 
> from hive.
> Unfortunately the IsolatedClientClassloader (which is used by the 
> VersionSuite) uses hardcoded fix repositories:
> {code}
> val classpath = quietly {
>   SparkSubmitUtils.resolveMavenCoordinates(
> hiveArtifacts.mkString(","),
> SparkSubmitUtils.buildIvySettings(
>   Some("http://www.datanucleus.org/downloads/maven2";),
>   ivyPath),
> exclusions = version.exclusions)
> }
> {code}
> The problem is with the hard-coded repositories:
>  1. it's hard to run unit tests in an environment where only one internal 
> maven repository is available (and central/datanucleus is not)
>  2. it's impossible to run unit tests against custom built hive/hadoop 
> artifacts (which are not available from the central repository)
> VersionSuite has already a specific SPARK_VERSIONS_SUITE_IVY_PATH environment 
> variable to define a custom local repository as ivy cache.
> I suggest to add an additional environment variable 
> (SPARK_VERSIONS_SUITE_IVY_REPOSITORIES to the HiveClientBuilder.scala), to 
> make it possible adding new remote repositories for testing the different 
> hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19836) Customizable remote repository url for hive versions unit test

2017-03-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19836.
---
Resolution: Duplicate

> Customizable remote repository url for hive versions unit test
> --
>
> Key: SPARK-19836
> URL: https://issues.apache.org/jira/browse/SPARK-19836
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Elek, Marton
>  Labels: ivy, unittest
>
> When the VersionSuite test runs from sql/hive it downloads different versions 
> from hive.
> Unfortunately the IsolatedClientClassloader (which is used by the 
> VersionSuite) uses hardcoded fix repositories:
> {code}
> val classpath = quietly {
>   SparkSubmitUtils.resolveMavenCoordinates(
> hiveArtifacts.mkString(","),
> SparkSubmitUtils.buildIvySettings(
>   Some("http://www.datanucleus.org/downloads/maven2";),
>   ivyPath),
> exclusions = version.exclusions)
> }
> {code}
> The problem is with the hard-coded repositories:
>  1. it's hard to run unit tests in an environment where only one internal 
> maven repository is available (and central/datanucleus is not)
>  2. it's impossible to run unit tests against custom built hive/hadoop 
> artifacts (which are not available from the central repository)
> VersionSuite has already a specific SPARK_VERSIONS_SUITE_IVY_PATH environment 
> variable to define a custom local repository as ivy cache.
> I suggest to add an additional environment variable 
> (SPARK_VERSIONS_SUITE_IVY_REPOSITORIES to the HiveClientBuilder.scala), to 
> make it possible adding new remote repositories for testing the different 
> hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19831:


Assignee: Apache Spark

> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Assignee: Apache Spark
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests：
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not receive the heartbeat master by *receive* method.  
> Because any other rpc message may block the *receive* method. Then worker 
> won't receive the heartbeat message timely. So it had better send the 
> heartbeat master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19831:


Assignee: (was: Apache Spark)

> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests：
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not receive the heartbeat master by *receive* method.  
> Because any other rpc message may block the *receive* method. Then worker 
> won't receive the heartbeat message timely. So it had better send the 
> heartbeat master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899334#comment-15899334
 ] 

Apache Spark commented on SPARK-19831:
--

User 'hustfxj' has created a pull request for this issue:
https://github.com/apache/spark/pull/17189

> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests：
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not receive the heartbeat master by *receive* method.  
> Because any other rpc message may block the *receive* method. Then worker 
> won't receive the heartbeat message timely. So it had better send the 
> heartbeat master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19478) JDBC Sink

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19478:


Assignee: (was: Apache Spark)

> JDBC Sink
> -
>
> Key: SPARK-19478
> URL: https://issues.apache.org/jira/browse/SPARK-19478
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>
> A sink that transactionally commits data into a database use JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19478) JDBC Sink

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19478:


Assignee: Apache Spark

> JDBC Sink
> -
>
> Key: SPARK-19478
> URL: https://issues.apache.org/jira/browse/SPARK-19478
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> A sink that transactionally commits data into a database use JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19478) JDBC Sink

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899458#comment-15899458
 ] 

Apache Spark commented on SPARK-19478:
--

User 'GaalDornick' has created a pull request for this issue:
https://github.com/apache/spark/pull/17190

> JDBC Sink
> -
>
> Key: SPARK-19478
> URL: https://issues.apache.org/jira/browse/SPARK-19478
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>
> A sink that transactionally commits data into a database use JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19849) Support ArrayType in to_json function/expression

2017-03-07 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-19849:


 Summary: Support ArrayType in to_json function/expression
 Key: SPARK-19849
 URL: https://issues.apache.org/jira/browse/SPARK-19849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon


After SPARK-19595, we could

{code}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("array").select(from_json(col("array"), 
schema)).show()
{code}

Maybe, it'd be better if we provide a way to read it back via {{to_json}} as 
below:

{code}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", 
schema).as("array"))
df.show()

// Read back.
df.select(to_json($"array").as("json")).show()
{code}

{code}

+--+
| array|
+--+
|[[1], [2]]|
+--+

+-+
| json|
+-+
|[{"a":1},{"a":2}]|
+-+
{code}

Currently, it throws an exception as below:

{code}
org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' 
due to data type mismatch: structtojson requires that the expression is a 
struct expression.;;
'Project [structtojson(array#30, Some(Asia/Seoul)) AS structtojson(array)#45]
+- Project 
[jsontostruct(ArrayType(StructType(StructField(a,IntegerType,true)),true), 
array#27, Some(Asia/Seoul)) AS array#30]
   +- Project [value#25 AS array#27]
  +- LocalRelation [value#25]

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:80)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19850) Support aliased expressions in function parameters

2017-03-07 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-19850:
-

 Summary: Support aliased expressions in function parameters
 Key: SPARK-19850
 URL: https://issues.apache.org/jira/browse/SPARK-19850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


The SQL parser currently does not allow a user to pass an aliased expression as 
function parameter. This can be useful if we want to create a struct. For 
example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a 
struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19850) Support aliased expressions in function parameters

2017-03-07 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-19850:
--
Issue Type: Improvement  (was: Bug)

> Support aliased expressions in function parameters
> --
>
> Key: SPARK-19850
> URL: https://issues.apache.org/jira/browse/SPARK-19850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The SQL parser currently does not allow a user to pass an aliased expression 
> as function parameter. This can be useful if we want to create a struct. For 
> example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a 
> struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14471) The alias created in SELECT could be used in GROUP BY and followed expressions

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899584#comment-15899584
 ] 

Apache Spark commented on SPARK-14471:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17191

> The alias created in SELECT could be used in GROUP BY and followed expressions
> --
>
> Key: SPARK-14471
> URL: https://issues.apache.org/jira/browse/SPARK-14471
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> This query should be able to run:
> {code}
> select a a1, a1 + 1 as b, count(1) from t group by a1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14471) The alias created in SELECT could be used in GROUP BY and followed expressions

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14471:


Assignee: (was: Apache Spark)

> The alias created in SELECT could be used in GROUP BY and followed expressions
> --
>
> Key: SPARK-14471
> URL: https://issues.apache.org/jira/browse/SPARK-14471
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> This query should be able to run:
> {code}
> select a a1, a1 + 1 as b, count(1) from t group by a1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14471) The alias created in SELECT could be used in GROUP BY and followed expressions

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14471:


Assignee: Apache Spark

> The alias created in SELECT could be used in GROUP BY and followed expressions
> --
>
> Key: SPARK-14471
> URL: https://issues.apache.org/jira/browse/SPARK-14471
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> This query should be able to run:
> {code}
> select a a1, a1 + 1 as b, count(1) from t group by a1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-03-07 Thread Jayesh lalwani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899665#comment-15899665
 ] 

Jayesh lalwani commented on SPARK-15463:


Does it make sense to have a to_csv and a from_csv function that is modeled 
after to_json and from_json?

The applications that we are supporting need inputs from a combination of 
sources and formats. Also, there is a combination of sinks and formats. For 
example, we might need
a) Files with CSV content
b) Files with JSON content
c) Kafka with CSV content
d) Kafka with JSON content
e) Parquet

Also, if the input has a nested structure (JSON/Parquet) sometimes, we prefer 
keeping the data in a StructType object.. and sometimes we prefer to flatten 
the Struct Type object into a dataframe. 
For example, if we are getting data from Kafka as JSON, massaging it, and are 
writing JSON to Kafka, we would prefer to be able to transform a StructType 
object, and not have to flatten it into a dataframe
Another example is that we get data from JSON, that needs to be stored in an 
RDMBS database. This requires us to flatten the data into a data frame before 
storing it into the table

So, this is what I was thinking. We should have the following functions
1) from_json - Convert a Dataframe with String to DataFrame with StructType
2) to_json - Convert a Dataframe with StructType to Dataframe of String
3) from_csv - Convert a Dataframe of String to DataFrame of StructType
4) to_csv - COnvert a DataFrame of StructType to DataFrame of String
5) flatten - convert Data Frame with StructType into a DataFrame that has the 
same fields as the StructType


Essentially, the request in the Change Request can be done by calling 
*flatten(from_csv())*

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19849) Support ArrayType in to_json function/expression

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19849:


Assignee: (was: Apache Spark)

> Support ArrayType in to_json function/expression
> 
>
> Key: SPARK-19849
> URL: https://issues.apache.org/jira/browse/SPARK-19849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> After SPARK-19595, we could
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
> Seq(("""[{"a": 1}, {"a": 
> 2}]""")).toDF("array").select(from_json(col("array"), schema)).show()
> {code}
> Maybe, it'd be better if we provide a way to read it back via {{to_json}} as 
> below:
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
> val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", 
> schema).as("array"))
> df.show()
> // Read back.
> df.select(to_json($"array").as("json")).show()
> {code}
> {code}
> +--+
> | array|
> +--+
> |[[1], [2]]|
> +--+
> +-+
> | json|
> +-+
> |[{"a":1},{"a":2}]|
> +-+
> {code}
> Currently, it throws an exception as below:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'structtojson(`array`)' due to data type mismatch: structtojson requires that 
> the expression is a struct expression.;;
> 'Project [structtojson(array#30, Some(Asia/Seoul)) AS structtojson(array)#45]
> +- Project 
> [jsontostruct(ArrayType(StructType(StructField(a,IntegerType,true)),true), 
> array#27, Some(Asia/Seoul)) AS array#30]
>+- Project [value#25 AS array#27]
>   +- LocalRelation [value#25]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:80)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19849) Support ArrayType in to_json function/expression

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899705#comment-15899705
 ] 

Apache Spark commented on SPARK-19849:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17192

> Support ArrayType in to_json function/expression
> 
>
> Key: SPARK-19849
> URL: https://issues.apache.org/jira/browse/SPARK-19849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> After SPARK-19595, we could
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
> Seq(("""[{"a": 1}, {"a": 
> 2}]""")).toDF("array").select(from_json(col("array"), schema)).show()
> {code}
> Maybe, it'd be better if we provide a way to read it back via {{to_json}} as 
> below:
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
> val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", 
> schema).as("array"))
> df.show()
> // Read back.
> df.select(to_json($"array").as("json")).show()
> {code}
> {code}
> +--+
> | array|
> +--+
> |[[1], [2]]|
> +--+
> +-+
> | json|
> +-+
> |[{"a":1},{"a":2}]|
> +-+
> {code}
> Currently, it throws an exception as below:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'structtojson(`array`)' due to data type mismatch: structtojson requires that 
> the expression is a struct expression.;;
> 'Project [structtojson(array#30, Some(Asia/Seoul)) AS structtojson(array)#45]
> +- Project 
> [jsontostruct(ArrayType(StructType(StructField(a,IntegerType,true)),true), 
> array#27, Some(Asia/Seoul)) AS array#30]
>+- Project [value#25 AS array#27]
>   +- LocalRelation [value#25]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:80)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19849) Support ArrayType in to_json function/expression

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19849:


Assignee: Apache Spark

> Support ArrayType in to_json function/expression
> 
>
> Key: SPARK-19849
> URL: https://issues.apache.org/jira/browse/SPARK-19849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> After SPARK-19595, we could
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
> Seq(("""[{"a": 1}, {"a": 
> 2}]""")).toDF("array").select(from_json(col("array"), schema)).show()
> {code}
> Maybe, it'd be better if we provide a way to read it back via {{to_json}} as 
> below:
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
> val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", 
> schema).as("array"))
> df.show()
> // Read back.
> df.select(to_json($"array").as("json")).show()
> {code}
> {code}
> +--+
> | array|
> +--+
> |[[1], [2]]|
> +--+
> +-+
> | json|
> +-+
> |[{"a":1},{"a":2}]|
> +-+
> {code}
> Currently, it throws an exception as below:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'structtojson(`array`)' due to data type mismatch: structtojson requires that 
> the expression is a struct expression.;;
> 'Project [structtojson(array#30, Some(Asia/Seoul)) AS structtojson(array)#45]
> +- Project 
> [jsontostruct(ArrayType(StructType(StructField(a,IntegerType,true)),true), 
> array#27, Some(Asia/Seoul)) AS array#30]
>+- Project [value#25 AS array#27]
>   +- LocalRelation [value#25]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:80)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19637) add to_json APIs to SQL

2017-03-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19637.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.2.0

> add to_json APIs to SQL
> ---
>
> Key: SPARK-19637
> URL: https://issues.apache.org/jira/browse/SPARK-19637
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> The method "to_json" is a useful method in turning a struct into a json 
> string. It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19840) Disallow creating permanent functions with invalid class names

2017-03-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19840:
--
Description: 
Currently, Spark raises exceptions on creating invalid **temporary** functions, 
but doesn't for **permanent** functions. This issue aims to disallow creating 
permanent functions with invalid class names.

*BEFORE*
{code}
scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
java.lang.ClassNotFoundException: org.invalid at 
...

scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
++
||
++
++

scala> sql("show functions like 'function_*'").show(false)
+---+
|function   |
+---+
|default.function_with_invalid_classname|
+---+

scala> sql("select function_with_invalid_classname()").show
org.apache.spark.sql.AnalysisException: Undefined function: 
'function_with_invalid_classname'. This function is neither a registered 
temporary function nor a permanent function registered in the database 
'default'.; line 1 pos 7
{code}

*AFTER*
{code}
scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
java.lang.ClassNotFoundException: org.invalid
{code}

  was:
Currently, Spark raises exceptions on creating invalid **temporary** functions, 
but doesn't for **permanent** functions. This issue aims to disallow creating 
permanent functions with invalid class names.

**BEFORE**
{code}
scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
java.lang.ClassNotFoundException: org.invalid at 
...

scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
++
||
++
++
{code}

**AFTER**
{code}
scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
java.lang.ClassNotFoundException: org.invalid
{code}


> Disallow creating permanent functions with invalid class names
> --
>
> Key: SPARK-19840
> URL: https://issues.apache.org/jira/browse/SPARK-19840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>
> Currently, Spark raises exceptions on creating invalid **temporary** 
> functions, but doesn't for **permanent** functions. This issue aims to 
> disallow creating permanent functions with invalid class names.
> *BEFORE*
> {code}
> scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid at 
> ...
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> ++
> ||
> ++
> ++
> scala> sql("show functions like 'function_*'").show(false)
> +---+
> |function   |
> +---+
> |default.function_with_invalid_classname|
> +---+
> scala> sql("select function_with_invalid_classname()").show
> org.apache.spark.sql.AnalysisException: Undefined function: 
> 'function_with_invalid_classname'. This function is neither a registered 
> temporary function nor a permanent function registered in the database 
> 'default'.; line 1 pos 7
> {code}
> *AFTER*
> {code}
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table

2017-03-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19765:

Labels: release_notes  (was: )

> UNCACHE TABLE should also un-cache all cached plans that refer to this table
> 
>
> Key: SPARK-19765
> URL: https://issues.apache.org/jira/browse/SPARK-19765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: release_notes
>
> DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, 
> UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all 
> the cached plans that refer to this table



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table

2017-03-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19765:

Description: DropTableCommand, TruncateTableCommand, 
AlterTableRenameCommand, UncacheTableCommand, RefreshTable and 
InsertIntoHiveTable will un-cache all the cached plans that refer to this table

> UNCACHE TABLE should also un-cache all cached plans that refer to this table
> 
>
> Key: SPARK-19765
> URL: https://issues.apache.org/jira/browse/SPARK-19765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: release_notes
>
> DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, 
> UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all 
> the cached plans that refer to this table



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table

2017-03-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19765.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> UNCACHE TABLE should also un-cache all cached plans that refer to this table
> 
>
> Key: SPARK-19765
> URL: https://issues.apache.org/jira/browse/SPARK-19765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: release_notes
> Fix For: 2.2.0
>
>
> DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, 
> UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all 
> the cached plans that refer to this table



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2017-03-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18549.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.2.0

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Ari Gesher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899964#comment-15899964
 ] 

Ari Gesher commented on SPARK-19764:


We narrowed this down to driver OOM that wasn't being properly propagated into 
our Jupyter Notebook.

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Ari Gesher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Gesher resolved SPARK-19764.

Resolution: Not A Bug

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)

Michael Styles created SPARK-19851:
--

 Summary: Add support for EVERY and ANY (SOME) aggregates
 Key: SPARK-19851
 URL: https://issues.apache.org/jira/browse/SPARK-19851
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Michael Styles






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899979#comment-15899979
 ] 

Apache Spark commented on SPARK-19348:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/17193

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Vinayak Joshi
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899980#comment-15899980
 ] 

Shixiong Zhu commented on SPARK-19764:
--

[~agesher] Do you have the OOM stack trace? So that we can fix it.

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899988#comment-15899988
 ] 

Michael Styles commented on SPARK-19851:


https://github.com/apache/spark/pull/17194

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2017-03-07 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17498.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16883
[https://github.com/apache/spark/pull/16883]

> StringIndexer.setHandleInvalid should have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>Assignee: Vincent
>Priority: Minor
> Fix For: 2.2.0
>
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-19851:
---
Description: 
Add support for EVERY and ANY (SOME) aggregates.

- EVERY returns true if all input values are true.
- ANY returns true if at least one input value is true.
- SOME is equivalent to ANY.

Both aggregates are part of the SQL standard.

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-07 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-19852:
-

 Summary: StringIndexer.setHandleInvalid should have another option 
'new': Python API and docs
 Key: SPARK-19852
 URL: https://issues.apache.org/jira/browse/SPARK-19852
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Priority: Minor


Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
will probably require:
* putting HandleInvalid within StringIndexer to update its built-in doc (See 
Bucketizer for an example.)
* updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Ari Gesher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1599#comment-1599
 ] 

Ari Gesher commented on SPARK-19764:


We were collecting more data than we had heap for. Still useful?

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19516) update public doc to use SparkSession instead of SparkContext

2017-03-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19516.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16856
[https://github.com/apache/spark/pull/16856]

> update public doc to use SparkSession instead of SparkContext
> -
>
> Key: SPARK-19516
> URL: https://issues.apache.org/jira/browse/SPARK-19516
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19803.

   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0

Thanks for fixing this [~uncleGen] and for reporting it [~sitalke...@gmail.com]

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Component/s: Tests

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Affects Version/s: (was: 2.3.0)
   2.2.0

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-07 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19803:
---
Labels: flaky-test  (was: )

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19851:
-
Component/s: (was: Spark Core)

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900127#comment-15900127
 ] 

Shixiong Zhu commented on SPARK-19764:
--

So you don't set an UncaughtExceptionHandler and this OOM happened on the 
driver side? If so, then it's not a Spark bug.

> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16207) order guarantees for DataFrames

2017-03-07 Thread Chris Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900152#comment-15900152
 ] 

Chris Rogers commented on SPARK-16207:
--

The lack of documentation on this is immensely confusing.

> order guarantees for DataFrames
> ---
>
> Key: SPARK-16207
> URL: https://issues.apache.org/jira/browse/SPARK-16207
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Max Moroz
>Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are 
> available for the preservation of order in DataFrames. Different blogs, SO 
> answers, and posts on course websites suggest different things. It would be 
> good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file 
> it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16207) order guarantees for DataFrames

2017-03-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900162#comment-15900162
 ] 

Sean Owen commented on SPARK-16207:
---

[~rcrogers] where would you document this? we could add a document in a place 
that would have helped you. However as I say, lots of things don't preserve 
order, so I don't know if it's sensible to write that everywhere.

> order guarantees for DataFrames
> ---
>
> Key: SPARK-16207
> URL: https://issues.apache.org/jira/browse/SPARK-16207
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Max Moroz
>Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are 
> available for the preservation of order in DataFrames. Different blogs, SO 
> answers, and posts on course websites suggest different things. It would be 
> good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file 
> it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch

2017-03-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19561.

   Resolution: Fixed
 Assignee: Jason White
Fix Version/s: 2.2.0
   2.1.1

> Pyspark Dataframes don't allow timestamps near epoch
> 
>
> Key: SPARK-19561
> URL: https://issues.apache.org/jira/browse/SPARK-19561
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 2.1.1, 2.2.0
>
>
> Pyspark does not allow timestamps at or near the epoch to be created in a 
> DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299
> TimestampType.toInternal converts a datetime object to a number representing 
> microseconds since the epoch. For all times more than 2148 seconds before or 
> after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J 
> automatically serializes it as a long.
> However, for times within this range (~35 minutes before or after the epoch), 
> Py4J serializes it as an int. When creating the object on the Scala side, 
> ints are not recognized and the value goes to null. This leads to null values 
> in non-nullable fields, and corrupted Parquet files.
> The solution is trivial - force TimestampType.toInternal to always return a 
> long.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-07 Thread Nick Afshartous (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900184#comment-15900184
 ] 

Nick Afshartous commented on SPARK-19767:
-

Yes, I completed the steps in the Prerequisites section of  
https://github.com/apache/spark/blob/master/docs/README.md 
and got the same error about unknown tag {{include_example}} on two different 
computers (Linux and OSX).  

Looks like {{include_example}} is local in 
{{./docs/_plugins/include_example.rb}}, so maybe this is some kind of path 
issue where its not finding the local file ?  

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900207#comment-15900207
 ] 

Sean Owen commented on SPARK-19767:
---

Oh, are you not running from the {{docs/}} directory?

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19702) Increasse refuse_seconds timeout in the Mesos Spark Dispatcher

2017-03-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19702:
-

 Assignee: Michael Gummelt
 Priority: Minor  (was: Major)
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/17031

> Increasse refuse_seconds timeout in the Mesos Spark Dispatcher
> --
>
> Key: SPARK-19702
> URL: https://issues.apache.org/jira/browse/SPARK-19702
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
>Priority: Minor
> Fix For: 2.2.0
>
>
> Due to the problem described here: 
> https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos 
> frameworks concurrently can result in starvation.  For example, running 10 
> dispatchers could result in 5 of them getting all the offers, even if they 
> have no jobs to launch.  We must implement increase the refuse_seconds 
> timeout to solve this problem.  Another option would have been to implement 
> suppress/revive, but that can cause starvation due to the unreliability of 
> mesos RPC calls.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-07 Thread Nick Afshartous (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900216#comment-15900216
 ] 

Nick Afshartous commented on SPARK-19767:
-

Missed that one, thanks.  

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16207) order guarantees for DataFrames

2017-03-07 Thread Chris Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900220#comment-15900220
 ] 

Chris Rogers commented on SPARK-16207:
--

[~srowen] since there is no documentation yet, I don't know whether a clear, 
coherent generalization can be made.  I would be happy with "most of the 
methods DO NOT preserve order, with these specific exceptions", or "most of the 
methods DO preserve order, with these specific exceptions".

Failing a generalization, I'd also be happy with method-by-method documentation 
of ordering semantics, which seems like a very minimal amount of copy-pasting 
("Preserves ordering: yes", "Preserves ordering: no").  Maybe that's a good 
place to start, since there seems to be some confusion about what the 
generalization would be.

> order guarantees for DataFrames
> ---
>
> Key: SPARK-16207
> URL: https://issues.apache.org/jira/browse/SPARK-16207
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Max Moroz
>Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are 
> available for the preservation of order in DataFrames. Different blogs, SO 
> answers, and posts on course websites suggest different things. It would be 
> good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file 
> it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16207) order guarantees for DataFrames

2017-03-07 Thread Chris Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900220#comment-15900220
 ] 

Chris Rogers edited comment on SPARK-16207 at 3/7/17 9:52 PM:
--

[~srowen] since there is no documentation yet, I don't know whether a clear, 
coherent generalization can be made.  I would be happy with "most of the 
methods DO NOT preserve order, with these specific exceptions", or "most of the 
methods DO preserve order, with these specific exceptions".

Failing a generalization, I'd also be happy with method-by-method documentation 
of ordering semantics, which seems like a very minimal amount of copy-pasting 
("Preserves ordering: yes", "Preserves ordering: no").  Maybe that's a good 
place to start, since there seems to be some confusion about what the 
generalization would be.

I'm new to Scala so not sure if this is practical, but maybe the appropriate 
methods could be moved to an `RDDPreservesOrdering` class with an implicit 
conversion, akin to `PairRDDFunctions`?


was (Author: rcrogers):
[~srowen] since there is no documentation yet, I don't know whether a clear, 
coherent generalization can be made.  I would be happy with "most of the 
methods DO NOT preserve order, with these specific exceptions", or "most of the 
methods DO preserve order, with these specific exceptions".

Failing a generalization, I'd also be happy with method-by-method documentation 
of ordering semantics, which seems like a very minimal amount of copy-pasting 
("Preserves ordering: yes", "Preserves ordering: no").  Maybe that's a good 
place to start, since there seems to be some confusion about what the 
generalization would be.

> order guarantees for DataFrames
> ---
>
> Key: SPARK-16207
> URL: https://issues.apache.org/jira/browse/SPARK-16207
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Max Moroz
>Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are 
> available for the preservation of order in DataFrames. Different blogs, SO 
> answers, and posts on course websites suggest different things. It would be 
> good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file 
> it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets

2017-03-07 Thread Chris Bowden (JIRA)

Chris Bowden created SPARK-19853:


 Summary: Uppercase Kafka topics fail when startingOffsets are 
SpecificOffsets
 Key: SPARK-19853
 URL: https://issues.apache.org/jira/browse/SPARK-19853
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Chris Bowden
Priority: Trivial


When using the KafkaSource with Structured Streaming, consumer assignments are 
not what the user expects if startingOffsets is set to an explicit set of 
topics/partitions in JSON where the topic(s) happen to have uppercase 
characters. When StartingOffsets is constructed, the original string value from 
options is transformed toLowerCase to make matching on "earliest" and "latest" 
case insensitive. However, the toLowerCase json is passed to SpecificOffsets 
for the terminal condition, so topic names may not be what the user intended by 
the time assignments are made with the underlying KafkaConsumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets

2017-03-07 Thread Chris Bowden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bowden updated SPARK-19853:
-
Description: 
When using the KafkaSource with Structured Streaming, consumer assignments are 
not what the user expects if startingOffsets is set to an explicit set of 
topics/partitions in JSON where the topic(s) happen to have uppercase 
characters. When StartingOffsets is constructed, the original string value from 
options is transformed toLowerCase to make matching on "earliest" and "latest" 
case insensitive. However, the toLowerCase json is passed to SpecificOffsets 
for the terminal condition, so topic names may not be what the user intended by 
the time assignments are made with the underlying KafkaConsumer.

>From KafkaSourceProvider:
{code}
val startingOffsets = 
caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) 
match {
case Some("latest") => LatestOffsets
case Some("earliest") => EarliestOffsets
case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
case None => LatestOffsets
  }
{code}

  was:When using the KafkaSource with Structured Streaming, consumer 
assignments are not what the user expects if startingOffsets is set to an 
explicit set of topics/partitions in JSON where the topic(s) happen to have 
uppercase characters. When StartingOffsets is constructed, the original string 
value from options is transformed toLowerCase to make matching on "earliest" 
and "latest" case insensitive. However, the toLowerCase json is passed to 
SpecificOffsets for the terminal condition, so topic names may not be what the 
user intended by the time assignments are made with the underlying 
KafkaConsumer.


> Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
> 
>
> Key: SPARK-19853
> URL: https://issues.apache.org/jira/browse/SPARK-19853
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Chris Bowden
>Priority: Trivial
>
> When using the KafkaSource with Structured Streaming, consumer assignments 
> are not what the user expects if startingOffsets is set to an explicit set of 
> topics/partitions in JSON where the topic(s) happen to have uppercase 
> characters. When StartingOffsets is constructed, the original string value 
> from options is transformed toLowerCase to make matching on "earliest" and 
> "latest" case insensitive. However, the toLowerCase json is passed to 
> SpecificOffsets for the terminal condition, so topic names may not be what 
> the user intended by the time assignments are made with the underlying 
> KafkaConsumer.
> From KafkaSourceProvider:
> {code}
> val startingOffsets = 
> caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase)
>  match {
> case Some("latest") => LatestOffsets
> case Some("earliest") => EarliestOffsets
> case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
> case None => LatestOffsets
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets

2017-03-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19853:
-
Target Version/s: 2.2.0

> Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
> 
>
> Key: SPARK-19853
> URL: https://issues.apache.org/jira/browse/SPARK-19853
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Chris Bowden
>Priority: Trivial
>
> When using the KafkaSource with Structured Streaming, consumer assignments 
> are not what the user expects if startingOffsets is set to an explicit set of 
> topics/partitions in JSON where the topic(s) happen to have uppercase 
> characters. When StartingOffsets is constructed, the original string value 
> from options is transformed toLowerCase to make matching on "earliest" and 
> "latest" case insensitive. However, the toLowerCase json is passed to 
> SpecificOffsets for the terminal condition, so topic names may not be what 
> the user intended by the time assignments are made with the underlying 
> KafkaConsumer.
> From KafkaSourceProvider:
> {code}
> val startingOffsets = 
> caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase)
>  match {
> case Some("latest") => LatestOffsets
> case Some("earliest") => EarliestOffsets
> case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json))
> case None => LatestOffsets
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-07 Thread Ari Gesher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Gesher updated SPARK-19764:
---

We're driving everything from Python.  It may be a bug that we're not getting 
the error to propagate up to the notebook - generally, we see exceptions.  When 
we ran the same job from the PySpark shell, we saw the stacktrace, so I'm 
inclined to point at something in the notebook stop that made it not propagate.

We're happy to investigate if you think it's useful.



> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10

2017-03-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18138:

Labels: releasenotes  (was: )

> More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
> 
>
> Key: SPARK-18138
> URL: https://issues.apache.org/jira/browse/SPARK-18138
> Project: Spark
>  Issue Type: Task
>Reporter: Reynold Xin
>Assignee: Sean Owen
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 2.1.0
>
>
> Plan:
> - Mark it very explicit in Spark 2.1.0 that support for the aforementioned 
> environments are deprecated.
> - Remove support it Spark 2.2.0
> Also see mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19854) Refactor file partitioning strategy to make it easier to extend / unit test

2017-03-07 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-19854:
---

 Summary: Refactor file partitioning strategy to make it easier to 
extend / unit test
 Key: SPARK-19854
 URL: https://issues.apache.org/jira/browse/SPARK-19854
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


The way we currently do file partitioning strategy is hard coded in 
FileSourceScanExec. This is not ideal for two reasons:

1. It is difficult to unit test the default strategy. In order to test this, we 
need to do almost end-to-end tests by creating actual files on the file system.

2. It is difficult to experiment with different partitioning strategies without 
adding a lot of if branches.

The goal of this story is to create an internal interface for this so we can 
make this pluggable for both better testing and experimentation.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19855) Create an internal FilePartitionStrategy interface

2017-03-07 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-19855:
---

 Summary: Create an internal FilePartitionStrategy interface
 Key: SPARK-19855
 URL: https://issues.apache.org/jira/browse/SPARK-19855
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19856) Turn partitioning related test cases in FileSourceStrategySuite into unit tests

2017-03-07 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-19856:
---

 Summary: Turn partitioning related test cases in 
FileSourceStrategySuite into unit tests
 Key: SPARK-19856
 URL: https://issues.apache.org/jira/browse/SPARK-19856
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19856) Turn partitioning related test cases in FileSourceStrategySuite from integration tests into unit tests

2017-03-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-19856:

Summary: Turn partitioning related test cases in FileSourceStrategySuite 
from integration tests into unit tests  (was: Turn partitioning related test 
cases in FileSourceStrategySuite into unit tests)

> Turn partitioning related test cases in FileSourceStrategySuite from 
> integration tests into unit tests
> --
>
> Key: SPARK-19856
> URL: https://issues.apache.org/jira/browse/SPARK-19856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19855) Create an internal FilePartitionStrategy interface

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19855:


Assignee: Apache Spark  (was: Reynold Xin)

> Create an internal FilePartitionStrategy interface
> --
>
> Key: SPARK-19855
> URL: https://issues.apache.org/jira/browse/SPARK-19855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19855) Create an internal FilePartitionStrategy interface

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900315#comment-15900315
 ] 

Apache Spark commented on SPARK-19855:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17196

> Create an internal FilePartitionStrategy interface
> --
>
> Key: SPARK-19855
> URL: https://issues.apache.org/jira/browse/SPARK-19855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19855) Create an internal FilePartitionStrategy interface

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19855:


Assignee: Reynold Xin  (was: Apache Spark)

> Create an internal FilePartitionStrategy interface
> --
>
> Key: SPARK-19855
> URL: https://issues.apache.org/jira/browse/SPARK-19855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-19857:
--

 Summary: CredentialUpdater calculates the wrong time for next 
update
 Key: SPARK-19857
 URL: https://issues.apache.org/jira/browse/SPARK-19857
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.1.0
Reporter: Marcelo Vanzin


This is the code:

{code}
val remainingTime = 
getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
  - System.currentTimeMillis()
{code}

If you spot the problem, you get a virtual cookie.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19857:


Assignee: Apache Spark

> CredentialUpdater calculates the wrong time for next update
> ---
>
> Key: SPARK-19857
> URL: https://issues.apache.org/jira/browse/SPARK-19857
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> This is the code:
> {code}
> val remainingTime = 
> getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
>   - System.currentTimeMillis()
> {code}
> If you spot the problem, you get a virtual cookie.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19857:


Assignee: (was: Apache Spark)

> CredentialUpdater calculates the wrong time for next update
> ---
>
> Key: SPARK-19857
> URL: https://issues.apache.org/jira/browse/SPARK-19857
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> This is the code:
> {code}
> val remainingTime = 
> getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
>   - System.currentTimeMillis()
> {code}
> If you spot the problem, you get a virtual cookie.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900363#comment-15900363
 ] 

Apache Spark commented on SPARK-19857:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17198

> CredentialUpdater calculates the wrong time for next update
> ---
>
> Key: SPARK-19857
> URL: https://issues.apache.org/jira/browse/SPARK-19857
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> This is the code:
> {code}
> val remainingTime = 
> getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
>   - System.currentTimeMillis()
> {code}
> If you spot the problem, you get a virtual cookie.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-07 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-19858:


 Summary: Add output mode to flatMapGroupsWithState and disallow 
invalid cases
 Key: SPARK-19858
 URL: https://issues.apache.org/jira/browse/SPARK-19858
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19858:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900377#comment-15900377
 ] 

Apache Spark commented on SPARK-19858:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17197

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19858:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-19859:


 Summary: The new watermark should override the old one
 Key: SPARK-19859
 URL: https://issues.apache.org/jira/browse/SPARK-19859
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


The new watermark should override the old one. Otherwise, we just pick up the 
first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19859:


Assignee: Apache Spark  (was: Shixiong Zhu)

> The new watermark should override the old one
> -
>
> Key: SPARK-19859
> URL: https://issues.apache.org/jira/browse/SPARK-19859
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> The new watermark should override the old one. Otherwise, we just pick up the 
> first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900410#comment-15900410
 ] 

Apache Spark commented on SPARK-19859:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17199

> The new watermark should override the old one
> -
>
> Key: SPARK-19859
> URL: https://issues.apache.org/jira/browse/SPARK-19859
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The new watermark should override the old one. Otherwise, we just pick up the 
> first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19859) The new watermark should override the old one

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19859:


Assignee: Shixiong Zhu  (was: Apache Spark)

> The new watermark should override the old one
> -
>
> Key: SPARK-19859
> URL: https://issues.apache.org/jira/browse/SPARK-19859
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The new watermark should override the old one. Otherwise, we just pick up the 
> first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19857) CredentialUpdater calculates the wrong time for next update

2017-03-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19857.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.2.0
   2.1.1

> CredentialUpdater calculates the wrong time for next update
> ---
>
> Key: SPARK-19857
> URL: https://issues.apache.org/jira/browse/SPARK-19857
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.1.1, 2.2.0
>
>
> This is the code:
> {code}
> val remainingTime = 
> getTimeOfNextUpdateFromFileName(credentialsStatus.getPath)
>   - System.currentTimeMillis()
> {code}
> If you spot the problem, you get a virtual cookie.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-07 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900463#comment-15900463
 ] 

Vincent commented on SPARK-19852:
-

I can work on this issue, since it is related to SPARK-17498

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900524#comment-15900524
 ] 

Apache Spark commented on SPARK-19561:
--

User 'JasonMWhite' has created a pull request for this issue:
https://github.com/apache/spark/pull/17200

> Pyspark Dataframes don't allow timestamps near epoch
> 
>
> Key: SPARK-19561
> URL: https://issues.apache.org/jira/browse/SPARK-19561
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 2.1.1, 2.2.0
>
>
> Pyspark does not allow timestamps at or near the epoch to be created in a 
> DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299
> TimestampType.toInternal converts a datetime object to a number representing 
> microseconds since the epoch. For all times more than 2148 seconds before or 
> after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J 
> automatically serializes it as a long.
> However, for times within this range (~35 minutes before or after the epoch), 
> Py4J serializes it as an int. When creating the object on the Scala side, 
> ints are not recognized and the value goes to null. This leads to null values 
> in non-nullable fields, and corrupted Parquet files.
> The solution is trivial - force TimestampType.toInternal to always return a 
> long.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2017-03-07 Thread Jim Kleckner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900533#comment-15900533
 ] 

Jim Kleckner commented on SPARK-16333:
--

I ended up here when looking into why an upgrade of our streaming computation 
to 2.1.0 was pegging the network at a gigabit/second.
Setting to spark.eventLog.enabled to false confirmed that this logging from 
slave port 50010 was the culprit.

How can anyone with seriously large numbers of tasks use spark history with 
this amount of load?

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-18055:


Assignee: Michael Armbrust

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19810) Remove support for Scala 2.10

2017-03-07 Thread Min Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900548#comment-15900548
 ] 

Min Shen commented on SPARK-19810:
--

[~srowen],

Want to get an idea regarding the timeline for removing Scala 2.10.
We have heavy usage of Spark at LinkedIn, and we are right now still deploying 
Spark built with Scala 2.10 due to various dependencies on other systems we 
have which still rely on Scala 2.10.
While we also have plans to upgrade our various internal systems to start using 
Scala 2.11, it will take a while for that to happen.
In the mean time, if support for Scala 2.10 is removed in Spark 2.2, this is 
going to potentially block us from upgrading to Spark 2.2+ while we haven't 
fully moved off Scala 2.10 yet.
Want to raise this concern here and also to understand the timeline for 
removing Scala 2.10 in Spark.

> Remove support for Scala 2.10
> -
>
> Key: SPARK-19810
> URL: https://issues.apache.org/jira/browse/SPARK-19810
> Project: Spark
>  Issue Type: Task
>  Components: ML, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Critical
>
> This tracks the removal of Scala 2.10 support, as discussed in 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html
>  and other lists.
> The primary motivations are to simplify the code and build, and to enable 
> Scala 2.12 support later. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18055:
-
Target Version/s: 2.2.0

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18055:


Assignee: Apache Spark  (was: Michael Armbrust)

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Apache Spark
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18055:


Assignee: Michael Armbrust  (was: Apache Spark)

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-03-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900568#comment-15900568
 ] 

Apache Spark commented on SPARK-18055:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/17201

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.

2017-03-07 Thread wuchang (JIRA)

wuchang created SPARK-19860:
---

 Summary: DataFrame join get conflict error if two frames has a 
same name column.
 Key: SPARK-19860
 URL: https://issues.apache.org/jira/browse/SPARK-19860
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: wuchang


>>> print df1.collect()
[Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', 
in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), 
Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', 
in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), 
Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', 
in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), 
Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', 
in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), 
Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', 
in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), 
Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', 
in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), 
Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', 
in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), 
Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', 
in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), 
Row(fdate=u'20170301', in_amount1=10159653)]
>>> print df2.collect()
[Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', 
in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), 
Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', 
in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), 
Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', 
in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), 
Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', 
in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), 
Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', 
in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), 
Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', 
in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), 
Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', 
in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), 
Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', 
in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), 
Row(fdate=u'20170301', in_amount2=9475418)]
>>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially 
true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
jdf = self._jdf.join(other._jdf, on._jc, how)
  File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 
933, in __call__
  File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"
Failure when resolving conflicting references in Join:
'Join Inner, (fdate#42 = fdate#42)
:- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as 
int) AS in_amount1#97]
:  +- Filter (inorout#44 = A)
: +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
:+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT 
(firm_id#40 = -1) && (fdate#42 >= 20170201)))
:   +- SubqueryAlias history_transfer_v
:  +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
bankwaterid#48, waterid#49, waterstate#50, source#51]
: +- SubqueryAlias history_transfer
:+- 
Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
 parquet
+- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as 
int)

[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing

2017-03-07 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900608#comment-15900608
 ] 

Takeshi Yamamuro commented on SPARK-18359:
--

Since JDK9 use CLDR as locale by default, it seems good to explicitly handle 
it: http://openjdk.java.net/jeps/252

> Let user specify locale in CSV parsing
> --
>
> Key: SPARK-18359
> URL: https://issues.apache.org/jira/browse/SPARK-18359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: yannick Radji
>
> On the DataFrameReader object there no CSV-specific option to set decimal 
> delimiter on comma whereas dot like it use to be in France and Europe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19861) watermark should not be a negative time.

2017-03-07 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-19861:
-

 Summary: watermark should not be a negative time.
 Key: SPARK-19861
 URL: https://issues.apache.org/jira/browse/SPARK-19861
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19861) watermark should not be a negative time.

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19861:


Assignee: Apache Spark

> watermark should not be a negative time.
> 
>
> Key: SPARK-19861
> URL: https://issues.apache.org/jira/browse/SPARK-19861
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19861) watermark should not be a negative time.

2017-03-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19861:


Assignee: (was: Apache Spark)

> watermark should not be a negative time.
> 
>
> Key: SPARK-19861
> URL: https://issues.apache.org/jira/browse/SPARK-19861
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 133 matches

Mail list logo