[jira] [Resolved] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14879.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12645
[https://github.com/apache/spark/pull/12645]

> Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to 
> sql/core
> 
>
> Key: SPARK-14879
> URL: https://issues.apache.org/jira/browse/SPARK-14879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14833) Refactor StreamTests to test for source fault-tolerance correctly.

2016-04-23 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14833.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Refactor StreamTests to test for source fault-tolerance correctly.
> --
>
> Key: SPARK-14833
> URL: https://issues.apache.org/jira/browse/SPARK-14833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> Current StreamTest allows testing of a streaming Dataset generated explicitly 
> wraps a source. This is different from the actual production code path where 
> the source object is dynamically created through a DataSource object every 
> time a query is started. So all the fault-tolerance testing in 
> FileSourceSuite and FileSourceStressSuite is not really testing the actual 
> code path as they are just reusing the FileStreamSource object. 
> Instead of maintaining a mapping of source --> expected offset in StreamTest 
> (which requires reuse of source object), it should maintain a mapping of 
> source index --> offset, so that it is independent of the source object. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14882) Programming Guide Improvements

2016-04-23 Thread Ben McCann (JIRA)
Ben McCann created SPARK-14882:
--

 Summary: Programming Guide Improvements
 Key: SPARK-14882
 URL: https://issues.apache.org/jira/browse/SPARK-14882
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Ben McCann


I'm reading http://spark.apache.org/docs/latest/programming-guide.html

It says "Spark 1.6.1 uses Scala 2.10. To write applications in Scala, you will 
need to use a compatible Scala version (e.g. 2.10.X)." However, it doesn't seem 
to me that Scala 2.10 is required because I see versions compiled for both 2.10 
and 2.11 in Maven Central.

There are a few references to Tachyon that look like they should be changed to 
Alluxio



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14838) Implement statistics in SerializeFromObject to avoid failure when estimating sizeInBytes for ObjectType

2016-04-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14838:
---
Assignee: Liang-Chi Hsieh

> Implement statistics in SerializeFromObject to avoid failure when estimating 
> sizeInBytes for ObjectType
> ---
>
> Key: SPARK-14838
> URL: https://issues.apache.org/jira/browse/SPARK-14838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Spark will determine the plan size to automatically broadcast it or not when 
> doing join. As it can't estimate object type size, this mechanism will throw 
> failure as shown in 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56533/consoleFull.
>  We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14838) Implement statistics in SerializeFromObject to avoid failure when estimating sizeInBytes for ObjectType

2016-04-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14838.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12599
[https://github.com/apache/spark/pull/12599]

> Implement statistics in SerializeFromObject to avoid failure when estimating 
> sizeInBytes for ObjectType
> ---
>
> Key: SPARK-14838
> URL: https://issues.apache.org/jira/browse/SPARK-14838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Spark will determine the plan size to automatically broadcast it or not when 
> doing join. As it can't estimate object type size, this mechanism will throw 
> failure as shown in 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56533/consoleFull.
>  We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14874) Remove the obsolete Batch representation

2016-04-23 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-14874:
--
Summary: Remove the obsolete Batch representation  (was: Cleanup the 
useless Batch class)

> Remove the obsolete Batch representation
> 
>
> Key: SPARK-14874
> URL: https://issues.apache.org/jira/browse/SPARK-14874
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> The Batch class, which had been used to indicate progress in a stream, was 
> abandoned by SPARK-13985 and then became useless.
> Let's:
> - removes the Batch class
> - renames getBatch(...) to getData(...) for Source
> - renames addBatch(...) to addData(...) for Sink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Description: 
Since our parser defined based on antlr 4 can parse data type. We can remove 
org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
parser's functionality is a super set of DataTypeParser. Then, we can remove 
DataTypeParser. For the object DataTypeParser, we can keep it and let it just 
call the parserDataType method of CatalystSqlParser.

*The original description is shown below*
Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.

  was:
Since our parser defined based on antlr 4 can parse data type. We can remove 
org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
parser's functionality is a super set of DataTypeParser. Then, we can remove 
DataTypeParser.

*The original description is shown below*
Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.


> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> Since our parser defined based on antlr 4 can parse data type. We can remove 
> org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
> parser's functionality is a super set of DataTypeParser. Then, we can remove 
> DataTypeParser. For the object DataTypeParser, we can keep it and let it just 
> call the parserDataType method of CatalystSqlParser.
> *The original description is shown below*
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Description: 
Since our parser defined based on antlr 4 can parse data type (see 
CatalystSqlParser), we can remove 
org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
parser's functionality is a super set of DataTypeParser. Then, we can remove 
DataTypeParser. For the object DataTypeParser, we can keep it and let it just 
call the parserDataType method of CatalystSqlParser.

*The original description is shown below*
Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.

  was:
Since our parser defined based on antlr 4 can parse data type. We can remove 
org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
parser's functionality is a super set of DataTypeParser. Then, we can remove 
DataTypeParser. For the object DataTypeParser, we can keep it and let it just 
call the parserDataType method of CatalystSqlParser.

*The original description is shown below*
Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.


> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> Since our parser defined based on antlr 4 can parse data type (see 
> CatalystSqlParser), we can remove 
> org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
> parser's functionality is a super set of DataTypeParser. Then, we can remove 
> DataTypeParser. For the object DataTypeParser, we can keep it and let it just 
> call the parserDataType method of CatalystSqlParser.
> *The original description is shown below*
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Priority: Major  (was: Blocker)

> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> Since our parser defined based on antlr 4 can parse data type. We can remove 
> org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
> parser's functionality is a super set of DataTypeParser. Then, we can remove 
> DataTypeParser.
> *The original description is shown below*
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-14776

> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Since our parser defined based on antlr 4 can parse data type. We can remove 
> org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
> parser's functionality is a super set of DataTypeParser. Then, we can remove 
> DataTypeParser.
> *The original description is shown below*
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Description: 
Since our parser defined based on antlr 4 can parse data type. We can remove 
org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
parser's functionality is a super set of DataTypeParser. Then, we can remove 
DataTypeParser.

*The original description is shown below*
Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.

  was:
Since our parser defined based on antlr 4 can parse data type. We will not need 
to have  

Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.


> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Since our parser defined based on antlr 4 can parse data type. We can remove 
> org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
> parser's functionality is a super set of DataTypeParser. Then, we can remove 
> DataTypeParser.
> *The original description is shown below*
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Summary: Remove org.apache.spark.sql.catalyst.parser.DataTypeParser  (was: 
DDLParser should accept decimal(precision))

> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Description: 
Since our parser defined based on antlr 4 can parse data type. We will not need 
to have  

Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.

  was:
Since our parser defined based on antlr 4 can parse data type. We will not need 
to hav e 

Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.


> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Since our parser defined based on antlr 4 can parse data type. We will not 
> need to have  
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Description: 
Since our parser defined based on antlr 4 can parse data type. We will not need 
to hav e 

Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.

  was:Right now, our DDLParser does not support {{decimal(precision)}} (the 
scale will be set to 0). We should support it.


> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Since our parser defined based on antlr 4 can parse data type. We will not 
> need to hav e 
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14591) DDLParser should accept decimal(precision)

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14591:
-
Priority: Blocker  (was: Major)

> DDLParser should accept decimal(precision)
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14591) DDLParser should accept decimal(precision)

2016-04-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255430#comment-15255430
 ] 

Yin Huai commented on SPARK-14591:
--

[~hvanhovell] Where do we define those reserved keywords? 

> DDLParser should accept decimal(precision)
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4298) The spark-submit cannot read Main-Class from Manifest.

2016-04-23 Thread Jimit Kamlesh Raithatha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255429#comment-15255429
 ] 

Jimit Kamlesh Raithatha commented on SPARK-4298:


Brennon / Friends,

Any chance this issue is open in Spark 1.5.2 (for Hadoop 2.4)?

I still had to launch my program as follows:
./spark-submit --class problem1 /a/b/c/def.jar

My MANIFEST is a simple 2-liner:

Manifest-Version: 1.0
Main-Class: problem1

> The spark-submit cannot read Main-Class from Manifest.
> --
>
> Key: SPARK-4298
> URL: https://issues.apache.org/jira/browse/SPARK-4298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Linux
> spark-1.1.0-bin-hadoop2.4.tgz
> java version "1.7.0_72"
> Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
>Reporter: Milan Straka
>Assignee: Brennon York
> Fix For: 1.0.3, 1.1.2, 1.2.1, 1.3.0
>
>
> Consider trivial {{test.scala}}:
> {code:title=test.scala|borderStyle=solid}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> object Main {
>   def main(args: Array[String]) {
> val sc = new SparkContext()
> sc.stop()
>   }
> }
> {code}
> When built with {{sbt}} and executed using {{spark-submit 
> target/scala-2.10/test_2.10-1.0.jar}}, I get the following error:
> {code}
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Error: Cannot load main class from JAR: 
> file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar
> Run with --help for usage help or --verbose for debug output
> {code}
> When executed using {{spark-submit --class Main 
> target/scala-2.10/test_2.10-1.0.jar}}, it works.
> The jar file has correct MANIFEST.MF:
> {code:title=MANIFEST.MF|borderStyle=solid}
> Manifest-Version: 1.0
> Implementation-Vendor: test
> Implementation-Title: test
> Implementation-Version: 1.0
> Implementation-Vendor-Id: test
> Specification-Vendor: test
> Specification-Title: test
> Specification-Version: 1.0
> Main-Class: Main
> {code}
> The problem is that in {{org.apache.spark.deploy.SparkSubmitArguments}}, line 
> 127:
> {code}
>   val jar = new JarFile(primaryResource)
> {code}
> the primaryResource has String value 
> {{"file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar"}}, which is 
> URI, but JarFile can use only Path. One way to fix this would be using
> {code}
>   val uri = new URI(primaryResource)
>   val jar = new JarFile(uri.getPath)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14881:


Assignee: Apache Spark

> pyspark and sparkR shell default log level should match spark-shell/Scala
> -
>
> Key: SPARK-14881
> URL: https://issues.apache.org/jira/browse/SPARK-14881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> Scala spark-shell defaults to log level WARN. pyspark and sparkR should match 
> that by default (user can change it later)
> # ./bin/spark-shell
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255427#comment-15255427
 ] 

Apache Spark commented on SPARK-14881:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/12648

> pyspark and sparkR shell default log level should match spark-shell/Scala
> -
>
> Key: SPARK-14881
> URL: https://issues.apache.org/jira/browse/SPARK-14881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> Scala spark-shell defaults to log level WARN. pyspark and sparkR should match 
> that by default (user can change it later)
> # ./bin/spark-shell
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14881:


Assignee: (was: Apache Spark)

> pyspark and sparkR shell default log level should match spark-shell/Scala
> -
>
> Key: SPARK-14881
> URL: https://issues.apache.org/jira/browse/SPARK-14881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> Scala spark-shell defaults to log level WARN. pyspark and sparkR should match 
> that by default (user can change it later)
> # ./bin/spark-shell
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala

2016-04-23 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-14881:
-
Summary: pyspark and sparkR shell default log level should match 
spark-shell/Scala  (was: PySpark and sparkR shell default log level should 
match spark-shell/Scala)

> pyspark and sparkR shell default log level should match spark-shell/Scala
> -
>
> Key: SPARK-14881
> URL: https://issues.apache.org/jira/browse/SPARK-14881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> Scala spark-shell defaults to log level WARN. pyspark and sparkR should match 
> that by default (user can change it later)
> # ./bin/spark-shell
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14881) PySpark and sparkR shell default log level should match spark-shell/Scala

2016-04-23 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-14881:
-
Description: 
Scala spark-shell defaults to log level WARN. pyspark and sparkR should match 
that by default (user can change it later)

# ./bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).


> PySpark and sparkR shell default log level should match spark-shell/Scala
> -
>
> Key: SPARK-14881
> URL: https://issues.apache.org/jira/browse/SPARK-14881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> Scala spark-shell defaults to log level WARN. pyspark and sparkR should match 
> that by default (user can change it later)
> # ./bin/spark-shell
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14881) PySpark and sparkR shell default log level should match spark-shell/Scala

2016-04-23 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-14881:


 Summary: PySpark and sparkR shell default log level should match 
spark-shell/Scala
 Key: SPARK-14881
 URL: https://issues.apache.org/jira/browse/SPARK-14881
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, SparkR
Affects Versions: 2.0.0
Reporter: Felix Cheung
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13831) TPC-DS Query 35 fails with the following compile error

2016-04-23 Thread Roy Cecil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255426#comment-15255426
 ] 

Roy Cecil commented on SPARK-13831:
---

@Herman, thanks. I am validating the fix.

> TPC-DS Query 35 fails with the following compile error
> --
>
> Key: SPARK-13831
> URL: https://issues.apache.org/jira/browse/SPARK-13831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Roy Cecil
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> TPC-DS Query 35 fails with the following compile error.
> Scala.NotImplementedError: 
> scala.NotImplementedError: No parse rules for ASTNode type: 864, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR 1, 439,797, 1370
>   TOK_SUBQUERY_OP 1, 439,439, 1370
> exists 1, 439,439, 1370
>   TOK_QUERY 1, 441,797, 1508
> Pasting Query 35 for easy reference.
> select
>   ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   count(*) cnt1,
>   min(cd_dep_count) cd_dep_count1,
>   max(cd_dep_count) cd_dep_count2,
>   avg(cd_dep_count) cd_dep_count3,
>   cd_dep_employed_count,
>   count(*) cnt2,
>   min(cd_dep_employed_count) cd_dep_employed_count1,
>   max(cd_dep_employed_count) cd_dep_employed_count2,
>   avg(cd_dep_employed_count) cd_dep_employed_count3,
>   cd_dep_college_count,
>   count(*) cnt3,
>   min(cd_dep_college_count) cd_dep_college_count1,
>   max(cd_dep_college_count) cd_dep_college_count2,
>   avg(cd_dep_college_count) cd_dep_college_count3
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN
>   (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_qoy < 4) ss_wh1
>   ON c.c_customer_sk = ss_wh1.ss_customer_sk
>  where
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk  as customer_sk
> from web_sales,date_dim
> where
>   ws_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>UNION ALL
> select cs_ship_customer_sk  as customer_sk
> from catalog_sales,date_dim
> where
>   cs_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255424#comment-15255424
 ] 

Apache Spark commented on SPARK-12148:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/12647

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>Assignee: Felix Cheung
> Fix For: 2.0.0
>
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

2016-04-23 Thread Ahmed Mahran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Mahran updated SPARK-14880:
-
Description: 
The current implementation of (Stochastic) Gradient Descent performs one 
map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
smaller, the algorithm becomes shuffle-bound instead of CPU-bound.

{code}
(1 to numIterations or convergence) {
 rdd
  .sample(fraction)
  .map(Gradient)
  .reduce(Update)
}
{code}

A more performant variation requires only one map-reduce regardless from the 
number of iterations. A local mini-batch SGD could be run on each partition, 
then the results could be averaged. This is based on (Zinkevich, Martin, Markus 
Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient 
descent." In Advances in neural information processing systems, 2010, 
http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).

{code}
rdd
 .shuffle()
 .mapPartitions((1 to numIterations or convergence) {
   iter.sample(fraction).map(Gradient).reduce(Update)
 })
 .reduce(Average)
{code}

A higher level iteration could enclose the above variation; shuffling the data 
before the local mini-batches and feeding back the average weights from the 
last iteration. This allows more variability in the sampling of the 
mini-batches with the possibility to cover the whole dataset. Here is a Spark 
based implementation 
https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala

{code}
(1 to numIterations1 or convergence) {
 rdd
  .shuffle()
  .mapPartitions((1 to numIterations2 or convergence) {
iter.sample(fraction).map(Gradient).reduce(Update)
  })
  .reduce(Average)
}
{code}

  was:
The current implementation of (Stochastic) Gradient Descent performs one 
map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
smaller, the algorithm becomes shuffle-bound instead of CPU-bound.

(1 to numIterations or convergence) {
 rdd
  .sample(fraction)
  .map(Gradient)
  .reduce(Update)
}

A more performant variation requires only one map-reduce regardless from the 
number of iterations. A local mini-batch SGD could be run on each partition, 
then the results could be averaged. This is based on (Zinkevich, Martin, Markus 
Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient 
descent." In Advances in neural information processing systems, 2010, 
http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).

rdd
 .shuffle()
 .mapPartitions((1 to numIterations or convergence) {
   iter.sample(fraction).map(Gradient).reduce(Update)
 })
 .reduce(Average)

A higher level iteration could enclose the above variation; shuffling the data 
before the local mini-batches and feeding back the average weights from the 
last iteration. This allows more variability in the sampling of the 
mini-batches with the possibility to cover the whole dataset. Here is a Spark 
based implementation 
https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala

(1 to numIterations1 or convergence) {
 rdd
  .shuffle()
  .mapPartitions((1 to numIterations2 or convergence) {
iter.sample(fraction).map(Gradient).reduce(Update)
  })
  .reduce(Average)
}


> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---
>
> Key: SPARK-14880
> URL: https://issues.apache.org/jira/browse/SPARK-14880
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Ahmed Mahran
>  Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average weights from 
> the last iteration. This allows 

[jira] [Created] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

2016-04-23 Thread Ahmed Mahran (JIRA)
Ahmed Mahran created SPARK-14880:


 Summary: Parallel Gradient Descent with less map-reduce shuffle 
overhead
 Key: SPARK-14880
 URL: https://issues.apache.org/jira/browse/SPARK-14880
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Ahmed Mahran


The current implementation of (Stochastic) Gradient Descent performs one 
map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
smaller, the algorithm becomes shuffle-bound instead of CPU-bound.

(1 to numIterations or convergence) {
 rdd
  .sample(fraction)
  .map(Gradient)
  .reduce(Update)
}

A more performant variation requires only one map-reduce regardless from the 
number of iterations. A local mini-batch SGD could be run on each partition, 
then the results could be averaged. This is based on (Zinkevich, Martin, Markus 
Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient 
descent." In Advances in neural information processing systems, 2010, 
http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).

rdd
 .shuffle()
 .mapPartitions((1 to numIterations or convergence) {
   iter.sample(fraction).map(Gradient).reduce(Update)
 })
 .reduce(Average)

A higher level iteration could enclose the above variation; shuffling the data 
before the local mini-batches and feeding back the average weights from the 
last iteration. This allows more variability in the sampling of the 
mini-batches with the possibility to cover the whole dataset. Here is a Spark 
based implementation 
https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala

(1 to numIterations1 or convergence) {
 rdd
  .shuffle()
  .mapPartitions((1 to numIterations2 or convergence) {
iter.sample(fraction).map(Gradient).reduce(Update)
  })
  .reduce(Average)
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14878) Support Trim characters in the string trim function

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14878:


Assignee: Apache Spark

> Support Trim characters in the string trim function
> ---
>
> Key: SPARK-14878
> URL: https://issues.apache.org/jira/browse/SPARK-14878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kevin yu
>Assignee: Apache Spark
>
> The current Spark SQL does not support the trim characters in the string trim 
> function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 
> fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
>  We propose to implement it in this JIRA..
> The ANSI SQL2003's trim Syntax:
> SQL
>  ::= TRIM   
>  ::= [ [  ] [  ] FROM ] 
> 
>  ::= 
>  ::=
>   LEADING
> | TRAILING
> | BOTH
>  ::= 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14878) Support Trim characters in the string trim function

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255420#comment-15255420
 ] 

Apache Spark commented on SPARK-14878:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/12646

> Support Trim characters in the string trim function
> ---
>
> Key: SPARK-14878
> URL: https://issues.apache.org/jira/browse/SPARK-14878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kevin yu
>
> The current Spark SQL does not support the trim characters in the string trim 
> function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 
> fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
>  We propose to implement it in this JIRA..
> The ANSI SQL2003's trim Syntax:
> SQL
>  ::= TRIM   
>  ::= [ [  ] [  ] FROM ] 
> 
>  ::= 
>  ::=
>   LEADING
> | TRAILING
> | BOTH
>  ::= 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14878) Support Trim characters in the string trim function

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14878:


Assignee: (was: Apache Spark)

> Support Trim characters in the string trim function
> ---
>
> Key: SPARK-14878
> URL: https://issues.apache.org/jira/browse/SPARK-14878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kevin yu
>
> The current Spark SQL does not support the trim characters in the string trim 
> function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 
> fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
>  We propose to implement it in this JIRA..
> The ANSI SQL2003's trim Syntax:
> SQL
>  ::= TRIM   
>  ::= [ [  ] [  ] FROM ] 
> 
>  ::= 
>  ::=
>   LEADING
> | TRAILING
> | BOTH
>  ::= 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14877) Remove HiveMetastoreTypes class

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14877.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12644
[https://github.com/apache/spark/pull/12644]

> Remove HiveMetastoreTypes class
> ---
>
> Key: SPARK-14877
> URL: https://issues.apache.org/jira/browse/SPARK-14877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> It is unnecessary as DataType.catalogString largely replaces the need for 
> this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14879:


Assignee: Yin Huai  (was: Apache Spark)

> Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to 
> sql/core
> 
>
> Key: SPARK-14879
> URL: https://issues.apache.org/jira/browse/SPARK-14879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14879:


Assignee: Apache Spark  (was: Yin Huai)

> Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to 
> sql/core
> 
>
> Key: SPARK-14879
> URL: https://issues.apache.org/jira/browse/SPARK-14879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255403#comment-15255403
 ] 

Apache Spark commented on SPARK-14879:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/12645

> Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to 
> sql/core
> 
>
> Key: SPARK-14879
> URL: https://issues.apache.org/jira/browse/SPARK-14879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core

2016-04-23 Thread Yin Huai (JIRA)
Yin Huai created SPARK-14879:


 Summary: Move CreateMetastoreDataSource and 
CreateMetastoreDataSourceAsSelect to sql/core
 Key: SPARK-14879
 URL: https://issues.apache.org/jira/browse/SPARK-14879
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14878) Support Trim characters in the string trim function

2016-04-23 Thread kevin yu (JIRA)
kevin yu created SPARK-14878:


 Summary: Support Trim characters in the string trim function
 Key: SPARK-14878
 URL: https://issues.apache.org/jira/browse/SPARK-14878
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: kevin yu


The current Spark SQL does not support the trim characters in the string trim 
function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 fully 
supports it as shown in the 
https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
 We propose to implement it in this JIRA..
The ANSI SQL2003's trim Syntax:

SQL
 ::= TRIM   
 ::= [ [  ] [  ] FROM ] 

 ::= 
 ::=
  LEADING
| TRAILING
| BOTH
 ::= 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14877) Remove HiveMetastoreTypes class

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14877:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove HiveMetastoreTypes class
> ---
>
> Key: SPARK-14877
> URL: https://issues.apache.org/jira/browse/SPARK-14877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> It is unnecessary as DataType.catalogString largely replaces the need for 
> this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14877) Remove HiveMetastoreTypes class

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255388#comment-15255388
 ] 

Apache Spark commented on SPARK-14877:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12644

> Remove HiveMetastoreTypes class
> ---
>
> Key: SPARK-14877
> URL: https://issues.apache.org/jira/browse/SPARK-14877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> It is unnecessary as DataType.catalogString largely replaces the need for 
> this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14877) Remove HiveMetastoreTypes class

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14877:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove HiveMetastoreTypes class
> ---
>
> Key: SPARK-14877
> URL: https://issues.apache.org/jira/browse/SPARK-14877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> It is unnecessary as DataType.catalogString largely replaces the need for 
> this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14877) Remove HiveMetastoreTypes class

2016-04-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14877:
---

 Summary: Remove HiveMetastoreTypes class
 Key: SPARK-14877
 URL: https://issues.apache.org/jira/browse/SPARK-14877
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


It is unnecessary as DataType.catalogString largely replaces the need for this 
class.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14876) SparkSession should be case insensitive by default

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14876:


Assignee: Apache Spark  (was: Reynold Xin)

> SparkSession should be case insensitive by default
> --
>
> Key: SPARK-14876
> URL: https://issues.apache.org/jira/browse/SPARK-14876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> This would match most database systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14876) SparkSession should be case insensitive by default

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255382#comment-15255382
 ] 

Apache Spark commented on SPARK-14876:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12643

> SparkSession should be case insensitive by default
> --
>
> Key: SPARK-14876
> URL: https://issues.apache.org/jira/browse/SPARK-14876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This would match most database systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14876) SparkSession should be case insensitive by default

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14876:


Assignee: Reynold Xin  (was: Apache Spark)

> SparkSession should be case insensitive by default
> --
>
> Key: SPARK-14876
> URL: https://issues.apache.org/jira/browse/SPARK-14876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This would match most database systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14876) SparkSession should be case insensitive by default

2016-04-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14876:
---

 Summary: SparkSession should be case insensitive by default
 Key: SPARK-14876
 URL: https://issues.apache.org/jira/browse/SPARK-14876
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


This would match most database systems.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14846) Driver process fails to terminate when graceful shutdown is used

2016-04-23 Thread Mattias Aspholm (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255375#comment-15255375
 ] 

Mattias Aspholm edited comment on SPARK-14846 at 4/23/16 8:23 PM:
--

You're right of course. Sorry about that. I'm still having problems with the 
driver not closing down in graceful (even though there's no work left), but I 
realise now my initial conclusions was bad, the reason why it hangs in 
awaitTermination is that the termination condition is not signaled. I need to 
find out why that happens.

Ok for me to close this bug as invalid. I'll file another one if it turns out 
to be some bug after all.



was (Author: masph...@gmail.com):
Yes, you're right of course. Sorry about that. I'm still having problems with 
the driver not closing down in graceful (even though there's no work left), but 
I realise now my initial conclusions was bad, the reason why it hangs in 
awaitTermination is that the termination condition is not signaled. I need to 
find out why that happens.

Ok for me to close this bug as invalid. I'll file another one if it turns out 
to be some bug after all.


> Driver process fails to terminate when graceful shutdown is used
> 
>
> Key: SPARK-14846
> URL: https://issues.apache.org/jira/browse/SPARK-14846
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Mattias Aspholm
>
> During shutdown, the job scheduler in Streaming (JobScheduler.stop) spends 
> some time waiting for all queued work to complete. If graceful shutdown is 
> used, the time is 1 hour, for non-graceful shutdown it's 2 seconds.
> The wait is implemented using the ThreadPoolExecutor.awaitTermination method 
> in java.util.concurrent. The problem is that instead of looping over the 
> method for the desired period of time, the wait period is passed in as the 
> timeout parameter to awaitTermination. 
> The result is that if the termination condition is false the first time, the 
> method will sleep for the timeout period before trying again. In the case of 
> graceful shutdown this means at least an hour's wait before the condition is 
> checked again, even though all work is completed in just a few seconds. The 
> driver process will continue to live during this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14865) When creating a view, we should verify both the input SQL and the generated SQL

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14865.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12633
[https://github.com/apache/spark/pull/12633]

> When creating a view, we should verify both the input SQL and the generated 
> SQL
> ---
>
> Key: SPARK-14865
> URL: https://issues.apache.org/jira/browse/SPARK-14865
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> Before the generate the SQL, we should make sure it is valid.
> After we generate the SQL string for a create view command, we should verify 
> the string before putting it into metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14846) Driver process fails to terminate when graceful shutdown is used

2016-04-23 Thread Mattias Aspholm (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255375#comment-15255375
 ] 

Mattias Aspholm commented on SPARK-14846:
-

Yes, you're right of course. Sorry about that. I'm still having problems with 
the driver not closing down in graceful (even though there's no work left), but 
I realise now my initial conclusions was bad, the reason why it hangs in 
awaitTermination is that the termination condition is not signaled. I need to 
find out why that happens.

Ok for me to close this bug as invalid. I'll file another one if it turns out 
to be some bug after all.


> Driver process fails to terminate when graceful shutdown is used
> 
>
> Key: SPARK-14846
> URL: https://issues.apache.org/jira/browse/SPARK-14846
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Mattias Aspholm
>
> During shutdown, the job scheduler in Streaming (JobScheduler.stop) spends 
> some time waiting for all queued work to complete. If graceful shutdown is 
> used, the time is 1 hour, for non-graceful shutdown it's 2 seconds.
> The wait is implemented using the ThreadPoolExecutor.awaitTermination method 
> in java.util.concurrent. The problem is that instead of looping over the 
> method for the desired period of time, the wait period is passed in as the 
> timeout parameter to awaitTermination. 
> The result is that if the termination condition is false the first time, the 
> method will sleep for the timeout period before trying again. In the case of 
> graceful shutdown this means at least an hour's wait before the condition is 
> checked again, even though all work is completed in just a few seconds. The 
> driver process will continue to live during this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255363#comment-15255363
 ] 

Reynold Xin commented on SPARK-14654:
-

I see. The merge function isn't supposed to be called by the end user. 

Adding type parameters are not free -- actually everything we add is not free. 
We need to consider how much gain it brings. In this case I think the gain is 
minimal, if any.


> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14869) Don't mask exceptions in ResolveRelations

2016-04-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14869.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Don't mask exceptions in ResolveRelations
> -
>
> Key: SPARK-14869
> URL: https://issues.apache.org/jira/browse/SPARK-14869
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> In order to support SPARK-11197 (run SQL directly on files), we added some 
> code in ResolveRelations to catch the exception thrown by 
> catalog.lookupRelation and ignore it. This unfortunately masks all the 
> exceptions. It should've been sufficient to simply test the table does not 
> exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14872) Restructure commands.scala

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14872.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12636
[https://github.com/apache/spark/pull/12636]

> Restructure commands.scala
> --
>
> Key: SPARK-14872
> URL: https://issues.apache.org/jira/browse/SPARK-14872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14871) Disable StatsReportListener to declutter output

2016-04-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14871.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12635
[https://github.com/apache/spark/pull/12635]

> Disable StatsReportListener to declutter output
> ---
>
> Key: SPARK-14871
> URL: https://issues.apache.org/jira/browse/SPARK-14871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately 
> this clutters the spark-sql CLI output and makes it very difficult to read 
> the actual query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255359#comment-15255359
 ] 

holdenk commented on SPARK-14654:
-

So ACC isn't the internal buffer type, rather its the type of the Accumulator. 
This just replaces the runtime exception of someone trying to merge two 
incompatible Accumulators with a compile time check. 

> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255357#comment-15255357
 ] 

Reynold Xin commented on SPARK-14654:
-

No it can't be private spark because it needs to be implemented. I also don't 
see why we'd need to expose the internal buffer type, since it is strictly an 
implementation detail of the accumulators.

> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255356#comment-15255356
 ] 

holdenk commented on SPARK-14654:
-

Since we're talking about average accumulators anyways - what would the return 
type of a Long average accumulator's value function be?

Also would it maybe make sense to have the merge function be 
{code}private[spark]{code}? And or maybe change {code}abstract class 
Accumulator[IN, OUT] extends Serializable {{code} to {code}abstract class 
Accumulator[IN, ACC, OUT] extends Serializable {{code} and have merge take ACC 
e.g. {code}def merge(other: ACC): Unit{code} and then do 
{code}class LongAccumulator extends Accumulator[jl.Long, LongAccumulator, 
jl.Long] {{code}

> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255355#comment-15255355
 ] 

holdenk commented on SPARK-14654:
-

If were only going to do Long and Double for the easy creation on the 
SparkContext then I can certainly see why it wouldn't be worth the headaches of 
using reflection to avoid the duplicate boiler plate code between types. I 
didn't intend to suggest that the only way to create the accumulators would be 
through the reflection based API, just in place of the individual convenience 
functions on the SparkContext (would still have the ability to construct custom 
Accumulators and register them with registerAccumulator).

> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14867) Remove `--force` option in `build/mvn`.

2016-04-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14867:
--
   Priority: Major  (was: Trivial)
Description: 
Currently, `build/mvn` provides a convenient option, `--force`, in order to use 
the recommended version of maven without changing PATH environment variable.

However, there were two problems.
- `dev/lint-java` does not use the newly installed maven.
- It's not easy to type `--force` option always.

If we use '--force' option once, we had better prefer the Spark recommended 
maven.

This issue makes `build/mvn` check the existence of maven installed by 
`--force` option first.


According to [~srowen]'s comment, now this issue aims to remove `--force` 
option by auto-detection of maven version.

  was:
Currently, `build/mvn` provides a convenient option, `--force`, in order to use 
the recommended version of maven without changing PATH environment variable.

However, there were two problems.
- `dev/lint-java` does not use the newly installed maven.
- It's not easy to type `--force` option always.

If we use '--force' option once, we had better prefer the Spark recommended 
maven.

This issue makes `build/mvn` check the existence of maven installed by 
`--force` option first.

Summary: Remove `--force` option in `build/mvn`.  (was: Make 
`build/mvn` to use the downloaded maven if it exist.)

> Remove `--force` option in `build/mvn`.
> ---
>
> Key: SPARK-14867
> URL: https://issues.apache.org/jira/browse/SPARK-14867
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Dongjoon Hyun
>
> Currently, `build/mvn` provides a convenient option, `--force`, in order to 
> use the recommended version of maven without changing PATH environment 
> variable.
> However, there were two problems.
> - `dev/lint-java` does not use the newly installed maven.
> - It's not easy to type `--force` option always.
> If we use '--force' option once, we had better prefer the Spark recommended 
> maven.
> This issue makes `build/mvn` check the existence of maven installed by 
> `--force` option first.
> 
> According to [~srowen]'s comment, now this issue aims to remove `--force` 
> option by auto-detection of maven version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14729:


Assignee: Apache Spark

> Implement an existing cluster manager with New ExternalClusterManager 
> interface
> ---
>
> Key: SPARK-14729
> URL: https://issues.apache.org/jira/browse/SPARK-14729
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> SPARK-13904 adds an ExternalClusterManager interface to Spark to allow 
> external cluster managers to spawn Spark components. 
> This JIRA tracks following suggestion from [~rxin]: 
> 'One thing - can you guys try to see if you can implement one of the existing 
> cluster managers with this, and then we can make sure this is a proper API? 
> Otherwise it is really easy to get removed because it is currently unused by 
> anything in Spark.' 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14729:


Assignee: (was: Apache Spark)

> Implement an existing cluster manager with New ExternalClusterManager 
> interface
> ---
>
> Key: SPARK-14729
> URL: https://issues.apache.org/jira/browse/SPARK-14729
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> SPARK-13904 adds an ExternalClusterManager interface to Spark to allow 
> external cluster managers to spawn Spark components. 
> This JIRA tracks following suggestion from [~rxin]: 
> 'One thing - can you guys try to see if you can implement one of the existing 
> cluster managers with this, and then we can make sure this is a proper API? 
> Otherwise it is really easy to get removed because it is currently unused by 
> anything in Spark.' 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255340#comment-15255340
 ] 

Apache Spark commented on SPARK-14729:
--

User 'hbhanawat' has created a pull request for this issue:
https://github.com/apache/spark/pull/12641

> Implement an existing cluster manager with New ExternalClusterManager 
> interface
> ---
>
> Key: SPARK-14729
> URL: https://issues.apache.org/jira/browse/SPARK-14729
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> SPARK-13904 adds an ExternalClusterManager interface to Spark to allow 
> external cluster managers to spawn Spark components. 
> This JIRA tracks following suggestion from [~rxin]: 
> 'One thing - can you guys try to see if you can implement one of the existing 
> cluster managers with this, and then we can make sure this is a proper API? 
> Otherwise it is really easy to get removed because it is currently unused by 
> anything in Spark.' 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14694) Thrift Server + Hive Metastore + Kerberos doesn't work

2016-04-23 Thread zhangguancheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255036#comment-15255036
 ] 

zhangguancheng edited comment on SPARK-14694 at 4/23/16 6:50 PM:
-

Content of hive-site.xml:
{quote}




hive.server2.thrift.port
1 



hive.metastore.sasl.enabled
true 



hive.metastore.kerberos.keytab.file
/opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab 



hive.metastore.kerberos.principal
hive/c1@C1 



hive.server2.authentication
KERBEROS 



hive.server2.authentication.kerberos.principal
hive/c1@C1 



hive.server2.authentication.kerberos.keytab
/opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab 



  javax.jdo.option.ConnectionURL
  jdbc:mysql://localhost/test
  the URL of the MySQL database



  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver



  javax.jdo.option.ConnectionUserName
  xxx



  javax.jdo.option.ConnectionPassword
  x



  datanucleus.autoCreateSchema
  false



  datanucleus.fixedDatastore
  true



  hive.metastore.uris
  thrift://localhost:9083
  IP address (or fully-qualified domain name) and port of the 
metastore host



{quote}

And when I set hive.server2.enable.impersonation and hive.server2.enable.doAs 
to false, the error gone: 
{quote}

hive.server2.enable.impersonation
false 


hive.server2.enable.doAs
false 



hive.execution.engine
spark 

{quote}


was (Author: zhangguancheng):
Content of hive-site.xml:
{quote}




hive.server2.thrift.port
1 



hive.metastore.sasl.enabled
true 



hive.metastore.kerberos.keytab.file
/opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab 



hive.metastore.kerberos.principal
hive/c1@C1 



hive.server2.authentication
KERBEROS 



hive.server2.authentication.kerberos.principal
hive/c1@C1 



hive.server2.authentication.kerberos.keytab
/opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab 



  javax.jdo.option.ConnectionURL
  jdbc:mysql://localhost/test
  the URL of the MySQL database



  javax.jdo.option.ConnectionDriverName
  com.mysql.jdbc.Driver



  javax.jdo.option.ConnectionUserName
  xxx



  javax.jdo.option.ConnectionPassword
  x



  datanucleus.autoCreateSchema
  false



  datanucleus.fixedDatastore
  true



  hive.metastore.uris
  thrift://localhost:9083
  IP address (or fully-qualified domain name) and port of the 
metastore host



{quote}


> Thrift Server + Hive Metastore + Kerberos doesn't work
> --
>
> Key: SPARK-14694
> URL: https://issues.apache.org/jira/browse/SPARK-14694
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.1. compiled with hadoop 2.6.0, yarn, hive
> Hadoop 2.6.4 
> Hive 1.1.1 
> Kerberos
>Reporter: zhangguancheng
>  Labels: security
>
> My Hive Metasore is MySQL based. I started a spark thrift server on the same 
> node as the Hive Metastore. I can open beeline and run select statements but 
> for some commands like "show databases", I get an error:
> {quote}
> ERROR pool-24-thread-1 org.apache.thrift.transport.TSaslTransport:315 SASL 
> negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
> at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
> at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at 

[jira] [Updated] (SPARK-14594) Improve error messages for RDD API

2016-04-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-14594:
--
Assignee: Felix Cheung

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>Assignee: Felix Cheung
> Fix For: 2.0.0
>
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14594) Improve error messages for RDD API

2016-04-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-14594.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12622
[https://github.com/apache/spark/pull/12622]

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
> Fix For: 2.0.0
>
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14654) New accumulator API

2016-04-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255312#comment-15255312
 ] 

Reynold Xin edited comment on SPARK-14654 at 4/23/16 6:01 PM:
--

I don't get what you are trying to accomplish. It seems like you enjoy the 
cuteness of reflection. With your proposal:

1. Specialization won't work, which is a big part of this new API.

2. It is less obvious what the return types should be.

3. It is strictly less type safe, and app developers won't know what the 
accepted input types are.

4. It is unclear what the semantics is when "1" is passed in as initial value 
rather than "0".

5. We would need to implement all the primitive types, which I don't think make 
sense. In my thing only double and long are implemented. I don't see why we 
should implement all the primitive types. Why have a "byte" accumulator when 
the long one captures almost all the use cases? How often would having a 
"Boolean" accumulator make sense?

You are keeping almost all the issues with the existing API. And you would know 
if you want an avg or a long in the new one, because they have different 
functions.




was (Author: rxin):
I don't get what you are trying to accomplish. It seems like you enjoy the 
cuteness of reflection. With your proposal:

1. Specialization won't work, which is a big part of this new API.

2. It is less obvious what the return types should be.

3. It is strictly less type safe, and app developers won't know what the 
accepted input types are.

4. It is unclear what the semantics is when "1" is passed in as initial value 
rather than "0".

5. We would need to implement all the primitive types, which I don't think make 
sense. In my thing only double and long are implemented. I don't see why we 
should implement all the primitive types. Why have a "byte" accumulator when 
the long one captures almost all the use cases? How often would having a 
"Boolean" accumulator make sense?

You are keeping almost all the issues with the existing API.


And of course you would know if you want an avg or a long in the new one, 
because they have different functions.



> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, 

[jira] [Comment Edited] (SPARK-14654) New accumulator API

2016-04-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255312#comment-15255312
 ] 

Reynold Xin edited comment on SPARK-14654 at 4/23/16 6:02 PM:
--

I don't get what you are trying to accomplish. It seems like you enjoy the 
cuteness of reflection. With your proposal:

1. Specialization won't work, which is a big part of this new API.

2. It is less obvious what the return types should be.

3. It is strictly less type safe, and app developers won't know what the 
accepted input types are.

4. It is unclear what the semantics is when "1" is passed in as initial value 
rather than "0".

5. We would need to implement all the primitive types, which I don't think make 
sense. In my thing only double and long are implemented. I don't see why we 
should implement all the primitive types. Why have a "byte" accumulator when 
the long one captures almost all the use cases? How often would having a 
"Boolean" accumulator make sense?

You are keeping almost all the issues with the existing API. And users would 
know if they want an avg or a long in the new one, because they have different 
functions.




was (Author: rxin):
I don't get what you are trying to accomplish. It seems like you enjoy the 
cuteness of reflection. With your proposal:

1. Specialization won't work, which is a big part of this new API.

2. It is less obvious what the return types should be.

3. It is strictly less type safe, and app developers won't know what the 
accepted input types are.

4. It is unclear what the semantics is when "1" is passed in as initial value 
rather than "0".

5. We would need to implement all the primitive types, which I don't think make 
sense. In my thing only double and long are implemented. I don't see why we 
should implement all the primitive types. Why have a "byte" accumulator when 
the long one captures almost all the use cases? How often would having a 
"Boolean" accumulator make sense?

You are keeping almost all the issues with the existing API. And you would know 
if you want an avg or a long in the new one, because they have different 
functions.



> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): 

[jira] [Comment Edited] (SPARK-14654) New accumulator API

2016-04-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255312#comment-15255312
 ] 

Reynold Xin edited comment on SPARK-14654 at 4/23/16 5:54 PM:
--

I don't get what you are trying to accomplish. It seems like you enjoy the 
cuteness of reflection. With your proposal:

1. Specialization won't work, which is a big part of this new API.

2. It is less obvious what the return types should be.

3. It is strictly less type safe, and app developers won't know what the 
accepted input types are.

4. It is unclear what the semantics is when "1" is passed in as initial value 
rather than "0".

5. We would need to implement all the primitive types, which I don't think make 
sense. In my thing only double and long are implemented. I don't see why we 
should implement all the primitive types. Why have a "byte" accumulator when 
the long one captures almost all the use cases? How often would having a 
"Boolean" accumulator make sense?

You are keeping almost all the issues with the existing API.


And of course you would know if you want an avg or a long in the new one, 
because they have different functions.




was (Author: rxin):
I don't get what you are trying to accomplish. It seems like you enjoy the 
cuteness of reflection. With your proposal:

1. Specialization won't work, which is a big part of this new API.

2. It is less obvious what the return types should be.

3. It is strictly less type safe, and app developers won't know what the 
accepted input types are.

4. It is unclear what the semantics is when "1" is passed in as initial value 
rather than "0".

5. We would need to implement all the primitive types, which I don't think make 
sense. In my thing only double and long are implemented. I don't see why we 
should implement all the primitive types. Why have a "byte" accumulator when 
the long one captures almost all the use cases? How often would having a 
"Boolean" accumulator make sense?

You are keeping almost all the issues with the existing API.



> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new 

[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255312#comment-15255312
 ] 

Reynold Xin commented on SPARK-14654:
-

I don't get what you are trying to accomplish. It seems like you enjoy the 
cuteness of reflection. With your proposal:

1. Specialization won't work, which is a big part of this new API.

2. It is less obvious what the return types should be.

3. It is strictly less type safe, and app developers won't know what the 
accepted input types are.

4. It is unclear what the semantics is when "1" is passed in as initial value 
rather than "0".

5. We would need to implement all the primitive types, which I don't think make 
sense. In my thing only double and long are implemented. I don't see why we 
should implement all the primitive types. Why have a "byte" accumulator when 
the long one captures almost all the use cases? How often would having a 
"Boolean" accumulator make sense?

You are keeping almost all the issues with the existing API.



> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object

2016-04-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14873.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Java sampleByKey methods take ju.Map but with Scala Double values; results in 
> type Object
> -
>
> Key: SPARK-14873
> URL: https://issues.apache.org/jira/browse/SPARK-14873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's this odd bit of code in {{JavaStratifiedSamplingExample}}:
> {code}
> // specify the exact fraction desired from each key Map
> ImmutableMap fractions =
>   ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3);
> // Get an approximate sample from each stratum
> JavaPairRDD approxSample = data.sampleByKey(false, 
> fractions);
> {code}
> It highlights a problem like that in 
> https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types 
> are used where Java requires an object, and the result is that a signature 
> that logically takes Double (objects) takes an Object in the Java API. It's 
> an easy, similar fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-23 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255291#comment-15255291
 ] 

holdenk commented on SPARK-14654:
-

You wouldn't know if you want a counter or overage, but the same applies to the 
function `newLongAccumulator` so you give them a counter and if they want a 
different combiner function they subclass the `Accumulator` to implement that 
(or if we wanted to offer average easily you could add a flag or a different 
call of newAverageAccumulator and have standard implementations for average for 
the built in classes). If the users passes a 1 it means the accumulator starts 
with a value of 1. To me it just feels a little clunky have newXAccumulator 
\forall X in {default supported types} - but it is clearer at compile time so I 
can see why it might be a better fit.

If we do end up adding a lot of newXAccumulator to the API I think we should 
consider either grouping them in the API docs or moving them to a separate 
class.

> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14850:


Assignee: Apache Spark

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Blocker
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255282#comment-15255282
 ] 

Apache Spark commented on SPARK-14850:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12640

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14850:


Assignee: (was: Apache Spark)

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14864) [MLLIB] Implement Doc2Vec

2016-04-23 Thread Peter Mountanos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255275#comment-15255275
 ] 

Peter Mountanos commented on SPARK-14864:
-

[~prudenko] [~cqnguyen] I noticed previous discussion of possibly implementing 
Doc2Vec in issue SPARK-4101. Has there been any headway on this?

> [MLLIB] Implement Doc2Vec
> -
>
> Key: SPARK-14864
> URL: https://issues.apache.org/jira/browse/SPARK-14864
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Peter Mountanos
>Priority: Minor
>
> It would be useful to implement Doc2Vec, as described in the paper 
> [Distributed Representations of Sentences and 
> Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has 
> an implementation [Deep learning with 
> paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. 
> Le & Mikolov show that when aggregating Word2Vec vector representations for a 
> paragraph/document, it does not perform well for prediction tasks. Instead, 
> they propose the Paragraph Vector implementation, which provides 
> state-of-the-art results on several text classification and sentiment 
> analysis tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14856) Returning batch unexpected from wide table

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255261#comment-15255261
 ] 

Apache Spark commented on SPARK-14856:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/12639

> Returning batch unexpected from wide table
> --
>
> Key: SPARK-14856
> URL: https://issues.apache.org/jira/browse/SPARK-14856
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> When there the required schema support batch, but not full schema, the 
> parquet reader may return batch unexpectedly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]

2016-04-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255249#comment-15255249
 ] 

Cheng Lian commented on SPARK-14875:


Checked with [~cloud_fan], it was accidentally made private while adding 
bucketing feature. I'm removing this qualifier.

> OutputWriterFactory.newInstance shouldn't be private[sql]
> -
>
> Key: SPARK-14875
> URL: https://issues.apache.org/jira/browse/SPARK-14875
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Existing packages like spark-avro need to access 
> {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in 
> Spark 2.0. Should make it public again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]

2016-04-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255248#comment-15255248
 ] 

Cheng Lian commented on SPARK-14875:


[~marmbrus] Is there any reason why we made it private in Spark 2.0?

> OutputWriterFactory.newInstance shouldn't be private[sql]
> -
>
> Key: SPARK-14875
> URL: https://issues.apache.org/jira/browse/SPARK-14875
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Existing packages like spark-avro need to access 
> {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in 
> Spark 2.0. Should make it public again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]

2016-04-23 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-14875:
--

 Summary: OutputWriterFactory.newInstance shouldn't be private[sql]
 Key: SPARK-14875
 URL: https://issues.apache.org/jira/browse/SPARK-14875
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Existing packages like spark-avro need to access 
{{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in 
Spark 2.0. Should make it public again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14874) Cleanup the useless Batch class

2016-04-23 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-14874:
--
Summary: Cleanup the useless Batch class  (was: Remove the useless Batch 
class)

> Cleanup the useless Batch class
> ---
>
> Key: SPARK-14874
> URL: https://issues.apache.org/jira/browse/SPARK-14874
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> The Batch class, which had been used to indicate progress in a stream, was 
> abandoned by SPARK-13985 and then became useless.
> Let's:
> - removes the Batch class
> - renames getBatch(...) to getData(...) for Source
> - renames addBatch(...) to addData(...) for Sink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14874) Remove the useless Batch class

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14874:


Assignee: Apache Spark

> Remove the useless Batch class
> --
>
> Key: SPARK-14874
> URL: https://issues.apache.org/jira/browse/SPARK-14874
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
>
> The Batch class, which had been used to indicate progress in a stream, was 
> abandoned by SPARK-13985 and then became useless.
> Let's:
> - removes the Batch class
> - renames getBatch(...) to getData(...) for Source
> - renames addBatch(...) to addData(...) for Sink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14874) Remove the useless Batch class

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255236#comment-15255236
 ] 

Apache Spark commented on SPARK-14874:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12638

> Remove the useless Batch class
> --
>
> Key: SPARK-14874
> URL: https://issues.apache.org/jira/browse/SPARK-14874
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> The Batch class, which had been used to indicate progress in a stream, was 
> abandoned by SPARK-13985 and then became useless.
> Let's:
> - removes the Batch class
> - renames getBatch(...) to getData(...) for Source
> - renames addBatch(...) to addData(...) for Sink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14874) Remove the useless Batch class

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14874:


Assignee: (was: Apache Spark)

> Remove the useless Batch class
> --
>
> Key: SPARK-14874
> URL: https://issues.apache.org/jira/browse/SPARK-14874
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> The Batch class, which had been used to indicate progress in a stream, was 
> abandoned by SPARK-13985 and then became useless.
> Let's:
> - removes the Batch class
> - renames getBatch(...) to getData(...) for Source
> - renames addBatch(...) to addData(...) for Sink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14874) Remove the useless Batch class

2016-04-23 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-14874:
-

 Summary: Remove the useless Batch class
 Key: SPARK-14874
 URL: https://issues.apache.org/jira/browse/SPARK-14874
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Liwei Lin
Priority: Minor


The Batch class, which had been used to indicate progress in a stream, was 
abandoned by SPARK-13985 and then became useless.

Let's:
- removes the Batch class
- renames getBatch(...) to getData(...) for Source
- renames addBatch(...) to addData(...) for Sink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14846) Driver process fails to terminate when graceful shutdown is used

2016-04-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255223#comment-15255223
 ] 

Sean Owen commented on SPARK-14846:
---

No, that's not what methods like awaitNanos do in the JDK classes. It waits for 
up to that time, but the normal mechanism is that the Condition is signaled 
before the timeout occurs. This is not a sleep-and-poll.

> Driver process fails to terminate when graceful shutdown is used
> 
>
> Key: SPARK-14846
> URL: https://issues.apache.org/jira/browse/SPARK-14846
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Mattias Aspholm
>
> During shutdown, the job scheduler in Streaming (JobScheduler.stop) spends 
> some time waiting for all queued work to complete. If graceful shutdown is 
> used, the time is 1 hour, for non-graceful shutdown it's 2 seconds.
> The wait is implemented using the ThreadPoolExecutor.awaitTermination method 
> in java.util.concurrent. The problem is that instead of looping over the 
> method for the desired period of time, the wait period is passed in as the 
> timeout parameter to awaitTermination. 
> The result is that if the termination condition is false the first time, the 
> method will sleep for the timeout period before trying again. In the case of 
> graceful shutdown this means at least an hour's wait before the condition is 
> checked again, even though all work is completed in just a few seconds. The 
> driver process will continue to live during this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14873:


Assignee: Sean Owen  (was: Apache Spark)

> Java sampleByKey methods take ju.Map but with Scala Double values; results in 
> type Object
> -
>
> Key: SPARK-14873
> URL: https://issues.apache.org/jira/browse/SPARK-14873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> There's this odd bit of code in {{JavaStratifiedSamplingExample}}:
> {code}
> // specify the exact fraction desired from each key Map
> ImmutableMap fractions =
>   ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3);
> // Get an approximate sample from each stratum
> JavaPairRDD approxSample = data.sampleByKey(false, 
> fractions);
> {code}
> It highlights a problem like that in 
> https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types 
> are used where Java requires an object, and the result is that a signature 
> that logically takes Double (objects) takes an Object in the Java API. It's 
> an easy, similar fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255221#comment-15255221
 ] 

Apache Spark commented on SPARK-14873:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12637

> Java sampleByKey methods take ju.Map but with Scala Double values; results in 
> type Object
> -
>
> Key: SPARK-14873
> URL: https://issues.apache.org/jira/browse/SPARK-14873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> There's this odd bit of code in {{JavaStratifiedSamplingExample}}:
> {code}
> // specify the exact fraction desired from each key Map
> ImmutableMap fractions =
>   ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3);
> // Get an approximate sample from each stratum
> JavaPairRDD approxSample = data.sampleByKey(false, 
> fractions);
> {code}
> It highlights a problem like that in 
> https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types 
> are used where Java requires an object, and the result is that a signature 
> that logically takes Double (objects) takes an Object in the Java API. It's 
> an easy, similar fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14873:


Assignee: Apache Spark  (was: Sean Owen)

> Java sampleByKey methods take ju.Map but with Scala Double values; results in 
> type Object
> -
>
> Key: SPARK-14873
> URL: https://issues.apache.org/jira/browse/SPARK-14873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.1
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> There's this odd bit of code in {{JavaStratifiedSamplingExample}}:
> {code}
> // specify the exact fraction desired from each key Map
> ImmutableMap fractions =
>   ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3);
> // Get an approximate sample from each stratum
> JavaPairRDD approxSample = data.sampleByKey(false, 
> fractions);
> {code}
> It highlights a problem like that in 
> https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types 
> are used where Java requires an object, and the result is that a signature 
> that logically takes Double (objects) takes an Object in the Java API. It's 
> an easy, similar fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object

2016-04-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14873:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-11806

> Java sampleByKey methods take ju.Map but with Scala Double values; results in 
> type Object
> -
>
> Key: SPARK-14873
> URL: https://issues.apache.org/jira/browse/SPARK-14873
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> There's this odd bit of code in {{JavaStratifiedSamplingExample}}:
> {code}
> // specify the exact fraction desired from each key Map
> ImmutableMap fractions =
>   ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3);
> // Get an approximate sample from each stratum
> JavaPairRDD approxSample = data.sampleByKey(false, 
> fractions);
> {code}
> It highlights a problem like that in 
> https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types 
> are used where Java requires an object, and the result is that a signature 
> that logically takes Double (objects) takes an Object in the Java API. It's 
> an easy, similar fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object

2016-04-23 Thread Sean Owen (JIRA)
Sean Owen created SPARK-14873:
-

 Summary: Java sampleByKey methods take ju.Map but with Scala 
Double values; results in type Object
 Key: SPARK-14873
 URL: https://issues.apache.org/jira/browse/SPARK-14873
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 1.6.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


There's this odd bit of code in {{JavaStratifiedSamplingExample}}:

{code}
// specify the exact fraction desired from each key Map
ImmutableMap fractions =
  ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3);

// Get an approximate sample from each stratum
JavaPairRDD approxSample = data.sampleByKey(false, 
fractions);
{code}

It highlights a problem like that in 
https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types 
are used where Java requires an object, and the result is that a signature that 
logically takes Double (objects) takes an Object in the Java API. It's an easy, 
similar fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14872) Restructure commands.scala

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14872:


Assignee: Reynold Xin  (was: Apache Spark)

> Restructure commands.scala
> --
>
> Key: SPARK-14872
> URL: https://issues.apache.org/jira/browse/SPARK-14872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14872) Restructure commands.scala

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255194#comment-15255194
 ] 

Apache Spark commented on SPARK-14872:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12636

> Restructure commands.scala
> --
>
> Key: SPARK-14872
> URL: https://issues.apache.org/jira/browse/SPARK-14872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14872) Restructure commands.scala

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14872:


Assignee: Apache Spark  (was: Reynold Xin)

> Restructure commands.scala
> --
>
> Key: SPARK-14872
> URL: https://issues.apache.org/jira/browse/SPARK-14872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14872) Restructure commands.scala

2016-04-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14872:
---

 Summary: Restructure commands.scala
 Key: SPARK-14872
 URL: https://issues.apache.org/jira/browse/SPARK-14872
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14871) Disable StatsReportListener to declutter output

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14871:


Assignee: Apache Spark  (was: Reynold Xin)

> Disable StatsReportListener to declutter output
> ---
>
> Key: SPARK-14871
> URL: https://issues.apache.org/jira/browse/SPARK-14871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately 
> this clutters the spark-sql CLI output and makes it very difficult to read 
> the actual query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14871) Disable StatsReportListener to declutter output

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14871:


Assignee: Reynold Xin  (was: Apache Spark)

> Disable StatsReportListener to declutter output
> ---
>
> Key: SPARK-14871
> URL: https://issues.apache.org/jira/browse/SPARK-14871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately 
> this clutters the spark-sql CLI output and makes it very difficult to read 
> the actual query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14871) Disable StatsReportListener to declutter output

2016-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255187#comment-15255187
 ] 

Apache Spark commented on SPARK-14871:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12635

> Disable StatsReportListener to declutter output
> ---
>
> Key: SPARK-14871
> URL: https://issues.apache.org/jira/browse/SPARK-14871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately 
> this clutters the spark-sql CLI output and makes it very difficult to read 
> the actual query results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14871) Disable StatsReportListener to declutter output

2016-04-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14871:
---

 Summary: Disable StatsReportListener to declutter output
 Key: SPARK-14871
 URL: https://issues.apache.org/jira/browse/SPARK-14871
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately 
this clutters the spark-sql CLI output and makes it very difficult to read the 
actual query results.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255186#comment-15255186
 ] 

Marco Gaido commented on SPARK-14594:
-

Yes, I do believe that this is what is happening

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-04-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12148.
-
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.0.0

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>Assignee: Felix Cheung
> Fix For: 2.0.0
>
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14870) NPE in generate aggregate

2016-04-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14870:
--

 Summary: NPE in generate aggregate
 Key: SPARK-14870
 URL: https://issues.apache.org/jira/browse/SPARK-14870
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Sameer Agarwal



When ran TPCDS Q14a
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 
(TID 234, localhost): java.lang.NullPointerException
at 
org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576)
at 
org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.RDD.collect(RDD.scala:879)
at 
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
at 
org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2386)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2366)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 

[jira] [Assigned] (SPARK-14869) Don't mask exceptions in ResolveRelations

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14869:


Assignee: Apache Spark  (was: Reynold Xin)

> Don't mask exceptions in ResolveRelations
> -
>
> Key: SPARK-14869
> URL: https://issues.apache.org/jira/browse/SPARK-14869
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> In order to support SPARK-11197 (run SQL directly on files), we added some 
> code in ResolveRelations to catch the exception thrown by 
> catalog.lookupRelation and ignore it. This unfortunately masks all the 
> exceptions. It should've been sufficient to simply test the table does not 
> exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14869) Don't mask exceptions in ResolveRelations

2016-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14869:


Assignee: Reynold Xin  (was: Apache Spark)

> Don't mask exceptions in ResolveRelations
> -
>
> Key: SPARK-14869
> URL: https://issues.apache.org/jira/browse/SPARK-14869
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> In order to support SPARK-11197 (run SQL directly on files), we added some 
> code in ResolveRelations to catch the exception thrown by 
> catalog.lookupRelation and ignore it. This unfortunately masks all the 
> exceptions. It should've been sufficient to simply test the table does not 
> exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >