[jira] [Resolved] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core
[ https://issues.apache.org/jira/browse/SPARK-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14879. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12645 [https://github.com/apache/spark/pull/12645] > Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to > sql/core > > > Key: SPARK-14879 > URL: https://issues.apache.org/jira/browse/SPARK-14879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14833) Refactor StreamTests to test for source fault-tolerance correctly.
[ https://issues.apache.org/jira/browse/SPARK-14833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-14833. -- Resolution: Fixed Fix Version/s: 2.0.0 > Refactor StreamTests to test for source fault-tolerance correctly. > -- > > Key: SPARK-14833 > URL: https://issues.apache.org/jira/browse/SPARK-14833 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.0 > > > Current StreamTest allows testing of a streaming Dataset generated explicitly > wraps a source. This is different from the actual production code path where > the source object is dynamically created through a DataSource object every > time a query is started. So all the fault-tolerance testing in > FileSourceSuite and FileSourceStressSuite is not really testing the actual > code path as they are just reusing the FileStreamSource object. > Instead of maintaining a mapping of source --> expected offset in StreamTest > (which requires reuse of source object), it should maintain a mapping of > source index --> offset, so that it is independent of the source object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14882) Programming Guide Improvements
Ben McCann created SPARK-14882: -- Summary: Programming Guide Improvements Key: SPARK-14882 URL: https://issues.apache.org/jira/browse/SPARK-14882 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Ben McCann I'm reading http://spark.apache.org/docs/latest/programming-guide.html It says "Spark 1.6.1 uses Scala 2.10. To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.10.X)." However, it doesn't seem to me that Scala 2.10 is required because I see versions compiled for both 2.10 and 2.11 in Maven Central. There are a few references to Tachyon that look like they should be changed to Alluxio -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14838) Implement statistics in SerializeFromObject to avoid failure when estimating sizeInBytes for ObjectType
[ https://issues.apache.org/jira/browse/SPARK-14838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-14838: --- Assignee: Liang-Chi Hsieh > Implement statistics in SerializeFromObject to avoid failure when estimating > sizeInBytes for ObjectType > --- > > Key: SPARK-14838 > URL: https://issues.apache.org/jira/browse/SPARK-14838 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > Spark will determine the plan size to automatically broadcast it or not when > doing join. As it can't estimate object type size, this mechanism will throw > failure as shown in > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56533/consoleFull. > We should fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14838) Implement statistics in SerializeFromObject to avoid failure when estimating sizeInBytes for ObjectType
[ https://issues.apache.org/jira/browse/SPARK-14838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14838. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12599 [https://github.com/apache/spark/pull/12599] > Implement statistics in SerializeFromObject to avoid failure when estimating > sizeInBytes for ObjectType > --- > > Key: SPARK-14838 > URL: https://issues.apache.org/jira/browse/SPARK-14838 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > Spark will determine the plan size to automatically broadcast it or not when > doing join. As it can't estimate object type size, this mechanism will throw > failure as shown in > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56533/consoleFull. > We should fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14874) Remove the obsolete Batch representation
[ https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-14874: -- Summary: Remove the obsolete Batch representation (was: Cleanup the useless Batch class) > Remove the obsolete Batch representation > > > Key: SPARK-14874 > URL: https://issues.apache.org/jira/browse/SPARK-14874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > The Batch class, which had been used to indicate progress in a stream, was > abandoned by SPARK-13985 and then became useless. > Let's: > - removes the Batch class > - renames getBatch(...) to getData(...) for Source > - renames addBatch(...) to addData(...) for Sink -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Description: Since our parser defined based on antlr 4 can parse data type. We can remove org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new parser's functionality is a super set of DataTypeParser. Then, we can remove DataTypeParser. For the object DataTypeParser, we can keep it and let it just call the parserDataType method of CatalystSqlParser. *The original description is shown below* Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. was: Since our parser defined based on antlr 4 can parse data type. We can remove org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new parser's functionality is a super set of DataTypeParser. Then, we can remove DataTypeParser. *The original description is shown below* Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > Since our parser defined based on antlr 4 can parse data type. We can remove > org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new > parser's functionality is a super set of DataTypeParser. Then, we can remove > DataTypeParser. For the object DataTypeParser, we can keep it and let it just > call the parserDataType method of CatalystSqlParser. > *The original description is shown below* > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Description: Since our parser defined based on antlr 4 can parse data type (see CatalystSqlParser), we can remove org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new parser's functionality is a super set of DataTypeParser. Then, we can remove DataTypeParser. For the object DataTypeParser, we can keep it and let it just call the parserDataType method of CatalystSqlParser. *The original description is shown below* Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. was: Since our parser defined based on antlr 4 can parse data type. We can remove org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new parser's functionality is a super set of DataTypeParser. Then, we can remove DataTypeParser. For the object DataTypeParser, we can keep it and let it just call the parserDataType method of CatalystSqlParser. *The original description is shown below* Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > Since our parser defined based on antlr 4 can parse data type (see > CatalystSqlParser), we can remove > org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new > parser's functionality is a super set of DataTypeParser. Then, we can remove > DataTypeParser. For the object DataTypeParser, we can keep it and let it just > call the parserDataType method of CatalystSqlParser. > *The original description is shown below* > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Priority: Major (was: Blocker) > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > Since our parser defined based on antlr 4 can parse data type. We can remove > org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new > parser's functionality is a super set of DataTypeParser. Then, we can remove > DataTypeParser. > *The original description is shown below* > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Issue Type: Sub-task (was: Bug) Parent: SPARK-14776 > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Since our parser defined based on antlr 4 can parse data type. We can remove > org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new > parser's functionality is a super set of DataTypeParser. Then, we can remove > DataTypeParser. > *The original description is shown below* > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Description: Since our parser defined based on antlr 4 can parse data type. We can remove org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new parser's functionality is a super set of DataTypeParser. Then, we can remove DataTypeParser. *The original description is shown below* Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. was: Since our parser defined based on antlr 4 can parse data type. We will not need to have Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Since our parser defined based on antlr 4 can parse data type. We can remove > org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new > parser's functionality is a super set of DataTypeParser. Then, we can remove > DataTypeParser. > *The original description is shown below* > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Summary: Remove org.apache.spark.sql.catalyst.parser.DataTypeParser (was: DDLParser should accept decimal(precision)) > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Description: Since our parser defined based on antlr 4 can parse data type. We will not need to have Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. was: Since our parser defined based on antlr 4 can parse data type. We will not need to hav e Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Since our parser defined based on antlr 4 can parse data type. We will not > need to have > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Description: Since our parser defined based on antlr 4 can parse data type. We will not need to hav e Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. was:Right now, our DDLParser does not support {{decimal(precision)}} (the scale will be set to 0). We should support it. > Remove org.apache.spark.sql.catalyst.parser.DataTypeParser > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Since our parser defined based on antlr 4 can parse data type. We will not > need to hav e > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14591) DDLParser should accept decimal(precision)
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14591: - Priority: Blocker (was: Major) > DDLParser should accept decimal(precision) > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14591) DDLParser should accept decimal(precision)
[ https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255430#comment-15255430 ] Yin Huai commented on SPARK-14591: -- [~hvanhovell] Where do we define those reserved keywords? > DDLParser should accept decimal(precision) > -- > > Key: SPARK-14591 > URL: https://issues.apache.org/jira/browse/SPARK-14591 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Right now, our DDLParser does not support {{decimal(precision)}} (the scale > will be set to 0). We should support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4298) The spark-submit cannot read Main-Class from Manifest.
[ https://issues.apache.org/jira/browse/SPARK-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255429#comment-15255429 ] Jimit Kamlesh Raithatha commented on SPARK-4298: Brennon / Friends, Any chance this issue is open in Spark 1.5.2 (for Hadoop 2.4)? I still had to launch my program as follows: ./spark-submit --class problem1 /a/b/c/def.jar My MANIFEST is a simple 2-liner: Manifest-Version: 1.0 Main-Class: problem1 > The spark-submit cannot read Main-Class from Manifest. > -- > > Key: SPARK-4298 > URL: https://issues.apache.org/jira/browse/SPARK-4298 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Linux > spark-1.1.0-bin-hadoop2.4.tgz > java version "1.7.0_72" > Java(TM) SE Runtime Environment (build 1.7.0_72-b14) > Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode) >Reporter: Milan Straka >Assignee: Brennon York > Fix For: 1.0.3, 1.1.2, 1.2.1, 1.3.0 > > > Consider trivial {{test.scala}}: > {code:title=test.scala|borderStyle=solid} > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > object Main { > def main(args: Array[String]) { > val sc = new SparkContext() > sc.stop() > } > } > {code} > When built with {{sbt}} and executed using {{spark-submit > target/scala-2.10/test_2.10-1.0.jar}}, I get the following error: > {code} > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > Error: Cannot load main class from JAR: > file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar > Run with --help for usage help or --verbose for debug output > {code} > When executed using {{spark-submit --class Main > target/scala-2.10/test_2.10-1.0.jar}}, it works. > The jar file has correct MANIFEST.MF: > {code:title=MANIFEST.MF|borderStyle=solid} > Manifest-Version: 1.0 > Implementation-Vendor: test > Implementation-Title: test > Implementation-Version: 1.0 > Implementation-Vendor-Id: test > Specification-Vendor: test > Specification-Title: test > Specification-Version: 1.0 > Main-Class: Main > {code} > The problem is that in {{org.apache.spark.deploy.SparkSubmitArguments}}, line > 127: > {code} > val jar = new JarFile(primaryResource) > {code} > the primaryResource has String value > {{"file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar"}}, which is > URI, but JarFile can use only Path. One way to fix this would be using > {code} > val uri = new URI(primaryResource) > val jar = new JarFile(uri.getPath) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala
[ https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14881: Assignee: Apache Spark > pyspark and sparkR shell default log level should match spark-shell/Scala > - > > Key: SPARK-14881 > URL: https://issues.apache.org/jira/browse/SPARK-14881 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Minor > > Scala spark-shell defaults to log level WARN. pyspark and sparkR should match > that by default (user can change it later) > # ./bin/spark-shell > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala
[ https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255427#comment-15255427 ] Apache Spark commented on SPARK-14881: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/12648 > pyspark and sparkR shell default log level should match spark-shell/Scala > - > > Key: SPARK-14881 > URL: https://issues.apache.org/jira/browse/SPARK-14881 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Priority: Minor > > Scala spark-shell defaults to log level WARN. pyspark and sparkR should match > that by default (user can change it later) > # ./bin/spark-shell > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala
[ https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14881: Assignee: (was: Apache Spark) > pyspark and sparkR shell default log level should match spark-shell/Scala > - > > Key: SPARK-14881 > URL: https://issues.apache.org/jira/browse/SPARK-14881 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Priority: Minor > > Scala spark-shell defaults to log level WARN. pyspark and sparkR should match > that by default (user can change it later) > # ./bin/spark-shell > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala
[ https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-14881: - Summary: pyspark and sparkR shell default log level should match spark-shell/Scala (was: PySpark and sparkR shell default log level should match spark-shell/Scala) > pyspark and sparkR shell default log level should match spark-shell/Scala > - > > Key: SPARK-14881 > URL: https://issues.apache.org/jira/browse/SPARK-14881 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Priority: Minor > > Scala spark-shell defaults to log level WARN. pyspark and sparkR should match > that by default (user can change it later) > # ./bin/spark-shell > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14881) PySpark and sparkR shell default log level should match spark-shell/Scala
[ https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-14881: - Description: Scala spark-shell defaults to log level WARN. pyspark and sparkR should match that by default (user can change it later) # ./bin/spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). > PySpark and sparkR shell default log level should match spark-shell/Scala > - > > Key: SPARK-14881 > URL: https://issues.apache.org/jira/browse/SPARK-14881 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Priority: Minor > > Scala spark-shell defaults to log level WARN. pyspark and sparkR should match > that by default (user can change it later) > # ./bin/spark-shell > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14881) PySpark and sparkR shell default log level should match spark-shell/Scala
Felix Cheung created SPARK-14881: Summary: PySpark and sparkR shell default log level should match spark-shell/Scala Key: SPARK-14881 URL: https://issues.apache.org/jira/browse/SPARK-14881 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell, SparkR Affects Versions: 2.0.0 Reporter: Felix Cheung Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13831) TPC-DS Query 35 fails with the following compile error
[ https://issues.apache.org/jira/browse/SPARK-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255426#comment-15255426 ] Roy Cecil commented on SPARK-13831: --- @Herman, thanks. I am validating the fix. > TPC-DS Query 35 fails with the following compile error > -- > > Key: SPARK-13831 > URL: https://issues.apache.org/jira/browse/SPARK-13831 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Roy Cecil >Assignee: Herman van Hovell > Fix For: 2.0.0 > > > TPC-DS Query 35 fails with the following compile error. > Scala.NotImplementedError: > scala.NotImplementedError: No parse rules for ASTNode type: 864, text: > TOK_SUBQUERY_EXPR : > TOK_SUBQUERY_EXPR 1, 439,797, 1370 > TOK_SUBQUERY_OP 1, 439,439, 1370 > exists 1, 439,439, 1370 > TOK_QUERY 1, 441,797, 1508 > Pasting Query 35 for easy reference. > select > ca_state, > cd_gender, > cd_marital_status, > cd_dep_count, > count(*) cnt1, > min(cd_dep_count) cd_dep_count1, > max(cd_dep_count) cd_dep_count2, > avg(cd_dep_count) cd_dep_count3, > cd_dep_employed_count, > count(*) cnt2, > min(cd_dep_employed_count) cd_dep_employed_count1, > max(cd_dep_employed_count) cd_dep_employed_count2, > avg(cd_dep_employed_count) cd_dep_employed_count3, > cd_dep_college_count, > count(*) cnt3, > min(cd_dep_college_count) cd_dep_college_count1, > max(cd_dep_college_count) cd_dep_college_count2, > avg(cd_dep_college_count) cd_dep_college_count3 > from > customer c > JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk > JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk > LEFT SEMI JOIN > (select ss_customer_sk > from store_sales >JOIN date_dim ON ss_sold_date_sk = d_date_sk > where > d_year = 2002 and > d_qoy < 4) ss_wh1 > ON c.c_customer_sk = ss_wh1.ss_customer_sk > where >exists ( > select tmp.customer_sk from ( > select ws_bill_customer_sk as customer_sk > from web_sales,date_dim > where > ws_sold_date_sk = d_date_sk and > d_year = 2002 and > d_qoy < 4 >UNION ALL > select cs_ship_customer_sk as customer_sk > from catalog_sales,date_dim > where > cs_sold_date_sk = d_date_sk and > d_year = 2002 and > d_qoy < 4 > ) tmp where c.c_customer_sk = tmp.customer_sk > ) > group by ca_state, > cd_gender, > cd_marital_status, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > order by ca_state, > cd_gender, > cd_marital_status, > cd_dep_count, > cd_dep_employed_count, > cd_dep_college_count > limit 100; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255424#comment-15255424 ] Apache Spark commented on SPARK-12148: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/12647 > SparkR: rename DataFrame to SparkDataFrame > -- > > Key: SPARK-12148 > URL: https://issues.apache.org/jira/browse/SPARK-12148 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Michael Lawrence >Assignee: Felix Cheung > Fix For: 2.0.0 > > > The SparkR package represents a Spark DataFrame with the class "DataFrame". > That conflicts with the more general DataFrame class defined in the S4Vectors > package. Would it not be more appropriate to use the name "SparkDataFrame" > instead? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead
[ https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed Mahran updated SPARK-14880: - Description: The current implementation of (Stochastic) Gradient Descent performs one map-reduce shuffle per iteration. Moreover, when the sampling fraction gets smaller, the algorithm becomes shuffle-bound instead of CPU-bound. {code} (1 to numIterations or convergence) { rdd .sample(fraction) .map(Gradient) .reduce(Update) } {code} A more performant variation requires only one map-reduce regardless from the number of iterations. A local mini-batch SGD could be run on each partition, then the results could be averaged. This is based on (Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient descent." In Advances in neural information processing systems, 2010, http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). {code} rdd .shuffle() .mapPartitions((1 to numIterations or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) {code} A higher level iteration could enclose the above variation; shuffling the data before the local mini-batches and feeding back the average weights from the last iteration. This allows more variability in the sampling of the mini-batches with the possibility to cover the whole dataset. Here is a Spark based implementation https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala {code} (1 to numIterations1 or convergence) { rdd .shuffle() .mapPartitions((1 to numIterations2 or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) } {code} was: The current implementation of (Stochastic) Gradient Descent performs one map-reduce shuffle per iteration. Moreover, when the sampling fraction gets smaller, the algorithm becomes shuffle-bound instead of CPU-bound. (1 to numIterations or convergence) { rdd .sample(fraction) .map(Gradient) .reduce(Update) } A more performant variation requires only one map-reduce regardless from the number of iterations. A local mini-batch SGD could be run on each partition, then the results could be averaged. This is based on (Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient descent." In Advances in neural information processing systems, 2010, http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). rdd .shuffle() .mapPartitions((1 to numIterations or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) A higher level iteration could enclose the above variation; shuffling the data before the local mini-batches and feeding back the average weights from the last iteration. This allows more variability in the sampling of the mini-batches with the possibility to cover the whole dataset. Here is a Spark based implementation https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala (1 to numIterations1 or convergence) { rdd .shuffle() .mapPartitions((1 to numIterations2 or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) } > Parallel Gradient Descent with less map-reduce shuffle overhead > --- > > Key: SPARK-14880 > URL: https://issues.apache.org/jira/browse/SPARK-14880 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Ahmed Mahran > Labels: performance > > The current implementation of (Stochastic) Gradient Descent performs one > map-reduce shuffle per iteration. Moreover, when the sampling fraction gets > smaller, the algorithm becomes shuffle-bound instead of CPU-bound. > {code} > (1 to numIterations or convergence) { > rdd > .sample(fraction) > .map(Gradient) > .reduce(Update) > } > {code} > A more performant variation requires only one map-reduce regardless from the > number of iterations. A local mini-batch SGD could be run on each partition, > then the results could be averaged. This is based on (Zinkevich, Martin, > Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic > gradient descent." In Advances in neural information processing systems, > 2010, > http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). > {code} > rdd > .shuffle() > .mapPartitions((1 to numIterations or convergence) { >iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > {code} > A higher level iteration could enclose the above variation; shuffling the > data before the local mini-batches and feeding back the average weights from > the last iteration. This allows
[jira] [Created] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead
Ahmed Mahran created SPARK-14880: Summary: Parallel Gradient Descent with less map-reduce shuffle overhead Key: SPARK-14880 URL: https://issues.apache.org/jira/browse/SPARK-14880 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Ahmed Mahran The current implementation of (Stochastic) Gradient Descent performs one map-reduce shuffle per iteration. Moreover, when the sampling fraction gets smaller, the algorithm becomes shuffle-bound instead of CPU-bound. (1 to numIterations or convergence) { rdd .sample(fraction) .map(Gradient) .reduce(Update) } A more performant variation requires only one map-reduce regardless from the number of iterations. A local mini-batch SGD could be run on each partition, then the results could be averaged. This is based on (Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic gradient descent." In Advances in neural information processing systems, 2010, http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). rdd .shuffle() .mapPartitions((1 to numIterations or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) A higher level iteration could enclose the above variation; shuffling the data before the local mini-batches and feeding back the average weights from the last iteration. This allows more variability in the sampling of the mini-batches with the possibility to cover the whole dataset. Here is a Spark based implementation https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala (1 to numIterations1 or convergence) { rdd .shuffle() .mapPartitions((1 to numIterations2 or convergence) { iter.sample(fraction).map(Gradient).reduce(Update) }) .reduce(Average) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14878) Support Trim characters in the string trim function
[ https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14878: Assignee: Apache Spark > Support Trim characters in the string trim function > --- > > Key: SPARK-14878 > URL: https://issues.apache.org/jira/browse/SPARK-14878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: kevin yu >Assignee: Apache Spark > > The current Spark SQL does not support the trim characters in the string trim > function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 > fully supports it as shown in the > https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html. > We propose to implement it in this JIRA.. > The ANSI SQL2003's trim Syntax: > SQL > ::= TRIM > ::= [ [ ] [ ] FROM ] > > ::= > ::= > LEADING > | TRAILING > | BOTH > ::= -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14878) Support Trim characters in the string trim function
[ https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255420#comment-15255420 ] Apache Spark commented on SPARK-14878: -- User 'kevinyu98' has created a pull request for this issue: https://github.com/apache/spark/pull/12646 > Support Trim characters in the string trim function > --- > > Key: SPARK-14878 > URL: https://issues.apache.org/jira/browse/SPARK-14878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: kevin yu > > The current Spark SQL does not support the trim characters in the string trim > function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 > fully supports it as shown in the > https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html. > We propose to implement it in this JIRA.. > The ANSI SQL2003's trim Syntax: > SQL > ::= TRIM > ::= [ [ ] [ ] FROM ] > > ::= > ::= > LEADING > | TRAILING > | BOTH > ::= -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14878) Support Trim characters in the string trim function
[ https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14878: Assignee: (was: Apache Spark) > Support Trim characters in the string trim function > --- > > Key: SPARK-14878 > URL: https://issues.apache.org/jira/browse/SPARK-14878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: kevin yu > > The current Spark SQL does not support the trim characters in the string trim > function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 > fully supports it as shown in the > https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html. > We propose to implement it in this JIRA.. > The ANSI SQL2003's trim Syntax: > SQL > ::= TRIM > ::= [ [ ] [ ] FROM ] > > ::= > ::= > LEADING > | TRAILING > | BOTH > ::= -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14877) Remove HiveMetastoreTypes class
[ https://issues.apache.org/jira/browse/SPARK-14877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14877. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12644 [https://github.com/apache/spark/pull/12644] > Remove HiveMetastoreTypes class > --- > > Key: SPARK-14877 > URL: https://issues.apache.org/jira/browse/SPARK-14877 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > It is unnecessary as DataType.catalogString largely replaces the need for > this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core
[ https://issues.apache.org/jira/browse/SPARK-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14879: Assignee: Yin Huai (was: Apache Spark) > Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to > sql/core > > > Key: SPARK-14879 > URL: https://issues.apache.org/jira/browse/SPARK-14879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core
[ https://issues.apache.org/jira/browse/SPARK-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14879: Assignee: Apache Spark (was: Yin Huai) > Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to > sql/core > > > Key: SPARK-14879 > URL: https://issues.apache.org/jira/browse/SPARK-14879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core
[ https://issues.apache.org/jira/browse/SPARK-14879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255403#comment-15255403 ] Apache Spark commented on SPARK-14879: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/12645 > Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to > sql/core > > > Key: SPARK-14879 > URL: https://issues.apache.org/jira/browse/SPARK-14879 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14879) Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core
Yin Huai created SPARK-14879: Summary: Move CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect to sql/core Key: SPARK-14879 URL: https://issues.apache.org/jira/browse/SPARK-14879 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14878) Support Trim characters in the string trim function
kevin yu created SPARK-14878: Summary: Support Trim characters in the string trim function Key: SPARK-14878 URL: https://issues.apache.org/jira/browse/SPARK-14878 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: kevin yu The current Spark SQL does not support the trim characters in the string trim function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 fully supports it as shown in the https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html. We propose to implement it in this JIRA.. The ANSI SQL2003's trim Syntax: SQL ::= TRIM ::= [ [ ] [ ] FROM ] ::= ::= LEADING | TRAILING | BOTH ::= -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14877) Remove HiveMetastoreTypes class
[ https://issues.apache.org/jira/browse/SPARK-14877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14877: Assignee: Reynold Xin (was: Apache Spark) > Remove HiveMetastoreTypes class > --- > > Key: SPARK-14877 > URL: https://issues.apache.org/jira/browse/SPARK-14877 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > It is unnecessary as DataType.catalogString largely replaces the need for > this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14877) Remove HiveMetastoreTypes class
[ https://issues.apache.org/jira/browse/SPARK-14877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255388#comment-15255388 ] Apache Spark commented on SPARK-14877: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12644 > Remove HiveMetastoreTypes class > --- > > Key: SPARK-14877 > URL: https://issues.apache.org/jira/browse/SPARK-14877 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > It is unnecessary as DataType.catalogString largely replaces the need for > this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14877) Remove HiveMetastoreTypes class
[ https://issues.apache.org/jira/browse/SPARK-14877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14877: Assignee: Apache Spark (was: Reynold Xin) > Remove HiveMetastoreTypes class > --- > > Key: SPARK-14877 > URL: https://issues.apache.org/jira/browse/SPARK-14877 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > It is unnecessary as DataType.catalogString largely replaces the need for > this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14877) Remove HiveMetastoreTypes class
Reynold Xin created SPARK-14877: --- Summary: Remove HiveMetastoreTypes class Key: SPARK-14877 URL: https://issues.apache.org/jira/browse/SPARK-14877 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is unnecessary as DataType.catalogString largely replaces the need for this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14876) SparkSession should be case insensitive by default
[ https://issues.apache.org/jira/browse/SPARK-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14876: Assignee: Apache Spark (was: Reynold Xin) > SparkSession should be case insensitive by default > -- > > Key: SPARK-14876 > URL: https://issues.apache.org/jira/browse/SPARK-14876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > This would match most database systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14876) SparkSession should be case insensitive by default
[ https://issues.apache.org/jira/browse/SPARK-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255382#comment-15255382 ] Apache Spark commented on SPARK-14876: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12643 > SparkSession should be case insensitive by default > -- > > Key: SPARK-14876 > URL: https://issues.apache.org/jira/browse/SPARK-14876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > This would match most database systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14876) SparkSession should be case insensitive by default
[ https://issues.apache.org/jira/browse/SPARK-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14876: Assignee: Reynold Xin (was: Apache Spark) > SparkSession should be case insensitive by default > -- > > Key: SPARK-14876 > URL: https://issues.apache.org/jira/browse/SPARK-14876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > This would match most database systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14876) SparkSession should be case insensitive by default
Reynold Xin created SPARK-14876: --- Summary: SparkSession should be case insensitive by default Key: SPARK-14876 URL: https://issues.apache.org/jira/browse/SPARK-14876 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin This would match most database systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14846) Driver process fails to terminate when graceful shutdown is used
[ https://issues.apache.org/jira/browse/SPARK-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255375#comment-15255375 ] Mattias Aspholm edited comment on SPARK-14846 at 4/23/16 8:23 PM: -- You're right of course. Sorry about that. I'm still having problems with the driver not closing down in graceful (even though there's no work left), but I realise now my initial conclusions was bad, the reason why it hangs in awaitTermination is that the termination condition is not signaled. I need to find out why that happens. Ok for me to close this bug as invalid. I'll file another one if it turns out to be some bug after all. was (Author: masph...@gmail.com): Yes, you're right of course. Sorry about that. I'm still having problems with the driver not closing down in graceful (even though there's no work left), but I realise now my initial conclusions was bad, the reason why it hangs in awaitTermination is that the termination condition is not signaled. I need to find out why that happens. Ok for me to close this bug as invalid. I'll file another one if it turns out to be some bug after all. > Driver process fails to terminate when graceful shutdown is used > > > Key: SPARK-14846 > URL: https://issues.apache.org/jira/browse/SPARK-14846 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Mattias Aspholm > > During shutdown, the job scheduler in Streaming (JobScheduler.stop) spends > some time waiting for all queued work to complete. If graceful shutdown is > used, the time is 1 hour, for non-graceful shutdown it's 2 seconds. > The wait is implemented using the ThreadPoolExecutor.awaitTermination method > in java.util.concurrent. The problem is that instead of looping over the > method for the desired period of time, the wait period is passed in as the > timeout parameter to awaitTermination. > The result is that if the termination condition is false the first time, the > method will sleep for the timeout period before trying again. In the case of > graceful shutdown this means at least an hour's wait before the condition is > checked again, even though all work is completed in just a few seconds. The > driver process will continue to live during this time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14865) When creating a view, we should verify both the input SQL and the generated SQL
[ https://issues.apache.org/jira/browse/SPARK-14865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14865. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12633 [https://github.com/apache/spark/pull/12633] > When creating a view, we should verify both the input SQL and the generated > SQL > --- > > Key: SPARK-14865 > URL: https://issues.apache.org/jira/browse/SPARK-14865 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Reynold Xin >Priority: Critical > Fix For: 2.0.0 > > > Before the generate the SQL, we should make sure it is valid. > After we generate the SQL string for a create view command, we should verify > the string before putting it into metastore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14846) Driver process fails to terminate when graceful shutdown is used
[ https://issues.apache.org/jira/browse/SPARK-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255375#comment-15255375 ] Mattias Aspholm commented on SPARK-14846: - Yes, you're right of course. Sorry about that. I'm still having problems with the driver not closing down in graceful (even though there's no work left), but I realise now my initial conclusions was bad, the reason why it hangs in awaitTermination is that the termination condition is not signaled. I need to find out why that happens. Ok for me to close this bug as invalid. I'll file another one if it turns out to be some bug after all. > Driver process fails to terminate when graceful shutdown is used > > > Key: SPARK-14846 > URL: https://issues.apache.org/jira/browse/SPARK-14846 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Mattias Aspholm > > During shutdown, the job scheduler in Streaming (JobScheduler.stop) spends > some time waiting for all queued work to complete. If graceful shutdown is > used, the time is 1 hour, for non-graceful shutdown it's 2 seconds. > The wait is implemented using the ThreadPoolExecutor.awaitTermination method > in java.util.concurrent. The problem is that instead of looping over the > method for the desired period of time, the wait period is passed in as the > timeout parameter to awaitTermination. > The result is that if the termination condition is false the first time, the > method will sleep for the timeout period before trying again. In the case of > graceful shutdown this means at least an hour's wait before the condition is > checked again, even though all work is completed in just a few seconds. The > driver process will continue to live during this time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255363#comment-15255363 ] Reynold Xin commented on SPARK-14654: - I see. The merge function isn't supposed to be called by the end user. Adding type parameters are not free -- actually everything we add is not free. We need to consider how much gain it brings. In this case I think the gain is minimal, if any. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14869) Don't mask exceptions in ResolveRelations
[ https://issues.apache.org/jira/browse/SPARK-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14869. - Resolution: Fixed Fix Version/s: 2.0.0 > Don't mask exceptions in ResolveRelations > - > > Key: SPARK-14869 > URL: https://issues.apache.org/jira/browse/SPARK-14869 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > In order to support SPARK-11197 (run SQL directly on files), we added some > code in ResolveRelations to catch the exception thrown by > catalog.lookupRelation and ignore it. This unfortunately masks all the > exceptions. It should've been sufficient to simply test the table does not > exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14872) Restructure commands.scala
[ https://issues.apache.org/jira/browse/SPARK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14872. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12636 [https://github.com/apache/spark/pull/12636] > Restructure commands.scala > -- > > Key: SPARK-14872 > URL: https://issues.apache.org/jira/browse/SPARK-14872 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14871) Disable StatsReportListener to declutter output
[ https://issues.apache.org/jira/browse/SPARK-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14871. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12635 [https://github.com/apache/spark/pull/12635] > Disable StatsReportListener to declutter output > --- > > Key: SPARK-14871 > URL: https://issues.apache.org/jira/browse/SPARK-14871 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately > this clutters the spark-sql CLI output and makes it very difficult to read > the actual query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255359#comment-15255359 ] holdenk commented on SPARK-14654: - So ACC isn't the internal buffer type, rather its the type of the Accumulator. This just replaces the runtime exception of someone trying to merge two incompatible Accumulators with a compile time check. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255357#comment-15255357 ] Reynold Xin commented on SPARK-14654: - No it can't be private spark because it needs to be implemented. I also don't see why we'd need to expose the internal buffer type, since it is strictly an implementation detail of the accumulators. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255356#comment-15255356 ] holdenk commented on SPARK-14654: - Since we're talking about average accumulators anyways - what would the return type of a Long average accumulator's value function be? Also would it maybe make sense to have the merge function be {code}private[spark]{code}? And or maybe change {code}abstract class Accumulator[IN, OUT] extends Serializable {{code} to {code}abstract class Accumulator[IN, ACC, OUT] extends Serializable {{code} and have merge take ACC e.g. {code}def merge(other: ACC): Unit{code} and then do {code}class LongAccumulator extends Accumulator[jl.Long, LongAccumulator, jl.Long] {{code} > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255355#comment-15255355 ] holdenk commented on SPARK-14654: - If were only going to do Long and Double for the easy creation on the SparkContext then I can certainly see why it wouldn't be worth the headaches of using reflection to avoid the duplicate boiler plate code between types. I didn't intend to suggest that the only way to create the accumulators would be through the reflection based API, just in place of the individual convenience functions on the SparkContext (would still have the ability to construct custom Accumulators and register them with registerAccumulator). > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14867) Remove `--force` option in `build/mvn`.
[ https://issues.apache.org/jira/browse/SPARK-14867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14867: -- Priority: Major (was: Trivial) Description: Currently, `build/mvn` provides a convenient option, `--force`, in order to use the recommended version of maven without changing PATH environment variable. However, there were two problems. - `dev/lint-java` does not use the newly installed maven. - It's not easy to type `--force` option always. If we use '--force' option once, we had better prefer the Spark recommended maven. This issue makes `build/mvn` check the existence of maven installed by `--force` option first. According to [~srowen]'s comment, now this issue aims to remove `--force` option by auto-detection of maven version. was: Currently, `build/mvn` provides a convenient option, `--force`, in order to use the recommended version of maven without changing PATH environment variable. However, there were two problems. - `dev/lint-java` does not use the newly installed maven. - It's not easy to type `--force` option always. If we use '--force' option once, we had better prefer the Spark recommended maven. This issue makes `build/mvn` check the existence of maven installed by `--force` option first. Summary: Remove `--force` option in `build/mvn`. (was: Make `build/mvn` to use the downloaded maven if it exist.) > Remove `--force` option in `build/mvn`. > --- > > Key: SPARK-14867 > URL: https://issues.apache.org/jira/browse/SPARK-14867 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Dongjoon Hyun > > Currently, `build/mvn` provides a convenient option, `--force`, in order to > use the recommended version of maven without changing PATH environment > variable. > However, there were two problems. > - `dev/lint-java` does not use the newly installed maven. > - It's not easy to type `--force` option always. > If we use '--force' option once, we had better prefer the Spark recommended > maven. > This issue makes `build/mvn` check the existence of maven installed by > `--force` option first. > > According to [~srowen]'s comment, now this issue aims to remove `--force` > option by auto-detection of maven version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface
[ https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14729: Assignee: Apache Spark > Implement an existing cluster manager with New ExternalClusterManager > interface > --- > > Key: SPARK-14729 > URL: https://issues.apache.org/jira/browse/SPARK-14729 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Hemant Bhanawat >Assignee: Apache Spark >Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > SPARK-13904 adds an ExternalClusterManager interface to Spark to allow > external cluster managers to spawn Spark components. > This JIRA tracks following suggestion from [~rxin]: > 'One thing - can you guys try to see if you can implement one of the existing > cluster managers with this, and then we can make sure this is a proper API? > Otherwise it is really easy to get removed because it is currently unused by > anything in Spark.' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface
[ https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14729: Assignee: (was: Apache Spark) > Implement an existing cluster manager with New ExternalClusterManager > interface > --- > > Key: SPARK-14729 > URL: https://issues.apache.org/jira/browse/SPARK-14729 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Hemant Bhanawat >Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > SPARK-13904 adds an ExternalClusterManager interface to Spark to allow > external cluster managers to spawn Spark components. > This JIRA tracks following suggestion from [~rxin]: > 'One thing - can you guys try to see if you can implement one of the existing > cluster managers with this, and then we can make sure this is a proper API? > Otherwise it is really easy to get removed because it is currently unused by > anything in Spark.' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface
[ https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255340#comment-15255340 ] Apache Spark commented on SPARK-14729: -- User 'hbhanawat' has created a pull request for this issue: https://github.com/apache/spark/pull/12641 > Implement an existing cluster manager with New ExternalClusterManager > interface > --- > > Key: SPARK-14729 > URL: https://issues.apache.org/jira/browse/SPARK-14729 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Hemant Bhanawat >Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > SPARK-13904 adds an ExternalClusterManager interface to Spark to allow > external cluster managers to spawn Spark components. > This JIRA tracks following suggestion from [~rxin]: > 'One thing - can you guys try to see if you can implement one of the existing > cluster managers with this, and then we can make sure this is a proper API? > Otherwise it is really easy to get removed because it is currently unused by > anything in Spark.' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14694) Thrift Server + Hive Metastore + Kerberos doesn't work
[ https://issues.apache.org/jira/browse/SPARK-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255036#comment-15255036 ] zhangguancheng edited comment on SPARK-14694 at 4/23/16 6:50 PM: - Content of hive-site.xml: {quote} hive.server2.thrift.port 1 hive.metastore.sasl.enabled true hive.metastore.kerberos.keytab.file /opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab hive.metastore.kerberos.principal hive/c1@C1 hive.server2.authentication KERBEROS hive.server2.authentication.kerberos.principal hive/c1@C1 hive.server2.authentication.kerberos.keytab /opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab javax.jdo.option.ConnectionURL jdbc:mysql://localhost/test the URL of the MySQL database javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName xxx javax.jdo.option.ConnectionPassword x datanucleus.autoCreateSchema false datanucleus.fixedDatastore true hive.metastore.uris thrift://localhost:9083 IP address (or fully-qualified domain name) and port of the metastore host {quote} And when I set hive.server2.enable.impersonation and hive.server2.enable.doAs to false, the error gone: {quote} hive.server2.enable.impersonation false hive.server2.enable.doAs false hive.execution.engine spark {quote} was (Author: zhangguancheng): Content of hive-site.xml: {quote} hive.server2.thrift.port 1 hive.metastore.sasl.enabled true hive.metastore.kerberos.keytab.file /opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab hive.metastore.kerberos.principal hive/c1@C1 hive.server2.authentication KERBEROS hive.server2.authentication.kerberos.principal hive/c1@C1 hive.server2.authentication.kerberos.keytab /opt/hive/apache-hive-1.1.1-bin/conf/hive.keytab javax.jdo.option.ConnectionURL jdbc:mysql://localhost/test the URL of the MySQL database javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName xxx javax.jdo.option.ConnectionPassword x datanucleus.autoCreateSchema false datanucleus.fixedDatastore true hive.metastore.uris thrift://localhost:9083 IP address (or fully-qualified domain name) and port of the metastore host {quote} > Thrift Server + Hive Metastore + Kerberos doesn't work > -- > > Key: SPARK-14694 > URL: https://issues.apache.org/jira/browse/SPARK-14694 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: Spark 1.6.1. compiled with hadoop 2.6.0, yarn, hive > Hadoop 2.6.4 > Hive 1.1.1 > Kerberos >Reporter: zhangguancheng > Labels: security > > My Hive Metasore is MySQL based. I started a spark thrift server on the same > node as the Hive Metastore. I can open beeline and run select statements but > for some commands like "show databases", I get an error: > {quote} > ERROR pool-24-thread-1 org.apache.thrift.transport.TSaslTransport:315 SASL > negotiation failure > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) > at > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271) > at > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at
[jira] [Updated] (SPARK-14594) Improve error messages for RDD API
[ https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-14594: -- Assignee: Felix Cheung > Improve error messages for RDD API > -- > > Key: SPARK-14594 > URL: https://issues.apache.org/jira/browse/SPARK-14594 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Marco Gaido >Assignee: Felix Cheung > Fix For: 2.0.0 > > > When you have an error in your R code using the RDD API, you always get as > error message: > Error in if (returnStatus != 0) { : argument is of length zero > This is not very useful and I think it might be better to catch the R > exception and show it instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14594) Improve error messages for RDD API
[ https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-14594. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12622 [https://github.com/apache/spark/pull/12622] > Improve error messages for RDD API > -- > > Key: SPARK-14594 > URL: https://issues.apache.org/jira/browse/SPARK-14594 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Marco Gaido > Fix For: 2.0.0 > > > When you have an error in your R code using the RDD API, you always get as > error message: > Error in if (returnStatus != 0) { : argument is of length zero > This is not very useful and I think it might be better to catch the R > exception and show it instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255312#comment-15255312 ] Reynold Xin edited comment on SPARK-14654 at 4/23/16 6:01 PM: -- I don't get what you are trying to accomplish. It seems like you enjoy the cuteness of reflection. With your proposal: 1. Specialization won't work, which is a big part of this new API. 2. It is less obvious what the return types should be. 3. It is strictly less type safe, and app developers won't know what the accepted input types are. 4. It is unclear what the semantics is when "1" is passed in as initial value rather than "0". 5. We would need to implement all the primitive types, which I don't think make sense. In my thing only double and long are implemented. I don't see why we should implement all the primitive types. Why have a "byte" accumulator when the long one captures almost all the use cases? How often would having a "Boolean" accumulator make sense? You are keeping almost all the issues with the existing API. And you would know if you want an avg or a long in the new one, because they have different functions. was (Author: rxin): I don't get what you are trying to accomplish. It seems like you enjoy the cuteness of reflection. With your proposal: 1. Specialization won't work, which is a big part of this new API. 2. It is less obvious what the return types should be. 3. It is strictly less type safe, and app developers won't know what the accepted input types are. 4. It is unclear what the semantics is when "1" is passed in as initial value rather than "0". 5. We would need to implement all the primitive types, which I don't think make sense. In my thing only double and long are implemented. I don't see why we should implement all the primitive types. Why have a "byte" accumulator when the long one captures almost all the use cases? How often would having a "Boolean" accumulator make sense? You are keeping almost all the issues with the existing API. And of course you would know if you want an avg or a long in the new one, because they have different functions. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long,
[jira] [Comment Edited] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255312#comment-15255312 ] Reynold Xin edited comment on SPARK-14654 at 4/23/16 6:02 PM: -- I don't get what you are trying to accomplish. It seems like you enjoy the cuteness of reflection. With your proposal: 1. Specialization won't work, which is a big part of this new API. 2. It is less obvious what the return types should be. 3. It is strictly less type safe, and app developers won't know what the accepted input types are. 4. It is unclear what the semantics is when "1" is passed in as initial value rather than "0". 5. We would need to implement all the primitive types, which I don't think make sense. In my thing only double and long are implemented. I don't see why we should implement all the primitive types. Why have a "byte" accumulator when the long one captures almost all the use cases? How often would having a "Boolean" accumulator make sense? You are keeping almost all the issues with the existing API. And users would know if they want an avg or a long in the new one, because they have different functions. was (Author: rxin): I don't get what you are trying to accomplish. It seems like you enjoy the cuteness of reflection. With your proposal: 1. Specialization won't work, which is a big part of this new API. 2. It is less obvious what the return types should be. 3. It is strictly less type safe, and app developers won't know what the accepted input types are. 4. It is unclear what the semantics is when "1" is passed in as initial value rather than "0". 5. We would need to implement all the primitive types, which I don't think make sense. In my thing only double and long are implemented. I don't see why we should implement all the primitive types. Why have a "byte" accumulator when the long one captures almost all the use cases? How often would having a "Boolean" accumulator make sense? You are keeping almost all the issues with the existing API. And you would know if you want an avg or a long in the new one, because they have different functions. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]):
[jira] [Comment Edited] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255312#comment-15255312 ] Reynold Xin edited comment on SPARK-14654 at 4/23/16 5:54 PM: -- I don't get what you are trying to accomplish. It seems like you enjoy the cuteness of reflection. With your proposal: 1. Specialization won't work, which is a big part of this new API. 2. It is less obvious what the return types should be. 3. It is strictly less type safe, and app developers won't know what the accepted input types are. 4. It is unclear what the semantics is when "1" is passed in as initial value rather than "0". 5. We would need to implement all the primitive types, which I don't think make sense. In my thing only double and long are implemented. I don't see why we should implement all the primitive types. Why have a "byte" accumulator when the long one captures almost all the use cases? How often would having a "Boolean" accumulator make sense? You are keeping almost all the issues with the existing API. And of course you would know if you want an avg or a long in the new one, because they have different functions. was (Author: rxin): I don't get what you are trying to accomplish. It seems like you enjoy the cuteness of reflection. With your proposal: 1. Specialization won't work, which is a big part of this new API. 2. It is less obvious what the return types should be. 3. It is strictly less type safe, and app developers won't know what the accepted input types are. 4. It is unclear what the semantics is when "1" is passed in as initial value rather than "0". 5. We would need to implement all the primitive types, which I don't think make sense. In my thing only double and long are implemented. I don't see why we should implement all the primitive types. Why have a "byte" accumulator when the long one captures almost all the use cases? How often would having a "Boolean" accumulator make sense? You are keeping almost all the issues with the existing API. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255312#comment-15255312 ] Reynold Xin commented on SPARK-14654: - I don't get what you are trying to accomplish. It seems like you enjoy the cuteness of reflection. With your proposal: 1. Specialization won't work, which is a big part of this new API. 2. It is less obvious what the return types should be. 3. It is strictly less type safe, and app developers won't know what the accepted input types are. 4. It is unclear what the semantics is when "1" is passed in as initial value rather than "0". 5. We would need to implement all the primitive types, which I don't think make sense. In my thing only double and long are implemented. I don't see why we should implement all the primitive types. Why have a "byte" accumulator when the long one captures almost all the use cases? How often would having a "Boolean" accumulator make sense? You are keeping almost all the issues with the existing API. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
[ https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14873. - Resolution: Fixed Fix Version/s: 2.0.0 > Java sampleByKey methods take ju.Map but with Scala Double values; results in > type Object > - > > Key: SPARK-14873 > URL: https://issues.apache.org/jira/browse/SPARK-14873 > Project: Spark > Issue Type: Sub-task > Components: Java API, Spark Core >Affects Versions: 1.6.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 2.0.0 > > > There's this odd bit of code in {{JavaStratifiedSamplingExample}}: > {code} > // specify the exact fraction desired from each key Map> ImmutableMap fractions = > ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3); > // Get an approximate sample from each stratum > JavaPairRDD approxSample = data.sampleByKey(false, > fractions); > {code} > It highlights a problem like that in > https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types > are used where Java requires an object, and the result is that a signature > that logically takes Double (objects) takes an Object in the Java API. It's > an easy, similar fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14654) New accumulator API
[ https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255291#comment-15255291 ] holdenk commented on SPARK-14654: - You wouldn't know if you want a counter or overage, but the same applies to the function `newLongAccumulator` so you give them a counter and if they want a different combiner function they subclass the `Accumulator` to implement that (or if we wanted to offer average easily you could add a flag or a different call of newAverageAccumulator and have standard implementations for average for the built in classes). If the users passes a 1 it means the accumulator starts with a value of 1. To me it just feels a little clunky have newXAccumulator \forall X in {default supported types} - but it is clearer at compile time so I can see why it might be a better fit. If we do end up adding a lot of newXAccumulator to the API I think we should consider either grouping them in the API docs or moving them to a separate class. > New accumulator API > --- > > Key: SPARK-14654 > URL: https://issues.apache.org/jira/browse/SPARK-14654 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The current accumulator API has a few problems: > 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, > AccumulatorParam, AccumulableParam, etc. > 2. The intermediate buffer type must be the same as the output type, so there > is no way to define an accumulator that computes averages. > 3. It is very difficult to specialize the methods, leading to excessive > boxing and making accumulators bad for metrics that change for each record. > 4. There is not a single coherent API that works for both Java and Scala. > This is a proposed new API that addresses all of the above. In this new API: > 1. There is only a single class (Accumulator) that is user facing > 2. The intermediate value is stored in the accumulator itself and can be > different from the output type. > 3. Concrete implementations can provide its own specialized methods. > 4. Designed to work for both Java and Scala. > {code} > abstract class Accumulator[IN, OUT] extends Serializable { > def isRegistered: Boolean = ... > def register(metadata: AccumulatorMetadata): Unit = ... > def metadata: AccumulatorMetadata = ... > def reset(): Unit > def add(v: IN): Unit > def merge(other: Accumulator[IN, OUT]): Unit > def value: OUT > def localValue: OUT = value > final def registerAccumulatorOnExecutor(): Unit = { > // Automatically register the accumulator when it is deserialized with > the task closure. > // This is for external accumulators and internal ones that do not > represent task level > // metrics, e.g. internal SQL metrics, which are per-operator. > val taskContext = TaskContext.get() > if (taskContext != null) { > taskContext.registerAccumulator(this) > } > } > // Called by Java when deserializing an object > private def readObject(in: ObjectInputStream): Unit = > Utils.tryOrIOException { > in.defaultReadObject() > registerAccumulator() > } > } > {code} > Metadata, provided by Spark after registration: > {code} > class AccumulatorMetadata( > val id: Long, > val name: Option[String], > val countFailedValues: Boolean > ) extends Serializable > {code} > and an implementation that also offers specialized getters and setters > {code} > class LongAccumulator extends Accumulator[jl.Long, jl.Long] { > private[this] var _sum = 0L > override def reset(): Unit = _sum = 0L > override def add(v: jl.Long): Unit = { > _sum += v > } > override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other > match { > case o: LongAccumulator => _sum += o.sum > case _ => throw new UnsupportedOperationException( > s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") > } > override def value: jl.Long = _sum > def sum: Long = _sum > } > {code} > and SparkContext... > {code} > class SparkContext { > ... > def newLongAccumulator(): LongAccumulator > def newLongAccumulator(name: Long): LongAccumulator > def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator > def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): > Accumulator[IN, OUT] > ... > } > {code} > To use it ... > {code} > val acc = sc.newLongAccumulator() > sc.parallelize(1 to 1000).map { i => > acc.add(1) > i > } > {code} > A work-in-progress prototype here: > https://github.com/rxin/spark/tree/accumulator-refactor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14850: Assignee: Apache Spark > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Blocker > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255282#comment-15255282 ] Apache Spark commented on SPARK-14850: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/12640 > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Priority: Blocker > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14850: Assignee: (was: Apache Spark) > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Priority: Blocker > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14864) [MLLIB] Implement Doc2Vec
[ https://issues.apache.org/jira/browse/SPARK-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255275#comment-15255275 ] Peter Mountanos commented on SPARK-14864: - [~prudenko] [~cqnguyen] I noticed previous discussion of possibly implementing Doc2Vec in issue SPARK-4101. Has there been any headway on this? > [MLLIB] Implement Doc2Vec > - > > Key: SPARK-14864 > URL: https://issues.apache.org/jira/browse/SPARK-14864 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Peter Mountanos >Priority: Minor > > It would be useful to implement Doc2Vec, as described in the paper > [Distributed Representations of Sentences and > Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has > an implementation [Deep learning with > paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. > Le & Mikolov show that when aggregating Word2Vec vector representations for a > paragraph/document, it does not perform well for prediction tasks. Instead, > they propose the Paragraph Vector implementation, which provides > state-of-the-art results on several text classification and sentiment > analysis tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14856) Returning batch unexpected from wide table
[ https://issues.apache.org/jira/browse/SPARK-14856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255261#comment-15255261 ] Apache Spark commented on SPARK-14856: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/12639 > Returning batch unexpected from wide table > -- > > Key: SPARK-14856 > URL: https://issues.apache.org/jira/browse/SPARK-14856 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > When there the required schema support batch, but not full schema, the > parquet reader may return batch unexpectedly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
[ https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255249#comment-15255249 ] Cheng Lian commented on SPARK-14875: Checked with [~cloud_fan], it was accidentally made private while adding bucketing feature. I'm removing this qualifier. > OutputWriterFactory.newInstance shouldn't be private[sql] > - > > Key: SPARK-14875 > URL: https://issues.apache.org/jira/browse/SPARK-14875 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Existing packages like spark-avro need to access > {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in > Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
[ https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255248#comment-15255248 ] Cheng Lian commented on SPARK-14875: [~marmbrus] Is there any reason why we made it private in Spark 2.0? > OutputWriterFactory.newInstance shouldn't be private[sql] > - > > Key: SPARK-14875 > URL: https://issues.apache.org/jira/browse/SPARK-14875 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Existing packages like spark-avro need to access > {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in > Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
Cheng Lian created SPARK-14875: -- Summary: OutputWriterFactory.newInstance shouldn't be private[sql] Key: SPARK-14875 URL: https://issues.apache.org/jira/browse/SPARK-14875 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Existing packages like spark-avro need to access {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14874) Cleanup the useless Batch class
[ https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-14874: -- Summary: Cleanup the useless Batch class (was: Remove the useless Batch class) > Cleanup the useless Batch class > --- > > Key: SPARK-14874 > URL: https://issues.apache.org/jira/browse/SPARK-14874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > The Batch class, which had been used to indicate progress in a stream, was > abandoned by SPARK-13985 and then became useless. > Let's: > - removes the Batch class > - renames getBatch(...) to getData(...) for Source > - renames addBatch(...) to addData(...) for Sink -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14874) Remove the useless Batch class
[ https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14874: Assignee: Apache Spark > Remove the useless Batch class > -- > > Key: SPARK-14874 > URL: https://issues.apache.org/jira/browse/SPARK-14874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Assignee: Apache Spark >Priority: Minor > > The Batch class, which had been used to indicate progress in a stream, was > abandoned by SPARK-13985 and then became useless. > Let's: > - removes the Batch class > - renames getBatch(...) to getData(...) for Source > - renames addBatch(...) to addData(...) for Sink -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14874) Remove the useless Batch class
[ https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255236#comment-15255236 ] Apache Spark commented on SPARK-14874: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/12638 > Remove the useless Batch class > -- > > Key: SPARK-14874 > URL: https://issues.apache.org/jira/browse/SPARK-14874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > The Batch class, which had been used to indicate progress in a stream, was > abandoned by SPARK-13985 and then became useless. > Let's: > - removes the Batch class > - renames getBatch(...) to getData(...) for Source > - renames addBatch(...) to addData(...) for Sink -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14874) Remove the useless Batch class
[ https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14874: Assignee: (was: Apache Spark) > Remove the useless Batch class > -- > > Key: SPARK-14874 > URL: https://issues.apache.org/jira/browse/SPARK-14874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > The Batch class, which had been used to indicate progress in a stream, was > abandoned by SPARK-13985 and then became useless. > Let's: > - removes the Batch class > - renames getBatch(...) to getData(...) for Source > - renames addBatch(...) to addData(...) for Sink -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14874) Remove the useless Batch class
Liwei Lin created SPARK-14874: - Summary: Remove the useless Batch class Key: SPARK-14874 URL: https://issues.apache.org/jira/browse/SPARK-14874 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Liwei Lin Priority: Minor The Batch class, which had been used to indicate progress in a stream, was abandoned by SPARK-13985 and then became useless. Let's: - removes the Batch class - renames getBatch(...) to getData(...) for Source - renames addBatch(...) to addData(...) for Sink -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14846) Driver process fails to terminate when graceful shutdown is used
[ https://issues.apache.org/jira/browse/SPARK-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255223#comment-15255223 ] Sean Owen commented on SPARK-14846: --- No, that's not what methods like awaitNanos do in the JDK classes. It waits for up to that time, but the normal mechanism is that the Condition is signaled before the timeout occurs. This is not a sleep-and-poll. > Driver process fails to terminate when graceful shutdown is used > > > Key: SPARK-14846 > URL: https://issues.apache.org/jira/browse/SPARK-14846 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Mattias Aspholm > > During shutdown, the job scheduler in Streaming (JobScheduler.stop) spends > some time waiting for all queued work to complete. If graceful shutdown is > used, the time is 1 hour, for non-graceful shutdown it's 2 seconds. > The wait is implemented using the ThreadPoolExecutor.awaitTermination method > in java.util.concurrent. The problem is that instead of looping over the > method for the desired period of time, the wait period is passed in as the > timeout parameter to awaitTermination. > The result is that if the termination condition is false the first time, the > method will sleep for the timeout period before trying again. In the case of > graceful shutdown this means at least an hour's wait before the condition is > checked again, even though all work is completed in just a few seconds. The > driver process will continue to live during this time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
[ https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14873: Assignee: Sean Owen (was: Apache Spark) > Java sampleByKey methods take ju.Map but with Scala Double values; results in > type Object > - > > Key: SPARK-14873 > URL: https://issues.apache.org/jira/browse/SPARK-14873 > Project: Spark > Issue Type: Sub-task > Components: Java API, Spark Core >Affects Versions: 1.6.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > There's this odd bit of code in {{JavaStratifiedSamplingExample}}: > {code} > // specify the exact fraction desired from each key Map> ImmutableMap fractions = > ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3); > // Get an approximate sample from each stratum > JavaPairRDD approxSample = data.sampleByKey(false, > fractions); > {code} > It highlights a problem like that in > https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types > are used where Java requires an object, and the result is that a signature > that logically takes Double (objects) takes an Object in the Java API. It's > an easy, similar fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
[ https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255221#comment-15255221 ] Apache Spark commented on SPARK-14873: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/12637 > Java sampleByKey methods take ju.Map but with Scala Double values; results in > type Object > - > > Key: SPARK-14873 > URL: https://issues.apache.org/jira/browse/SPARK-14873 > Project: Spark > Issue Type: Sub-task > Components: Java API, Spark Core >Affects Versions: 1.6.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > There's this odd bit of code in {{JavaStratifiedSamplingExample}}: > {code} > // specify the exact fraction desired from each key Map> ImmutableMap fractions = > ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3); > // Get an approximate sample from each stratum > JavaPairRDD approxSample = data.sampleByKey(false, > fractions); > {code} > It highlights a problem like that in > https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types > are used where Java requires an object, and the result is that a signature > that logically takes Double (objects) takes an Object in the Java API. It's > an easy, similar fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
[ https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14873: Assignee: Apache Spark (was: Sean Owen) > Java sampleByKey methods take ju.Map but with Scala Double values; results in > type Object > - > > Key: SPARK-14873 > URL: https://issues.apache.org/jira/browse/SPARK-14873 > Project: Spark > Issue Type: Sub-task > Components: Java API, Spark Core >Affects Versions: 1.6.1 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > There's this odd bit of code in {{JavaStratifiedSamplingExample}}: > {code} > // specify the exact fraction desired from each key Map> ImmutableMap fractions = > ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3); > // Get an approximate sample from each stratum > JavaPairRDD approxSample = data.sampleByKey(false, > fractions); > {code} > It highlights a problem like that in > https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types > are used where Java requires an object, and the result is that a signature > that logically takes Double (objects) takes an Object in the Java API. It's > an easy, similar fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
[ https://issues.apache.org/jira/browse/SPARK-14873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14873: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-11806 > Java sampleByKey methods take ju.Map but with Scala Double values; results in > type Object > - > > Key: SPARK-14873 > URL: https://issues.apache.org/jira/browse/SPARK-14873 > Project: Spark > Issue Type: Sub-task > Components: Java API, Spark Core >Affects Versions: 1.6.1 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > There's this odd bit of code in {{JavaStratifiedSamplingExample}}: > {code} > // specify the exact fraction desired from each key Map> ImmutableMap fractions = > ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3); > // Get an approximate sample from each stratum > JavaPairRDD approxSample = data.sampleByKey(false, > fractions); > {code} > It highlights a problem like that in > https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types > are used where Java requires an object, and the result is that a signature > that logically takes Double (objects) takes an Object in the Java API. It's > an easy, similar fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14873) Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
Sean Owen created SPARK-14873: - Summary: Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object Key: SPARK-14873 URL: https://issues.apache.org/jira/browse/SPARK-14873 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 1.6.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor There's this odd bit of code in {{JavaStratifiedSamplingExample}}: {code} // specify the exact fraction desired from each key MapImmutableMap fractions = ImmutableMap.of(1, (Object)0.1, 2, (Object) 0.6, 3, (Object) 0.3); // Get an approximate sample from each stratum JavaPairRDD approxSample = data.sampleByKey(false, fractions); {code} It highlights a problem like that in https://issues.apache.org/jira/browse/SPARK-12604 where Scala primitive types are used where Java requires an object, and the result is that a signature that logically takes Double (objects) takes an Object in the Java API. It's an easy, similar fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14872) Restructure commands.scala
[ https://issues.apache.org/jira/browse/SPARK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14872: Assignee: Reynold Xin (was: Apache Spark) > Restructure commands.scala > -- > > Key: SPARK-14872 > URL: https://issues.apache.org/jira/browse/SPARK-14872 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14872) Restructure commands.scala
[ https://issues.apache.org/jira/browse/SPARK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255194#comment-15255194 ] Apache Spark commented on SPARK-14872: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12636 > Restructure commands.scala > -- > > Key: SPARK-14872 > URL: https://issues.apache.org/jira/browse/SPARK-14872 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14872) Restructure commands.scala
[ https://issues.apache.org/jira/browse/SPARK-14872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14872: Assignee: Apache Spark (was: Reynold Xin) > Restructure commands.scala > -- > > Key: SPARK-14872 > URL: https://issues.apache.org/jira/browse/SPARK-14872 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14872) Restructure commands.scala
Reynold Xin created SPARK-14872: --- Summary: Restructure commands.scala Key: SPARK-14872 URL: https://issues.apache.org/jira/browse/SPARK-14872 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14871) Disable StatsReportListener to declutter output
[ https://issues.apache.org/jira/browse/SPARK-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14871: Assignee: Apache Spark (was: Reynold Xin) > Disable StatsReportListener to declutter output > --- > > Key: SPARK-14871 > URL: https://issues.apache.org/jira/browse/SPARK-14871 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately > this clutters the spark-sql CLI output and makes it very difficult to read > the actual query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14871) Disable StatsReportListener to declutter output
[ https://issues.apache.org/jira/browse/SPARK-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14871: Assignee: Reynold Xin (was: Apache Spark) > Disable StatsReportListener to declutter output > --- > > Key: SPARK-14871 > URL: https://issues.apache.org/jira/browse/SPARK-14871 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately > this clutters the spark-sql CLI output and makes it very difficult to read > the actual query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14871) Disable StatsReportListener to declutter output
[ https://issues.apache.org/jira/browse/SPARK-14871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255187#comment-15255187 ] Apache Spark commented on SPARK-14871: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12635 > Disable StatsReportListener to declutter output > --- > > Key: SPARK-14871 > URL: https://issues.apache.org/jira/browse/SPARK-14871 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately > this clutters the spark-sql CLI output and makes it very difficult to read > the actual query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14871) Disable StatsReportListener to declutter output
Reynold Xin created SPARK-14871: --- Summary: Disable StatsReportListener to declutter output Key: SPARK-14871 URL: https://issues.apache.org/jira/browse/SPARK-14871 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately this clutters the spark-sql CLI output and makes it very difficult to read the actual query results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14594) Improve error messages for RDD API
[ https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255186#comment-15255186 ] Marco Gaido commented on SPARK-14594: - Yes, I do believe that this is what is happening > Improve error messages for RDD API > -- > > Key: SPARK-14594 > URL: https://issues.apache.org/jira/browse/SPARK-14594 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Marco Gaido > > When you have an error in your R code using the RDD API, you always get as > error message: > Error in if (returnStatus != 0) { : argument is of length zero > This is not very useful and I think it might be better to catch the R > exception and show it instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12148. - Resolution: Fixed Assignee: Felix Cheung Fix Version/s: 2.0.0 > SparkR: rename DataFrame to SparkDataFrame > -- > > Key: SPARK-12148 > URL: https://issues.apache.org/jira/browse/SPARK-12148 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Michael Lawrence >Assignee: Felix Cheung > Fix For: 2.0.0 > > > The SparkR package represents a Spark DataFrame with the class "DataFrame". > That conflicts with the more general DataFrame class defined in the S4Vectors > package. Would it not be more appropriate to use the name "SparkDataFrame" > instead? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14870) NPE in generate aggregate
Davies Liu created SPARK-14870: -- Summary: NPE in generate aggregate Key: SPARK-14870 URL: https://issues.apache.org/jira/browse/SPARK-14870 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Sameer Agarwal When ran TPCDS Q14a {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 (TID 234, localhost): java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576) at org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) at org.apache.spark.rdd.RDD.collect(RDD.scala:879) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2386) at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2366) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at
[jira] [Assigned] (SPARK-14869) Don't mask exceptions in ResolveRelations
[ https://issues.apache.org/jira/browse/SPARK-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14869: Assignee: Apache Spark (was: Reynold Xin) > Don't mask exceptions in ResolveRelations > - > > Key: SPARK-14869 > URL: https://issues.apache.org/jira/browse/SPARK-14869 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Apache Spark > > In order to support SPARK-11197 (run SQL directly on files), we added some > code in ResolveRelations to catch the exception thrown by > catalog.lookupRelation and ignore it. This unfortunately masks all the > exceptions. It should've been sufficient to simply test the table does not > exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14869) Don't mask exceptions in ResolveRelations
[ https://issues.apache.org/jira/browse/SPARK-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14869: Assignee: Reynold Xin (was: Apache Spark) > Don't mask exceptions in ResolveRelations > - > > Key: SPARK-14869 > URL: https://issues.apache.org/jira/browse/SPARK-14869 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > > In order to support SPARK-11197 (run SQL directly on files), we added some > code in ResolveRelations to catch the exception thrown by > catalog.lookupRelation and ignore it. This unfortunately masks all the > exceptions. It should've been sufficient to simply test the table does not > exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org