[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890 ] Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 6:54 AM: --- [~rxin] If we want the rule to apply only on some save/ouput action, would not it be much better to check the rule before calling write function instead of adding the rule in checkanalysis.scala was (Author: animeshbaranawal): [~rxin] If we want the rule to apply only on some save/ouput action, would not it be much intuitive to check the rule before calling write function instead of adding the rule in checkanalysis.scala > Better AnalysisException for writing DataFrame with identically named columns > - > > Key: SPARK-8072 > URL: https://issues.apache.org/jira/browse/SPARK-8072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > > We should check if there are duplicate columns, and if yes, throw an explicit > error message saying there are duplicate columns. See current error message > below. > {code} > In [3]: df.withColumn('age', df.age) > Out[3]: DataFrame[age: bigint, name: string, age: bigint] > In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') > /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, > mode) > 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) > 351 """ > --> 352 self._jwrite.mode(mode).parquet(path) > 353 > 354 @since(1.4) > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc > in __call__(self, *args) > 535 answer = self.gateway_client.send_command(command) > 536 return_value = get_return_value(answer, self.gateway_client, > --> 537 self.target_id, self.name) > 538 > 539 for temp_arg in temp_args: > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o35.parquet. > : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could > be: age#0L, age#3L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scal
[jira] [Commented] (SPARK-8214) math function: hex
[ https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598961#comment-14598961 ] Apache Spark commented on SPARK-8214: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/6976 > math function: hex > -- > > Key: SPARK-8214 > URL: https://issues.apache.org/jira/browse/SPARK-8214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > > hex(BIGINT a): string > hex(STRING a): string > hex(BINARY a): string > If the argument is an INT or binary, hex returns the number as a STRING in > hexadecimal format. Otherwise if the number is a STRING, it converts each > character into its hexadecimal representation and returns the resulting > STRING. (See > http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, > BINARY version as of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8214) math function: hex
[ https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8214: --- Assignee: zhichao-li (was: Apache Spark) > math function: hex > -- > > Key: SPARK-8214 > URL: https://issues.apache.org/jira/browse/SPARK-8214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > > hex(BIGINT a): string > hex(STRING a): string > hex(BINARY a): string > If the argument is an INT or binary, hex returns the number as a STRING in > hexadecimal format. Otherwise if the number is a STRING, it converts each > character into its hexadecimal representation and returns the resulting > STRING. (See > http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, > BINARY version as of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8214) math function: hex
[ https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8214: --- Assignee: Apache Spark (was: zhichao-li) > math function: hex > -- > > Key: SPARK-8214 > URL: https://issues.apache.org/jira/browse/SPARK-8214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > hex(BIGINT a): string > hex(STRING a): string > hex(BINARY a): string > If the argument is an INT or binary, hex returns the number as a STRING in > hexadecimal format. Otherwise if the number is a STRING, it converts each > character into its hexadecimal representation and returns the resulting > STRING. (See > http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, > BINARY version as of Hive 0.12.0.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8533) Bump Flume version to 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8533: - Component/s: Streaming Priority: Minor (was: Major) Issue Type: Task (was: Bug) (Let's set component / type / priority) > Bump Flume version to 1.6.0 > --- > > Key: SPARK-8533 > URL: https://issues.apache.org/jira/browse/SPARK-8533 > Project: Spark > Issue Type: Task > Components: Streaming >Reporter: Hari Shreedharan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8587: - Component/s: MLlib > Return cost and cluster index KMeansModel.predict > - > > Key: SPARK-8587 > URL: https://issues.apache.org/jira/browse/SPARK-8587 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Sam Stoelinga >Priority: Minor > > Looking at PySpark the implementation of KMeansModel.predict > https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 > : > Currently: > it calculates the cost of the closest cluster and returns the index only. > My expectation: > Easy way to let the same function or a new function to return the cost with > the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8551) Python example code for elastic net
[ https://issues.apache.org/jira/browse/SPARK-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8551: - Component/s: PySpark Priority: Minor (was: Major) > Python example code for elastic net > --- > > Key: SPARK-8551 > URL: https://issues.apache.org/jira/browse/SPARK-8551 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Shuo Xiang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8585) Support LATERAL VIEW in Spark SQL parser
[ https://issues.apache.org/jira/browse/SPARK-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8585: - Component/s: SQL Priority: Minor (was: Major) (Components et al please: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) > Support LATERAL VIEW in Spark SQL parser > > > Key: SPARK-8585 > URL: https://issues.apache.org/jira/browse/SPARK-8585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Konstantin Shaposhnikov >Priority: Minor > > It would be good to support LATERAL VIEW SQL syntax without need to create > HiveContext. > Docs: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8561) Drop table can only drop the tables under database "default"
[ https://issues.apache.org/jira/browse/SPARK-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8561: - Component/s: SQL > Drop table can only drop the tables under database "default" > > > Key: SPARK-8561 > URL: https://issues.apache.org/jira/browse/SPARK-8561 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: baishuo > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup
[ https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598937#comment-14598937 ] Sean Owen commented on SPARK-8111: -- [~shivaram] done though I also just made you a JIRA admin, so that you can add Contributors at https://issues.apache.org/jira/plugins/servlet/project-config/SPARK/roles (Just be aware you can now edit lots of things in JIRA so careful what you click!) > SparkR shell should display Spark logo and version banner on startup > > > Key: SPARK-8111 > URL: https://issues.apache.org/jira/browse/SPARK-8111 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Matei Zaharia >Assignee: Alok Singh >Priority: Trivial > Labels: Starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8371) improve unit test for MaxOf and MinOf
[ https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8371. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6825 [https://github.com/apache/spark/pull/6825] > improve unit test for MaxOf and MinOf > - > > Key: SPARK-8371 > URL: https://issues.apache.org/jira/browse/SPARK-8371 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup
[ https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8111: - Assignee: Alok Singh > SparkR shell should display Spark logo and version banner on startup > > > Key: SPARK-8111 > URL: https://issues.apache.org/jira/browse/SPARK-8111 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Matei Zaharia >Assignee: Alok Singh >Priority: Trivial > Labels: Starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890 ] Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:34 AM: --- [~rxin] If we want the rule to apply only on some save/ouput action, would not it be much intuitive to check the rule before calling write function instead of adding the rule in checkanalysis.scala was (Author: animeshbaranawal): [~rxin] If we want the rule to apply only on some save/ouput action, would not it be much intuitive to add the rule in the DataFrameWriter.scala instead of in CheckAnalysis.scala > Better AnalysisException for writing DataFrame with identically named columns > - > > Key: SPARK-8072 > URL: https://issues.apache.org/jira/browse/SPARK-8072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > > We should check if there are duplicate columns, and if yes, throw an explicit > error message saying there are duplicate columns. See current error message > below. > {code} > In [3]: df.withColumn('age', df.age) > Out[3]: DataFrame[age: bigint, name: string, age: bigint] > In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') > /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, > mode) > 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) > 351 """ > --> 352 self._jwrite.mode(mode).parquet(path) > 353 > 354 @since(1.4) > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc > in __call__(self, *args) > 535 answer = self.gateway_client.send_command(command) > 536 return_value = get_return_value(answer, self.gateway_client, > --> 537 self.target_id, self.name) > 538 > 539 for temp_arg in temp_args: > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o35.parquet. > : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could > be: age#0L, age#3L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > a
[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890 ] Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:29 AM: --- [~rxin] If we want the rule to apply only on some save/ouput action, would not it be much intuitive to add the rule in the DataFrameWriter.scala instead of in CheckAnalysis.scala was (Author: animeshbaranawal): [rxin] If we want the rule to apply only on some save/ouput action, would not it be much intuitive to add the rule in the DataFrameWriter.scala instead of in CheckAnalysis.scala > Better AnalysisException for writing DataFrame with identically named columns > - > > Key: SPARK-8072 > URL: https://issues.apache.org/jira/browse/SPARK-8072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > > We should check if there are duplicate columns, and if yes, throw an explicit > error message saying there are duplicate columns. See current error message > below. > {code} > In [3]: df.withColumn('age', df.age) > Out[3]: DataFrame[age: bigint, name: string, age: bigint] > In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') > /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, > mode) > 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) > 351 """ > --> 352 self._jwrite.mode(mode).parquet(path) > 353 > 354 @since(1.4) > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc > in __call__(self, *args) > 535 answer = self.gateway_client.send_command(command) > 536 return_value = get_return_value(answer, self.gateway_client, > --> 537 self.target_id, self.name) > 538 > 539 for temp_arg in temp_args: > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o35.parquet. > : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could > be: age#0L, age#3L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spar
[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890 ] Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:29 AM: --- [rxin] If we want the rule to apply only on some save/ouput action, would not it be much intuitive to add the rule in the DataFrameWriter.scala instead of in CheckAnalysis.scala was (Author: animeshbaranawal): If we want the rule to apply only on some save/ouput action, would not it be much intuitive to add the rule in the DataFrameWriter.scala instead of in CheckAnalysis.scala > Better AnalysisException for writing DataFrame with identically named columns > - > > Key: SPARK-8072 > URL: https://issues.apache.org/jira/browse/SPARK-8072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > > We should check if there are duplicate columns, and if yes, throw an explicit > error message saying there are duplicate columns. See current error message > below. > {code} > In [3]: df.withColumn('age', df.age) > Out[3]: DataFrame[age: bigint, name: string, age: bigint] > In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') > /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, > mode) > 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) > 351 """ > --> 352 self._jwrite.mode(mode).parquet(path) > 353 > 354 @since(1.4) > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc > in __call__(self, *args) > 535 answer = self.gateway_client.send_command(command) > 536 return_value = get_return_value(answer, self.gateway_client, > --> 537 self.target_id, self.name) > 538 > 539 for temp_arg in temp_args: > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o35.parquet. > : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could > be: age#0L, age#3L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.ca
[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890 ] Animesh Baranawal commented on SPARK-8072: -- If we want the rule to apply only on some save/ouput action, would not it be much intuitive to add the rule in the DataFrameWriter.scala instead of in CheckAnalysis.scala > Better AnalysisException for writing DataFrame with identically named columns > - > > Key: SPARK-8072 > URL: https://issues.apache.org/jira/browse/SPARK-8072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > > We should check if there are duplicate columns, and if yes, throw an explicit > error message saying there are duplicate columns. See current error message > below. > {code} > In [3]: df.withColumn('age', df.age) > Out[3]: DataFrame[age: bigint, name: string, age: bigint] > In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') > /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, > mode) > 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) > 351 """ > --> 352 self._jwrite.mode(mode).parquet(path) > 353 > 354 @since(1.4) > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc > in __call__(self, *args) > 535 answer = self.gateway_client.send_command(command) > 536 return_value = get_return_value(answer, self.gateway_client, > --> 537 self.target_id, self.name) > 538 > 539 for temp_arg in temp_args: > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o35.parquet. > : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could > be: age#0L, age#3L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterato
[jira] [Updated] (SPARK-6749) Make metastore client robust to underlying socket connection loss
[ https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6749: Assignee: Eric Liang > Make metastore client robust to underlying socket connection loss > - > > Key: SPARK-6749 > URL: https://issues.apache.org/jira/browse/SPARK-6749 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Eric Liang >Priority: Critical > Fix For: 1.5.0 > > > Right now, if metastore get restarted, we have to restart the driver to get a > new connection to the metastore client because the underlying socket > connection is gone. We should make metastore client robust to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6749) Make metastore client robust to underlying socket connection loss
[ https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-6749. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6912 [https://github.com/apache/spark/pull/6912] > Make metastore client robust to underlying socket connection loss > - > > Key: SPARK-6749 > URL: https://issues.apache.org/jira/browse/SPARK-6749 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Priority: Critical > Fix For: 1.5.0 > > > Right now, if metastore get restarted, we have to restart the driver to get a > new connection to the metastore client because the underlying socket > connection is gone. We should make metastore client robust to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8587) Return cost and cluster index KMeansModel.predict
Sam Stoelinga created SPARK-8587: Summary: Return cost and cluster index KMeansModel.predict Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict: Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Stoelinga updated SPARK-8587: - Description: Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102: Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. was: Looking at PySpark the implementation of KMeansModel.predict: Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. > Return cost and cluster index KMeansModel.predict > - > > Key: SPARK-8587 > URL: https://issues.apache.org/jira/browse/SPARK-8587 > Project: Spark > Issue Type: Improvement >Reporter: Sam Stoelinga >Priority: Minor > > Looking at PySpark the implementation of KMeansModel.predict > https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102: > > Currently: > it calculates the cost of the closest cluster and returns the index only. > My expectation: > Easy way to let the same function or a new function to return the cost with > the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Stoelinga updated SPARK-8587: - Description: Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. was: Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102: Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. > Return cost and cluster index KMeansModel.predict > - > > Key: SPARK-8587 > URL: https://issues.apache.org/jira/browse/SPARK-8587 > Project: Spark > Issue Type: Improvement >Reporter: Sam Stoelinga >Priority: Minor > > Looking at PySpark the implementation of KMeansModel.predict > https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 > : > Currently: > it calculates the cost of the closest cluster and returns the index only. > My expectation: > Easy way to let the same function or a new function to return the cost with > the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7137) Add checkInputColumn back to Params and print more info
[ https://issues.apache.org/jira/browse/SPARK-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-7137: --- Comment: was deleted (was: Sorry [~gweidner] , [~josephkb] just saw it was unassigned when i created the patch.thanks) > Add checkInputColumn back to Params and print more info > --- > > Key: SPARK-7137 > URL: https://issues.apache.org/jira/browse/SPARK-7137 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Priority: Trivial > > In the PR for [https://issues.apache.org/jira/browse/SPARK-5957], > Params.checkInputColumn was moved to SchemaUtils and renamed to > checkColumnType. The downside is that it no longer has access to the > parameter info, so it cannot state which input column parameter was incorrect. > We should keep checkColumnType but also add checkInputColumn back to Params. > It should print out the parameter name and description. Internally, it may > call checkColumnType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time
[ https://issues.apache.org/jira/browse/SPARK-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598862#comment-14598862 ] Rekha Joshi commented on SPARK-2645: [~sowen] [~vokom] Hi. On a quick look, with my setup latest 1.5.0-SNAPSHOT., I believe this can still happen. If Executor hits on an unhandled exception, the system will exit with code 50 (SparkExitCode.UNCAUGHT_EXCEPTION) {code} //SparkUncaughtExceptionHandler// if (!Utils.inShutdown()) { if (exception.isInstanceOf[OutOfMemoryError]) { System.exit(SparkExitCode.OOM) } else { System.exit(SparkExitCode.UNCAUGHT_EXCEPTION) } } .. . private[spark] object SparkExitCode { /** The default uncaught exception handler was reached. */ val UNCAUGHT_EXCEPTION = 50 {code} Git patch is to defensively avoid a stop if already done and/or handling exception at SparkEnv.Please review.Thanks. > Spark driver calls System.exit(50) after calling SparkContext.stop() the > second time > - > > Key: SPARK-2645 > URL: https://issues.apache.org/jira/browse/SPARK-2645 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Vlad Komarov > > In some cases my application calls SparkContext.stop() after it has already > stopped and this leads to stopping JVM that runs spark driver. > E.g > This program should run forever > {code} > JavaSparkContext context = new JavaSparkContext("spark://12.34.21.44:7077", > "DummyApp"); > try { > JavaRDD rdd = context.parallelize(Arrays.asList(1, 2, > 3)); > rdd.count(); > } catch (Throwable e) { > e.printStackTrace(); > } > try { > context.cancelAllJobs(); > context.stop(); > //call stop second time > context.stop(); > } catch (Throwable e) { > e.printStackTrace(); > } > Thread.currentThread().join(); > {code} > but it finishes with exit code 50 after calling SparkContext.stop() the > second time. > Also it throws an exception like this > {code} > org.apache.spark.ServerStateException: Server is already stopped > at org.apache.spark.HttpServer.stop(HttpServer.scala:122) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.SparkContext.stop(SparkContext.scala:984) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91) > [spark-core_2.10-1.0.0.jar:1.0.0] > at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at > akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [scala-library-2.10.4.jar:na] > {code} > One remark is that this behavior is only reproducible when I call > SparkContext.cancellAllJobs() before calling SparkContext.stop() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time
[ https://issues.apache.org/jira/browse/SPARK-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2645: --- Assignee: Apache Spark > Spark driver calls System.exit(50) after calling SparkContext.stop() the > second time > - > > Key: SPARK-2645 > URL: https://issues.apache.org/jira/browse/SPARK-2645 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Vlad Komarov >Assignee: Apache Spark > > In some cases my application calls SparkContext.stop() after it has already > stopped and this leads to stopping JVM that runs spark driver. > E.g > This program should run forever > {code} > JavaSparkContext context = new JavaSparkContext("spark://12.34.21.44:7077", > "DummyApp"); > try { > JavaRDD rdd = context.parallelize(Arrays.asList(1, 2, > 3)); > rdd.count(); > } catch (Throwable e) { > e.printStackTrace(); > } > try { > context.cancelAllJobs(); > context.stop(); > //call stop second time > context.stop(); > } catch (Throwable e) { > e.printStackTrace(); > } > Thread.currentThread().join(); > {code} > but it finishes with exit code 50 after calling SparkContext.stop() the > second time. > Also it throws an exception like this > {code} > org.apache.spark.ServerStateException: Server is already stopped > at org.apache.spark.HttpServer.stop(HttpServer.scala:122) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.SparkContext.stop(SparkContext.scala:984) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91) > [spark-core_2.10-1.0.0.jar:1.0.0] > at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at > akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [scala-library-2.10.4.jar:na] > {code} > One remark is that this behavior is only reproducible when I call > SparkContext.cancellAllJobs() before calling SparkContext.stop() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time
[ https://issues.apache.org/jira/browse/SPARK-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598859#comment-14598859 ] Apache Spark commented on SPARK-2645: - User 'rekhajoshm' has created a pull request for this issue: https://github.com/apache/spark/pull/6973 > Spark driver calls System.exit(50) after calling SparkContext.stop() the > second time > - > > Key: SPARK-2645 > URL: https://issues.apache.org/jira/browse/SPARK-2645 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Vlad Komarov > > In some cases my application calls SparkContext.stop() after it has already > stopped and this leads to stopping JVM that runs spark driver. > E.g > This program should run forever > {code} > JavaSparkContext context = new JavaSparkContext("spark://12.34.21.44:7077", > "DummyApp"); > try { > JavaRDD rdd = context.parallelize(Arrays.asList(1, 2, > 3)); > rdd.count(); > } catch (Throwable e) { > e.printStackTrace(); > } > try { > context.cancelAllJobs(); > context.stop(); > //call stop second time > context.stop(); > } catch (Throwable e) { > e.printStackTrace(); > } > Thread.currentThread().join(); > {code} > but it finishes with exit code 50 after calling SparkContext.stop() the > second time. > Also it throws an exception like this > {code} > org.apache.spark.ServerStateException: Server is already stopped > at org.apache.spark.HttpServer.stop(HttpServer.scala:122) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.SparkContext.stop(SparkContext.scala:984) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91) > [spark-core_2.10-1.0.0.jar:1.0.0] > at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at > akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [scala-library-2.10.4.jar:na] > {code} > One remark is that this behavior is only reproducible when I call > SparkContext.cancellAllJobs() before calling SparkContext.stop() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time
[ https://issues.apache.org/jira/browse/SPARK-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2645: --- Assignee: (was: Apache Spark) > Spark driver calls System.exit(50) after calling SparkContext.stop() the > second time > - > > Key: SPARK-2645 > URL: https://issues.apache.org/jira/browse/SPARK-2645 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Vlad Komarov > > In some cases my application calls SparkContext.stop() after it has already > stopped and this leads to stopping JVM that runs spark driver. > E.g > This program should run forever > {code} > JavaSparkContext context = new JavaSparkContext("spark://12.34.21.44:7077", > "DummyApp"); > try { > JavaRDD rdd = context.parallelize(Arrays.asList(1, 2, > 3)); > rdd.count(); > } catch (Throwable e) { > e.printStackTrace(); > } > try { > context.cancelAllJobs(); > context.stop(); > //call stop second time > context.stop(); > } catch (Throwable e) { > e.printStackTrace(); > } > Thread.currentThread().join(); > {code} > but it finishes with exit code 50 after calling SparkContext.stop() the > second time. > Also it throws an exception like this > {code} > org.apache.spark.ServerStateException: Server is already stopped > at org.apache.spark.HttpServer.stop(HttpServer.scala:122) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.SparkContext.stop(SparkContext.scala:984) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) > ~[spark-core_2.10-1.0.0.jar:1.0.0] > at > org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91) > [spark-core_2.10-1.0.0.jar:1.0.0] > at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at > akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [scala-library-2.10.4.jar:na] > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [scala-library-2.10.4.jar:na] > {code} > One remark is that this behavior is only reproducible when I call > SparkContext.cancellAllJobs() before calling SparkContext.stop() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8586) SQL add jar command does not work well with Scala REPL
Yin Huai created SPARK-8586: --- Summary: SQL add jar command does not work well with Scala REPL Key: SPARK-8586 URL: https://issues.apache.org/jira/browse/SPARK-8586 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. So, SerDe added through add jar command may not be loaded in the context class loader when we lookup the table. For example, the following code will fail when we try to show the table. {code} hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar") hive.sql("drop table if exists jsonTable") hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'") hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", "val").insertInto("jsonTable") hive.table("jsonTable").show {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8585) Support LATERAL VIEW in Spark SQL parser
Konstantin Shaposhnikov created SPARK-8585: -- Summary: Support LATERAL VIEW in Spark SQL parser Key: SPARK-8585 URL: https://issues.apache.org/jira/browse/SPARK-8585 Project: Spark Issue Type: Improvement Reporter: Konstantin Shaposhnikov It would be good to support LATERAL VIEW SQL syntax without need to create HiveContext. Docs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5768: --- Assignee: Apache Spark > Spark UI Shows incorrect memory under Yarn > -- > > Key: SPARK-5768 > URL: https://issues.apache.org/jira/browse/SPARK-5768 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 > Environment: Centos 6 >Reporter: Al M >Assignee: Apache Spark >Priority: Trivial > > I am running Spark on Yarn with 2 executors. The executors are running on > separate physical machines. > I have spark.executor.memory set to '40g'. This is because I want to have > 40g of memory used on each machine. I have one executor per machine. > When I run my application I see from 'top' that both my executors are using > the full 40g of memory I allocated to them. > The 'Executors' tab in the Spark UI shows something different. It shows the > memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it > look like I only have 20GB available per executor when really I have 40GB > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598815#comment-14598815 ] Apache Spark commented on SPARK-5768: - User 'rekhajoshm' has created a pull request for this issue: https://github.com/apache/spark/pull/6972 > Spark UI Shows incorrect memory under Yarn > -- > > Key: SPARK-5768 > URL: https://issues.apache.org/jira/browse/SPARK-5768 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 > Environment: Centos 6 >Reporter: Al M >Priority: Trivial > > I am running Spark on Yarn with 2 executors. The executors are running on > separate physical machines. > I have spark.executor.memory set to '40g'. This is because I want to have > 40g of memory used on each machine. I have one executor per machine. > When I run my application I see from 'top' that both my executors are using > the full 40g of memory I allocated to them. > The 'Executors' tab in the Spark UI shows something different. It shows the > memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it > look like I only have 20GB available per executor when really I have 40GB > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5768: --- Assignee: (was: Apache Spark) > Spark UI Shows incorrect memory under Yarn > -- > > Key: SPARK-5768 > URL: https://issues.apache.org/jira/browse/SPARK-5768 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 > Environment: Centos 6 >Reporter: Al M >Priority: Trivial > > I am running Spark on Yarn with 2 executors. The executors are running on > separate physical machines. > I have spark.executor.memory set to '40g'. This is because I want to have > 40g of memory used on each machine. I have one executor per machine. > When I run my application I see from 'top' that both my executors are using > the full 40g of memory I allocated to them. > The 'Executors' tab in the Spark UI shows something different. It shows the > memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it > look like I only have 20GB available per executor when really I have 40GB > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8233) misc function: hash
[ https://issues.apache.org/jira/browse/SPARK-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598797#comment-14598797 ] Apache Spark commented on SPARK-8233: - User 'qiansl127' has created a pull request for this issue: https://github.com/apache/spark/pull/6971 > misc function: hash > --- > > Key: SPARK-8233 > URL: https://issues.apache.org/jira/browse/SPARK-8233 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > hash(a1[, a2...]): int > Returns a hash value of the arguments. See Hive's implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8233) misc function: hash
[ https://issues.apache.org/jira/browse/SPARK-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8233: --- Assignee: (was: Apache Spark) > misc function: hash > --- > > Key: SPARK-8233 > URL: https://issues.apache.org/jira/browse/SPARK-8233 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > hash(a1[, a2...]): int > Returns a hash value of the arguments. See Hive's implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8233) misc function: hash
[ https://issues.apache.org/jira/browse/SPARK-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8233: --- Assignee: Apache Spark > misc function: hash > --- > > Key: SPARK-8233 > URL: https://issues.apache.org/jira/browse/SPARK-8233 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > hash(a1[, a2...]): int > Returns a hash value of the arguments. See Hive's implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8031) Version number written to Hive metastore is "0.13.1aa" instead of "0.13.1a"
[ https://issues.apache.org/jira/browse/SPARK-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi closed SPARK-8031. -- Resolution: Implemented > Version number written to Hive metastore is "0.13.1aa" instead of "0.13.1a" > --- > > Key: SPARK-8031 > URL: https://issues.apache.org/jira/browse/SPARK-8031 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0 >Reporter: Cheng Lian >Priority: Trivial > Fix For: 1.5.0 > > > While debugging {{CliSuite}} for 1.4.0-SNAPSHOT, noticed the following WARN > log line: > {noformat} > 15/06/02 13:40:29 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 0.13.1aa > {noformat} > The problem is that, the version of Hive dependencies 1.4.0-SNAPSHOT uses is > {{0.13.1a}} (the one shaded by [~pwendell]), but the version showed in this > line is {{0.13.1aa}} (one more {{a}}). The WARN log itself is OK since > {{CliSuite}} initializes a brand new temporary Derby metastore. > While initializing Hive metastore, Hive calls {{ObjectStore.checkSchema()}} > and may write the "short" version string to metastore. This short version > string is defined by {{hive.version.shortname}} in the POM. However, [it was > defined as > {{0.13.1aa}}|https://github.com/pwendell/hive/commit/32e515907f0005c7a28ee388eadd1c94cf99b2d4#diff-600376dffeb79835ede4a0b285078036R62]. > Confirmed with [~pwendell] that it should be a typo. > This doesn't cause any trouble for now, but we probably want to fix this in > the future if we ever need to release another shaded version of Hive 0.13.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8031) Version number written to Hive metastore is "0.13.1aa" instead of "0.13.1a"
[ https://issues.apache.org/jira/browse/SPARK-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-8031: --- Fix Version/s: 1.5.0 Hi. This issue is not in 1.5.0-SNAPSHOT where hive.version are correctly set to 0.13.1a, and hive.version.short to 0.13.1.Thanks > Version number written to Hive metastore is "0.13.1aa" instead of "0.13.1a" > --- > > Key: SPARK-8031 > URL: https://issues.apache.org/jira/browse/SPARK-8031 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0 >Reporter: Cheng Lian >Priority: Trivial > Fix For: 1.5.0 > > > While debugging {{CliSuite}} for 1.4.0-SNAPSHOT, noticed the following WARN > log line: > {noformat} > 15/06/02 13:40:29 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 0.13.1aa > {noformat} > The problem is that, the version of Hive dependencies 1.4.0-SNAPSHOT uses is > {{0.13.1a}} (the one shaded by [~pwendell]), but the version showed in this > line is {{0.13.1aa}} (one more {{a}}). The WARN log itself is OK since > {{CliSuite}} initializes a brand new temporary Derby metastore. > While initializing Hive metastore, Hive calls {{ObjectStore.checkSchema()}} > and may write the "short" version string to metastore. This short version > string is defined by {{hive.version.shortname}} in the POM. However, [it was > defined as > {{0.13.1aa}}|https://github.com/pwendell/hive/commit/32e515907f0005c7a28ee388eadd1c94cf99b2d4#diff-600376dffeb79835ede4a0b285078036R62]. > Confirmed with [~pwendell] that it should be a typo. > This doesn't cause any trouble for now, but we probably want to fix this in > the future if we ever need to release another shaded version of Hive 0.13.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8236) misc function: crc32
[ https://issues.apache.org/jira/browse/SPARK-8236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598780#comment-14598780 ] Apache Spark commented on SPARK-8236: - User 'qiansl127' has created a pull request for this issue: https://github.com/apache/spark/pull/6970 > misc function: crc32 > > > Key: SPARK-8236 > URL: https://issues.apache.org/jira/browse/SPARK-8236 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > crc32(string/binary): bigint > Computes a cyclic redundancy check value for string or binary argument and > returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8236) misc function: crc32
[ https://issues.apache.org/jira/browse/SPARK-8236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8236: --- Assignee: (was: Apache Spark) > misc function: crc32 > > > Key: SPARK-8236 > URL: https://issues.apache.org/jira/browse/SPARK-8236 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > crc32(string/binary): bigint > Computes a cyclic redundancy check value for string or binary argument and > returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8235) misc function: sha1 / sha
[ https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598779#comment-14598779 ] Apache Spark commented on SPARK-8235: - User 'qiansl127' has created a pull request for this issue: https://github.com/apache/spark/pull/6970 > misc function: sha1 / sha > - > > Key: SPARK-8235 > URL: https://issues.apache.org/jira/browse/SPARK-8235 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > sha1(string/binary): string > sha(string/binary): string > Calculates the SHA-1 digest for string or binary and returns the value as a > hex string (as of Hive 1.3.0). Example: sha1('ABC') = > '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8236) misc function: crc32
[ https://issues.apache.org/jira/browse/SPARK-8236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8236: --- Assignee: Apache Spark > misc function: crc32 > > > Key: SPARK-8236 > URL: https://issues.apache.org/jira/browse/SPARK-8236 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > crc32(string/binary): bigint > Computes a cyclic redundancy check value for string or binary argument and > returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8578) Should ignore user defined output committer when appending data
[ https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598774#comment-14598774 ] Apache Spark commented on SPARK-8578: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/6966 > Should ignore user defined output committer when appending data > --- > > Key: SPARK-8578 > URL: https://issues.apache.org/jira/browse/SPARK-8578 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Cheng Lian >Assignee: Yin Huai > > When appending data to a file system via Hadoop API, it's safer to ignore > user defined output committer classes like {{DirectParquetOutputCommitter}}. > Because it's relatively hard to handle task failure in this case. For > example, {{DirectParquetOutputCommitter}} directly writes to the output > directory to boost write performance when working with S3. However, there's > no general way to determine task output file path of a specific task in > Hadoop API, thus we don't know to revert a failed append job. (When doing > overwrite, we can just remove the whole output directory.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8578) Should ignore user defined output committer when appending data
[ https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8578: --- Assignee: Yin Huai (was: Apache Spark) > Should ignore user defined output committer when appending data > --- > > Key: SPARK-8578 > URL: https://issues.apache.org/jira/browse/SPARK-8578 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Cheng Lian >Assignee: Yin Huai > > When appending data to a file system via Hadoop API, it's safer to ignore > user defined output committer classes like {{DirectParquetOutputCommitter}}. > Because it's relatively hard to handle task failure in this case. For > example, {{DirectParquetOutputCommitter}} directly writes to the output > directory to boost write performance when working with S3. However, there's > no general way to determine task output file path of a specific task in > Hadoop API, thus we don't know to revert a failed append job. (When doing > overwrite, we can just remove the whole output directory.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8578) Should ignore user defined output committer when appending data
[ https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598773#comment-14598773 ] Apache Spark commented on SPARK-8578: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/6964 > Should ignore user defined output committer when appending data > --- > > Key: SPARK-8578 > URL: https://issues.apache.org/jira/browse/SPARK-8578 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Cheng Lian >Assignee: Yin Huai > > When appending data to a file system via Hadoop API, it's safer to ignore > user defined output committer classes like {{DirectParquetOutputCommitter}}. > Because it's relatively hard to handle task failure in this case. For > example, {{DirectParquetOutputCommitter}} directly writes to the output > directory to boost write performance when working with S3. However, there's > no general way to determine task output file path of a specific task in > Hadoop API, thus we don't know to revert a failed append job. (When doing > overwrite, we can just remove the whole output directory.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8578) Should ignore user defined output committer when appending data
[ https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8578: --- Assignee: Apache Spark (was: Yin Huai) > Should ignore user defined output committer when appending data > --- > > Key: SPARK-8578 > URL: https://issues.apache.org/jira/browse/SPARK-8578 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > When appending data to a file system via Hadoop API, it's safer to ignore > user defined output committer classes like {{DirectParquetOutputCommitter}}. > Because it's relatively hard to handle task failure in this case. For > example, {{DirectParquetOutputCommitter}} directly writes to the output > directory to boost write performance when working with S3. However, there's > no general way to determine task output file path of a specific task in > Hadoop API, thus we don't know to revert a failed append job. (When doing > overwrite, we can just remove the whole output directory.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8393: --- Assignee: Apache Spark > JavaStreamingContext#awaitTermination() throws non-declared > InterruptedException > > > Key: SPARK-8393 > URL: https://issues.apache.org/jira/browse/SPARK-8393 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1 >Reporter: Jaromir Vanek >Assignee: Apache Spark >Priority: Trivial > > Call to {{JavaStreamingContext#awaitTermination()}} can throw > {{InterruptedException}} which cannot be caught easily in Java because it's > not declared in {{@throws(classOf[InterruptedException])}} annotation. > This {{InterruptedException}} comes originally from {{ContextWaiter}} where > Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8393: --- Assignee: (was: Apache Spark) > JavaStreamingContext#awaitTermination() throws non-declared > InterruptedException > > > Key: SPARK-8393 > URL: https://issues.apache.org/jira/browse/SPARK-8393 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1 >Reporter: Jaromir Vanek >Priority: Trivial > > Call to {{JavaStreamingContext#awaitTermination()}} can throw > {{InterruptedException}} which cannot be caught easily in Java because it's > not declared in {{@throws(classOf[InterruptedException])}} annotation. > This {{InterruptedException}} comes originally from {{ContextWaiter}} where > Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
[ https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598769#comment-14598769 ] Apache Spark commented on SPARK-8393: - User 'rekhajoshm' has created a pull request for this issue: https://github.com/apache/spark/pull/6969 > JavaStreamingContext#awaitTermination() throws non-declared > InterruptedException > > > Key: SPARK-8393 > URL: https://issues.apache.org/jira/browse/SPARK-8393 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.1 >Reporter: Jaromir Vanek >Priority: Trivial > > Call to {{JavaStreamingContext#awaitTermination()}} can throw > {{InterruptedException}} which cannot be caught easily in Java because it's > not declared in {{@throws(classOf[InterruptedException])}} annotation. > This {{InterruptedException}} comes originally from {{ContextWaiter}} where > Java {{ReentrantLock}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8553) Resuming Checkpointed QueueStream Fails
[ https://issues.apache.org/jira/browse/SPARK-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das closed SPARK-8553. Resolution: Won't Fix > Resuming Checkpointed QueueStream Fails > --- > > Key: SPARK-8553 > URL: https://issues.apache.org/jira/browse/SPARK-8553 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.4.0 >Reporter: Shaanan Cohney > > After using a QueueStream within a checkpointed StreamingContext, when the > context is resumed the following error is triggered: > {code} > 15/06/23 02:33:09 WARN QueueInputDStream: isTimeValid called with > 1434987594000 ms where as last valid time is 1434987678000 ms > 15/06/23 02:33:09 ERROR StreamingContext: Error starting the context, marking > it as stopped > org.apache.spark.SparkException: RDD transformations and actions can only be > invoked by the driver, not inside of other transformations; for example, > rdd1.map(x => rdd2.values.count() * x) is invalid because the values > transformation and count action cannot be performed inside of the rdd1.map > transformation. For more information, see SPARK-5063. > at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87) > at org.apache.spark.rdd.RDD.persist(RDD.scala:162) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$apply$8.apply(DStream.scala:357) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$apply$8.apply(DStream.scala:354) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:354) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > at scala.Option.orElse(Option.scala:257) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > at > org.apache.spark.streaming.api.python.PythonTransformedDStream.compute(PythonDStream.scala:195) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > at scala.Option.orElse(Option.scala:257) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > at > org.apache.spark.streaming.api.python.PythonStateDStream.compute(PythonDStream.scala:242) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > at scala.Option.orElse(Option.scala:257) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > at > org.apache.spark.streaming.api.python.PythonStateDStream.compute(PythonDStream.scala:241) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream$$anon
[jira] [Commented] (SPARK-8553) Resuming Checkpointed QueueStream Fails
[ https://issues.apache.org/jira/browse/SPARK-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598725#comment-14598725 ] Tathagata Das commented on SPARK-8553: -- Yes, this is not a supported feature and its pretty hard to support. Recovering a streaming context requires recovering all the data needed by the streaming context to recover. Since arbitrary RDDs get added to queueStream, there is no way to recover data of those RDDs. So this is not a feature that we will support. Yes, we should document this for queueStream. I am marking this JIRA as Wont Fix > Resuming Checkpointed QueueStream Fails > --- > > Key: SPARK-8553 > URL: https://issues.apache.org/jira/browse/SPARK-8553 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.4.0 >Reporter: Shaanan Cohney > > After using a QueueStream within a checkpointed StreamingContext, when the > context is resumed the following error is triggered: > {code} > 15/06/23 02:33:09 WARN QueueInputDStream: isTimeValid called with > 1434987594000 ms where as last valid time is 1434987678000 ms > 15/06/23 02:33:09 ERROR StreamingContext: Error starting the context, marking > it as stopped > org.apache.spark.SparkException: RDD transformations and actions can only be > invoked by the driver, not inside of other transformations; for example, > rdd1.map(x => rdd2.values.count() * x) is invalid because the values > transformation and count action cannot be performed inside of the rdd1.map > transformation. For more information, see SPARK-5063. > at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87) > at org.apache.spark.rdd.RDD.persist(RDD.scala:162) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$apply$8.apply(DStream.scala:357) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$apply$8.apply(DStream.scala:354) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:354) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > at scala.Option.orElse(Option.scala:257) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > at > org.apache.spark.streaming.api.python.PythonTransformedDStream.compute(PythonDStream.scala:195) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > at scala.Option.orElse(Option.scala:257) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > at > org.apache.spark.streaming.api.python.PythonStateDStream.compute(PythonDStream.scala:242) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) > at scala.Option.orElse(Option.scala:257) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) > at > org.apache.spark.streaming.api.python.PythonStateDStream.compute(PythonDStream.scala:241) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598709#comment-14598709 ] Aaron Staple edited comment on SPARK-1503 at 6/24/15 1:46 AM: -- I believe this stopping criteria was added after the paper was written. It is documented on page 8 of the userguide (https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no explanation is provided. (The userguide also documents this as a <= test, while the current code uses <.) And unfortunately I couldn’t find an explanation in the code or git history. I think the switch to absolute tolerance may be because a relative difference measurement could be less useful when the weights are extremely small, and 1 is a convenient cutoff point. (Using 1, the equation is simple and the interpretation is clear.) I believe [~mengxr] alluded to switching to an absolute tolerance at 1 already (https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be able to provide more information. With regard to using the new weight norms as the basis for measuring relative weight difference, I think that if the convergence test passes using either the old or new weight norms, then the old and new norms are going to be very similar. It may not make a significant difference which test is used. (It may also be worth pointing out that in cases where the tolerance tests with respect to different old/new weights return different results, if the tolerance wrt new weights is met (and wrt old weights is not) then the weight norm increased slightly; if the tolerance wrt the old weights is met (and wrt new weights not) then we weight norm decreased slightly.) Finally, TFOCS adopts a policy of skipping the convergence test on the first iteration if the weights are unchanged. I believe this condition is based on implementation specific behavior and does not need to be adopted generally. was (Author: staple): I believe this stopping criteria was added after the paper was written. It is documented on page 8 of the userguide (https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no explanation is provided. (The userguide also documents this as a <= test, while the current code uses <.) And unfortunately I couldn’t find an explanation in the code or git history. I think the switch to absolute tolerance may be because a relative difference measurement could be less useful when the weights are extremely small, and 1 is a convenient cutoff point. (Using 1, the equation is simple and the interpretation is clear.) I believe [~mengxr] alluded to switching to an absolute tolerance at 1 already (https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be able to provide more information. With regard to using the new weight norms as the basis for measuring relative weight difference, I think that if the convergence test passes using either the old or new weight norms, then the old and new norms are going to be very similar. It may not make a significant difference which test is used. (It may also be worth pointing out that in cases where the tolerance tests with respect to different old/new weights return different results, if the tolerance wrt new weights is met (and wrt old weights is not) then the weight norm increased slightly; if the tolerance wrt the old weights is met (and wrt new weights not) then we weight norm decreased slightly.) Finally, TFOCS adopts a policy of skipping the convergence test after the first iteration if the weights are unchanged. I believe this condition is based on implementation specific behavior and does not need to be adopted generally. > Implement Nesterov's accelerated first-order method > --- > > Key: SPARK-1503 > URL: https://issues.apache.org/jira/browse/SPARK-1503 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Aaron Staple > Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png > > > Nesterov's accelerated first-order method is a drop-in replacement for > steepest descent but it converges much faster. We should implement this > method and compare its performance with existing algorithms, including SGD > and L-BFGS. > TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's > method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598709#comment-14598709 ] Aaron Staple commented on SPARK-1503: - I believe this stopping criteria was added after the paper was written. It is documented on page 8 of the userguide (https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no explanation is provided. (The userguide also documents this as a <= test, while the current code uses <.) And unfortunately I couldn’t find an explanation in the code or git history. I think the switch to absolute tolerance may be because a relative difference measurement could be less useful when the weights are extremely small, and 1 is a convenient cutoff point. (Using 1, the equation is simple and the interpretation is clear.) I believe [~mengxr] alluded to switching to an absolute tolerance at 1 already (https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be able to provide more information. With regard to using the new weight norms as the basis for measuring relative weight difference, I think that if the convergence test passes using either the old or new weight norms, then the old and new norms are going to be very similar. It may not make a significant difference which test is used. (It may also be worth pointing out that in cases where the tolerance tests with respect to different old/new weights return different results, if the tolerance wrt new weights is met (and wrt old weights is not) then the weight norm increased slightly; if the tolerance wrt the old weights is met (and wrt new weights not) then we weight norm decreased slightly.) Finally, TFOCS adopts a policy of skipping the convergence test after the first iteration if the weights are unchanged. I believe this condition is based on implementation specific behavior and does not need to be adopted generally. > Implement Nesterov's accelerated first-order method > --- > > Key: SPARK-1503 > URL: https://issues.apache.org/jira/browse/SPARK-1503 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Aaron Staple > Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png > > > Nesterov's accelerated first-order method is a drop-in replacement for > steepest descent but it converges much faster. We should implement this > method and compare its performance with existing algorithms, including SGD > and L-BFGS. > TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's > method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8581) Simplify and clean up the checkpointing code
[ https://issues.apache.org/jira/browse/SPARK-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8581: --- Assignee: Andrew Or (was: Apache Spark) > Simplify and clean up the checkpointing code > > > Key: SPARK-8581 > URL: https://issues.apache.org/jira/browse/SPARK-8581 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > It is an old piece of code and a little overly complex at the moment. We can > rewrite this to improve the readability and preserve exactly the same > semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8581) Simplify and clean up the checkpointing code
[ https://issues.apache.org/jira/browse/SPARK-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598706#comment-14598706 ] Apache Spark commented on SPARK-8581: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6968 > Simplify and clean up the checkpointing code > > > Key: SPARK-8581 > URL: https://issues.apache.org/jira/browse/SPARK-8581 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > > It is an old piece of code and a little overly complex at the moment. We can > rewrite this to improve the readability and preserve exactly the same > semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8584) Better exception message if invalid checkpoint dir is specified
[ https://issues.apache.org/jira/browse/SPARK-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8584: --- Assignee: Apache Spark (was: Andrew Or) > Better exception message if invalid checkpoint dir is specified > --- > > Key: SPARK-8584 > URL: https://issues.apache.org/jira/browse/SPARK-8584 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > > If we're running Spark on a cluster, the checkpoint dir must be a non-local > path. Otherwise, the attempt to read from a checkpoint will fail because the > checkpoint files are written on the executors, not on the driver. > Currently, the error message that you get looks something like the following, > which is not super intuitive: > {code} > Checkpoint RDD 3 (0) has different number of partitions than original RDD 2 > (100) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8584) Better exception message if invalid checkpoint dir is specified
[ https://issues.apache.org/jira/browse/SPARK-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8584: --- Assignee: Andrew Or (was: Apache Spark) > Better exception message if invalid checkpoint dir is specified > --- > > Key: SPARK-8584 > URL: https://issues.apache.org/jira/browse/SPARK-8584 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > If we're running Spark on a cluster, the checkpoint dir must be a non-local > path. Otherwise, the attempt to read from a checkpoint will fail because the > checkpoint files are written on the executors, not on the driver. > Currently, the error message that you get looks something like the following, > which is not super intuitive: > {code} > Checkpoint RDD 3 (0) has different number of partitions than original RDD 2 > (100) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8581) Simplify and clean up the checkpointing code
[ https://issues.apache.org/jira/browse/SPARK-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8581: --- Assignee: Apache Spark (was: Andrew Or) > Simplify and clean up the checkpointing code > > > Key: SPARK-8581 > URL: https://issues.apache.org/jira/browse/SPARK-8581 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Minor > > It is an old piece of code and a little overly complex at the moment. We can > rewrite this to improve the readability and preserve exactly the same > semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8584) Better exception message if invalid checkpoint dir is specified
[ https://issues.apache.org/jira/browse/SPARK-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598707#comment-14598707 ] Apache Spark commented on SPARK-8584: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6968 > Better exception message if invalid checkpoint dir is specified > --- > > Key: SPARK-8584 > URL: https://issues.apache.org/jira/browse/SPARK-8584 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > If we're running Spark on a cluster, the checkpoint dir must be a non-local > path. Otherwise, the attempt to read from a checkpoint will fail because the > checkpoint files are written on the executors, not on the driver. > Currently, the error message that you get looks something like the following, > which is not super intuitive: > {code} > Checkpoint RDD 3 (0) has different number of partitions than original RDD 2 > (100) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system
[ https://issues.apache.org/jira/browse/SPARK-8583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598702#comment-14598702 ] Apache Spark commented on SPARK-8583: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/6967 > Refactor python/run-tests to integrate with dev/run-test's module system > > > Key: SPARK-8583 > URL: https://issues.apache.org/jira/browse/SPARK-8583 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra, PySpark >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should refactor the {{python/run-tests}} script to be written in Python > and integrate with the recent {{dev/run-tests}} module system so that we can > more granularly skip Python tests in the pull request builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system
[ https://issues.apache.org/jira/browse/SPARK-8583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8583: --- Assignee: Josh Rosen (was: Apache Spark) > Refactor python/run-tests to integrate with dev/run-test's module system > > > Key: SPARK-8583 > URL: https://issues.apache.org/jira/browse/SPARK-8583 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra, PySpark >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should refactor the {{python/run-tests}} script to be written in Python > and integrate with the recent {{dev/run-tests}} module system so that we can > more granularly skip Python tests in the pull request builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system
[ https://issues.apache.org/jira/browse/SPARK-8583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8583: --- Assignee: Apache Spark (was: Josh Rosen) > Refactor python/run-tests to integrate with dev/run-test's module system > > > Key: SPARK-8583 > URL: https://issues.apache.org/jira/browse/SPARK-8583 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra, PySpark >Reporter: Josh Rosen >Assignee: Apache Spark > > We should refactor the {{python/run-tests}} script to be written in Python > and integrate with the recent {{dev/run-tests}} module system so that we can > more granularly skip Python tests in the pull request builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8584) Better exception message if invalid checkpoint dir is specified
Andrew Or created SPARK-8584: Summary: Better exception message if invalid checkpoint dir is specified Key: SPARK-8584 URL: https://issues.apache.org/jira/browse/SPARK-8584 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or If we're running Spark on a cluster, the checkpoint dir must be a non-local path. Otherwise, the attempt to read from a checkpoint will fail because the checkpoint files are written on the executors, not on the driver. Currently, the error message that you get looks something like the following, which is not super intuitive: {code} Checkpoint RDD 3 (0) has different number of partitions than original RDD 2 (100) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system
Josh Rosen created SPARK-8583: - Summary: Refactor python/run-tests to integrate with dev/run-test's module system Key: SPARK-8583 URL: https://issues.apache.org/jira/browse/SPARK-8583 Project: Spark Issue Type: Improvement Components: Build, Project Infra, PySpark Reporter: Josh Rosen Assignee: Josh Rosen We should refactor the {{python/run-tests}} script to be written in Python and integrate with the recent {{dev/run-tests}} module system so that we can more granularly skip Python tests in the pull request builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
[ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8582: - Description: In Spark, checkpointing allows the user to truncate the lineage of his RDD and save the intermediate contents to HDFS for fault tolerance. However, this is not currently implemented super efficiently: Every time we checkpoint an RDD, we actually compute it twice: once during the action that triggered the checkpointing in the first place, and once while we checkpoint (we iterate through an RDD's partitions and write them to disk). See this line for more detail: https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102. Instead, we should have a `CheckpointingInterator` that writes checkpoint data to HDFS while we run the action. This will speed up many usages of `RDD#checkpoint` by 2X. (Alternatively, the user can just cache the RDD before checkpointing it, but this is not always viable for very large input data. It's also not a great API to use in general.) was: In Spark, checkpointing allows the user to truncate the lineage of his RDD and save the intermediate contents to HDFS for fault tolerance. However, this is not currently implemented super efficiently: Every time we checkpoint an RDD, we actually compute it twice: once during the action that triggered the checkpointing in the first place, and once while we checkpoint (we iterate through an RDD's partitions and write them to disk). Instead, we should have a `CheckpointingInterator` that writes checkpoint data to HDFS while we run the action. This will speed up many usages of `RDD#checkpoint` by 2X. (Alternatively, the user can just cache the RDD before checkpointing it, but this is not always viable for very large input data. It's also not a great API to use in general.) > Optimize checkpointing to avoid computing an RDD twice > -- > > Key: SPARK-8582 > URL: https://issues.apache.org/jira/browse/SPARK-8582 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > In Spark, checkpointing allows the user to truncate the lineage of his RDD > and save the intermediate contents to HDFS for fault tolerance. However, this > is not currently implemented super efficiently: > Every time we checkpoint an RDD, we actually compute it twice: once during > the action that triggered the checkpointing in the first place, and once > while we checkpoint (we iterate through an RDD's partitions and write them to > disk). See this line for more detail: > https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102. > Instead, we should have a `CheckpointingInterator` that writes checkpoint > data to HDFS while we run the action. This will speed up many usages of > `RDD#checkpoint` by 2X. > (Alternatively, the user can just cache the RDD before checkpointing it, but > this is not always viable for very large input data. It's also not a great > API to use in general.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
[ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598692#comment-14598692 ] Andrew Or commented on SPARK-8582: -- [~tdas] also wants this. > Optimize checkpointing to avoid computing an RDD twice > -- > > Key: SPARK-8582 > URL: https://issues.apache.org/jira/browse/SPARK-8582 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > In Spark, checkpointing allows the user to truncate the lineage of his RDD > and save the intermediate contents to HDFS for fault tolerance. However, this > is not currently implemented super efficiently: > Every time we checkpoint an RDD, we actually compute it twice: once during > the action that triggered the checkpointing in the first place, and once > while we checkpoint (we iterate through an RDD's partitions and write them to > disk). See this line for more detail: > https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102. > Instead, we should have a `CheckpointingInterator` that writes checkpoint > data to HDFS while we run the action. This will speed up many usages of > `RDD#checkpoint` by 2X. > (Alternatively, the user can just cache the RDD before checkpointing it, but > this is not always viable for very large input data. It's also not a great > API to use in general.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
Andrew Or created SPARK-8582: Summary: Optimize checkpointing to avoid computing an RDD twice Key: SPARK-8582 URL: https://issues.apache.org/jira/browse/SPARK-8582 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or In Spark, checkpointing allows the user to truncate the lineage of his RDD and save the intermediate contents to HDFS for fault tolerance. However, this is not currently implemented super efficiently: Every time we checkpoint an RDD, we actually compute it twice: once during the action that triggered the checkpointing in the first place, and once while we checkpoint (we iterate through an RDD's partitions and write them to disk). Instead, we should have a `CheckpointingInterator` that writes checkpoint data to HDFS while we run the action. This will speed up many usages of `RDD#checkpoint` by 2X. (Alternatively, the user can just cache the RDD before checkpointing it, but this is not always viable for very large input data. It's also not a great API to use in general.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version
[ https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598681#comment-14598681 ] Saisai Shao commented on SPARK-8337: Hi [~juanrh], will you also address {{OffsetRange}} problem described in SPARK-8389. > KafkaUtils.createDirectStream for python is lacking API/feature parity with > the Scala/Java version > -- > > Key: SPARK-8337 > URL: https://issues.apache.org/jira/browse/SPARK-8337 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.4.0 >Reporter: Amit Ramesh >Priority: Critical > > See the following thread for context. > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8187) date/time function: date_sub
[ https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8187: -- Shepherd: Davies Liu > date/time function: date_sub > > > Key: SPARK-8187 > URL: https://issues.apache.org/jira/browse/SPARK-8187 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Adrian Wang > > date_sub(string startdate, int days): string > date_sub(date startdate, int days): date > Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = > '2008-12-30'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8186) date/time function: date_add
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8186: -- Shepherd: Davies Liu > date/time function: date_add > > > Key: SPARK-8186 > URL: https://issues.apache.org/jira/browse/SPARK-8186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Adrian Wang > > date_add(string startdate, int days): string > date_add(date startdate, int days): date > Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8075) apply type checking interface to more expressions
[ https://issues.apache.org/jira/browse/SPARK-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8075: --- Shepherd: Michael Armbrust > apply type checking interface to more expressions > - > > Key: SPARK-8075 > URL: https://issues.apache.org/jira/browse/SPARK-8075 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > As https://github.com/apache/spark/pull/6405 has been merged, we need to > apply the type checking interface to more expressions, and finally remove the > default implementation of it in Expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8581) Simplify and clean up the checkpointing code
Andrew Or created SPARK-8581: Summary: Simplify and clean up the checkpointing code Key: SPARK-8581 URL: https://issues.apache.org/jira/browse/SPARK-8581 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor It is an old piece of code and a little overly complex at the moment. We can rewrite this to improve the readability and preserve exactly the same semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7157) Add approximate stratified sampling to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598658#comment-14598658 ] Reynold Xin commented on SPARK-7157: I'm keeping this open still for the Java API. > Add approximate stratified sampling to DataFrame > > > Key: SPARK-7157 > URL: https://issues.apache.org/jira/browse/SPARK-7157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7157) Add approximate stratified sampling to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7157: --- Description: (was: def sampleBy(c) > Add approximate stratified sampling to DataFrame > > > Key: SPARK-7157 > URL: https://issues.apache.org/jira/browse/SPARK-7157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598655#comment-14598655 ] Justin McCarthy commented on SPARK-: Lack of quoting is also leading to errors where a column name collides with a SQL reserved word. There are some very common and useful words that are frequently used as column identifiers: http://www.postgresql.org/docs/9.0/static/sql-keywords-appendix.html Here's SQL-99's take on quoted identifiers: http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier Fix could be as simple as: {code:title=JdbcRDD.scala} private val columnList : String = if (columns.length==0) "1" else "\""+columns.mkString("\",\"")+"\"" {code} > org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: >Reporter: John Ferguson >Priority: Critical > > Is there a way to have JDBC DataFrames use quoted/escaped column names? > Right now, it looks like it "sees" the names correctly in the schema created > but does not escape them in the SQL it creates when they are not compliant: > org.apache.spark.sql.jdbc.JDBCRDD > > private val columnList: String = { > val sb = new StringBuilder() > columns.foreach(x => sb.append(",").append(x)) > if (sb.length == 0) "1" else sb.substring(1) > } > If you see value in this, I would take a shot at adding the quoting > (escaping) of column names here. If you don't do it, some drivers... like > postgresql's will simply drop case all names when parsing the query. As you > can see in the TL;DR below that means they won't match the schema I am given. > TL;DR: > > I am able to connect to a Postgres database in the shell (with driver > referenced): >val jdbcDf = > sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser", "sp500") > In fact when I run: >jdbcDf.registerTempTable("sp500") >val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI > FROM sp500") > and >val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share"))) > The values come back as expected. However, if I try: >jdbcDf.show > Or if I try > >val all = sqlContext.sql("SELECT * FROM sp500") >all.show > I get errors about column names not being found. In fact the error includes > a mention of column names all lower cased. For now I will change my schema > to be more restrictive. Right now it is, per a Stack Overflow poster, not > ANSI compliant by doing things that are allowed by ""'s in pgsql, MySQL and > SQLServer. BTW, our users are giving us tables like this... because various > tools they already use support non-compliant names. In fact, this is mild > compared to what we've had to support. > Currently the schema in question uses mixed case, quoted names with special > characters and spaces: > CREATE TABLE sp500 > ( > "Symbol" text, > "Name" text, > "Sector" text, > "Price" double precision, > "Dividend Yield" double precision, > "Price/Earnings" double precision, > "Earnings/Share" double precision, > "Book Value" double precision, > "52 week low" double precision, > "52 week high" double precision, > "Market Cap" double precision, > "EBITDA" double precision, > "Price/Sales" double precision, > "Price/Book" double precision, > "SEC Filings" text > ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility
[ https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8580: -- Issue Type: Sub-task (was: Test) Parent: SPARK-5463 > Add Parquet files generated by different systems to test interoperability and > compatibility > --- > > Key: SPARK-8580 > URL: https://issues.apache.org/jira/browse/SPARK-8580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 > to improve interoperability with other systems (reading non-standard Parquet > files they generate, and generating standard Parquet files), it would be good > to have a set of standard test Parquet files generated by various > systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old > versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility
Cheng Lian created SPARK-8580: - Summary: Add Parquet files generated by different systems to test interoperability and compatibility Key: SPARK-8580 URL: https://issues.apache.org/jira/browse/SPARK-8580 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 to improve interoperability with other systems (reading non-standard Parquet files they generate, and generating standard Parquet files), it would be good to have a set of standard test Parquet files generated by various systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8139) Documents data sources and Parquet output committer related options
[ https://issues.apache.org/jira/browse/SPARK-8139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8139. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6683 [https://github.com/apache/spark/pull/6683] > Documents data sources and Parquet output committer related options > --- > > Key: SPARK-8139 > URL: https://issues.apache.org/jira/browse/SPARK-8139 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 1.5.0 > > > Should document the following two options: > - {{spark.sql.sources.outputCommitterClass}} > - {{spark.sql.parquet.output.committer.class}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598624#comment-14598624 ] Ai He commented on SPARK-7810: -- Hi, I encountered this problem one month ago and I have missed the stack trace. Then I just took a look at the port JVM listened to and found only IPV6 protocol was supported. That's why I'd like to make this improvement. For the last question, I don't quite understand what the tree is? > rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used > --- > > Key: SPARK-7810 > URL: https://issues.apache.org/jira/browse/SPARK-7810 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He > > Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 > is used. The current method only works well with ipv4. New modification > should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598610#comment-14598610 ] Davies Liu commented on SPARK-7810: --- What's the stack trace look like? Does the host only have IPv6? There are multiple place which donot consider IPv6 in mind, you can grep `127.0.0.1` or `localhost` in the tree, could you also fix them together? > rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used > --- > > Key: SPARK-7810 > URL: https://issues.apache.org/jira/browse/SPARK-7810 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.3.1 >Reporter: Ai He > > Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 > is used. The current method only works well with ipv4. New modification > should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Due to limited review bandwidth, features appearing on this list will get higher priority for code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a starter task (TODO: add a link) rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. h1. Roadmap (WIP) h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727) * Improve GMM scalability and stability (SPARK-7206) * Frequent itemsets improvements (SPARK-7211) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7898) * naive Bayes h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML * List of issues identified during Spark 1.4 QA: (SPARK-7536) h2. SparkR API for ML h2. Documentation * [Search for documentation improvements | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)] was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Due to limited review bandwidth, features appearing on this list will get higher priority for code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a starter task (TODO: add a link) rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. h1. Roadmap (WIP) h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Im
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Due to limited review bandwidth, features appearing on this list will get higher priority for code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a starter task (TODO: add a link) rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. h1. Roadmap (WIP) h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727) * Improve GMM scalability and stability (SPARK-7206) * Frequent itemsets improvements (SPARK-7211) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7898) * naive Bayes h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML h2. SparkR API for ML h2. Documentation * [Search for documentation improvements | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)] was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Due to limited review bandwidth, features appearing on this list will get higher priority for code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a starter task (TODO: add a link) rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. h1. Roadmap (WIP) h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tr
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Due to limited review bandwidth, features appearing on this list will get higher priority for code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a starter task (TODO: add a link) rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. h1. Roadmap (WIP) h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727) * Improve GMM scalability and stability (SPARK-7206) * Frequent itemsets improvements (SPARK-7211) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7898) * naive Bayes h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML h2. SparkR API for ML h2. Documentation was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Due to limited review bandwidth, features appearing on this list will get higher priority for code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a starter task (TODO: add a link) rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. h1. Roadmap (WIP) h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7898) * naive Bayes h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML h2. SparkR API for M
[jira] [Commented] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598579#comment-14598579 ] Joseph K. Bradley commented on SPARK-7131: -- Busy this week, but I expect to begin work sometime next week. > Move tree,forest implementation from spark.mllib to spark.ml > > > Key: SPARK-7131 > URL: https://issues.apache.org/jira/browse/SPARK-7131 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Original Estimate: 168h > Remaining Estimate: 168h > > We want to change and improve the spark.ml API for trees and ensembles, but > we cannot change the old API in spark.mllib. To support the changes we want > to make, we should move the implementation from spark.mllib to spark.ml. We > will generalize and modify it, but will also ensure that we do not change the > behavior of the old API. > This JIRA should be done in several PRs, in this order: > 1. Copy the implementation over to spark.ml and change the spark.ml classes > to use that implementation, rather than calling the spark.mllib > implementation. The current spark.ml tests will ensure that the 2 > implementations learn exactly the same models. Note: This should include > performance testing to make sure the updated code does not have any > regressions. > 2. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. > After these updates, we can more safely generalize and improve the spark.ml > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7131: - Remaining Estimate: 168h Original Estimate: 168h > Move tree,forest implementation from spark.mllib to spark.ml > > > Key: SPARK-7131 > URL: https://issues.apache.org/jira/browse/SPARK-7131 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Original Estimate: 168h > Remaining Estimate: 168h > > We want to change and improve the spark.ml API for trees and ensembles, but > we cannot change the old API in spark.mllib. To support the changes we want > to make, we should move the implementation from spark.mllib to spark.ml. We > will generalize and modify it, but will also ensure that we do not change the > behavior of the old API. > This JIRA should be done in several PRs, in this order: > 1. Copy the implementation over to spark.ml and change the spark.ml classes > to use that implementation, rather than calling the spark.mllib > implementation. The current spark.ml tests will ensure that the 2 > implementations learn exactly the same models. Note: This should include > performance testing to make sure the updated code does not have any > regressions. > 2. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. > After these updates, we can more safely generalize and improve the spark.ml > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup
[ https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598517#comment-14598517 ] Shivaram Venkataraman commented on SPARK-8111: -- [~srowen] Could you help add [~aloknsingh] as a developer and assign this issue ? > SparkR shell should display Spark logo and version banner on startup > > > Key: SPARK-8111 > URL: https://issues.apache.org/jira/browse/SPARK-8111 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Matei Zaharia >Priority: Trivial > Labels: Starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup
[ https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8111. -- Resolution: Fixed > SparkR shell should display Spark logo and version banner on startup > > > Key: SPARK-8111 > URL: https://issues.apache.org/jira/browse/SPARK-8111 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Matei Zaharia >Priority: Trivial > Labels: Starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup
[ https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598515#comment-14598515 ] Shivaram Venkataraman commented on SPARK-8111: -- Issue resolved by https://github.com/apache/spark/pull/6944 > SparkR shell should display Spark logo and version banner on startup > > > Key: SPARK-8111 > URL: https://issues.apache.org/jira/browse/SPARK-8111 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Matei Zaharia >Priority: Trivial > Labels: Starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8449) HDF5 read/write support for Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8449: - Fix Version/s: (was: 1.4.1) > HDF5 read/write support for Spark MLlib > --- > > Key: SPARK-8449 > URL: https://issues.apache.org/jira/browse/SPARK-8449 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Alexander Ulanov > Original Estimate: 96h > Remaining Estimate: 96h > > Add support for reading and writing HDF5 file format to/from LabeledPoint. > HDFS and local file system have to be supported. Other Spark formats to be > discussed. > Interface proposal: > /* path - directory path in any Hadoop-supported file system URI */ > MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit > /* path - file or directory path in any Hadoop-supported file system URI */ > MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8578) Should ignore user defined output committer when appending data
[ https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8578: -- Target Version/s: 1.4.1, 1.5.0 (was: 1.5.0) > Should ignore user defined output committer when appending data > --- > > Key: SPARK-8578 > URL: https://issues.apache.org/jira/browse/SPARK-8578 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Cheng Lian >Assignee: Yin Huai > > When appending data to a file system via Hadoop API, it's safer to ignore > user defined output committer classes like {{DirectParquetOutputCommitter}}. > Because it's relatively hard to handle task failure in this case. For > example, {{DirectParquetOutputCommitter}} directly writes to the output > directory to boost write performance when working with S3. However, there's > no general way to determine task output file path of a specific task in > Hadoop API, thus we don't know to revert a failed append job. (When doing > overwrite, we can just remove the whole output directory.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8235) misc function: sha1 / sha
[ https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8235: --- Assignee: Apache Spark > misc function: sha1 / sha > - > > Key: SPARK-8235 > URL: https://issues.apache.org/jira/browse/SPARK-8235 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > sha1(string/binary): string > sha(string/binary): string > Calculates the SHA-1 digest for string or binary and returns the value as a > hex string (as of Hive 1.3.0). Example: sha1('ABC') = > '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8235) misc function: sha1 / sha
[ https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598493#comment-14598493 ] Apache Spark commented on SPARK-8235: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/6963 > misc function: sha1 / sha > - > > Key: SPARK-8235 > URL: https://issues.apache.org/jira/browse/SPARK-8235 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > sha1(string/binary): string > sha(string/binary): string > Calculates the SHA-1 digest for string or binary and returns the value as a > hex string (as of Hive 1.3.0). Example: sha1('ABC') = > '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8235) misc function: sha1 / sha
[ https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8235: --- Assignee: (was: Apache Spark) > misc function: sha1 / sha > - > > Key: SPARK-8235 > URL: https://issues.apache.org/jira/browse/SPARK-8235 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > sha1(string/binary): string > sha(string/binary): string > Calculates the SHA-1 digest for string or binary and returns the value as a > hex string (as of Hive 1.3.0). Example: sha1('ABC') = > '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8579) Support arbitrary object in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598481#comment-14598481 ] Apache Spark commented on SPARK-8579: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/6959 > Support arbitrary object in UnsafeRow > - > > Key: SPARK-8579 > URL: https://issues.apache.org/jira/browse/SPARK-8579 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > It's common to run count(distinct xxx) in SQL, the data type will be UDT of > OpenHashSet, it's good that we could use UnsafeRow to reducing the memory > usage during aggregation. > Also for DecimalType, which could be used inside the grouping key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8579) Support arbitrary object in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8579: --- Assignee: Davies Liu (was: Apache Spark) > Support arbitrary object in UnsafeRow > - > > Key: SPARK-8579 > URL: https://issues.apache.org/jira/browse/SPARK-8579 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > It's common to run count(distinct xxx) in SQL, the data type will be UDT of > OpenHashSet, it's good that we could use UnsafeRow to reducing the memory > usage during aggregation. > Also for DecimalType, which could be used inside the grouping key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8579) Support arbitrary object in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8579: --- Assignee: Apache Spark (was: Davies Liu) > Support arbitrary object in UnsafeRow > - > > Key: SPARK-8579 > URL: https://issues.apache.org/jira/browse/SPARK-8579 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > It's common to run count(distinct xxx) in SQL, the data type will be UDT of > OpenHashSet, it's good that we could use UnsafeRow to reducing the memory > usage during aggregation. > Also for DecimalType, which could be used inside the grouping key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8579) Support arbitrary object in UnsafeRow
Davies Liu created SPARK-8579: - Summary: Support arbitrary object in UnsafeRow Key: SPARK-8579 URL: https://issues.apache.org/jira/browse/SPARK-8579 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Assignee: Davies Liu It's common to run count(distinct xxx) in SQL, the data type will be UDT of OpenHashSet, it's good that we could use UnsafeRow to reducing the memory usage during aggregation. Also for DecimalType, which could be used inside the grouping key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8190) ExpressionEvalHelper.checkEvaluation should also run the optimizer version
[ https://issues.apache.org/jira/browse/SPARK-8190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8190. --- Resolution: Fixed > ExpressionEvalHelper.checkEvaluation should also run the optimizer version > -- > > Key: SPARK-8190 > URL: https://issues.apache.org/jira/browse/SPARK-8190 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu > > We should remove the existing ExpressionOptimizationSuite, and update > checkEvaluation to also run the optimizer version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8578) Should ignore user defined output committer when appending data
Cheng Lian created SPARK-8578: - Summary: Should ignore user defined output committer when appending data Key: SPARK-8578 URL: https://issues.apache.org/jira/browse/SPARK-8578 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Yin Huai When appending data to a file system via Hadoop API, it's safer to ignore user defined output committer classes like {{DirectParquetOutputCommitter}}. Because it's relatively hard to handle task failure in this case. For example, {{DirectParquetOutputCommitter}} directly writes to the output directory to boost write performance when working with S3. However, there's no general way to determine task output file path of a specific task in Hadoop API, thus we don't know to revert a failed append job. (When doing overwrite, we can just remove the whole output directory.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Due to limited review bandwidth, features appearing on this list will get higher priority for code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a starter task (TODO: add a link) rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. h1. Roadmap (WIP) h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7898) * naive Bayes h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML h2. SparkR API for ML h2. Documentation was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Due to limited review bandwidth, features appearing on this list will get higher priority for code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a starter task (TODO: add a link) rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. h1. Roadmap h2. Algorithms h2. Pipeline API h2. Model persistence h2. Python API for ML h2. SparkR API for ML h2. Documentation > MLlib 1.5 Roadmap > - > > Key: SPARK-8445 > URL: https://issues.apache.org/jira/browse/SPARK-8445 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > We expect to see many MLlib contributors for the 1.5 release. To scale out > the development, we created this master list for MLlib features we plan t
[jira] [Created] (SPARK-8577) ScalaReflectionLock.synchronized can cause deadlock
koert kuipers created SPARK-8577: Summary: ScalaReflectionLock.synchronized can cause deadlock Key: SPARK-8577 URL: https://issues.apache.org/jira/browse/SPARK-8577 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: koert kuipers Priority: Minor Just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. in my own project and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark sql code. the issue went away when i changed my build to have: parallelExecution in Test := false so that the tests run consecutively... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598444#comment-14598444 ] Reynold Xin commented on SPARK-8072: [~animeshbaranawal] I think we should only apply that rule when there is some save/output action. > Better AnalysisException for writing DataFrame with identically named columns > - > > Key: SPARK-8072 > URL: https://issues.apache.org/jira/browse/SPARK-8072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > > We should check if there are duplicate columns, and if yes, throw an explicit > error message saying there are duplicate columns. See current error message > below. > {code} > In [3]: df.withColumn('age', df.age) > Out[3]: DataFrame[age: bigint, name: string, age: bigint] > In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') > /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, > mode) > 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) > 351 """ > --> 352 self._jwrite.mode(mode).parquet(path) > 353 > 354 @since(1.4) > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc > in __call__(self, *args) > 535 answer = self.gateway_client.send_command(command) > 536 return_value = get_return_value(answer, self.gateway_client, > --> 537 self.target_id, self.name) > 538 > 539 for temp_arg in temp_args: > /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o35.parquet. > : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could > be: age#0L, age#3L.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Gro