date:20150623


[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 6:54 AM:
---

[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much better to check the rule before calling write function instead of adding 
the rule in checkanalysis.scala


was (Author: animeshbaranawal):
[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to check the rule before calling write function instead of 
adding the rule in checkanalysis.scala

> Better AnalysisException for writing DataFrame with identically named columns
> -
>
> Key: SPARK-8072
> URL: https://issues.apache.org/jira/browse/SPARK-8072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should check if there are duplicate columns, and if yes, throw an explicit 
> error message saying there are duplicate columns. See current error message 
> below. 
> {code}
> In [3]: df.withColumn('age', df.age)
> Out[3]: DataFrame[age: bigint, name: string, age: bigint]
> In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
> /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
> mode)
> 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
> 351 """
> --> 352 self._jwrite.mode(mode).parquet(path)
> 353 
> 354 @since(1.4)
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
>  in __call__(self, *args)
> 535 answer = self.gateway_client.send_command(command)
> 536 return_value = get_return_value(answer, self.gateway_client,
> --> 537 self.target_id, self.name)
> 538 
> 539 for temp_arg in temp_args:
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o35.parquet.
> : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
> be: age#0L, age#3L.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scal

[jira] [Commented] (SPARK-8214) math function: hex


[ 
https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598961#comment-14598961
 ] 

Apache Spark commented on SPARK-8214:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/6976

> math function: hex
> --
>
> Key: SPARK-8214
> URL: https://issues.apache.org/jira/browse/SPARK-8214
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: zhichao-li
>
> hex(BIGINT a): string
> hex(STRING a): string
> hex(BINARY a): string
> If the argument is an INT or binary, hex returns the number as a STRING in 
> hexadecimal format. Otherwise if the number is a STRING, it converts each 
> character into its hexadecimal representation and returns the resulting 
> STRING. (See 
> http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, 
> BINARY version as of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8214) math function: hex


 [ 
https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8214:
---

Assignee: zhichao-li  (was: Apache Spark)

> math function: hex
> --
>
> Key: SPARK-8214
> URL: https://issues.apache.org/jira/browse/SPARK-8214
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: zhichao-li
>
> hex(BIGINT a): string
> hex(STRING a): string
> hex(BINARY a): string
> If the argument is an INT or binary, hex returns the number as a STRING in 
> hexadecimal format. Otherwise if the number is a STRING, it converts each 
> character into its hexadecimal representation and returns the resulting 
> STRING. (See 
> http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, 
> BINARY version as of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8214) math function: hex


 [ 
https://issues.apache.org/jira/browse/SPARK-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8214:
---

Assignee: Apache Spark  (was: zhichao-li)

> math function: hex
> --
>
> Key: SPARK-8214
> URL: https://issues.apache.org/jira/browse/SPARK-8214
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> hex(BIGINT a): string
> hex(STRING a): string
> hex(BINARY a): string
> If the argument is an INT or binary, hex returns the number as a STRING in 
> hexadecimal format. Otherwise if the number is a STRING, it converts each 
> character into its hexadecimal representation and returns the resulting 
> STRING. (See 
> http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex, 
> BINARY version as of Hive 0.12.0.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8533) Bump Flume version to 1.6.0


 [ 
https://issues.apache.org/jira/browse/SPARK-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8533:
-
Component/s: Streaming
   Priority: Minor  (was: Major)
 Issue Type: Task  (was: Bug)

(Let's set component / type / priority)

> Bump Flume version to 1.6.0
> ---
>
> Key: SPARK-8533
> URL: https://issues.apache.org/jira/browse/SPARK-8533
> Project: Spark
>  Issue Type: Task
>  Components: Streaming
>Reporter: Hari Shreedharan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8587) Return cost and cluster index KMeansModel.predict


 [ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8587:
-
Component/s: MLlib

> Return cost and cluster index KMeansModel.predict
> -
>
> Key: SPARK-8587
> URL: https://issues.apache.org/jira/browse/SPARK-8587
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sam Stoelinga
>Priority: Minor
>
> Looking at PySpark the implementation of KMeansModel.predict 
> https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
>  : 
> Currently:
> it calculates the cost of the closest cluster and returns the index only.
> My expectation:
> Easy way to let the same function or a new function to return the cost with 
> the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8551) Python example code for elastic net


 [ 
https://issues.apache.org/jira/browse/SPARK-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8551:
-
Component/s: PySpark
   Priority: Minor  (was: Major)

> Python example code for elastic net
> ---
>
> Key: SPARK-8551
> URL: https://issues.apache.org/jira/browse/SPARK-8551
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Shuo Xiang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8585) Support LATERAL VIEW in Spark SQL parser


 [ 
https://issues.apache.org/jira/browse/SPARK-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8585:
-
Component/s: SQL
   Priority: Minor  (was: Major)

(Components et al please: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)

> Support LATERAL VIEW in Spark SQL parser
> 
>
> Key: SPARK-8585
> URL: https://issues.apache.org/jira/browse/SPARK-8585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Konstantin Shaposhnikov
>Priority: Minor
>
> It would be good to support LATERAL VIEW SQL syntax without need to create 
> HiveContext.
> Docs: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8561) Drop table can only drop the tables under database "default"


 [ 
https://issues.apache.org/jira/browse/SPARK-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8561:
-
Component/s: SQL

> Drop table can only drop the tables under database "default"
> 
>
> Key: SPARK-8561
> URL: https://issues.apache.org/jira/browse/SPARK-8561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: baishuo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup


[ 
https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598937#comment-14598937
 ] 

Sean Owen commented on SPARK-8111:
--

[~shivaram] done though I also just made you a JIRA admin, so that you can add 
Contributors at 
https://issues.apache.org/jira/plugins/servlet/project-config/SPARK/roles  
(Just be aware you can now edit lots of things in JIRA so careful what you 
click!)

> SparkR shell should display Spark logo and version banner on startup
> 
>
> Key: SPARK-8111
> URL: https://issues.apache.org/jira/browse/SPARK-8111
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Matei Zaharia
>Assignee: Alok Singh
>Priority: Trivial
>  Labels: Starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8371) improve unit test for MaxOf and MinOf


 [ 
https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8371.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6825
[https://github.com/apache/spark/pull/6825]

> improve unit test for MaxOf and MinOf
> -
>
> Key: SPARK-8371
> URL: https://issues.apache.org/jira/browse/SPARK-8371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup


 [ 
https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8111:
-
Assignee: Alok Singh

> SparkR shell should display Spark logo and version banner on startup
> 
>
> Key: SPARK-8111
> URL: https://issues.apache.org/jira/browse/SPARK-8111
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Matei Zaharia
>Assignee: Alok Singh
>Priority: Trivial
>  Labels: Starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns


[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:34 AM:
---

[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to check the rule before calling write function instead of 
adding the rule in checkanalysis.scala


was (Author: animeshbaranawal):
[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala

> Better AnalysisException for writing DataFrame with identically named columns
> -
>
> Key: SPARK-8072
> URL: https://issues.apache.org/jira/browse/SPARK-8072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should check if there are duplicate columns, and if yes, throw an explicit 
> error message saying there are duplicate columns. See current error message 
> below. 
> {code}
> In [3]: df.withColumn('age', df.age)
> Out[3]: DataFrame[age: bigint, name: string, age: bigint]
> In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
> /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
> mode)
> 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
> 351 """
> --> 352 self._jwrite.mode(mode).parquet(path)
> 353 
> 354 @since(1.4)
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
>  in __call__(self, *args)
> 535 answer = self.gateway_client.send_command(command)
> 536 return_value = get_return_value(answer, self.gateway_client,
> --> 537 self.target_id, self.name)
> 538 
> 539 for temp_arg in temp_args:
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o35.parquet.
> : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
> be: age#0L, age#3L.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   a

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns


[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:29 AM:
---

[~rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala


was (Author: animeshbaranawal):
[rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala

> Better AnalysisException for writing DataFrame with identically named columns
> -
>
> Key: SPARK-8072
> URL: https://issues.apache.org/jira/browse/SPARK-8072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should check if there are duplicate columns, and if yes, throw an explicit 
> error message saying there are duplicate columns. See current error message 
> below. 
> {code}
> In [3]: df.withColumn('age', df.age)
> Out[3]: DataFrame[age: bigint, name: string, age: bigint]
> In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
> /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
> mode)
> 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
> 351 """
> --> 352 self._jwrite.mode(mode).parquet(path)
> 353 
> 354 @since(1.4)
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
>  in __call__(self, *args)
> 535 answer = self.gateway_client.send_command(command)
> 536 return_value = get_return_value(answer, self.gateway_client,
> --> 537 self.target_id, self.name)
> 538 
> 539 for temp_arg in temp_args:
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o35.parquet.
> : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
> be: age#0L, age#3L.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spar

[jira] [Comment Edited] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns


[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal edited comment on SPARK-8072 at 6/24/15 5:29 AM:
---

[rxin]
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala


was (Author: animeshbaranawal):
If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala

> Better AnalysisException for writing DataFrame with identically named columns
> -
>
> Key: SPARK-8072
> URL: https://issues.apache.org/jira/browse/SPARK-8072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should check if there are duplicate columns, and if yes, throw an explicit 
> error message saying there are duplicate columns. See current error message 
> below. 
> {code}
> In [3]: df.withColumn('age', df.age)
> Out[3]: DataFrame[age: bigint, name: string, age: bigint]
> In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
> /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
> mode)
> 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
> 351 """
> --> 352 self._jwrite.mode(mode).parquet(path)
> 353 
> 354 @since(1.4)
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
>  in __call__(self, *args)
> 535 answer = self.gateway_client.send_command(command)
> 536 return_value = get_return_value(answer, self.gateway_client,
> --> 537 self.target_id, self.name)
> 538 
> 539 for temp_arg in temp_args:
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o35.parquet.
> : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
> be: age#0L, age#3L.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.ca

[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns


[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598890#comment-14598890
 ] 

Animesh Baranawal commented on SPARK-8072:
--

If we want the rule to apply only on some save/ouput action, would not it be 
much intuitive to add the rule in the DataFrameWriter.scala instead of in 
CheckAnalysis.scala

> Better AnalysisException for writing DataFrame with identically named columns
> -
>
> Key: SPARK-8072
> URL: https://issues.apache.org/jira/browse/SPARK-8072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should check if there are duplicate columns, and if yes, throw an explicit 
> error message saying there are duplicate columns. See current error message 
> below. 
> {code}
> In [3]: df.withColumn('age', df.age)
> Out[3]: DataFrame[age: bigint, name: string, age: bigint]
> In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
> /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
> mode)
> 350 >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
> 351 """
> --> 352 self._jwrite.mode(mode).parquet(path)
> 353 
> 354 @since(1.4)
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
>  in __call__(self, *args)
> 535 answer = self.gateway_client.send_command(command)
> 536 return_value = get_return_value(answer, self.gateway_client,
> --> 537 self.target_id, self.name)
> 538 
> 539 for temp_arg in temp_args:
> /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o35.parquet.
> : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
> be: age#0L, age#3L.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterato

[jira] [Updated] (SPARK-6749) Make metastore client robust to underlying socket connection loss

2015-06-23 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6749:

Assignee: Eric Liang

> Make metastore client robust to underlying socket connection loss
> -
>
> Key: SPARK-6749
> URL: https://issues.apache.org/jira/browse/SPARK-6749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Eric Liang
>Priority: Critical
> Fix For: 1.5.0
>
>
> Right now, if metastore get restarted, we have to restart the driver to get a 
> new connection to the metastore client because the underlying socket 
> connection is gone. We should make metastore client robust to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6749) Make metastore client robust to underlying socket connection loss

2015-06-23 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-6749.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6912
[https://github.com/apache/spark/pull/6912]

> Make metastore client robust to underlying socket connection loss
> -
>
> Key: SPARK-6749
> URL: https://issues.apache.org/jira/browse/SPARK-6749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 1.5.0
>
>
> Right now, if metastore get restarted, we have to restart the driver to get a 
> new connection to the metastore client because the underlying socket 
> connection is gone. We should make metastore client robust to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-23 Thread Sam Stoelinga (JIRA)

Sam Stoelinga created SPARK-8587:


 Summary: Return cost and cluster index KMeansModel.predict
 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
Reporter: Sam Stoelinga
Priority: Minor


Looking at PySpark the implementation of KMeansModel.predict: 

Currently:
it calculates the cost of the closest cluster and returns the index only.

My expectation:
Easy way to let the same function or a new function to return the cost with the 
index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-23 Thread Sam Stoelinga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Stoelinga updated SPARK-8587:
-
Description: 
Looking at PySpark the implementation of KMeansModel.predict 
https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102:
 

Currently:
it calculates the cost of the closest cluster and returns the index only.

My expectation:
Easy way to let the same function or a new function to return the cost with the 
index.

  was:
Looking at PySpark the implementation of KMeansModel.predict: 

Currently:
it calculates the cost of the closest cluster and returns the index only.

My expectation:
Easy way to let the same function or a new function to return the cost with the 
index.


> Return cost and cluster index KMeansModel.predict
> -
>
> Key: SPARK-8587
> URL: https://issues.apache.org/jira/browse/SPARK-8587
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sam Stoelinga
>Priority: Minor
>
> Looking at PySpark the implementation of KMeansModel.predict 
> https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102:
>  
> Currently:
> it calculates the cost of the closest cluster and returns the index only.
> My expectation:
> Easy way to let the same function or a new function to return the cost with 
> the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-23 Thread Sam Stoelinga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Stoelinga updated SPARK-8587:
-
Description: 
Looking at PySpark the implementation of KMeansModel.predict 
https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
 : 

Currently:
it calculates the cost of the closest cluster and returns the index only.

My expectation:
Easy way to let the same function or a new function to return the cost with the 
index.

  was:
Looking at PySpark the implementation of KMeansModel.predict 
https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102:
 

Currently:
it calculates the cost of the closest cluster and returns the index only.

My expectation:
Easy way to let the same function or a new function to return the cost with the 
index.


> Return cost and cluster index KMeansModel.predict
> -
>
> Key: SPARK-8587
> URL: https://issues.apache.org/jira/browse/SPARK-8587
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sam Stoelinga
>Priority: Minor
>
> Looking at PySpark the implementation of KMeansModel.predict 
> https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
>  : 
> Currently:
> it calculates the cost of the closest cluster and returns the index only.
> My expectation:
> Easy way to let the same function or a new function to return the cost with 
> the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7137) Add checkInputColumn back to Params and print more info


 [ 
https://issues.apache.org/jira/browse/SPARK-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-7137:
---
Comment: was deleted

(was: Sorry [~gweidner] , [~josephkb] just saw it was unassigned when i created 
the patch.thanks)

> Add checkInputColumn back to Params and print more info
> ---
>
> Key: SPARK-7137
> URL: https://issues.apache.org/jira/browse/SPARK-7137
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> In the PR for [https://issues.apache.org/jira/browse/SPARK-5957], 
> Params.checkInputColumn was moved to SchemaUtils and renamed to 
> checkColumnType.  The downside is that it no longer has access to the 
> parameter info, so it cannot state which input column parameter was incorrect.
> We should keep checkColumnType but also add checkInputColumn back to Params.  
> It should print out the parameter name and description.  Internally, it may 
> call checkColumnType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time


[ 
https://issues.apache.org/jira/browse/SPARK-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598862#comment-14598862
 ] 

Rekha Joshi commented on SPARK-2645:


[~sowen] [~vokom] Hi. On a quick look, with my setup latest 1.5.0-SNAPSHOT., I 
believe this can still happen.
If Executor hits on an unhandled exception, the system will exit with code 50 
(SparkExitCode.UNCAUGHT_EXCEPTION)
{code}
//SparkUncaughtExceptionHandler//
if (!Utils.inShutdown()) {
  if (exception.isInstanceOf[OutOfMemoryError]) {
System.exit(SparkExitCode.OOM)
  } else {
System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)
  }
}
.. . 
private[spark] object SparkExitCode {
  /** The default uncaught exception handler was reached. */
  val UNCAUGHT_EXCEPTION = 50
{code}

Git patch is to defensively avoid a stop if already done and/or handling 
exception at SparkEnv.Please review.Thanks.

> Spark driver calls System.exit(50) after calling SparkContext.stop() the 
> second time 
> -
>
> Key: SPARK-2645
> URL: https://issues.apache.org/jira/browse/SPARK-2645
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Vlad Komarov
>
> In some cases my application calls SparkContext.stop() after it has already 
> stopped and this leads to stopping JVM that runs spark driver.
> E.g
> This program should run forever
> {code}
> JavaSparkContext context = new JavaSparkContext("spark://12.34.21.44:7077", 
> "DummyApp");
> try {
> JavaRDD rdd = context.parallelize(Arrays.asList(1, 2, 
> 3));
> rdd.count();
> } catch (Throwable e) {
> e.printStackTrace();
> }
> try {
> context.cancelAllJobs();
> context.stop();
> //call stop second time
> context.stop();
> } catch (Throwable e) {
> e.printStackTrace();
> }
> Thread.currentThread().join();
> {code}
> but it finishes with exit code 50 after calling SparkContext.stop() the 
> second time.
> Also it throws an exception like this
> {code}
> org.apache.spark.ServerStateException: Server is already stopped
>   at org.apache.spark.HttpServer.stop(HttpServer.scala:122) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:984) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91)
>  [spark-core_2.10-1.0.0.jar:1.0.0]
>   at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) 
> [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
>  [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) 
> [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>  [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  [scala-library-2.10.4.jar:na]
> {code}
> One remark is that this behavior is only reproducible when I call 
> SparkContext.cancellAllJobs() before calling SparkContext.stop()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time


 [ 
https://issues.apache.org/jira/browse/SPARK-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2645:
---

Assignee: Apache Spark

> Spark driver calls System.exit(50) after calling SparkContext.stop() the 
> second time 
> -
>
> Key: SPARK-2645
> URL: https://issues.apache.org/jira/browse/SPARK-2645
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Vlad Komarov
>Assignee: Apache Spark
>
> In some cases my application calls SparkContext.stop() after it has already 
> stopped and this leads to stopping JVM that runs spark driver.
> E.g
> This program should run forever
> {code}
> JavaSparkContext context = new JavaSparkContext("spark://12.34.21.44:7077", 
> "DummyApp");
> try {
> JavaRDD rdd = context.parallelize(Arrays.asList(1, 2, 
> 3));
> rdd.count();
> } catch (Throwable e) {
> e.printStackTrace();
> }
> try {
> context.cancelAllJobs();
> context.stop();
> //call stop second time
> context.stop();
> } catch (Throwable e) {
> e.printStackTrace();
> }
> Thread.currentThread().join();
> {code}
> but it finishes with exit code 50 after calling SparkContext.stop() the 
> second time.
> Also it throws an exception like this
> {code}
> org.apache.spark.ServerStateException: Server is already stopped
>   at org.apache.spark.HttpServer.stop(HttpServer.scala:122) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:984) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91)
>  [spark-core_2.10-1.0.0.jar:1.0.0]
>   at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) 
> [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
>  [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) 
> [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>  [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  [scala-library-2.10.4.jar:na]
> {code}
> One remark is that this behavior is only reproducible when I call 
> SparkContext.cancellAllJobs() before calling SparkContext.stop()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time


[ 
https://issues.apache.org/jira/browse/SPARK-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598859#comment-14598859
 ] 

Apache Spark commented on SPARK-2645:
-

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/6973

> Spark driver calls System.exit(50) after calling SparkContext.stop() the 
> second time 
> -
>
> Key: SPARK-2645
> URL: https://issues.apache.org/jira/browse/SPARK-2645
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Vlad Komarov
>
> In some cases my application calls SparkContext.stop() after it has already 
> stopped and this leads to stopping JVM that runs spark driver.
> E.g
> This program should run forever
> {code}
> JavaSparkContext context = new JavaSparkContext("spark://12.34.21.44:7077", 
> "DummyApp");
> try {
> JavaRDD rdd = context.parallelize(Arrays.asList(1, 2, 
> 3));
> rdd.count();
> } catch (Throwable e) {
> e.printStackTrace();
> }
> try {
> context.cancelAllJobs();
> context.stop();
> //call stop second time
> context.stop();
> } catch (Throwable e) {
> e.printStackTrace();
> }
> Thread.currentThread().join();
> {code}
> but it finishes with exit code 50 after calling SparkContext.stop() the 
> second time.
> Also it throws an exception like this
> {code}
> org.apache.spark.ServerStateException: Server is already stopped
>   at org.apache.spark.HttpServer.stop(HttpServer.scala:122) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:984) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91)
>  [spark-core_2.10-1.0.0.jar:1.0.0]
>   at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) 
> [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
>  [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) 
> [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>  [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  [scala-library-2.10.4.jar:na]
> {code}
> One remark is that this behavior is only reproducible when I call 
> SparkContext.cancellAllJobs() before calling SparkContext.stop()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time


 [ 
https://issues.apache.org/jira/browse/SPARK-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2645:
---

Assignee: (was: Apache Spark)

> Spark driver calls System.exit(50) after calling SparkContext.stop() the 
> second time 
> -
>
> Key: SPARK-2645
> URL: https://issues.apache.org/jira/browse/SPARK-2645
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Vlad Komarov
>
> In some cases my application calls SparkContext.stop() after it has already 
> stopped and this leads to stopping JVM that runs spark driver.
> E.g
> This program should run forever
> {code}
> JavaSparkContext context = new JavaSparkContext("spark://12.34.21.44:7077", 
> "DummyApp");
> try {
> JavaRDD rdd = context.parallelize(Arrays.asList(1, 2, 
> 3));
> rdd.count();
> } catch (Throwable e) {
> e.printStackTrace();
> }
> try {
> context.cancelAllJobs();
> context.stop();
> //call stop second time
> context.stop();
> } catch (Throwable e) {
> e.printStackTrace();
> }
> Thread.currentThread().join();
> {code}
> but it finishes with exit code 50 after calling SparkContext.stop() the 
> second time.
> Also it throws an exception like this
> {code}
> org.apache.spark.ServerStateException: Server is already stopped
>   at org.apache.spark.HttpServer.stop(HttpServer.scala:122) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:984) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96)
>  ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) 
> ~[spark-core_2.10-1.0.0.jar:1.0.0]
>   at 
> org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91)
>  [spark-core_2.10-1.0.0.jar:1.0.0]
>   at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) 
> [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
>  [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) 
> [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>  [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> [scala-library-2.10.4.jar:na]
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  [scala-library-2.10.4.jar:na]
> {code}
> One remark is that this behavior is only reproducible when I call 
> SparkContext.cancellAllJobs() before calling SparkContext.stop()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8586) SQL add jar command does not work well with Scala REPL

2015-06-23 Thread Yin Huai (JIRA)

Yin Huai created SPARK-8586:
---

 Summary: SQL add jar command does not work well with Scala REPL
 Key: SPARK-8586
 URL: https://issues.apache.org/jira/browse/SPARK-8586
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical


Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. 
So, SerDe added through add jar command may not be loaded in the context class 
loader when we lookup the table.
For example, the following code will fail when we try to show the table. 
{code}
hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar")
hive.sql("drop table if exists jsonTable")
hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe'")
hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", 
"val").insertInto("jsonTable")
hive.table("jsonTable").show
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8585) Support LATERAL VIEW in Spark SQL parser

2015-06-23 Thread Konstantin Shaposhnikov (JIRA)

Konstantin Shaposhnikov created SPARK-8585:
--

 Summary: Support LATERAL VIEW in Spark SQL parser
 Key: SPARK-8585
 URL: https://issues.apache.org/jira/browse/SPARK-8585
 Project: Spark
  Issue Type: Improvement
Reporter: Konstantin Shaposhnikov


It would be good to support LATERAL VIEW SQL syntax without need to create 
HiveContext.

Docs: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5768) Spark UI Shows incorrect memory under Yarn


 [ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5768:
---

Assignee: Apache Spark

> Spark UI Shows incorrect memory under Yarn
> --
>
> Key: SPARK-5768
> URL: https://issues.apache.org/jira/browse/SPARK-5768
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
> Environment: Centos 6
>Reporter: Al M
>Assignee: Apache Spark
>Priority: Trivial
>
> I am running Spark on Yarn with 2 executors.  The executors are running on 
> separate physical machines.
> I have spark.executor.memory set to '40g'.  This is because I want to have 
> 40g of memory used on each machine.  I have one executor per machine.
> When I run my application I see from 'top' that both my executors are using 
> the full 40g of memory I allocated to them.
> The 'Executors' tab in the Spark UI shows something different.  It shows the 
> memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
> look like I only have 20GB available per executor when really I have 40GB 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn


[ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598815#comment-14598815
 ] 

Apache Spark commented on SPARK-5768:
-

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/6972

> Spark UI Shows incorrect memory under Yarn
> --
>
> Key: SPARK-5768
> URL: https://issues.apache.org/jira/browse/SPARK-5768
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> I am running Spark on Yarn with 2 executors.  The executors are running on 
> separate physical machines.
> I have spark.executor.memory set to '40g'.  This is because I want to have 
> 40g of memory used on each machine.  I have one executor per machine.
> When I run my application I see from 'top' that both my executors are using 
> the full 40g of memory I allocated to them.
> The 'Executors' tab in the Spark UI shows something different.  It shows the 
> memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
> look like I only have 20GB available per executor when really I have 40GB 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5768) Spark UI Shows incorrect memory under Yarn


 [ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5768:
---

Assignee: (was: Apache Spark)

> Spark UI Shows incorrect memory under Yarn
> --
>
> Key: SPARK-5768
> URL: https://issues.apache.org/jira/browse/SPARK-5768
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.2.0, 1.2.1
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> I am running Spark on Yarn with 2 executors.  The executors are running on 
> separate physical machines.
> I have spark.executor.memory set to '40g'.  This is because I want to have 
> 40g of memory used on each machine.  I have one executor per machine.
> When I run my application I see from 'top' that both my executors are using 
> the full 40g of memory I allocated to them.
> The 'Executors' tab in the Spark UI shows something different.  It shows the 
> memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
> look like I only have 20GB available per executor when really I have 40GB 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8233) misc function: hash


[ 
https://issues.apache.org/jira/browse/SPARK-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598797#comment-14598797
 ] 

Apache Spark commented on SPARK-8233:
-

User 'qiansl127' has created a pull request for this issue:
https://github.com/apache/spark/pull/6971

> misc function: hash
> ---
>
> Key: SPARK-8233
> URL: https://issues.apache.org/jira/browse/SPARK-8233
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> hash(a1[, a2...]): int
> Returns a hash value of the arguments. See Hive's implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8233) misc function: hash


 [ 
https://issues.apache.org/jira/browse/SPARK-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8233:
---

Assignee: (was: Apache Spark)

> misc function: hash
> ---
>
> Key: SPARK-8233
> URL: https://issues.apache.org/jira/browse/SPARK-8233
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> hash(a1[, a2...]): int
> Returns a hash value of the arguments. See Hive's implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8233) misc function: hash


 [ 
https://issues.apache.org/jira/browse/SPARK-8233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8233:
---

Assignee: Apache Spark

> misc function: hash
> ---
>
> Key: SPARK-8233
> URL: https://issues.apache.org/jira/browse/SPARK-8233
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> hash(a1[, a2...]): int
> Returns a hash value of the arguments. See Hive's implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8031) Version number written to Hive metastore is "0.13.1aa" instead of "0.13.1a"


 [ 
https://issues.apache.org/jira/browse/SPARK-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi closed SPARK-8031.
--
Resolution: Implemented

> Version number written to Hive metastore is "0.13.1aa" instead of "0.13.1a"
> ---
>
> Key: SPARK-8031
> URL: https://issues.apache.org/jira/browse/SPARK-8031
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0
>Reporter: Cheng Lian
>Priority: Trivial
> Fix For: 1.5.0
>
>
> While debugging {{CliSuite}} for 1.4.0-SNAPSHOT, noticed the following WARN 
> log line:
> {noformat}
> 15/06/02 13:40:29 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 0.13.1aa
> {noformat}
> The problem is that, the version of Hive dependencies 1.4.0-SNAPSHOT uses is 
> {{0.13.1a}} (the one shaded by [~pwendell]), but the version showed in this 
> line is {{0.13.1aa}} (one more {{a}}). The WARN log itself is OK since 
> {{CliSuite}} initializes a brand new temporary Derby metastore.
> While initializing Hive metastore, Hive calls {{ObjectStore.checkSchema()}} 
> and may write the "short" version string to metastore. This short version 
> string is defined by {{hive.version.shortname}} in the POM. However, [it was 
> defined as 
> {{0.13.1aa}}|https://github.com/pwendell/hive/commit/32e515907f0005c7a28ee388eadd1c94cf99b2d4#diff-600376dffeb79835ede4a0b285078036R62].
>  Confirmed with [~pwendell] that it should be a typo.
> This doesn't cause any trouble for now, but we probably want to fix this in 
> the future if we ever need to release another shaded version of Hive 0.13.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8031) Version number written to Hive metastore is "0.13.1aa" instead of "0.13.1a"


 [ 
https://issues.apache.org/jira/browse/SPARK-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-8031:
---
Fix Version/s: 1.5.0

Hi. This issue is not in 1.5.0-SNAPSHOT where hive.version are correctly set to 
0.13.1a, and hive.version.short to 0.13.1.Thanks


> Version number written to Hive metastore is "0.13.1aa" instead of "0.13.1a"
> ---
>
> Key: SPARK-8031
> URL: https://issues.apache.org/jira/browse/SPARK-8031
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0
>Reporter: Cheng Lian
>Priority: Trivial
> Fix For: 1.5.0
>
>
> While debugging {{CliSuite}} for 1.4.0-SNAPSHOT, noticed the following WARN 
> log line:
> {noformat}
> 15/06/02 13:40:29 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 0.13.1aa
> {noformat}
> The problem is that, the version of Hive dependencies 1.4.0-SNAPSHOT uses is 
> {{0.13.1a}} (the one shaded by [~pwendell]), but the version showed in this 
> line is {{0.13.1aa}} (one more {{a}}). The WARN log itself is OK since 
> {{CliSuite}} initializes a brand new temporary Derby metastore.
> While initializing Hive metastore, Hive calls {{ObjectStore.checkSchema()}} 
> and may write the "short" version string to metastore. This short version 
> string is defined by {{hive.version.shortname}} in the POM. However, [it was 
> defined as 
> {{0.13.1aa}}|https://github.com/pwendell/hive/commit/32e515907f0005c7a28ee388eadd1c94cf99b2d4#diff-600376dffeb79835ede4a0b285078036R62].
>  Confirmed with [~pwendell] that it should be a typo.
> This doesn't cause any trouble for now, but we probably want to fix this in 
> the future if we ever need to release another shaded version of Hive 0.13.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8236) misc function: crc32


[ 
https://issues.apache.org/jira/browse/SPARK-8236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598780#comment-14598780
 ] 

Apache Spark commented on SPARK-8236:
-

User 'qiansl127' has created a pull request for this issue:
https://github.com/apache/spark/pull/6970

> misc function: crc32
> 
>
> Key: SPARK-8236
> URL: https://issues.apache.org/jira/browse/SPARK-8236
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> crc32(string/binary): bigint
> Computes a cyclic redundancy check value for string or binary argument and 
> returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8236) misc function: crc32


 [ 
https://issues.apache.org/jira/browse/SPARK-8236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8236:
---

Assignee: (was: Apache Spark)

> misc function: crc32
> 
>
> Key: SPARK-8236
> URL: https://issues.apache.org/jira/browse/SPARK-8236
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> crc32(string/binary): bigint
> Computes a cyclic redundancy check value for string or binary argument and 
> returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8235) misc function: sha1 / sha


[ 
https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598779#comment-14598779
 ] 

Apache Spark commented on SPARK-8235:
-

User 'qiansl127' has created a pull request for this issue:
https://github.com/apache/spark/pull/6970

> misc function: sha1 / sha
> -
>
> Key: SPARK-8235
> URL: https://issues.apache.org/jira/browse/SPARK-8235
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> sha1(string/binary): string
> sha(string/binary): string
> Calculates the SHA-1 digest for string or binary and returns the value as a 
> hex string (as of Hive 1.3.0). Example: sha1('ABC') = 
> '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8236) misc function: crc32


 [ 
https://issues.apache.org/jira/browse/SPARK-8236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8236:
---

Assignee: Apache Spark

> misc function: crc32
> 
>
> Key: SPARK-8236
> URL: https://issues.apache.org/jira/browse/SPARK-8236
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> crc32(string/binary): bigint
> Computes a cyclic redundancy check value for string or binary argument and 
> returns bigint value (as of Hive 1.3.0). Example: crc32('ABC') = 2743272264.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8578) Should ignore user defined output committer when appending data


[ 
https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598774#comment-14598774
 ] 

Apache Spark commented on SPARK-8578:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6966

> Should ignore user defined output committer when appending data
> ---
>
> Key: SPARK-8578
> URL: https://issues.apache.org/jira/browse/SPARK-8578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Yin Huai
>
> When appending data to a file system via Hadoop API, it's safer to ignore 
> user defined output committer classes like {{DirectParquetOutputCommitter}}. 
> Because it's relatively hard to handle task failure in this case.  For 
> example, {{DirectParquetOutputCommitter}} directly writes to the output 
> directory to boost write performance when working with S3. However, there's 
> no general way to determine task output file path of a specific task in 
> Hadoop API, thus we don't know to revert a failed append job. (When doing 
> overwrite, we can just remove the whole output directory.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8578) Should ignore user defined output committer when appending data


 [ 
https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8578:
---

Assignee: Yin Huai  (was: Apache Spark)

> Should ignore user defined output committer when appending data
> ---
>
> Key: SPARK-8578
> URL: https://issues.apache.org/jira/browse/SPARK-8578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Yin Huai
>
> When appending data to a file system via Hadoop API, it's safer to ignore 
> user defined output committer classes like {{DirectParquetOutputCommitter}}. 
> Because it's relatively hard to handle task failure in this case.  For 
> example, {{DirectParquetOutputCommitter}} directly writes to the output 
> directory to boost write performance when working with S3. However, there's 
> no general way to determine task output file path of a specific task in 
> Hadoop API, thus we don't know to revert a failed append job. (When doing 
> overwrite, we can just remove the whole output directory.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8578) Should ignore user defined output committer when appending data


[ 
https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598773#comment-14598773
 ] 

Apache Spark commented on SPARK-8578:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/6964

> Should ignore user defined output committer when appending data
> ---
>
> Key: SPARK-8578
> URL: https://issues.apache.org/jira/browse/SPARK-8578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Yin Huai
>
> When appending data to a file system via Hadoop API, it's safer to ignore 
> user defined output committer classes like {{DirectParquetOutputCommitter}}. 
> Because it's relatively hard to handle task failure in this case.  For 
> example, {{DirectParquetOutputCommitter}} directly writes to the output 
> directory to boost write performance when working with S3. However, there's 
> no general way to determine task output file path of a specific task in 
> Hadoop API, thus we don't know to revert a failed append job. (When doing 
> overwrite, we can just remove the whole output directory.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8578) Should ignore user defined output committer when appending data


 [ 
https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8578:
---

Assignee: Apache Spark  (was: Yin Huai)

> Should ignore user defined output committer when appending data
> ---
>
> Key: SPARK-8578
> URL: https://issues.apache.org/jira/browse/SPARK-8578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> When appending data to a file system via Hadoop API, it's safer to ignore 
> user defined output committer classes like {{DirectParquetOutputCommitter}}. 
> Because it's relatively hard to handle task failure in this case.  For 
> example, {{DirectParquetOutputCommitter}} directly writes to the output 
> directory to boost write performance when working with S3. However, there's 
> no general way to determine task output file path of a specific task in 
> Hadoop API, thus we don't know to revert a failed append job. (When doing 
> overwrite, we can just remove the whole output directory.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException


 [ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8393:
---

Assignee: Apache Spark

> JavaStreamingContext#awaitTermination() throws non-declared 
> InterruptedException
> 
>
> Key: SPARK-8393
> URL: https://issues.apache.org/jira/browse/SPARK-8393
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: Jaromir Vanek
>Assignee: Apache Spark
>Priority: Trivial
>
> Call to {{JavaStreamingContext#awaitTermination()}} can throw 
> {{InterruptedException}} which cannot be caught easily in Java because it's 
> not declared in {{@throws(classOf[InterruptedException])}} annotation.
> This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
> Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException


 [ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8393:
---

Assignee: (was: Apache Spark)

> JavaStreamingContext#awaitTermination() throws non-declared 
> InterruptedException
> 
>
> Key: SPARK-8393
> URL: https://issues.apache.org/jira/browse/SPARK-8393
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: Jaromir Vanek
>Priority: Trivial
>
> Call to {{JavaStreamingContext#awaitTermination()}} can throw 
> {{InterruptedException}} which cannot be caught easily in Java because it's 
> not declared in {{@throws(classOf[InterruptedException])}} annotation.
> This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
> Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8393) JavaStreamingContext#awaitTermination() throws non-declared InterruptedException


[ 
https://issues.apache.org/jira/browse/SPARK-8393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598769#comment-14598769
 ] 

Apache Spark commented on SPARK-8393:
-

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/6969

> JavaStreamingContext#awaitTermination() throws non-declared 
> InterruptedException
> 
>
> Key: SPARK-8393
> URL: https://issues.apache.org/jira/browse/SPARK-8393
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1
>Reporter: Jaromir Vanek
>Priority: Trivial
>
> Call to {{JavaStreamingContext#awaitTermination()}} can throw 
> {{InterruptedException}} which cannot be caught easily in Java because it's 
> not declared in {{@throws(classOf[InterruptedException])}} annotation.
> This {{InterruptedException}} comes originally from {{ContextWaiter}} where 
> Java {{ReentrantLock}} is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8553) Resuming Checkpointed QueueStream Fails

2015-06-23 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-8553.

Resolution: Won't Fix

> Resuming Checkpointed QueueStream Fails
> ---
>
> Key: SPARK-8553
> URL: https://issues.apache.org/jira/browse/SPARK-8553
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.4.0
>Reporter: Shaanan Cohney
>
> After using a QueueStream within a checkpointed StreamingContext, when the 
> context is resumed the following error is triggered:
> {code}
> 15/06/23 02:33:09 WARN QueueInputDStream: isTimeValid called with 
> 1434987594000 ms where as last valid time is 1434987678000 ms
> 15/06/23 02:33:09 ERROR StreamingContext: Error starting the context, marking 
> it as stopped
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
>   at org.apache.spark.rdd.RDD.persist(RDD.scala:162)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$apply$8.apply(DStream.scala:357)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$apply$8.apply(DStream.scala:354)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:354)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
>   at 
> org.apache.spark.streaming.api.python.PythonTransformedDStream.compute(PythonDStream.scala:195)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
>   at 
> org.apache.spark.streaming.api.python.PythonStateDStream.compute(PythonDStream.scala:242)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
>   at 
> org.apache.spark.streaming.api.python.PythonStateDStream.compute(PythonDStream.scala:241)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anon

[jira] [Commented] (SPARK-8553) Resuming Checkpointed QueueStream Fails

2015-06-23 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598725#comment-14598725
 ] 

Tathagata Das commented on SPARK-8553:
--

Yes, this is not a supported feature and its pretty hard to support. Recovering 
a streaming context requires recovering all the data needed by the streaming 
context to recover. Since arbitrary RDDs get added to queueStream, there is no 
way to recover data of those RDDs. So this is not a feature that we will 
support. 

Yes, we should document this for queueStream. 

I am marking this JIRA as Wont Fix

> Resuming Checkpointed QueueStream Fails
> ---
>
> Key: SPARK-8553
> URL: https://issues.apache.org/jira/browse/SPARK-8553
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.4.0
>Reporter: Shaanan Cohney
>
> After using a QueueStream within a checkpointed StreamingContext, when the 
> context is resumed the following error is triggered:
> {code}
> 15/06/23 02:33:09 WARN QueueInputDStream: isTimeValid called with 
> 1434987594000 ms where as last valid time is 1434987678000 ms
> 15/06/23 02:33:09 ERROR StreamingContext: Error starting the context, marking 
> it as stopped
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
>   at org.apache.spark.rdd.RDD.persist(RDD.scala:162)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$apply$8.apply(DStream.scala:357)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$apply$8.apply(DStream.scala:354)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:354)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
>   at 
> org.apache.spark.streaming.api.python.PythonTransformedDStream.compute(PythonDStream.scala:195)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
>   at 
> org.apache.spark.streaming.api.python.PythonStateDStream.compute(PythonDStream.scala:242)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
>   at 
> org.apache.spark.streaming.api.python.PythonStateDStream.compute(PythonDStream.scala:241)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1

[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-06-23 Thread Aaron Staple (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598709#comment-14598709
 ] 

Aaron Staple edited comment on SPARK-1503 at 6/24/15 1:46 AM:
--

I believe this stopping criteria was added after the paper was written. It is 
documented on page 8 of the userguide 
(https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no 
explanation is provided. (The userguide also documents this as a <= test, while 
the current code uses <.) And unfortunately I couldn’t find an explanation in 
the code or git history.

I think the switch to absolute tolerance may be because a relative difference 
measurement could be less useful when the weights are extremely small, and 1 is 
a convenient cutoff point. (Using 1, the equation is simple and the 
interpretation is clear.) I believe [~mengxr] alluded to switching to an 
absolute tolerance at 1 already 
(https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be 
able to provide more information.

With regard to using the new weight norms as the basis for measuring relative 
weight difference, I think that if the convergence test passes using either the 
old or new weight norms, then the old and new norms are going to be very 
similar. It may not make a significant difference which test is used. (It may 
also be worth pointing out that in cases where the tolerance tests with respect 
to different old/new weights return different results, if the tolerance wrt new 
weights is met (and wrt old weights is not) then the weight norm increased 
slightly; if the tolerance wrt the old weights is met (and wrt new weights not) 
then we weight norm decreased slightly.)

Finally, TFOCS adopts a policy of skipping the convergence test on the first 
iteration if the weights are unchanged. I believe this condition is based on 
implementation specific behavior and does not need to be adopted generally.



was (Author: staple):
I believe this stopping criteria was added after the paper was written. It is 
documented on page 8 of the userguide 
(https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no 
explanation is provided. (The userguide also documents this as a <= test, while 
the current code uses <.) And unfortunately I couldn’t find an explanation in 
the code or git history.

I think the switch to absolute tolerance may be because a relative difference 
measurement could be less useful when the weights are extremely small, and 1 is 
a convenient cutoff point. (Using 1, the equation is simple and the 
interpretation is clear.) I believe [~mengxr] alluded to switching to an 
absolute tolerance at 1 already 
(https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be 
able to provide more information.

With regard to using the new weight norms as the basis for measuring relative 
weight difference, I think that if the convergence test passes using either the 
old or new weight norms, then the old and new norms are going to be very 
similar. It may not make a significant difference which test is used. (It may 
also be worth pointing out that in cases where the tolerance tests with respect 
to different old/new weights return different results, if the tolerance wrt new 
weights is met (and wrt old weights is not) then the weight norm increased 
slightly; if the tolerance wrt the old weights is met (and wrt new weights not) 
then we weight norm decreased slightly.)

Finally, TFOCS adopts a policy of skipping the convergence test after the first 
iteration if the weights are unchanged. I believe this condition is based on 
implementation specific behavior and does not need to be adopted generally.


> Implement Nesterov's accelerated first-order method
> ---
>
> Key: SPARK-1503
> URL: https://issues.apache.org/jira/browse/SPARK-1503
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Aaron Staple
> Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png
>
>
> Nesterov's accelerated first-order method is a drop-in replacement for 
> steepest descent but it converges much faster. We should implement this 
> method and compare its performance with existing algorithms, including SGD 
> and L-BFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
> method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-06-23 Thread Aaron Staple (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598709#comment-14598709
 ] 

Aaron Staple commented on SPARK-1503:
-

I believe this stopping criteria was added after the paper was written. It is 
documented on page 8 of the userguide 
(https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no 
explanation is provided. (The userguide also documents this as a <= test, while 
the current code uses <.) And unfortunately I couldn’t find an explanation in 
the code or git history.

I think the switch to absolute tolerance may be because a relative difference 
measurement could be less useful when the weights are extremely small, and 1 is 
a convenient cutoff point. (Using 1, the equation is simple and the 
interpretation is clear.) I believe [~mengxr] alluded to switching to an 
absolute tolerance at 1 already 
(https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be 
able to provide more information.

With regard to using the new weight norms as the basis for measuring relative 
weight difference, I think that if the convergence test passes using either the 
old or new weight norms, then the old and new norms are going to be very 
similar. It may not make a significant difference which test is used. (It may 
also be worth pointing out that in cases where the tolerance tests with respect 
to different old/new weights return different results, if the tolerance wrt new 
weights is met (and wrt old weights is not) then the weight norm increased 
slightly; if the tolerance wrt the old weights is met (and wrt new weights not) 
then we weight norm decreased slightly.)

Finally, TFOCS adopts a policy of skipping the convergence test after the first 
iteration if the weights are unchanged. I believe this condition is based on 
implementation specific behavior and does not need to be adopted generally.


> Implement Nesterov's accelerated first-order method
> ---
>
> Key: SPARK-1503
> URL: https://issues.apache.org/jira/browse/SPARK-1503
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Aaron Staple
> Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png
>
>
> Nesterov's accelerated first-order method is a drop-in replacement for 
> steepest descent but it converges much faster. We should implement this 
> method and compare its performance with existing algorithms, including SGD 
> and L-BFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
> method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8581) Simplify and clean up the checkpointing code


 [ 
https://issues.apache.org/jira/browse/SPARK-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8581:
---

Assignee: Andrew Or  (was: Apache Spark)

> Simplify and clean up the checkpointing code
> 
>
> Key: SPARK-8581
> URL: https://issues.apache.org/jira/browse/SPARK-8581
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> It is an old piece of code and a little overly complex at the moment. We can 
> rewrite this to improve the readability and preserve exactly the same 
> semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8581) Simplify and clean up the checkpointing code


[ 
https://issues.apache.org/jira/browse/SPARK-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598706#comment-14598706
 ] 

Apache Spark commented on SPARK-8581:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/6968

> Simplify and clean up the checkpointing code
> 
>
> Key: SPARK-8581
> URL: https://issues.apache.org/jira/browse/SPARK-8581
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> It is an old piece of code and a little overly complex at the moment. We can 
> rewrite this to improve the readability and preserve exactly the same 
> semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8584) Better exception message if invalid checkpoint dir is specified


 [ 
https://issues.apache.org/jira/browse/SPARK-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8584:
---

Assignee: Apache Spark  (was: Andrew Or)

> Better exception message if invalid checkpoint dir is specified
> ---
>
> Key: SPARK-8584
> URL: https://issues.apache.org/jira/browse/SPARK-8584
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> If we're running Spark on a cluster, the checkpoint dir must be a non-local 
> path. Otherwise, the attempt to read from a checkpoint will fail because the 
> checkpoint files are written on the executors, not on the driver.
> Currently, the error message that you get looks something like the following, 
> which is not super intuitive:
> {code}
> Checkpoint RDD 3 (0) has different number of partitions than original RDD 2 
> (100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8584) Better exception message if invalid checkpoint dir is specified


 [ 
https://issues.apache.org/jira/browse/SPARK-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8584:
---

Assignee: Andrew Or  (was: Apache Spark)

> Better exception message if invalid checkpoint dir is specified
> ---
>
> Key: SPARK-8584
> URL: https://issues.apache.org/jira/browse/SPARK-8584
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> If we're running Spark on a cluster, the checkpoint dir must be a non-local 
> path. Otherwise, the attempt to read from a checkpoint will fail because the 
> checkpoint files are written on the executors, not on the driver.
> Currently, the error message that you get looks something like the following, 
> which is not super intuitive:
> {code}
> Checkpoint RDD 3 (0) has different number of partitions than original RDD 2 
> (100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8581) Simplify and clean up the checkpointing code


 [ 
https://issues.apache.org/jira/browse/SPARK-8581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8581:
---

Assignee: Apache Spark  (was: Andrew Or)

> Simplify and clean up the checkpointing code
> 
>
> Key: SPARK-8581
> URL: https://issues.apache.org/jira/browse/SPARK-8581
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>
> It is an old piece of code and a little overly complex at the moment. We can 
> rewrite this to improve the readability and preserve exactly the same 
> semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8584) Better exception message if invalid checkpoint dir is specified


[ 
https://issues.apache.org/jira/browse/SPARK-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598707#comment-14598707
 ] 

Apache Spark commented on SPARK-8584:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/6968

> Better exception message if invalid checkpoint dir is specified
> ---
>
> Key: SPARK-8584
> URL: https://issues.apache.org/jira/browse/SPARK-8584
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> If we're running Spark on a cluster, the checkpoint dir must be a non-local 
> path. Otherwise, the attempt to read from a checkpoint will fail because the 
> checkpoint files are written on the executors, not on the driver.
> Currently, the error message that you get looks something like the following, 
> which is not super intuitive:
> {code}
> Checkpoint RDD 3 (0) has different number of partitions than original RDD 2 
> (100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system


[ 
https://issues.apache.org/jira/browse/SPARK-8583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598702#comment-14598702
 ] 

Apache Spark commented on SPARK-8583:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6967

> Refactor python/run-tests to integrate with dev/run-test's module system
> 
>
> Key: SPARK-8583
> URL: https://issues.apache.org/jira/browse/SPARK-8583
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra, PySpark
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should refactor the {{python/run-tests}} script to be written in Python 
> and integrate with the recent {{dev/run-tests}} module system so that we can 
> more granularly skip Python tests in the pull request builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system


 [ 
https://issues.apache.org/jira/browse/SPARK-8583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8583:
---

Assignee: Josh Rosen  (was: Apache Spark)

> Refactor python/run-tests to integrate with dev/run-test's module system
> 
>
> Key: SPARK-8583
> URL: https://issues.apache.org/jira/browse/SPARK-8583
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra, PySpark
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should refactor the {{python/run-tests}} script to be written in Python 
> and integrate with the recent {{dev/run-tests}} module system so that we can 
> more granularly skip Python tests in the pull request builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system


 [ 
https://issues.apache.org/jira/browse/SPARK-8583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8583:
---

Assignee: Apache Spark  (was: Josh Rosen)

> Refactor python/run-tests to integrate with dev/run-test's module system
> 
>
> Key: SPARK-8583
> URL: https://issues.apache.org/jira/browse/SPARK-8583
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra, PySpark
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> We should refactor the {{python/run-tests}} script to be written in Python 
> and integrate with the recent {{dev/run-tests}} module system so that we can 
> more granularly skip Python tests in the pull request builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8584) Better exception message if invalid checkpoint dir is specified

Andrew Or created SPARK-8584:


 Summary: Better exception message if invalid checkpoint dir is 
specified
 Key: SPARK-8584
 URL: https://issues.apache.org/jira/browse/SPARK-8584
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or


If we're running Spark on a cluster, the checkpoint dir must be a non-local 
path. Otherwise, the attempt to read from a checkpoint will fail because the 
checkpoint files are written on the executors, not on the driver.

Currently, the error message that you get looks something like the following, 
which is not super intuitive:
{code}
Checkpoint RDD 3 (0) has different number of partitions than original RDD 2 
(100)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8583) Refactor python/run-tests to integrate with dev/run-test's module system

2015-06-23 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-8583:
-

 Summary: Refactor python/run-tests to integrate with 
dev/run-test's module system
 Key: SPARK-8583
 URL: https://issues.apache.org/jira/browse/SPARK-8583
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra, PySpark
Reporter: Josh Rosen
Assignee: Josh Rosen


We should refactor the {{python/run-tests}} script to be written in Python and 
integrate with the recent {{dev/run-tests}} module system so that we can more 
granularly skip Python tests in the pull request builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice


 [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8582:
-
Description: 
In Spark, checkpointing allows the user to truncate the lineage of his RDD and 
save the intermediate contents to HDFS for fault tolerance. However, this is 
not currently implemented super efficiently:

Every time we checkpoint an RDD, we actually compute it twice: once during the 
action that triggered the checkpointing in the first place, and once while we 
checkpoint (we iterate through an RDD's partitions and write them to disk). See 
this line for more detail: 
https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.

Instead, we should have a `CheckpointingInterator` that writes checkpoint data 
to HDFS while we run the action. This will speed up many usages of 
`RDD#checkpoint` by 2X.

(Alternatively, the user can just cache the RDD before checkpointing it, but 
this is not always viable for very large input data. It's also not a great API 
to use in general.)

  was:
In Spark, checkpointing allows the user to truncate the lineage of his RDD and 
save the intermediate contents to HDFS for fault tolerance. However, this is 
not currently implemented super efficiently:

Every time we checkpoint an RDD, we actually compute it twice: once during the 
action that triggered the checkpointing in the first place, and once while we 
checkpoint (we iterate through an RDD's partitions and write them to disk).

Instead, we should have a `CheckpointingInterator` that writes checkpoint data 
to HDFS while we run the action. This will speed up many usages of 
`RDD#checkpoint` by 2X.

(Alternatively, the user can just cache the RDD before checkpointing it, but 
this is not always viable for very large input data. It's also not a great API 
to use in general.)


> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice


[ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598692#comment-14598692
 ] 

Andrew Or commented on SPARK-8582:
--

[~tdas] also wants this.

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

Andrew Or created SPARK-8582:


 Summary: Optimize checkpointing to avoid computing an RDD twice
 Key: SPARK-8582
 URL: https://issues.apache.org/jira/browse/SPARK-8582
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or


In Spark, checkpointing allows the user to truncate the lineage of his RDD and 
save the intermediate contents to HDFS for fault tolerance. However, this is 
not currently implemented super efficiently:

Every time we checkpoint an RDD, we actually compute it twice: once during the 
action that triggered the checkpointing in the first place, and once while we 
checkpoint (we iterate through an RDD's partitions and write them to disk).

Instead, we should have a `CheckpointingInterator` that writes checkpoint data 
to HDFS while we run the action. This will speed up many usages of 
`RDD#checkpoint` by 2X.

(Alternatively, the user can just cache the RDD before checkpointing it, but 
this is not always viable for very large input data. It's also not a great API 
to use in general.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version

2015-06-23 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598681#comment-14598681
 ] 

Saisai Shao commented on SPARK-8337:


Hi [~juanrh], will you also address {{OffsetRange}} problem described in 
SPARK-8389.

> KafkaUtils.createDirectStream for python is lacking API/feature parity with 
> the Scala/Java version
> --
>
> Key: SPARK-8337
> URL: https://issues.apache.org/jira/browse/SPARK-8337
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.4.0
>Reporter: Amit Ramesh
>Priority: Critical
>
> See the following thread for context.
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8187) date/time function: date_sub


 [ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8187:
--
Shepherd: Davies Liu

> date/time function: date_sub
> 
>
> Key: SPARK-8187
> URL: https://issues.apache.org/jira/browse/SPARK-8187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
>
> date_sub(string startdate, int days): string
> date_sub(date startdate, int days): date
> Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = 
> '2008-12-30'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8186) date/time function: date_add


 [ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8186:
--
Shepherd: Davies Liu

> date/time function: date_add
> 
>
> Key: SPARK-8186
> URL: https://issues.apache.org/jira/browse/SPARK-8186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
>
> date_add(string startdate, int days): string
> date_add(date startdate, int days): date
> Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8075) apply type checking interface to more expressions


 [ 
https://issues.apache.org/jira/browse/SPARK-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8075:
---
Shepherd: Michael Armbrust

> apply type checking interface to more expressions
> -
>
> Key: SPARK-8075
> URL: https://issues.apache.org/jira/browse/SPARK-8075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> As https://github.com/apache/spark/pull/6405 has been merged, we need to 
> apply the type checking interface to more expressions, and finally remove the 
> default implementation of it in Expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8581) Simplify and clean up the checkpointing code

Andrew Or created SPARK-8581:


 Summary: Simplify and clean up the checkpointing code
 Key: SPARK-8581
 URL: https://issues.apache.org/jira/browse/SPARK-8581
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


It is an old piece of code and a little overly complex at the moment. We can 
rewrite this to improve the readability and preserve exactly the same semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7157) Add approximate stratified sampling to DataFrame


[ 
https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598658#comment-14598658
 ] 

Reynold Xin commented on SPARK-7157:


I'm keeping this open still for the Java API.


> Add approximate stratified sampling to DataFrame
> 
>
> Key: SPARK-7157
> URL: https://issues.apache.org/jira/browse/SPARK-7157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7157) Add approximate stratified sampling to DataFrame


 [ 
https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7157:
---
Description: (was: def sampleBy(c)

> Add approximate stratified sampling to DataFrame
> 
>
> Key: SPARK-7157
> URL: https://issues.apache.org/jira/browse/SPARK-7157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names

2015-06-23 Thread Justin McCarthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598655#comment-14598655
 ] 

Justin McCarthy commented on SPARK-:


Lack of quoting is also leading to errors where a column name collides with a 
SQL reserved word.

There are some very common and useful words that are frequently used as column 
identifiers: 
http://www.postgresql.org/docs/9.0/static/sql-keywords-appendix.html

Here's SQL-99's take on quoted identifiers: 
http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier

Fix could be as simple as:

{code:title=JdbcRDD.scala}
private val columnList : String = if (columns.length==0) "1" else 
"\""+columns.mkString("\",\"")+"\""
{code}

> org.apache.spark.sql.jdbc.JDBCRDD  does not escape/quote column names
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment:  
>Reporter: John Ferguson
>Priority: Critical
>
> Is there a way to have JDBC DataFrames use quoted/escaped column names?  
> Right now, it looks like it "sees" the names correctly in the schema created 
> but does not escape them in the SQL it creates when they are not compliant:
> org.apache.spark.sql.jdbc.JDBCRDD
> 
> private val columnList: String = {
> val sb = new StringBuilder()
> columns.foreach(x => sb.append(",").append(x))
> if (sb.length == 0) "1" else sb.substring(1)
> }
> If you see value in this, I would take a shot at adding the quoting 
> (escaping) of column names here.  If you don't do it, some drivers... like 
> postgresql's will simply drop case all names when parsing the query.  As you 
> can see in the TL;DR below that means they won't match the schema I am given.
> TL;DR:
>  
> I am able to connect to a Postgres database in the shell (with driver 
> referenced):
>val jdbcDf = 
> sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser", "sp500")
> In fact when I run:
>jdbcDf.registerTempTable("sp500")
>val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI 
> FROM sp500")
> and
>val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share")))
> The values come back as expected.  However, if I try:
>jdbcDf.show
> Or if I try
>
>val all = sqlContext.sql("SELECT * FROM sp500")
>all.show
> I get errors about column names not being found.  In fact the error includes 
> a mention of column names all lower cased.  For now I will change my schema 
> to be more restrictive.  Right now it is, per a Stack Overflow poster, not 
> ANSI compliant by doing things that are allowed by ""'s in pgsql, MySQL and 
> SQLServer.  BTW, our users are giving us tables like this... because various 
> tools they already use support non-compliant names.  In fact, this is mild 
> compared to what we've had to support.
> Currently the schema in question uses mixed case, quoted names with special 
> characters and spaces:
> CREATE TABLE sp500
> (
> "Symbol" text,
> "Name" text,
> "Sector" text,
> "Price" double precision,
> "Dividend Yield" double precision,
> "Price/Earnings" double precision,
> "Earnings/Share" double precision,
> "Book Value" double precision,
> "52 week low" double precision,
> "52 week high" double precision,
> "Market Cap" double precision,
> "EBITDA" double precision,
> "Price/Sales" double precision,
> "Price/Book" double precision,
> "SEC Filings" text
> ) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility


 [ 
https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8580:
--
Issue Type: Sub-task  (was: Test)
Parent: SPARK-5463

> Add Parquet files generated by different systems to test interoperability and 
> compatibility
> ---
>
> Key: SPARK-8580
> URL: https://issues.apache.org/jira/browse/SPARK-8580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 
> to improve interoperability with other systems (reading non-standard Parquet 
> files they generate, and generating standard Parquet files), it would be good 
> to have a set of standard test Parquet files generated by various 
> systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old 
> versions of Spark SQL) to ensure compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility

Cheng Lian created SPARK-8580:
-

 Summary: Add Parquet files generated by different systems to test 
interoperability and compatibility
 Key: SPARK-8580
 URL: https://issues.apache.org/jira/browse/SPARK-8580
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian


As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 to 
improve interoperability with other systems (reading non-standard Parquet files 
they generate, and generating standard Parquet files), it would be good to have 
a set of standard test Parquet files generated by various systems/tools 
(parquet-thrift, parquet-avro, parquet-hive, Impala, and old versions of Spark 
SQL) to ensure compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8139) Documents data sources and Parquet output committer related options


 [ 
https://issues.apache.org/jira/browse/SPARK-8139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-8139.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6683
[https://github.com/apache/spark/pull/6683]

> Documents data sources and Parquet output committer related options
> ---
>
> Key: SPARK-8139
> URL: https://issues.apache.org/jira/browse/SPARK-8139
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.5.0
>
>
> Should document the following two options:
> - {{spark.sql.sources.outputCommitterClass}}
> - {{spark.sql.parquet.output.committer.class}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used

2015-06-23 Thread Ai He (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598624#comment-14598624
 ] 

Ai He commented on SPARK-7810:
--

Hi, I encountered this problem one month ago and I have missed the stack trace. 
Then I just took a look at the port JVM listened to and found only IPV6 
protocol was supported. That's why I'd like to make this improvement.

For the last question, I don't quite understand what the tree is?

> rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
> ---
>
> Key: SPARK-7810
> URL: https://issues.apache.org/jira/browse/SPARK-7810
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>
> Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 
> is used. The current method only works well with ipv4. New modification 
> should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7810) rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used


[ 
https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598610#comment-14598610
 ] 

Davies Liu commented on SPARK-7810:
---

What's the stack trace look like? Does the host only have IPv6?

There are multiple place which donot consider IPv6 in mind, you can grep 
`127.0.0.1` or `localhost` in the tree, could you also fix them together?

> rdd.py "_load_from_socket" cannot load data from jvm socket if ipv6 is used
> ---
>
> Key: SPARK-7810
> URL: https://issues.apache.org/jira/browse/SPARK-7810
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>
> Method "_load_from_socket" in rdd.py cannot load data from jvm socket if ipv6 
> is used. The current method only works well with ipv4. New modification 
> should work around both two protocols.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

[
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-8445:
-
Description:
We expect to see many MLlib contributors for the 1.5 release. To scale out the
development, we created this master list for MLlib features we plan to have in
Spark 1.5. Due to limited review bandwidth, features appearing on this list
will get higher priority for code review. But feel free to suggest new items to
the list in comments. We are experimenting with this process. Your feedback
would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a starter
task (TODO: add a link) rather than a medium/big feature. Based on our
experience, mixing the development process with a big feature usually causes
long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when
you start working on some features. This is to avoid duplicate work. For small
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned
first before coding and keep the ETA updated on the JIRA. If there exist no
activity on the JIRA page for a certain amount of time, the JIRA should be
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.

h1. Roadmap (WIP)

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class
probabilities (SPARK-3727)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent itemsets improvements (SPARK-7211)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7898)
* naive Bayes

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)

h2. SparkR API for ML

h2. Documentation

* [Search for documentation improvements |
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]

was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the
development, we created this master list for MLlib features we plan to have in
Spark 1.5. Due to limited review bandwidth, features appearing on this list
will get higher priority for code review. But feel free to suggest new items to
the list in comments. We are experimenting with this process. Your feedback
would be greatly appreciated.

h1. Instructions

h2. For contributors:

h2. For committers:

h1. Roadmap (WIP)

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Im

[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

[
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

h1. Instructions

h2. For contributors:

h2. For committers:

h1. Roadmap (WIP)

h2. Algorithms and performance

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7898)
* naive Bayes

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

h2. SparkR API for ML

h2. Documentation

h1. Instructions

h2. For contributors:

h2. For committers:

h1. Roadmap (WIP)

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tr

[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

[
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

h1. Instructions

h2. For contributors:

h2. For committers:

h1. Roadmap (WIP)

h2. Algorithms and performance

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7898)
* naive Bayes

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

h2. SparkR API for ML

h2. Documentation

h1. Instructions

h2. For contributors:

h2. For committers:

h1. Roadmap (WIP)

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7898)
* naive Bayes

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

h2. SparkR API for M

[jira] [Commented] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml


[ 
https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598579#comment-14598579
 ] 

Joseph K. Bradley commented on SPARK-7131:
--

Busy this week, but I expect to begin work sometime next week.

> Move tree,forest implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-7131
> URL: https://issues.apache.org/jira/browse/SPARK-7131
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to change and improve the spark.ml API for trees and ensembles, but 
> we cannot change the old API in spark.mllib.  To support the changes we want 
> to make, we should move the implementation from spark.mllib to spark.ml.  We 
> will generalize and modify it, but will also ensure that we do not change the 
> behavior of the old API.
> This JIRA should be done in several PRs, in this order:
> 1. Copy the implementation over to spark.ml and change the spark.ml classes 
> to use that implementation, rather than calling the spark.mllib 
> implementation.  The current spark.ml tests will ensure that the 2 
> implementations learn exactly the same models.  Note: This should include 
> performance testing to make sure the updated code does not have any 
> regressions.
> 2. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation.  The spark.ml tests will again 
> ensure that we do not change any behavior.
> 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.
> After these updates, we can more safely generalize and improve the spark.ml 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml


 [ 
https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7131:
-
Remaining Estimate: 168h
 Original Estimate: 168h

> Move tree,forest implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-7131
> URL: https://issues.apache.org/jira/browse/SPARK-7131
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to change and improve the spark.ml API for trees and ensembles, but 
> we cannot change the old API in spark.mllib.  To support the changes we want 
> to make, we should move the implementation from spark.mllib to spark.ml.  We 
> will generalize and modify it, but will also ensure that we do not change the 
> behavior of the old API.
> This JIRA should be done in several PRs, in this order:
> 1. Copy the implementation over to spark.ml and change the spark.ml classes 
> to use that implementation, rather than calling the spark.mllib 
> implementation.  The current spark.ml tests will ensure that the 2 
> implementations learn exactly the same models.  Note: This should include 
> performance testing to make sure the updated code does not have any 
> regressions.
> 2. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation.  The spark.ml tests will again 
> ensure that we do not change any behavior.
> 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.
> After these updates, we can more safely generalize and improve the spark.ml 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup

2015-06-23 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598517#comment-14598517
 ] 

Shivaram Venkataraman commented on SPARK-8111:
--

[~srowen] Could you help add [~aloknsingh] as a developer and assign this issue 
?

> SparkR shell should display Spark logo and version banner on startup
> 
>
> Key: SPARK-8111
> URL: https://issues.apache.org/jira/browse/SPARK-8111
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Matei Zaharia
>Priority: Trivial
>  Labels: Starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup

2015-06-23 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8111.
--
Resolution: Fixed

> SparkR shell should display Spark logo and version banner on startup
> 
>
> Key: SPARK-8111
> URL: https://issues.apache.org/jira/browse/SPARK-8111
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Matei Zaharia
>Priority: Trivial
>  Labels: Starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup

2015-06-23 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598515#comment-14598515
 ] 

Shivaram Venkataraman commented on SPARK-8111:
--

Issue resolved by https://github.com/apache/spark/pull/6944

> SparkR shell should display Spark logo and version banner on startup
> 
>
> Key: SPARK-8111
> URL: https://issues.apache.org/jira/browse/SPARK-8111
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Matei Zaharia
>Priority: Trivial
>  Labels: Starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8449) HDF5 read/write support for Spark MLlib

2015-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8449:
-
Fix Version/s: (was: 1.4.1)

> HDF5 read/write support for Spark MLlib
> ---
>
> Key: SPARK-8449
> URL: https://issues.apache.org/jira/browse/SPARK-8449
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Add support for reading and writing HDF5 file format to/from LabeledPoint. 
> HDFS and local file system have to be supported. Other Spark formats to be 
> discussed. 
> Interface proposal:
> /* path - directory path in any Hadoop-supported file system URI */
> MLUtils.saveAsHDF5(sc: SparkContext, path: String, RDD[LabeledPoint]): Unit
> /* path - file or directory path in any Hadoop-supported file system URI */
> MLUtils.loadHDF5(sc: SparkContext, path: String): RDD[LabeledPoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8578) Should ignore user defined output committer when appending data


 [ 
https://issues.apache.org/jira/browse/SPARK-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8578:
--
Target Version/s: 1.4.1, 1.5.0  (was: 1.5.0)

> Should ignore user defined output committer when appending data
> ---
>
> Key: SPARK-8578
> URL: https://issues.apache.org/jira/browse/SPARK-8578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Yin Huai
>
> When appending data to a file system via Hadoop API, it's safer to ignore 
> user defined output committer classes like {{DirectParquetOutputCommitter}}. 
> Because it's relatively hard to handle task failure in this case.  For 
> example, {{DirectParquetOutputCommitter}} directly writes to the output 
> directory to boost write performance when working with S3. However, there's 
> no general way to determine task output file path of a specific task in 
> Hadoop API, thus we don't know to revert a failed append job. (When doing 
> overwrite, we can just remove the whole output directory.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8235) misc function: sha1 / sha


 [ 
https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8235:
---

Assignee: Apache Spark

> misc function: sha1 / sha
> -
>
> Key: SPARK-8235
> URL: https://issues.apache.org/jira/browse/SPARK-8235
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> sha1(string/binary): string
> sha(string/binary): string
> Calculates the SHA-1 digest for string or binary and returns the value as a 
> hex string (as of Hive 1.3.0). Example: sha1('ABC') = 
> '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8235) misc function: sha1 / sha


[ 
https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598493#comment-14598493
 ] 

Apache Spark commented on SPARK-8235:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6963

> misc function: sha1 / sha
> -
>
> Key: SPARK-8235
> URL: https://issues.apache.org/jira/browse/SPARK-8235
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> sha1(string/binary): string
> sha(string/binary): string
> Calculates the SHA-1 digest for string or binary and returns the value as a 
> hex string (as of Hive 1.3.0). Example: sha1('ABC') = 
> '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8235) misc function: sha1 / sha


 [ 
https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8235:
---

Assignee: (was: Apache Spark)

> misc function: sha1 / sha
> -
>
> Key: SPARK-8235
> URL: https://issues.apache.org/jira/browse/SPARK-8235
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> sha1(string/binary): string
> sha(string/binary): string
> Calculates the SHA-1 digest for string or binary and returns the value as a 
> hex string (as of Hive 1.3.0). Example: sha1('ABC') = 
> '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8579) Support arbitrary object in UnsafeRow


[ 
https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598481#comment-14598481
 ] 

Apache Spark commented on SPARK-8579:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6959

> Support arbitrary object in UnsafeRow
> -
>
> Key: SPARK-8579
> URL: https://issues.apache.org/jira/browse/SPARK-8579
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> It's common to run count(distinct xxx) in SQL, the data type will be UDT of 
> OpenHashSet, it's good that we could use UnsafeRow to reducing the memory 
> usage during aggregation.
> Also for DecimalType, which could be used inside the grouping key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8579) Support arbitrary object in UnsafeRow


 [ 
https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8579:
---

Assignee: Davies Liu  (was: Apache Spark)

> Support arbitrary object in UnsafeRow
> -
>
> Key: SPARK-8579
> URL: https://issues.apache.org/jira/browse/SPARK-8579
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> It's common to run count(distinct xxx) in SQL, the data type will be UDT of 
> OpenHashSet, it's good that we could use UnsafeRow to reducing the memory 
> usage during aggregation.
> Also for DecimalType, which could be used inside the grouping key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8579) Support arbitrary object in UnsafeRow


 [ 
https://issues.apache.org/jira/browse/SPARK-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8579:
---

Assignee: Apache Spark  (was: Davies Liu)

> Support arbitrary object in UnsafeRow
> -
>
> Key: SPARK-8579
> URL: https://issues.apache.org/jira/browse/SPARK-8579
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> It's common to run count(distinct xxx) in SQL, the data type will be UDT of 
> OpenHashSet, it's good that we could use UnsafeRow to reducing the memory 
> usage during aggregation.
> Also for DecimalType, which could be used inside the grouping key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8579) Support arbitrary object in UnsafeRow

Davies Liu created SPARK-8579:
-

 Summary: Support arbitrary object in UnsafeRow
 Key: SPARK-8579
 URL: https://issues.apache.org/jira/browse/SPARK-8579
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


It's common to run count(distinct xxx) in SQL, the data type will be UDT of 
OpenHashSet, it's good that we could use UnsafeRow to reducing the memory usage 
during aggregation.

Also for DecimalType, which could be used inside the grouping key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8190) ExpressionEvalHelper.checkEvaluation should also run the optimizer version


 [ 
https://issues.apache.org/jira/browse/SPARK-8190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8190.
---
Resolution: Fixed

> ExpressionEvalHelper.checkEvaluation should also run the optimizer version
> --
>
> Key: SPARK-8190
> URL: https://issues.apache.org/jira/browse/SPARK-8190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> We should remove the existing ExpressionOptimizationSuite, and update 
> checkEvaluation to also run the optimizer version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8578) Should ignore user defined output committer when appending data

Cheng Lian created SPARK-8578:
-

 Summary: Should ignore user defined output committer when 
appending data
 Key: SPARK-8578
 URL: https://issues.apache.org/jira/browse/SPARK-8578
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Yin Huai


When appending data to a file system via Hadoop API, it's safer to ignore user 
defined output committer classes like {{DirectParquetOutputCommitter}}. Because 
it's relatively hard to handle task failure in this case.  For example, 
{{DirectParquetOutputCommitter}} directly writes to the output directory to 
boost write performance when working with S3. However, there's no general way 
to determine task output file path of a specific task in Hadoop API, thus we 
don't know to revert a failed append job. (When doing overwrite, we can just 
remove the whole output directory.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

2015-06-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8445:
-
Description: 
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Due to limited review bandwidth, features appearing on this list 
will get higher priority for code review. But feel free to suggest new items to 
the list in comments. We are experimenting with this process. Your feedback 
would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a starter 
task (TODO: add a link) rather than a medium/big feature. Based on our 
experience, mixing the development process with a big feature usually causes 
long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.

h1. Roadmap (WIP)

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7898)
* naive Bayes

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

h2. SparkR API for ML

h2. Documentation


  was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Due to limited review bandwidth, features appearing on this list 
will get higher priority for code review. But feel free to suggest new items to 
the list in comments. We are experimenting with this process. Your feedback 
would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a starter 
task (TODO: add a link) rather than a medium/big feature. Based on our 
experience, mixing the development process with a big feature usually causes 
long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.

h1. Roadmap

h2. Algorithms

h2. Pipeline API

h2. Model persistence

h2. Python API for ML

h2. SparkR API for ML

h2. Documentation



> MLlib 1.5 Roadmap
> -
>
> Key: SPARK-8445
> URL: https://issues.apache.org/jira/browse/SPARK-8445
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> We expect to see many MLlib contributors for the 1.5 release. To scale out 
> the development, we created this master list for MLlib features we plan t

[jira] [Created] (SPARK-8577) ScalaReflectionLock.synchronized can cause deadlock

2015-06-23 Thread koert kuipers (JIRA)

koert kuipers created SPARK-8577:


 Summary: ScalaReflectionLock.synchronized can cause deadlock
 Key: SPARK-8577
 URL: https://issues.apache.org/jira/browse/SPARK-8577
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: koert kuipers
Priority: Minor


Just a heads up, i was doing some basic coding using DataFrame, Row, 
StructType, etc. in my own project and i ended up with deadlocks in my sbt 
tests due to the usage of ScalaReflectionLock.synchronized in the spark sql 
code.
the issue went away when i changed my build to have:
  parallelExecution in Test := false
so that the tests run consecutively...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns