[jira] [Updated] (SPARK-4445) Don't display storage level in toDebugString unless RDD is persisted
[ https://issues.apache.org/jira/browse/SPARK-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4445: --- Issue Type: Bug (was: Improvement) > Don't display storage level in toDebugString unless RDD is persisted > > > Key: SPARK-4445 > URL: https://issues.apache.org/jira/browse/SPARK-4445 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Prashant Sharma >Priority: Blocker > > The current approach lists the storage level all the time, even if the RDD is > not persisted. The storage level should only be listed if the RDD is > persisted. We just need to guard it with a check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4445) Don't display storage level in toDebugString unless RDD is persisted
Patrick Wendell created SPARK-4445: -- Summary: Don't display storage level in toDebugString unless RDD is persisted Key: SPARK-4445 URL: https://issues.apache.org/jira/browse/SPARK-4445 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Blocker The current approach lists the storage level all the time, even if the RDD is not persisted. The storage level should only be listed if the RDD is persisted. We just need to guard it with a check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4435) Add setThreshold in Python LogisticRegressionModel and SVMModel
[ https://issues.apache.org/jira/browse/SPARK-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214382#comment-14214382 ] Apache Spark commented on SPARK-4435: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3305 > Add setThreshold in Python LogisticRegressionModel and SVMModel > --- > > Key: SPARK-4435 > URL: https://issues.apache.org/jira/browse/SPARK-4435 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4444) Drop VD type parameter from EdgeRDD
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214369#comment-14214369 ] Apache Spark commented on SPARK-: - User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/3303 > Drop VD type parameter from EdgeRDD > --- > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Blocker > > Due to vertex attribute caching, EdgeRDD previously took two type parameters: > ED and VD. However, this is an implementation detail that should not be > exposed in the interface, so this PR drops the VD type parameter. > This requires removing the filter method from the EdgeRDD interface, because > it depends on vertex attribute caching. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4444) Drop VD type parameter from EdgeRDD
Ankur Dave created SPARK-: - Summary: Drop VD type parameter from EdgeRDD Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave Assignee: Ankur Dave Priority: Blocker Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter. This requires removing the filter method from the EdgeRDD interface, because it depends on vertex attribute caching. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4443) Statistics bug for external table in spark sql hive
[ https://issues.apache.org/jira/browse/SPARK-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214351#comment-14214351 ] Apache Spark commented on SPARK-4443: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3304 > Statistics bug for external table in spark sql hive > --- > > Key: SPARK-4443 > URL: https://issues.apache.org/jira/browse/SPARK-4443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > When table is external, the `totalSize` is always zero, which will influence > join strategy(always use broadcast join for external table) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4443) Statistics bug for external table in spark sql hive
[ https://issues.apache.org/jira/browse/SPARK-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-4443: --- Description: When table is external, `totalSize` is always zero, which will influence join strategy(always use broadcast join for external table) Target Version/s: 1.2.0 Affects Version/s: 1.1.0 Fix Version/s: 1.2.0 > Statistics bug for external table in spark sql hive > --- > > Key: SPARK-4443 > URL: https://issues.apache.org/jira/browse/SPARK-4443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > When table is external, `totalSize` is always zero, which will influence join > strategy(always use broadcast join for external table) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4443) Statistics bug for external table in spark sql hive
[ https://issues.apache.org/jira/browse/SPARK-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-4443: --- Description: When table is external, the `totalSize` is always zero, which will influence join strategy(always use broadcast join for external table) (was: When table is external, `totalSize` is always zero, which will influence join strategy(always use broadcast join for external table)) > Statistics bug for external table in spark sql hive > --- > > Key: SPARK-4443 > URL: https://issues.apache.org/jira/browse/SPARK-4443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > When table is external, the `totalSize` is always zero, which will influence > join strategy(always use broadcast join for external table) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4443) Statistics bug for external table in spark sql hive
wangfei created SPARK-4443: -- Summary: Statistics bug for external table in spark sql hive Key: SPARK-4443 URL: https://issues.apache.org/jira/browse/SPARK-4443 Project: Spark Issue Type: Bug Components: SQL Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4437) Docs for difference between WholeTextFileRecordReader and WholeCombineFileRecordReader
[ https://issues.apache.org/jira/browse/SPARK-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214335#comment-14214335 ] Apache Spark commented on SPARK-4437: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3301 > Docs for difference between WholeTextFileRecordReader and > WholeCombineFileRecordReader > -- > > Key: SPARK-4437 > URL: https://issues.apache.org/jira/browse/SPARK-4437 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Andrew Ash >Assignee: Davies Liu > > Tracking per this dev@ thread: > {quote} > On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin wrote: > I don't think the code is immediately obvious. > Davies - I think you added the code, and Josh reviewed it. Can you guys > explain and maybe submit a patch to add more documentation on the whole > thing? > Thanks. > On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad > wrote: > > Hello Everyone, > > > > I am going through the source code of rdd and Record readers > > There are found 2 classes > > > > 1. WholeTextFileRecordReader > > 2. WholeCombineFileRecordReader ( extends CombineFileRecordReader ) > > > > The description of both the classes is perfectly similar. > > > > I am not able to understand why we have 2 classes. Is > > CombineFileRecordReader providing some extra advantage? > > > > Regards > > Vibhanshu > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214320#comment-14214320 ] Vijay commented on SPARK-4402: -- Yes, output path is being validated in PairRDDFunctions.saveAsHadoopDataset. Please find the below exception details. So, the output path is validated only during the execution saveAsHadoopDataset. After completing all the preceding statements. My query is that is it possible to make this validation in the first place when the program executon starts. Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/home/HadoopUser/eclipse-scala/test/output1 already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:968) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:878) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:792) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1159) at test.OutputTest$.main(OutputTest.scala:19) at test.OutputTest.main(OutputTest.scala) > Output path validation of an action statement resulting in runtime exception > > > Key: SPARK-4402 > URL: https://issues.apache.org/jira/browse/SPARK-4402 > Project: Spark > Issue Type: Wish >Reporter: Vijay >Priority: Minor > > Output path validation is happening at the time of statement execution as a > part of lazyevolution of action statement. But if the path already exists > then it throws a runtime exception. Hence all the processing completed till > that point is lost which results in resource wastage (processing time and CPU > usage). > If this I/O related validation is done before the RDD action operations then > this runtime exception can be avoided. > I believe similar validation/ feature is implemented in hadoop also. > Example: > SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.
[ https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4422: - Affects Version/s: 1.2.0 > In some cases, Vectors.fromBreeze get wrong results. > > > Key: SPARK-4422 > URL: https://issues.apache.org/jira/browse/SPARK-4422 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Minor > Fix For: 1.2.0 > > > {noformat} > import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} > var x = BDM.zeros[Double](10, 10) > val v = Vectors.fromBreeze(x(::, 0)) > assert(v.size == x.rows) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.
[ https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-4422: -- Reopened for branch-1.0 and branch-1.1. Changed the priority to minor because `fromBreeze` is private and we don't use breeze matrix slicing in MLlib. > In some cases, Vectors.fromBreeze get wrong results. > > > Key: SPARK-4422 > URL: https://issues.apache.org/jira/browse/SPARK-4422 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2, 1.1.1 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Minor > Fix For: 1.2.0 > > > {noformat} > import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} > var x = BDM.zeros[Double](10, 10) > val v = Vectors.fromBreeze(x(::, 0)) > assert(v.size == x.rows) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.
[ https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4422: - Affects Version/s: (was: 1.3.0) (was: 1.2.0) (was: 1.1.0) 1.1.1 1.0.2 > In some cases, Vectors.fromBreeze get wrong results. > > > Key: SPARK-4422 > URL: https://issues.apache.org/jira/browse/SPARK-4422 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2, 1.1.1 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Critical > Fix For: 1.2.0 > > > {noformat} > import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} > var x = BDM.zeros[Double](10, 10) > val v = Vectors.fromBreeze(x(::, 0)) > assert(v.size == x.rows) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.
[ https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4422: - Target Version/s: 1.2.0, 1.0.3, 1.1.2 (was: 1.1.0, 1.2.0, 1.3.0) > In some cases, Vectors.fromBreeze get wrong results. > > > Key: SPARK-4422 > URL: https://issues.apache.org/jira/browse/SPARK-4422 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2, 1.1.1 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Critical > Fix For: 1.2.0 > > > {noformat} > import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} > var x = BDM.zeros[Double](10, 10) > val v = Vectors.fromBreeze(x(::, 0)) > assert(v.size == x.rows) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.
[ https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4422: - Priority: Minor (was: Critical) > In some cases, Vectors.fromBreeze get wrong results. > > > Key: SPARK-4422 > URL: https://issues.apache.org/jira/browse/SPARK-4422 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2, 1.1.1 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Minor > Fix For: 1.2.0 > > > {noformat} > import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} > var x = BDM.zeros[Double](10, 10) > val v = Vectors.fromBreeze(x(::, 0)) > assert(v.size == x.rows) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.
[ https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4422. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3281 [https://github.com/apache/spark/pull/3281] > In some cases, Vectors.fromBreeze get wrong results. > > > Key: SPARK-4422 > URL: https://issues.apache.org/jira/browse/SPARK-4422 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0, 1.2.0, 1.3.0 >Reporter: Guoqiang Li >Priority: Critical > Fix For: 1.2.0 > > > {noformat} > import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} > var x = BDM.zeros[Double](10, 10) > val v = Vectors.fromBreeze(x(::, 0)) > assert(v.size == x.rows) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4422) In some cases, Vectors.fromBreeze get wrong results.
[ https://issues.apache.org/jira/browse/SPARK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4422: - Assignee: Guoqiang Li > In some cases, Vectors.fromBreeze get wrong results. > > > Key: SPARK-4422 > URL: https://issues.apache.org/jira/browse/SPARK-4422 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0, 1.2.0, 1.3.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Critical > Fix For: 1.2.0 > > > {noformat} > import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, sum => brzSum} > var x = BDM.zeros[Double](10, 10) > val v = Vectors.fromBreeze(x(::, 0)) > assert(v.size == x.rows) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4056) Upgrade snappy-java to 1.1.1.5
[ https://issues.apache.org/jira/browse/SPARK-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4056: -- Fix Version/s: (was: 1.2.0) (was: 1.1.1) > Upgrade snappy-java to 1.1.1.5 > -- > > Key: SPARK-4056 > URL: https://issues.apache.org/jira/browse/SPARK-4056 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should upgrade snappy-java to 1.1.1.5 across all of our maintenance > branches. This release improves error messages when attempting to > deserialize empty inputs using SnappyInputStream (this operation is always an > error, but the old error messages made it hard to distinguish failures due to > empty streams from ones due to reading invalid / corrupted streams); see > https://github.com/xerial/snappy-java/issues/89 for more context. > This should be a major help in the Snappy debugging work that I've been doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4056) Upgrade snappy-java to 1.1.1.5
[ https://issues.apache.org/jira/browse/SPARK-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214278#comment-14214278 ] Josh Rosen commented on SPARK-4056: --- I've removed the "Fixed Versions" here, since we reverted this particular commit. > Upgrade snappy-java to 1.1.1.5 > -- > > Key: SPARK-4056 > URL: https://issues.apache.org/jira/browse/SPARK-4056 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should upgrade snappy-java to 1.1.1.5 across all of our maintenance > branches. This release improves error messages when attempting to > deserialize empty inputs using SnappyInputStream (this operation is always an > error, but the old error messages made it hard to distinguish failures due to > empty streams from ones due to reading invalid / corrupted streams); see > https://github.com/xerial/snappy-java/issues/89 for more context. > This should be a major help in the Snappy debugging work that I've been doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4441) Close Tachyon client when TachyonBlockManager is shut down
[ https://issues.apache.org/jira/browse/SPARK-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214274#comment-14214274 ] Apache Spark commented on SPARK-4441: - User 'shimingfei' has created a pull request for this issue: https://github.com/apache/spark/pull/3299 > Close Tachyon client when TachyonBlockManager is shut down > -- > > Key: SPARK-4441 > URL: https://issues.apache.org/jira/browse/SPARK-4441 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.1.0 >Reporter: shimingfei > > Currently Tachyon client is not shut down when TachyonBlockManager is shut > down. which causes some resources in Tachyon not reclaimed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4441) Close Tachyon client when TachyonBlockManager is shut down
shimingfei created SPARK-4441: - Summary: Close Tachyon client when TachyonBlockManager is shut down Key: SPARK-4441 URL: https://issues.apache.org/jira/browse/SPARK-4441 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: shimingfei Currently Tachyon client is not shut down when TachyonBlockManager is shut down. which causes some resources in Tachyon not reclaimed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4442) Move common unit test utilities into their own package / module
Josh Rosen created SPARK-4442: - Summary: Move common unit test utilities into their own package / module Key: SPARK-4442 URL: https://issues.apache.org/jira/browse/SPARK-4442 Project: Spark Issue Type: Improvement Reporter: Josh Rosen Priority: Minor We should move generally-useful unit test fixtures / utility methods to their own test utilities set package / module to make them easier to find / use. See https://github.com/apache/spark/pull/3121#discussion-diff-20413659 for one example of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4147) Remove log4j dependency
[ https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214261#comment-14214261 ] Nathan M commented on SPARK-4147: - This code just forces log4j on the end user which is less than ideal. SLF4J should avoid this, seems like something wrong is being done trying to set the log level like this... > Remove log4j dependency > --- > > Key: SPARK-4147 > URL: https://issues.apache.org/jira/browse/SPARK-4147 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Tobias Pfeiffer > > spark-core has a hard dependency on log4j, which shouldn't be necessary since > slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my > sbt file. > Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. > However, removing the log4j dependency fails because in > https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 > a static method of org.apache.log4j.LogManager is accessed *even if* log4j > is not in use. > I guess removing all dependencies on log4j may be a bigger task, but it would > be a great help if the access to LogManager would be done only if log4j use > was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times
[ https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214239#comment-14214239 ] Nicholas Chammas commented on SPARK-4325: - {quote} Replace instances of download; rsync to rest of cluster with parallel downloads on all nodes of the cluster. {quote} Actually, the current way may be better. If you are launching a 100+ node cluster, for example, it probably isn't a good idea to have all of them hit a resource (e.g. a file at an Apache mirror) at once without some thought. I'd bet it's safer and more reliable for now to have a single node download and then broadcast to the rest of the cluster. This is the current behavior. > Improve spark-ec2 cluster launch times > -- > > Key: SPARK-4325 > URL: https://issues.apache.org/jira/browse/SPARK-4325 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Priority: Minor > > There are several optimizations we know we can make to [{{setup.sh}} | > https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches > faster. > There are also some improvements to the AMIs that will help a lot. > Potential improvements: > * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This > will reduce or eliminate SSH wait time and Ganglia init time. > * Replace instances of {{download; rsync to rest of cluster}} with parallel > downloads on all nodes of the cluster. > * Replace instances of > {code} > for node in $NODES; do > command > sleep 0.3 > done > wait{code} > with simpler calls to {{pssh}}. > * Remove the [linear backoff | > https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] > when we wait for SSH availability now that we are already waiting for EC2 > status checks to clear before testing SSH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214223#comment-14214223 ] Rui Li commented on SPARK-2321: --- Hey [~joshrosen], Thanks a lot for the update! I created SPARK-4440 for the enhancement. > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.2.0 > > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4440) Enhance the job progress API to expose more information
Rui Li created SPARK-4440: - Summary: Enhance the job progress API to expose more information Key: SPARK-4440 URL: https://issues.apache.org/jira/browse/SPARK-4440 Project: Spark Issue Type: Improvement Reporter: Rui Li The progress API introduced in SPARK-2321 provides a new way for user to monitor job progress. However the information exposed in the API is relatively limited. It'll be much more useful if we can enhance the API to expose more data. Some improvement for example may include but not limited to: 1. Stage submission and completion time. 2. Task metrics. The requirement is initially identified for the hive on spark project(HIVE-7292), other application should benefit as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4309) Date type support missing in HiveThriftServer2
[ https://issues.apache.org/jira/browse/SPARK-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214202#comment-14214202 ] Apache Spark commented on SPARK-4309: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3298 > Date type support missing in HiveThriftServer2 > -- > > Key: SPARK-4309 > URL: https://issues.apache.org/jira/browse/SPARK-4309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.1 >Reporter: Cheng Lian > Fix For: 1.2.0 > > > Date type is not supported while retrieving result set in HiveThriftServer2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly
[ https://issues.apache.org/jira/browse/SPARK-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214203#comment-14214203 ] Apache Spark commented on SPARK-4407: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3298 > Thrift server for 0.13.1 doesn't deserialize complex types properly > --- > > Key: SPARK-4407 > URL: https://issues.apache.org/jira/browse/SPARK-4407 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.2 >Reporter: Cheng Lian >Priority: Blocker > Fix For: 1.2.0 > > > The following snippet can reproduce this issue: > {code} > CREATE TABLE t0(m MAP); > INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10; > SELECT * FROM t0; > {code} > Exception throw: > {code} > java.lang.RuntimeException: java.lang.ClassCastException: > scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > at com.sun.proxy.$Proxy21.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 > cannot be cast to java.lang.String > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > ... 19 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4439) Expose RandomForest in Python
[ https://issues.apache.org/jira/browse/SPARK-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4439: - Summary: Expose RandomForest in Python (was: Export RandomForest in Python) > Expose RandomForest in Python > - > > Key: SPARK-4439 > URL: https://issues.apache.org/jira/browse/SPARK-4439 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4439) Export RandomForest in Python
Matei Zaharia created SPARK-4439: Summary: Export RandomForest in Python Key: SPARK-4439 URL: https://issues.apache.org/jira/browse/SPARK-4439 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207925#comment-14207925 ] Ashutosh Trivedi edited comment on SPARK-4038 at 11/17/14 2:18 AM: --- I think I am following the procedure. I opened a discussion on dev mailing list and [~mengxr] asked me to open this JIRA. If you read the description-- this JIRA is to discuss about various Outlier/anomaly detection algorithms. I don't just 'care to code' in Spark. Since I am using spark for my projects, I found that there are no algorithms on Outliers and I think it should have it and I can contribute. I am aware of one algorithm AVF (link attached). The questions raised are valid and we want community to discuss it. This algorithm deals with categorical data, It uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it. I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us. was (Author: rusty): I think I am following the procedure. I opened a discussion on dev mailing list and Xiangrui asked me to open this JIRA. If you read the description this JIRA is to discuss about various Outlier/anomaly detection algorithms. I don't just 'care to code' in Spark. Since I am using spark for my projects, I found that there are no algorithms on Outliers and I think it should have algorithms for it. I am aware of one algorithm AVF (link attached). The questions raised are valid and we want community to discuss it. This algorithm deals with categorical data, It uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it. I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us. > Outlier Detection Algorithm for MLlib > - > > Key: SPARK-4038 > URL: https://issues.apache.org/jira/browse/SPARK-4038 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ashutosh Trivedi >Priority: Minor > > The aim of this JIRA is to discuss about which parallel outlier detection > algorithms can be included in MLlib. > The one which I am familiar with is Attribute Value Frequency (AVF). It > scales linearly with the number of data points and attributes, and relies on > a single data scan. It is not distance based and well suited for categorical > data. In original paper a parallel version is also given, which is not > complected to implement. I am working on the implementation and soon submit > the initial code for review. > Here is the Link for the paper > http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 > As pointed out by Xiangrui in discussion > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html > There are other algorithms also. Lets discuss about which will be more > general and easily paralleled. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214199#comment-14214199 ] Josh Rosen commented on SPARK-4434: --- As a regression test, we should probably add a triple-slash test case to ClientSuite. > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207925#comment-14207925 ] Ashutosh Trivedi edited comment on SPARK-4038 at 11/17/14 2:15 AM: --- I think I am following the procedure. I opened a discussion on dev mailing list and Xiangrui asked me to open this JIRA. If you read the description this JIRA is to discuss about various Outlier/anomaly detection algorithms. I don't just 'care to code' in Spark. Since I am using spark for my projects, I found that there are no algorithms on Outliers and I think it should have algorithms for it. I am aware of one algorithm AVF (link attached). The questions raised are valid and we want community to discuss it. This algorithm deals with categorical data, It uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it. I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us. was (Author: rusty): The questions raised are valid and we want community to discuss it. This algorithm deals with categorical data, It uses the simplest approach by calculating frequency of each attribute in the data set. Some of the people in community are already doing the review and I am working on it. I did not find any other algorithm which work on categorical data to find outliers. If you are aware of any other algorithm which is well known please share with us. > Outlier Detection Algorithm for MLlib > - > > Key: SPARK-4038 > URL: https://issues.apache.org/jira/browse/SPARK-4038 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ashutosh Trivedi >Priority: Minor > > The aim of this JIRA is to discuss about which parallel outlier detection > algorithms can be included in MLlib. > The one which I am familiar with is Attribute Value Frequency (AVF). It > scales linearly with the number of data points and attributes, and relies on > a single data scan. It is not distance based and well suited for categorical > data. In original paper a parallel version is also given, which is not > complected to implement. I am working on the implementation and soon submit > the initial code for review. > Here is the Link for the paper > http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 > As pointed out by Xiangrui in discussion > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html > There are other algorithms also. Lets discuss about which will be more > general and easily paralleled. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214196#comment-14214196 ] Josh Rosen commented on SPARK-4434: --- Fellow Databricks folks: I've added a regression test for this in https://github.com/databricks/spark-integration-tests/commit/f121f45aecbeafcec21d3bb670737fc9f7d6da0b (I'm sharing this link here so that it's easy to find this test once we open-source that repository). The test is essentially a scripted / automated version of the commands that I've listed in this JIRA. These tests confirm that reverting that earlier PR fixes this issue. [~adav], do you want to open a separate JIRA to fix the "file://" error message? > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4438) Add HistoryServer RESTful API
[ https://issues.apache.org/jira/browse/SPARK-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gankun Luo updated SPARK-4438: -- Attachment: HistoryServer RESTful API Design Doc.pdf > Add HistoryServer RESTful API > - > > Key: SPARK-4438 > URL: https://issues.apache.org/jira/browse/SPARK-4438 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Gankun Luo > Attachments: HistoryServer RESTful API Design Doc.pdf > > > Spark HistoryServer currently only supports keep track of all completed > applications through the WEBUI, does not provide RESTful API for external > system query completed application information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4436) Debian packaging misses datanucleus jars
[ https://issues.apache.org/jira/browse/SPARK-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-4436: Description: If Spark is built with Hive support (i.e. -Phive), then the necessary datanucleus jars end up in lib_managed, not as part of the uber-jar. The debian packaging isn't including anything from lib_managed. As a consequence, HiveContext et al. will fail with the packaged Spark even though it was built with -Phive. see comment in bin/compute-classpath.sh Packaging everything from lib_managed/jars into /lib is an adequate solution. was: If Spark is built with HIve support (i.e. -Phive), then the necessary datanucleus jars end up in lib_managed, not as part of the uber-jar. The debian packaging isn't including anything from lib_managed. As a consequence, HiveContext et al. will fail with the packaged Spark even though it was built with -Phive. see comment in bin/compute-classpath.sh Packaging everything from lib_managed/jars into /lib is an adequate solution. > Debian packaging misses datanucleus jars > > > Key: SPARK-4436 > URL: https://issues.apache.org/jira/browse/SPARK-4436 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0, 1.0.1, 1.1.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra >Priority: Minor > > If Spark is built with Hive support (i.e. -Phive), then the necessary > datanucleus jars end up in lib_managed, not as part of the uber-jar. The > debian packaging isn't including anything from lib_managed. As a > consequence, HiveContext et al. will fail with the packaged Spark even though > it was built with -Phive. > see comment in bin/compute-classpath.sh > Packaging everything from lib_managed/jars into /lib is an adequate > solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4438) Add HistoryServer RESTful API
Gankun Luo created SPARK-4438: - Summary: Add HistoryServer RESTful API Key: SPARK-4438 URL: https://issues.apache.org/jira/browse/SPARK-4438 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Gankun Luo Spark HistoryServer currently only supports keep track of all completed applications through the WEBUI, does not provide RESTful API for external system query completed application information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4436) Debian packaging misses datanucleus jars
[ https://issues.apache.org/jira/browse/SPARK-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214182#comment-14214182 ] Apache Spark commented on SPARK-4436: - User 'markhamstra' has created a pull request for this issue: https://github.com/apache/spark/pull/3297 > Debian packaging misses datanucleus jars > > > Key: SPARK-4436 > URL: https://issues.apache.org/jira/browse/SPARK-4436 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0, 1.0.1, 1.1.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra >Priority: Minor > > If Spark is built with HIve support (i.e. -Phive), then the necessary > datanucleus jars end up in lib_managed, not as part of the uber-jar. The > debian packaging isn't including anything from lib_managed. As a > consequence, HiveContext et al. will fail with the packaged Spark even though > it was built with -Phive. > see comment in bin/compute-classpath.sh > Packaging everything from lib_managed/jars into /lib is an adequate > solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3624) "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian packages
[ https://issues.apache.org/jira/browse/SPARK-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214183#comment-14214183 ] Apache Spark commented on SPARK-3624: - User 'markhamstra' has created a pull request for this issue: https://github.com/apache/spark/pull/3297 > "Failed to find Spark assembly in /usr/share/spark/lib" for RELEASED debian > packages > > > Key: SPARK-3624 > URL: https://issues.apache.org/jira/browse/SPARK-3624 > Project: Spark > Issue Type: Bug > Components: Build, Deploy >Affects Versions: 1.1.0 >Reporter: Christian Tzolov >Priority: Minor > > The compute-classpath.sh requires that for a 'RELASED' package the Spark > assembly jar is accessible from a /lib folder. > Currently the jdeb packaging (assembly module) bundles the assembly jar into > a folder called 'jars'. > The result is : > /usr/share/spark/bin/spark-submit --num-executors 10--master > yarn-cluster --class org.apache.spark.examples.SparkPi > /usr/share/spark/jars/spark-examples-1.1.0-hadoop2.2.0-gphd-3.0.1.0.jar 10 > ls: cannot access /usr/share/spark/lib: No such file or directory > Failed to find Spark assembly in /usr/share/spark/lib > You need to build Spark before running this program. > Trivial solution is to rename the '${deb.install.path}/jars' > inside assembly/pom.xml to ${deb.install.path}/lib. > Another less impactful (considering backward compatibility) solution is to > define a lib->jars symlink in the assembly/pom.xml -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4399) Support multiple cloud providers
[ https://issues.apache.org/jira/browse/SPARK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214167#comment-14214167 ] Andrew Ash commented on SPARK-4399: --- Agreed that it does seem out of scope for what an open source project would normally focus on. In my mind the Apache Spark team's responsibility ends at producing the source and binary tarball releases. If distributors or others want to make deploying those releases on particular cloud providers easier they are free to do it, but that is not the Spark team's responsibility. Have you observed much demand for non-EC2 cloud providers from the users list? > Support multiple cloud providers > > > Key: SPARK-4399 > URL: https://issues.apache.org/jira/browse/SPARK-4399 > Project: Spark > Issue Type: New Feature > Components: EC2 >Affects Versions: 1.2.0 >Reporter: Andrew Ash > > We currently have Spark startup scripts for Amazon EC2 but not for various > other cloud providers. This ticket is an umbrella to support multiple cloud > providers in the bundled scripts, not just Amazon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214166#comment-14214166 ] Aaron Davidson commented on SPARK-4434: --- Side note: the error message about "file://", which was not introduced in the patch you reverted, is incorrect. A "file://XX.jar" URI is never valid. One or three slashes must be used; two slashes indicates that a hostname follows. > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214165#comment-14214165 ] Apache Spark commented on SPARK-4434: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3296 > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4437) Docs for difference between WholeTextFileRecordReader and WholeCombineFileRecordReader
Andrew Ash created SPARK-4437: - Summary: Docs for difference between WholeTextFileRecordReader and WholeCombineFileRecordReader Key: SPARK-4437 URL: https://issues.apache.org/jira/browse/SPARK-4437 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Andrew Ash Assignee: Davies Liu Tracking per this dev@ thread: {quote} On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin wrote: I don't think the code is immediately obvious. Davies - I think you added the code, and Josh reviewed it. Can you guys explain and maybe submit a patch to add more documentation on the whole thing? Thanks. On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad wrote: > Hello Everyone, > > I am going through the source code of rdd and Record readers > There are found 2 classes > > 1. WholeTextFileRecordReader > 2. WholeCombineFileRecordReader ( extends CombineFileRecordReader ) > > The description of both the classes is perfectly similar. > > I am not able to understand why we have 2 classes. Is > CombineFileRecordReader providing some extra advantage? > > Regards > Vibhanshu {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214164#comment-14214164 ] Apache Spark commented on SPARK-4434: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3295 > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4436) Debian packaging misses datanucleus jars
Mark Hamstra created SPARK-4436: --- Summary: Debian packaging misses datanucleus jars Key: SPARK-4436 URL: https://issues.apache.org/jira/browse/SPARK-4436 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0, 1.0.1, 1.0.0 Reporter: Mark Hamstra Assignee: Mark Hamstra Priority: Minor If Spark is built with HIve support (i.e. -Phive), then the necessary datanucleus jars end up in lib_managed, not as part of the uber-jar. The debian packaging isn't including anything from lib_managed. As a consequence, HiveContext et al. will fail with the packaged Spark even though it was built with -Phive. see comment in bin/compute-classpath.sh Packaging everything from lib_managed/jars into /lib is an adequate solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214155#comment-14214155 ] Matei Zaharia commented on SPARK-4434: -- [~joshrosen] make sure to revert this on 1.2 and master as well. > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214152#comment-14214152 ] Josh Rosen commented on SPARK-4434: --- It looks like this was caused by https://github.com/apache/spark/pull/2925 (SPARK-4075), since reverting that fixes this issue. I'll work on committing my test code to our internal tests repository and open a PR to investigate / revert that commit. > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4399) Support multiple cloud providers
[ https://issues.apache.org/jira/browse/SPARK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214153#comment-14214153 ] Patrick Wendell commented on SPARK-4399: I think this might actually be "out of scope" and something for which it would be nice to have a community library or project outside of Spark. The Spark ec2 scripts were designed to be a way for someone to play around with a Spark cluster quickly, and there is definitely user interest in something richer for launching production Spark clusters across different environments, etc. > Support multiple cloud providers > > > Key: SPARK-4399 > URL: https://issues.apache.org/jira/browse/SPARK-4399 > Project: Spark > Issue Type: New Feature > Components: EC2 >Affects Versions: 1.2.0 >Reporter: Andrew Ash > > We currently have Spark startup scripts for Amazon EC2 but not for various > other cloud providers. This ticket is an umbrella to support multiple cloud > providers in the bundled scripts, not just Amazon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2208) local metrics tests can fail on fast machines
[ https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214146#comment-14214146 ] Apache Spark commented on SPARK-2208: - User 'XuefengWu' has created a pull request for this issue: https://github.com/apache/spark/pull/3294 > local metrics tests can fail on fast machines > - > > Key: SPARK-2208 > URL: https://issues.apache.org/jira/browse/SPARK-2208 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell > Labels: starter > > I'm temporarily disabling this check. I think the issue is that on fast > machines the fetch wait time can actually be zero, even across all tasks. > We should see if we can write this in a different way to make sure there is a > delay. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-603) add simple Counter API
[ https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214142#comment-14214142 ] Imran Rashid commented on SPARK-603: Hey, this was originally reported by me too (probably I messed up when creating it on the old Jira, not sure if there is a way to change the reporter now?) I think perhaps the original issue was a little unclear, I'll try to clarify a little bit: I do *not* think we need to support something at the "operation" level -- having it work at the stage level (or even job level) is fine. I'm not even sure what it would mean to work at the operation level, since individual records are pushed through all the operations of a stage in one go. But the operation level is still a useful abstraction for the *developer*. Its nice for them to be able to write methods which are eg., just a {{filter}}. For normal RDD operations, this works just fine of course -- you can have a bunch of util methods that take in an RDD and output an RDD, maybe some {{filter}}, some {{map}}, etc., they can get combined however you like, everything remains lazy until there is some action. All wonderful. Things get messy as soon as you start to include accumulators, however -- you've got include them in your return values and then the outside logic has to know when they actual contain valid data. Rather than trying to solve this problem in general, I'm proposing that we do something dead-simple for basic counters, which might even live outside of accumulators completely. Putting accumulator values in the web UI is not bad for just this purpose, but overall I don't think its the right solution: 1. It limits what we can do with accumulators (see my comments on SPARK-664) 2. The api is more complicated than it needs to be. If the only point of accumulators is counters, then we can get away with something as simple as: {code} rdd.map{x => if (isFoobar(x)) { Counters("foobar") += 1 } ... } {code} (eg., no need to even declare the counter up front.) 3. Having the value in the UI is nice, but its not the same as programmatic access. eg. it can be useful to have them in the job logs, the actual values might be used in other computation (eg., gives the size of a datastructure for a later step), etc. Even with the simpler counter api, this is tricky b/c of lazy evaluation. But maybe that is a reason you create a call-back up front: {code} Counters.addCallback("foobar"){counts => ...} rdd.map{x => if (isFoobar(x)) { Counters("foobar") += 1 } ... } {code} 4. If you have long-running tasks, it might be nice to get incremental feedback from counters *during* the task. (There was a real need for long-running tasks before sort-based shuffle, when you couldn't have too many tasks in a shuffle ... perhaps its not anymore, I'm not sure.) We can get a little further with accumulators, eg. a SparkListener could do something with accumulator values when the stages finish. But I think we're stuck on the other points. I feel like right now accumulators are trapped between just being counters, and being a more general method of computation, and not quite doing either one very well. > add simple Counter API > -- > > Key: SPARK-603 > URL: https://issues.apache.org/jira/browse/SPARK-603 > Project: Spark > Issue Type: New Feature >Priority: Minor > > Users need a very simple way to create counters in their jobs. Accumulators > provide a way to do this, but are a little clunky, for two reasons: > 1) the setup is a nuisance > 2) w/ delayed evaluation, you don't know when it will actually run, so its > hard to look at the values > consider this code: > {code} > def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = { > val filterCount = sc.accumulator(0) > val filtered = rdd.filter{r => > if (isOK(r)) true else {filterCount += 1; false} > } > println("removed " + filterCount.value + " records) > filtered > } > {code} > The println will always say 0 records were filtered, because its printed > before anything has actually run. I could print out the value later on, but > note that it would destroy the modularity of the method -- kinda ugly to > return the accumulator just so that it can get printed later on. (and of > course, the caller in turn might not know when the filter is going to get > applied, and would have to pass the accumulator up even further ...) > I'd like to have Counters which just automatically get printed out whenever a > stage has been run, and also with some api to get them back. I realize this > is tricky b/c a stage can get re-computed, so maybe you should only increment > the counters once. > Maybe a more general way to do this is to provide some callback for whenever > an RDD is computed -- by default, you would just print
[jira] [Created] (SPARK-4435) Add setThreshold in Python LogisticRegressionModel and SVMModel
Matei Zaharia created SPARK-4435: Summary: Add setThreshold in Python LogisticRegressionModel and SVMModel Key: SPARK-4435 URL: https://issues.apache.org/jira/browse/SPARK-4435 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214134#comment-14214134 ] Matei Zaharia commented on SPARK-4306: -- [~srinathsmn] I've assigned it to you. When do you think you'll get this done? It would be great to include in 1.2 but for that we'd need it quite soon (say this week). If you don't have time, I can also assign it to someone else. > LogisticRegressionWithLBFGS support for PySpark MLlib > -- > > Key: SPARK-4306 > URL: https://issues.apache.org/jira/browse/SPARK-4306 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Varadharajan > Labels: newbie > Original Estimate: 48h > Remaining Estimate: 48h > > Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib > interfact. This task is to add support for LogisticRegressionWithLBFGS > algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4306: - Assignee: Varadharajan > LogisticRegressionWithLBFGS support for PySpark MLlib > -- > > Key: SPARK-4306 > URL: https://issues.apache.org/jira/browse/SPARK-4306 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Varadharajan >Assignee: Varadharajan > Labels: newbie > Original Estimate: 48h > Remaining Estimate: 48h > > Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib > interfact. This task is to add support for LogisticRegressionWithLBFGS > algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4306: - Target Version/s: 1.2.0 > LogisticRegressionWithLBFGS support for PySpark MLlib > -- > > Key: SPARK-4306 > URL: https://issues.apache.org/jira/browse/SPARK-4306 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Varadharajan > Labels: newbie > Original Estimate: 48h > Remaining Estimate: 48h > > Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib > interfact. This task is to add support for LogisticRegressionWithLBFGS > algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly
[ https://issues.apache.org/jira/browse/SPARK-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214127#comment-14214127 ] Apache Spark commented on SPARK-4407: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/3292 > Thrift server for 0.13.1 doesn't deserialize complex types properly > --- > > Key: SPARK-4407 > URL: https://issues.apache.org/jira/browse/SPARK-4407 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.2 >Reporter: Cheng Lian >Priority: Blocker > Fix For: 1.2.0 > > > The following snippet can reproduce this issue: > {code} > CREATE TABLE t0(m MAP); > INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10; > SELECT * FROM t0; > {code} > Exception throw: > {code} > java.lang.RuntimeException: java.lang.ClassCastException: > scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > at com.sun.proxy.$Proxy21.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 > cannot be cast to java.lang.String > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > ... 19 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4309) Date type support missing in HiveThriftServer2
[ https://issues.apache.org/jira/browse/SPARK-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214126#comment-14214126 ] Apache Spark commented on SPARK-4309: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/3292 > Date type support missing in HiveThriftServer2 > -- > > Key: SPARK-4309 > URL: https://issues.apache.org/jira/browse/SPARK-4309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.1 >Reporter: Cheng Lian > Fix For: 1.2.0 > > > Date type is not supported while retrieving result set in HiveThriftServer2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214123#comment-14214123 ] Josh Rosen commented on SPARK-4434: --- I think that there are only a small number of patches in branch-1.1 that are related to this, so I'm going to see if I can narrow it down to a specific commit. https://github.com/apache/spark/pull/2925 is one potential culprit, but there may be others. I'm not sure whether this affects HDFS URLs; I haven't tried it yet since I don't have a Docker-ized HDFS set up in my integration tests project. > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-4434: - Assignee: Josh Rosen > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4434: -- Assignee: Andrew Or (was: Josh Rosen) > spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 > - > > Key: SPARK-4434 > URL: https://issues.apache.org/jira/browse/SPARK-4434 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Andrew Or >Priority: Blocker > > When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 > allowed you to omit the {{file://}} or {{hdfs://}} prefix from the > application JAR URL, e.g. > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar > {code} > In Spark 1.1.1 and 1.2.0, this same command now fails with an error: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > Usage: DriverClient [options] launch > [driver options] > Usage: DriverClient kill > {code} > I tried changing my URL to conform to the new format, but this either > resulted in an error or a job that failed: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Jar url > 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' > is not in valid format. > Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) > {code} > If I omit the extra slash: > {code} > ./bin/spark-submit --deploy-mode cluster --master > spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar > Sending launch command to spark://joshs-mbp.att.net:7077 > Driver successfully submitted as driver-20141116143235-0002 > ... waiting before polling master for driver state > ... polling master for driver state > State of driver-20141116143235-0002 is ERROR > Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > java.lang.IllegalArgumentException: Wrong FS: > file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) > {code} > This bug effectively prevents users from using {{spark-submit}} in cluster > mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
Josh Rosen created SPARK-4434: - Summary: spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 Key: SPARK-4434 URL: https://issues.apache.org/jira/browse/SPARK-4434 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: Josh Rosen Priority: Blocker When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the application JAR URL, e.g. {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar {code} In Spark 1.1.1 and 1.2.0, this same command now fails with an error: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) Usage: DriverClient [options] launch [driver options] Usage: DriverClient kill {code} I tried changing my URL to conform to the new format, but this either resulted in an error or a job that failed: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) {code} If I omit the extra slash: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Sending launch command to spark://joshs-mbp.att.net:7077 Driver successfully submitted as driver-20141116143235-0002 ... waiting before polling master for driver state ... polling master for driver state State of driver-20141116143235-0002 is ERROR Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) {code} This bug effectively prevents users from using {{spark-submit}} in cluster mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4407) Thrift server for 0.13.1 doesn't deserialize complex types properly
[ https://issues.apache.org/jira/browse/SPARK-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4407. - Resolution: Fixed Fix Version/s: 1.2.0 > Thrift server for 0.13.1 doesn't deserialize complex types properly > --- > > Key: SPARK-4407 > URL: https://issues.apache.org/jira/browse/SPARK-4407 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.2 >Reporter: Cheng Lian >Priority: Blocker > Fix For: 1.2.0 > > > The following snippet can reproduce this issue: > {code} > CREATE TABLE t0(m MAP); > INSERT OVERWRITE TABLE t0 SELECT MAP(key, value) FROM src LIMIT 10; > SELECT * FROM t0; > {code} > Exception throw: > {code} > java.lang.RuntimeException: java.lang.ClassCastException: > scala.collection.immutable.Map$Map1 cannot be cast to java.lang.String > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > at com.sun.proxy.$Proxy21.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 > cannot be cast to java.lang.String > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(Shim13.scala:142) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:165) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > ... 19 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4309) Date type support missing in HiveThriftServer2
[ https://issues.apache.org/jira/browse/SPARK-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4309. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3178 [https://github.com/apache/spark/pull/3178] > Date type support missing in HiveThriftServer2 > -- > > Key: SPARK-4309 > URL: https://issues.apache.org/jira/browse/SPARK-4309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.1 >Reporter: Cheng Lian > Fix For: 1.2.0 > > > Date type is not supported while retrieving result set in HiveThriftServer2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-664) Accumulator updates should get locally merged before sent to the driver
[ https://issues.apache.org/jira/browse/SPARK-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214038#comment-14214038 ] Imran Rashid commented on SPARK-664: Hi [~aash] , thanks for taking another look at this -- sorry I have been aloof for a little while. I didn't know about SPARK-2380 , obviously this was created long before that. Honestly, I'm not a big fan of SPARK-2380, it seems to really limit what we can do with accumulators. We could really use them to expose a completely different model of computation. Let me give an example use case. Accumulators are in principle general enough that they let you compute lots of different things in one pass. Eg., by using accumulators, you could: * create a bloom filter of records that meet some criteria * assign records to different buckets, and count how many are in each bucket, even up to 100K buckets (eg., by having accumulator of {{Array}}) * use hyperloglog to count how many distinct ids you have * filter down to only those records with some parsing error, for a closer look (just by using plain old {{rdd.filter()}} You could do all that in one pass, if the first 3 were done w/ accumulators. When I started using spark, I actually wrote a bunch of code to do exactly that kind of thing. But it performed really poorly -- after some profiling & investigating how accumulators work, I saw why. Those big accumulators I was creating just put a lot of work on the driver. Accumulators provide the right API to do that kind of thing, but the implementation would have to change. I definitely agree that if the results get merged on the executor before getting sent to the executor, it increases the latency of the *per-task* results, but does that matter? I would prefer that we have something that supports the more general computation model, and the important thing is only the latency of the *overall* result. It feels like we're moving to accumulators being treated just like counters (but with an awkward api). > Accumulator updates should get locally merged before sent to the driver > --- > > Key: SPARK-664 > URL: https://issues.apache.org/jira/browse/SPARK-664 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid >Priority: Minor > > Whenever a task finishes, the accumulator updates from that task are > immediately sent back to the driver. When the accumulator updates are big, > this is inefficient because (a) a lot more data has to be sent to the driver > and (b) the driver has to do all the work of merging the updates together. > Probably doesn't matter for small accumulators / low number of tasks, but if > both are big, this could be a big bottleneck. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213945#comment-14213945 ] RJ Nowling commented on SPARK-2429: --- Hi Yu, I'm having trouble finding the function to cut a dendrogram -- I see the tests but not the implementation. I feel that you should be able to assign values in O(log N) time with the hierarchical method vs O(N) with the standard kmeans. So, say you train a model (this may be slower than kmeans) then assign additional points to clusters after training. If clusters at the same levels in the hierarchy do not overlap, you should be able to choose the closest cluster at each level until you find a leaf. I'm assuming that the children of a given cluster are contained within that cluster (spacially) -- can you show this or find a reference for this? If so, then assignment should be faster for a larger number of clusters as Jun was saying above. Do you agree with this? Or is there something I am misunderstanding! Thanks! > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213942#comment-14213942 ] Jun Yang commented on SPARK-2429: - Hi Yu Ishikawa Thanks for your wonderful hierarchical implementation of KMeans, which just meets one of our project requirement :) In our project, we initially used a MPI-based HAC implementation to do agglomeration bottom-up hierarchical clustering, and since we want to migrate the entire back-end pipeline to Spark, we just look for the alike hierarchical clustering implementation on Spark or we need to write it by ourselves. >From functionality perspective, you implementation looks pretty good( I have >already read through your code), but I still have several questions regarding >to performance and scalability: 1. In your implementation, in each divisive steps, there will be a "copy" operations to distribution the data nodes in the parent cluster tree to the split children cluster trees, when the document size is large, I think this copy cost is non-neglectable, right? A potential optimization method is to keep the entire document data cached, and in each divisive steps, we just record the index of the documents into the ClusterTree object, so the cost could be lowered quite a lot. Does this idea make sense? 2. In your test code, the cluster size is not quite large( only about 100 ), have you ever tested it with big cluster size and big document corpus? e.g., 1 clusters with 200 documents. What is the performance behavior facing this kind of use case? Since in production environment, this use case is usually typical. Look forward to your reply. Thanks > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4393) Memory leak in connection manager timeout thread
[ https://issues.apache.org/jira/browse/SPARK-4393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4393. Resolution: Fixed Fix Version/s: 1.2.0 > Memory leak in connection manager timeout thread > > > Key: SPARK-4393 > URL: https://issues.apache.org/jira/browse/SPARK-4393 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.2.0 > > > This JIRA tracks a fix for a memory leak in ConnectionManager's TimerTasks, > originally reported in [a > comment|https://issues.apache.org/jira/browse/SPARK-3633?focusedCommentId=14208318&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14208318] > on SPARK-3633. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org