[jira] [Updated] (SPARK-6727) Model export/import for spark.ml: HashingTF
[ https://issues.apache.org/jira/browse/SPARK-6727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6727: - Target Version/s: (was: 1.4.0) Model export/import for spark.ml: HashingTF --- Key: SPARK-6727 URL: https://issues.apache.org/jira/browse/SPARK-6727 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6788) Model export/import for spark.ml: Tokenizer
[ https://issues.apache.org/jira/browse/SPARK-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6788: - Target Version/s: (was: 1.4.0) Model export/import for spark.ml: Tokenizer --- Key: SPARK-6788 URL: https://issues.apache.org/jira/browse/SPARK-6788 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6790) Model export/import for spark.ml: LinearRegression
[ https://issues.apache.org/jira/browse/SPARK-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6790: - Target Version/s: (was: 1.4.0) Model export/import for spark.ml: LinearRegression -- Key: SPARK-6790 URL: https://issues.apache.org/jira/browse/SPARK-6790 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6789) Model export/import for spark.ml: ALS
[ https://issues.apache.org/jira/browse/SPARK-6789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6789: - Target Version/s: (was: 1.4.0) Model export/import for spark.ml: ALS - Key: SPARK-6789 URL: https://issues.apache.org/jira/browse/SPARK-6789 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6791) Model export/import for spark.ml: meta-algorithms
[ https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6791: - Target Version/s: (was: 1.4.0) Model export/import for spark.ml: meta-algorithms - Key: SPARK-6791 URL: https://issues.apache.org/jira/browse/SPARK-6791 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Algorithms: Pipeline, CrossValidator (and associated models) This task will block on all other subtasks for [SPARK-6725]. This task will also include adding export/import as a required part of the PipelineStage interface since meta-algorithms will depend on sub-algorithms supporting save/load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6725: - Description: This is an umbrella JIRA for adding model export/import to the spark.ml API. This JIRA is for adding the internal Saveable/Loadable API and Parquet-based format, not for other formats like PMML. This will require the following steps: * Add export/import for all PipelineStages supported by spark.ml ** This will include some Transformers which are not Models. ** These can use almost the same format as the spark.mllib model save/load functions, but the model metadata must store a different class name (marking the class as a spark.ml class). * After all PipelineStages support save/load, add an interface which forces future additions to support save/load. *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. Other libraries and formats can support this, and it would be great if we could too. We could do either of the following: * save() optionally takes a dataset (or schema), and load will return a (model, schema) pair. * Models themselves save the input schema. Both options would mean inheriting from new Saveable, Loadable types. was: This is an umbrella JIRA for adding model export/import to the spark.ml API. This JIRA is for adding the internal Saveable/Loadable API and Parquet-based format, not for other formats like PMML. This will require the following steps: * Add export/import for all PipelineStages supported by spark.ml ** This will include some Transformers which are not Models. ** These can use almost the same format as the spark.mllib model save/load functions, but the model metadata must store a different class name (marking the class as a spark.ml class). * After all PipelineStages support save/load, add an interface which forces future additions to support save/load. Model export/import for Pipeline API Key: SPARK-6725 URL: https://issues.apache.org/jira/browse/SPARK-6725 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical This is an umbrella JIRA for adding model export/import to the spark.ml API. This JIRA is for adding the internal Saveable/Loadable API and Parquet-based format, not for other formats like PMML. This will require the following steps: * Add export/import for all PipelineStages supported by spark.ml ** This will include some Transformers which are not Models. ** These can use almost the same format as the spark.mllib model save/load functions, but the model metadata must store a different class name (marking the class as a spark.ml class). * After all PipelineStages support save/load, add an interface which forces future additions to support save/load. *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. Other libraries and formats can support this, and it would be great if we could too. We could do either of the following: * save() optionally takes a dataset (or schema), and load will return a (model, schema) pair. * Models themselves save the input schema. Both options would mean inheriting from new Saveable, Loadable types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6725) Model export/import for Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6725: - Target Version/s: (was: 1.4.0) Model export/import for Pipeline API Key: SPARK-6725 URL: https://issues.apache.org/jira/browse/SPARK-6725 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical This is an umbrella JIRA for adding model export/import to the spark.ml API. This JIRA is for adding the internal Saveable/Loadable API and Parquet-based format, not for other formats like PMML. This will require the following steps: * Add export/import for all PipelineStages supported by spark.ml ** This will include some Transformers which are not Models. ** These can use almost the same format as the spark.mllib model save/load functions, but the model metadata must store a different class name (marking the class as a spark.ml class). * After all PipelineStages support save/load, add an interface which forces future additions to support save/load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7002) Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message
[ https://issues.apache.org/jira/browse/SPARK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503730#comment-14503730 ] Tom Hubregtsen edited comment on SPARK-7002 at 4/20/15 9:46 PM: Your speculation was correct: After the above computation, I performed the next extra steps: I first tried to remove the data from rdd3, unpersisting it {code} scala rdd3.unpersist() scala rdd3.collect() {code} -- This did not work, rdd2 was still not on the disk I then looked in the file system and found shuffle data. I removed these manually (shuffle_0_0_0.data and shuffle_0_0_0.index), after which I invoked the action on the child {code} scala rdd3.collect() {code} -- This worked, rdd2 appeared on disk Next to this, I also looked if a different action that could not rely on these shuffle files would invoke computation on rdd2 (as per your suggestion; FYI, I performed these two experiments separately from each other so that they don't influence each other): {code} scala val rdd4 = rdd2.reduceByKey( (x,y) = x*y) scala rdd4.collect() {code} -- This worked too, rdd2 appeared on disk again Conclusion: Rdd2 was actually not recomputed, as rdd3 was using the shuffle data that was stored on disk. Action: Should we still do something about the message in .toDebugString? It currently mentions when data is persisted on either disk or memory, but does not mention that it is saving the shuffle data. I do believe this is something you want to know. I at least called this method with the intention to know where in my DAG data is actually present, and got to believe data was not present, while in fact it was. was (Author: thubregtsen): Your speculation was correct: After the above computation, I performed the next extra steps: I first tried to remove the data from rdd3, unpersisting it scala rdd3.unpersist() scala rdd3.collect() -- This did not work, rdd2 was still not on the disk I then looked in the file system and found shuffle data. I removed these manually (shuffle_0_0_0.data and shuffle_0_0_0.index), after which I invoked the action on the child scala rdd3.collect() -- This worked, rdd2 appeared on disk Next to this, I also looked if a different action that could not rely on these shuffle files would invoke computation on rdd2 (as per your suggestion; FYI, I performed these two experiments separately from each other so that they don't influence each other): scala val rdd4 = rdd2.reduceByKey( (x,y) = x*y) scala rdd4.collect() -- This worked too, rdd2 appeared on disk again Conclusion: Rdd2 was actually not recomputed, as rdd3 was using the shuffle data that was stored on disk. Action: Should we still do something about the message in .toDebugString? It currently mentions when data is persisted on either disk or memory, but does not mention that it is saving the shuffle data. I do believe this is something you want to know. I at least called this method with the intention to know where in my DAG data is actually present, and got to believe data was not present, while in fact it was. Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message Key: SPARK-7002 URL: https://issues.apache.org/jira/browse/SPARK-7002 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: Platform: Power8 OS: Ubuntu 14.10 Java: java-8-openjdk-ppc64el Reporter: Tom Hubregtsen Priority: Minor Labels: disk, persist, unpersist The major issue is: Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message. This is pointed out at 2) next to this: toDebugString on a child RDD does not show that the parent RDD is [Disk Serialized 1x Replicated]. This is pointed out at 1) Note: I am persisting to disk (DISK_ONLY) to validate that the RDD is or is not physically stored, as I did not want to solely rely on a missing line in .toDebugString (see comments in trace) {code} scala val rdd1 = sc.parallelize(List(1,2,3)) scala val rdd2 = rdd1.map(x = (x,x+1)) scala val rdd3 = rdd2.reduceByKey( (x,y) = x+y) scala import org.apache.spark.storage.StorageLevel scala rdd2.persist(StorageLevel.DISK_ONLY) scala rdd3.collect() scala rdd2.toDebugString res4: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res5: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100)
[jira] [Commented] (SPARK-7002) Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message
[ https://issues.apache.org/jira/browse/SPARK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503730#comment-14503730 ] Tom Hubregtsen commented on SPARK-7002: --- Your speculation was correct: After the above computation, I performed the next extra steps: I first tried to remove the data from rdd3, unpersisting it scala rdd3.unpersist() scala rdd3.collect() -- This did not work, rdd2 was still not on the disk I then looked in the file system and found shuffle data. I removed these manually (shuffle_0_0_0.data and shuffle_0_0_0.index), after which I invoked the action on the child scala rdd3.collect() -- This worked, rdd2 appeared on disk Next to this, I also looked if a different action that could not rely on these shuffle files would invoke computation on rdd2 (as per your suggestion; FYI, I performed these two experiments separately from each other so that they don't influence each other): scala val rdd4 = rdd2.reduceByKey( (x,y) = x*y) scala rdd4.collect() -- This worked too, rdd2 appeared on disk again Conclusion: Rdd2 was actually not recomputed, as rdd3 was using the shuffle data that was stored on disk. Action: Should we still do something about the message in .toDebugString? It currently mentions when data is persisted on either disk or memory, but does not mention that it is saving the shuffle data. I do believe this is something you want to know. I at least called this method with the intention to know where in my DAG data is actually present, and got to believe data was not present, while in fact it was. Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message Key: SPARK-7002 URL: https://issues.apache.org/jira/browse/SPARK-7002 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: Platform: Power8 OS: Ubuntu 14.10 Java: java-8-openjdk-ppc64el Reporter: Tom Hubregtsen Priority: Minor Labels: disk, persist, unpersist The major issue is: Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message. This is pointed out at 2) next to this: toDebugString on a child RDD does not show that the parent RDD is [Disk Serialized 1x Replicated]. This is pointed out at 1) Note: I am persisting to disk (DISK_ONLY) to validate that the RDD is or is not physically stored, as I did not want to solely rely on a missing line in .toDebugString (see comments in trace) {code} scala val rdd1 = sc.parallelize(List(1,2,3)) scala val rdd2 = rdd1.map(x = (x,x+1)) scala val rdd3 = rdd2.reduceByKey( (x,y) = x+y) scala import org.apache.spark.storage.StorageLevel scala rdd2.persist(StorageLevel.DISK_ONLY) scala rdd3.collect() scala rdd2.toDebugString res4: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res5: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [] // 1) rdd3 does not show that the other RDD's are [Disk Serialized 1x Replicated], but the data is on disk. This is verified by // a) The line starting with CachedPartitions // b) a find in spark_local_dir: find . -name \* \| grep rdd returns ./spark-b39bcf9b-e7d7-4284-bdd2-1be7ac3cacef/blockmgr-4f4c0b1c-b47a-4972-b364-7179ea6e0873/1f/rdd_4_*, where * are the number of partitions scala rdd2.unpersist() scala rdd2.toDebugString res8: String = (100) MapPartitionsRDD[1] at map at console:23 [] \| ParallelCollectionRDD[0] at parallelize at console:21 [] scala rdd3.toDebugString res9: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| ParallelCollectionRDD[0] at parallelize at console:21 [] // successfully unpersisted, also not visible on disk scala rdd2.persist(StorageLevel.DISK_ONLY) scala rdd3.collect() scala rdd2.toDebugString res18: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res19: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \|
[jira] [Commented] (SPARK-6921) Spark SQL API saveAsParquetFile will output tachyon file with different block size
[ https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503921#comment-14503921 ] Sebastian YEPES FERNANDEZ commented on SPARK-6921: -- I can also validate this with v1.3.1 Spark SQL API saveAsParquetFile will output tachyon file with different block size Key: SPARK-6921 URL: https://issues.apache.org/jira/browse/SPARK-6921 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: zhangxiongfei Priority: Blocker I run below code in Spark Shell to access parquet files in Tachyon. 1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon val ta3 =sqlContext.parquetFile(tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m); 2.Second, set the fs.local.block.size to 256M to make sure that block size of output files in Tachyon is 256M. sc.hadoopConfiguration.setLong(fs.local.block.size,268435456) 3.Third,saved above DataFrame into Parquet files that is stored in Tachyon ta3.saveAsParquetFile(tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test); After above code run successfully, the output parquet files were stored in Tachyon,but these files have different block size,below is the information of those files in the path tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test: File Name Size Block Size In-Memory Pin Creation Time _SUCCESS 0.00 B 256.00 MB 100% NO 04-13-2015 17:48:23:519 _common_metadata 1088.00 B 256.00 MB 100% NO 04-13-2015 17:48:23:741 _metadata 22.71 KB 256.00 MB 100% NO 04-13-2015 17:48:23:646 part-r-1.parquet 177.19 MB 32.00 MB 100% NO 04-13-2015 17:46:44:626 part-r-2.parquet 177.21 MB 32.00 MB 100% NO 04-13-2015 17:46:44:636 part-r-3.parquet 177.02 MB 32.00 MB 100% NO 04-13-2015 17:46:45:439 part-r-4.parquet 177.21 MB 32.00 MB 100% NO 04-13-2015 17:46:44:845 part-r-5.parquet 177.40 MB 32.00 MB 100% NO 04-13-2015 17:46:44:638 part-r-6.parquet 177.33 MB 32.00 MB 100% NO 04-13-2015 17:46:44:648 It seems that the API saveAsParquetFile does not distribute/broadcast the hadoopconfiguration to executors like the other API such as saveAsTextFile.The configutation fs.local.block.size only take effects on Driver. If I set that configuration before loading parquet files,the problem is gone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7022) PySpark is missing ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503957#comment-14503957 ] Apache Spark commented on SPARK-7022: - User 'oefirouz' has created a pull request for this issue: https://github.com/apache/spark/pull/5601 PySpark is missing ParamGridBuilder --- Key: SPARK-7022 URL: https://issues.apache.org/jira/browse/SPARK-7022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz PySpark is missing the entirety of ML.Tuning (see: https://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7022) PySpark is missing ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7022: --- Assignee: Apache Spark PySpark is missing ParamGridBuilder --- Key: SPARK-7022 URL: https://issues.apache.org/jira/browse/SPARK-7022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz Assignee: Apache Spark PySpark is missing the entirety of ML.Tuning (see: https://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7022) PySpark is missing ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7022: --- Assignee: (was: Apache Spark) PySpark is missing ParamGridBuilder --- Key: SPARK-7022 URL: https://issues.apache.org/jira/browse/SPARK-7022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz PySpark is missing the entirety of ML.Tuning (see: https://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504084#comment-14504084 ] Joseph K. Bradley commented on SPARK-6635: -- Just to clarify, does that mean {{withColumn}} does *not* replace columns, but {{withName}} does? (I'm not sure what {{withName}} is.) DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at
[jira] [Commented] (SPARK-7002) Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message
[ https://issues.apache.org/jira/browse/SPARK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503761#comment-14503761 ] Sean Owen commented on SPARK-7002: -- The shuffle data is a sort of hidden, second type of caching that goes on. I don't know how much it's supposed to be exposed. My hunch is that if there's an easy API already to access this info, go ahead and propose adding it to the debug string, but if it's not otherwise easily accounted for, may not be worth adding it. It's good to know that there is a logic to what is happening, at least, rather than a bug. Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message Key: SPARK-7002 URL: https://issues.apache.org/jira/browse/SPARK-7002 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: Platform: Power8 OS: Ubuntu 14.10 Java: java-8-openjdk-ppc64el Reporter: Tom Hubregtsen Priority: Minor Labels: disk, persist, unpersist The major issue is: Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message. This is pointed out at 2) next to this: toDebugString on a child RDD does not show that the parent RDD is [Disk Serialized 1x Replicated]. This is pointed out at 1) Note: I am persisting to disk (DISK_ONLY) to validate that the RDD is or is not physically stored, as I did not want to solely rely on a missing line in .toDebugString (see comments in trace) {code} scala val rdd1 = sc.parallelize(List(1,2,3)) scala val rdd2 = rdd1.map(x = (x,x+1)) scala val rdd3 = rdd2.reduceByKey( (x,y) = x+y) scala import org.apache.spark.storage.StorageLevel scala rdd2.persist(StorageLevel.DISK_ONLY) scala rdd3.collect() scala rdd2.toDebugString res4: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res5: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [] // 1) rdd3 does not show that the other RDD's are [Disk Serialized 1x Replicated], but the data is on disk. This is verified by // a) The line starting with CachedPartitions // b) a find in spark_local_dir: find . -name \* \| grep rdd returns ./spark-b39bcf9b-e7d7-4284-bdd2-1be7ac3cacef/blockmgr-4f4c0b1c-b47a-4972-b364-7179ea6e0873/1f/rdd_4_*, where * are the number of partitions scala rdd2.unpersist() scala rdd2.toDebugString res8: String = (100) MapPartitionsRDD[1] at map at console:23 [] \| ParallelCollectionRDD[0] at parallelize at console:21 [] scala rdd3.toDebugString res9: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| ParallelCollectionRDD[0] at parallelize at console:21 [] // successfully unpersisted, also not visible on disk scala rdd2.persist(StorageLevel.DISK_ONLY) scala rdd3.collect() scala rdd2.toDebugString res18: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res19: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| ParallelCollectionRDD[0] at parallelize at console:21 [] // 2) The data is not visible on disk though the find command previously mentioned, and is also not mentioned in the toDebugString (no line starting with CachedPartitions, even though [Disk Serialized 1x Replicated] is mentioned). It does work when you call the action on the actual RDD: scala rdd2.collect() scala rdd2.toDebugString res21: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res22: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at
[jira] [Commented] (SPARK-7002) Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message
[ https://issues.apache.org/jira/browse/SPARK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503783#comment-14503783 ] Tom Hubregtsen commented on SPARK-7002: --- Great, thanks for your help :) I will be happy to propose this. What is the proper way to do this? Do I close this issue, and start a new issue with as type either New feature or Wish in which I explain what I believe is missing from .toDebugString and why? Anything else I should add? Thanks, Tom Hubregtsen Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message Key: SPARK-7002 URL: https://issues.apache.org/jira/browse/SPARK-7002 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Environment: Platform: Power8 OS: Ubuntu 14.10 Java: java-8-openjdk-ppc64el Reporter: Tom Hubregtsen Priority: Minor Labels: disk, persist, unpersist The major issue is: Persist on RDD fails the second time if the action is called on a child RDD without showing a FAILED message. This is pointed out at 2) next to this: toDebugString on a child RDD does not show that the parent RDD is [Disk Serialized 1x Replicated]. This is pointed out at 1) Note: I am persisting to disk (DISK_ONLY) to validate that the RDD is or is not physically stored, as I did not want to solely rely on a missing line in .toDebugString (see comments in trace) {code} scala val rdd1 = sc.parallelize(List(1,2,3)) scala val rdd2 = rdd1.map(x = (x,x+1)) scala val rdd3 = rdd2.reduceByKey( (x,y) = x+y) scala import org.apache.spark.storage.StorageLevel scala rdd2.persist(StorageLevel.DISK_ONLY) scala rdd3.collect() scala rdd2.toDebugString res4: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res5: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [] // 1) rdd3 does not show that the other RDD's are [Disk Serialized 1x Replicated], but the data is on disk. This is verified by // a) The line starting with CachedPartitions // b) a find in spark_local_dir: find . -name \* \| grep rdd returns ./spark-b39bcf9b-e7d7-4284-bdd2-1be7ac3cacef/blockmgr-4f4c0b1c-b47a-4972-b364-7179ea6e0873/1f/rdd_4_*, where * are the number of partitions scala rdd2.unpersist() scala rdd2.toDebugString res8: String = (100) MapPartitionsRDD[1] at map at console:23 [] \| ParallelCollectionRDD[0] at parallelize at console:21 [] scala rdd3.toDebugString res9: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| ParallelCollectionRDD[0] at parallelize at console:21 [] // successfully unpersisted, also not visible on disk scala rdd2.persist(StorageLevel.DISK_ONLY) scala rdd3.collect() scala rdd2.toDebugString res18: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res19: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| ParallelCollectionRDD[0] at parallelize at console:21 [] // 2) The data is not visible on disk though the find command previously mentioned, and is also not mentioned in the toDebugString (no line starting with CachedPartitions, even though [Disk Serialized 1x Replicated] is mentioned). It does work when you call the action on the actual RDD: scala rdd2.collect() scala rdd2.toDebugString res21: String = (100) MapPartitionsRDD[1] at map at console:23 [Disk Serialized 1x Replicated] \|CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [Disk Serialized 1x Replicated] scala rdd3.toDebugString res22: String = (100) ShuffledRDD[2] at reduceByKey at console:25 [] +-(100) MapPartitionsRDD[1] at map at console:23 [] \| CachedPartitions: 100; MemorySize: 0.0 B; TachyonSize: 0.0 B; DiskSize: 802.0 B \| ParallelCollectionRDD[0] at parallelize at console:21 [] // Data appears on disk again (using find command preciously
[jira] [Created] (SPARK-7019) Build docs on doc changes
Brennon York created SPARK-7019: --- Summary: Build docs on doc changes Key: SPARK-7019 URL: https://issues.apache.org/jira/browse/SPARK-7019 Project: Spark Issue Type: New Feature Components: Build Reporter: Brennon York Currently when a pull request changes the {{docs/}} directory, the docs aren't actually built. When a PR is submitted the {{git}} history should be checked to see if any doc changes were made and, if so, properly build the docs and report any issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504092#comment-14504092 ] Michael Armbrust commented on SPARK-6635: - Sorry, updated. I meant {{withColumn}}. DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at
[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems
[ https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503754#comment-14503754 ] Sean Owen commented on SPARK-7009: -- Or warnings, yes. These add to the case that updating to Java 7 would resolve gotchas that are currently merely documented or warned against. Build assembly JAR via ant to avoid zip64 problems -- Key: SPARK-7009 URL: https://issues.apache.org/jira/browse/SPARK-7009 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Environment: Java 7+ Reporter: Steve Loughran Original Estimate: 2h Remaining Estimate: 2h SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a format incompatible with Java and pyspark. Provided the total number of .class files+resources is 64K, ant can be used to make the final JAR instead, perhaps by unzipping the maven-generated JAR then rezipping it with zip64=never, before publishing the artifact via maven. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6726) Model export/import for spark.ml: LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6726: - Target Version/s: (was: 1.4.0) Model export/import for spark.ml: LogisticRegression Key: SPARK-6726 URL: https://issues.apache.org/jira/browse/SPARK-6726 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6786) Model export/import for spark.ml: Normalizer
[ https://issues.apache.org/jira/browse/SPARK-6786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6786: - Target Version/s: (was: 1.4.0) Model export/import for spark.ml: Normalizer Key: SPARK-6786 URL: https://issues.apache.org/jira/browse/SPARK-6786 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6787) Model export/import for spark.ml: StandardScaler
[ https://issues.apache.org/jira/browse/SPARK-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6787: - Target Version/s: (was: 1.4.0) Model export/import for spark.ml: StandardScaler Key: SPARK-6787 URL: https://issues.apache.org/jira/browse/SPARK-6787 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504030#comment-14504030 ] Michael Armbrust commented on SPARK-6635: - +1 to {{withName}} overwriting existing columns. DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
[jira] [Commented] (SPARK-7008) An Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504114#comment-14504114 ] zhengruifeng commented on SPARK-7008: - thanks for this information! An Implement of Factorization Machine (LibFM) - Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems
[ https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503752#comment-14503752 ] Steve Loughran commented on SPARK-7009: --- most of the others seemed fix by documentation patches... Build assembly JAR via ant to avoid zip64 problems -- Key: SPARK-7009 URL: https://issues.apache.org/jira/browse/SPARK-7009 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Environment: Java 7+ Reporter: Steve Loughran Original Estimate: 2h Remaining Estimate: 2h SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a format incompatible with Java and pyspark. Provided the total number of .class files+resources is 64K, ant can be used to make the final JAR instead, perhaps by unzipping the maven-generated JAR then rezipping it with zip64=never, before publishing the artifact via maven. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7016) Refactor dev/run-tests(-jenkins) from Bash to Python
[ https://issues.apache.org/jira/browse/SPARK-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brennon York updated SPARK-7016: Summary: Refactor dev/run-tests(-jenkins) from Bash to Python (was: Refactor {{dev/run-tests(-jenkins)}} from Bash to Python) Refactor dev/run-tests(-jenkins) from Bash to Python Key: SPARK-7016 URL: https://issues.apache.org/jira/browse/SPARK-7016 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Brennon York Currently the {dev/run-tests} and {dev/run-tests-jenkins} scripts are written in Bash and becoming quite unwieldy to manage, both in their current state and for future contributions. This proposal is to refactor both scripts into Python to allow for better manage-ability by the community, easier capability to add features, and provide a simpler approach to calling / running the various test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7016) Refactor {{dev/run-tests(-jenkins)}} from Bash to Python
[ https://issues.apache.org/jira/browse/SPARK-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brennon York updated SPARK-7016: Summary: Refactor {{dev/run-tests(-jenkins)}} from Bash to Python (was: Refactor {dev/run-tests(-jenkins)} from Bash to Python) Refactor {{dev/run-tests(-jenkins)}} from Bash to Python Key: SPARK-7016 URL: https://issues.apache.org/jira/browse/SPARK-7016 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Brennon York Currently the {dev/run-tests} and {dev/run-tests-jenkins} scripts are written in Bash and becoming quite unwieldy to manage, both in their current state and for future contributions. This proposal is to refactor both scripts into Python to allow for better manage-ability by the community, easier capability to add features, and provide a simpler approach to calling / running the various test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7020) Restrict module testing based on commit contents
[ https://issues.apache.org/jira/browse/SPARK-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brennon York updated SPARK-7020: Description: Currently all builds trigger all tests. This does not need to happen and, to minimize the test window, the {{git}} commit contents should be checked to determine which modules were affected and, for each, only run those tests. Restrict module testing based on commit contents Key: SPARK-7020 URL: https://issues.apache.org/jira/browse/SPARK-7020 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Brennon York Currently all builds trigger all tests. This does not need to happen and, to minimize the test window, the {{git}} commit contents should be checked to determine which modules were affected and, for each, only run those tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6917) Broken data returned to PySpark dataframe if any large numbers used in Scala land
[ https://issues.apache.org/jira/browse/SPARK-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503915#comment-14503915 ] Harry Brundage commented on SPARK-6917: --- [~davies] or [~joshrosen] any idea why this might be happening? I can dig in if you give me some pointers but I don't really know where to start! Broken data returned to PySpark dataframe if any large numbers used in Scala land - Key: SPARK-6917 URL: https://issues.apache.org/jira/browse/SPARK-6917 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.0 Environment: Spark 1.3, Python 2.7.6, Scala 2.10 Reporter: Harry Brundage Attachments: part-r-1.parquet When trying to access data stored in a Parquet file with an INT96 column (read: TimestampType() encoded for Impala), if the INT96 column is included in the fetched data, other, smaller numeric types come back broken. {code} In [1]: sql.sql.parquetFile(/Users/hornairs/Downloads/part-r-1.parquet).select('int_col', 'long_col').first() Out[1]: Row(int_col=Decimal('1'), long_col=Decimal('10')) In [2]: sql.parquetFile(/Users/hornairs/Downloads/part-r-1.parquet).first() Out[2]: Row(long_col={u'__class__': u'scala.runtime.BoxedUnit'}, str_col=u'Hello!', int_col={u'__class__': u'scala.runtime.BoxedUnit'}, date_col=datetime.datetime(1, 12, 31, 19, 0, tzinfo=DstTzInfo 'America/Toronto' EDT-1 day, 19:00:00 DST)) {code} Note the {{\{u'__class__': u'scala.runtime.BoxedUnit'}}} values being returned for the {{int_col}} and {{long_col}} columns in the second loop above. This only happens if I select the {{date_col}} which is stored as {{INT96}}. I don't know much about Scala boxing, but I assume that somehow by including numeric columns that are bigger than a machine word I trigger some different, slower execution path somewhere that boxes stuff and causes this problem. If anyone could give me any pointers on where to get started fixing this I'd be happy to dive in! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5995) Make ML Prediction Developer APIs public
[ https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5995: - Description: Previously, some Developer APIs were added to spark.ml for classification and regression to make it easier to add new algorithms and models: [SPARK-4789] There are ongoing discussions about the best design of the API. This JIRA is to continue that discussion and try to finalize those Developer APIs so that they can be made public. Please see [this design doc from SPARK-4789 | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] for details on the original API design. Some issues under debate: * Should there be strongly typed APIs for fit()? ** Proposal: No * Should the strongly typed API for transform() be public (vs. protected)? ** Proposal: Protected for now * What transformation methods should the API make developers implement for classification? ** Proposal: See design doc * Should there be a way to transform a single Row (instead of only DataFrames)? ** Proposal: Not for now was: Previously, some Developer APIs were added to spark.ml for classification and regression to make it easier to add new algorithms and models: [SPARK-4789] There are ongoing discussions about the best design of the API. This JIRA is to continue that discussion and try to finalize those Developer APIs so that they can be made public. Please see [this design doc from SPARK-4789 | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] for details on the original API design. Some issues under debate: * Should there be strongly typed APIs for fit()? * Should the strongly typed API for transform() be public (vs. protected)? * What transformation methods should the API make developers implement for classification? (See details below.) * Should there be a way to transform a single Row (instead of only DataFrames)? More on What transformation methods should the API make developers implement for classification?: * Goals: ** Optimize transform: Make it fast, and make it output only the desired columns. ** Easy development ** Support Classifier, Regressor, and ProbabilisticClassifier * (currently) Developers implement predictX methods for each output column X. They may override transform() to optimize speed. ** Pros: predictX is easy to understand. ** Cons: An optimized transform() is annoying to write. * Developers implement more basic transformation methods, such as features2raw, raw2pred, raw2prob. ** Pros: Abstract classes may implement optimized transform(). ** Cons: Different types of predictors require different methods: *** Predictor and Regressor: features2pred *** Classifier: features2raw, raw2pred *** ProbabilisticClassifier: raw2prob * Developers implement a single predict() method which takes parameters for what columns to output (returning tuple or some type with None for missing values). Abstract classes take the outputs they want and put them into columns. ** Pros: Developers only write 1 method and can optimize it as much as they want. It could be more optimized than the previous 2 options; e.g., if LogisticRegressionModel only wants the prediction, then it never has to construct intermediate results such as the vector of raw predictions. ** Cons: predict() will have a different signature for different abstractions, based on the possible output columns. Make ML Prediction Developer APIs public Key: SPARK-5995 URL: https://issues.apache.org/jira/browse/SPARK-5995 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Previously, some Developer APIs were added to spark.ml for classification and regression to make it easier to add new algorithms and models: [SPARK-4789] There are ongoing discussions about the best design of the API. This JIRA is to continue that discussion and try to finalize those Developer APIs so that they can be made public. Please see [this design doc from SPARK-4789 | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] for details on the original API design. Some issues under debate: * Should there be strongly typed APIs for fit()? ** Proposal: No * Should the strongly typed API for transform() be public (vs. protected)? ** Proposal: Protected for now * What transformation methods should the API make developers implement for classification? ** Proposal: See design doc * Should there be a way to transform a single Row (instead of only DataFrames)? ** Proposal: Not for now -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Issue Comment Deleted] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3530: - Comment: was deleted (was: Hi Xiangrui, Which part of this pipeline project would you like us to work on? Thanks! ) Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.2.0 This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7017) Refactor dev/run-tests into Python
Brennon York created SPARK-7017: --- Summary: Refactor dev/run-tests into Python Key: SPARK-7017 URL: https://issues.apache.org/jira/browse/SPARK-7017 Project: Spark Issue Type: Sub-task Reporter: Brennon York This issue is to specifically track the progress of the {{dev/run-tests}} script into Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7018) Refactor dev/run-tests-jenkins into Python
Brennon York created SPARK-7018: --- Summary: Refactor dev/run-tests-jenkins into Python Key: SPARK-7018 URL: https://issues.apache.org/jira/browse/SPARK-7018 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Brennon York This issue is to specifically track the progress of the {{dev/run-tests-jenkins}} script into Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7022) PySpark is missing ParamGridBuilder
Omede Firouz created SPARK-7022: --- Summary: PySpark is missing ParamGridBuilder Key: SPARK-7022 URL: https://issues.apache.org/jira/browse/SPARK-7022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz PySpark is missing the entirety of ML.Tuning (see: vhttps://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7022) PySpark is missing ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omede Firouz updated SPARK-7022: Description: PySpark is missing the entirety of ML.Tuning (see: https://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. was: PySpark is missing the entirety of ML.Tuning (see: vhttps://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. PySpark is missing ParamGridBuilder --- Key: SPARK-7022 URL: https://issues.apache.org/jira/browse/SPARK-7022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz PySpark is missing the entirety of ML.Tuning (see: https://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6954) Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative
[ https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6954: - Priority: Major (was: Minor) Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative - Key: SPARK-6954 URL: https://issues.apache.org/jira/browse/SPARK-6954 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1 Reporter: Cheolsoo Park Assignee: Cheolsoo Park Labels: yarn I have a simple test case for dynamic allocation on YARN that fails with the following stack trace- {code} 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -21 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} My test is as follows- # Start spark-shell with a single executor. # Run a {{select count(\*)}} query. The number of executors rises as input size is non-trivial. # After the job finishes, the number of executors falls as most of them become idle. # Rerun the same query again, and the request to add executors fails with the above error. In fact, the job itself continues to run with whatever executors it already has, but it never gets more executors unless the shell is closed and restarted. In fact, this error only happens when I configure {{executorIdleTimeout}} very small. For eg, I can reproduce it with the following configs- {code} spark.dynamicAllocation.executorIdleTimeout 5 spark.dynamicAllocation.schedulerBacklogTimeout 5 {code} Although I can simply increase {{executorIdleTimeout}} to something like 60 secs to avoid the error, I think this is still a bug to be fixed. The root cause seems that {{numExecutorsPending}} accidentally becomes negative if executors are killed too aggressively (i.e. {{executorIdleTimeout}} is too small) because under that circumstance, the new target # of executors can be smaller than the current # of executors. When that happens, {{ExecutorAllocationManager}} ends up trying to add a negative number of executors, which throws an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7021) JUnit output for Python tests
Brennon York created SPARK-7021: --- Summary: JUnit output for Python tests Key: SPARK-7021 URL: https://issues.apache.org/jira/browse/SPARK-7021 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Brennon York Priority: Minor Currently python returns its test output in its own format. What would be preferred is if the Python test runner could output its test results in JUnit format to better match the rest of the Jenkins test output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7016) Refactor dev/run-tests(-jenkins) from Bash to Python
[ https://issues.apache.org/jira/browse/SPARK-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brennon York updated SPARK-7016: Description: Currently the {{dev/run-tests}} and {{dev/run-tests-jenkins}} scripts are written in Bash and becoming quite unwieldy to manage, both in their current state and for future contributions. This proposal is to refactor both scripts into Python to allow for better manage-ability by the community, easier capability to add features, and provide a simpler approach to calling / running the various test suites. was: Currently the {dev/run-tests} and {dev/run-tests-jenkins} scripts are written in Bash and becoming quite unwieldy to manage, both in their current state and for future contributions. This proposal is to refactor both scripts into Python to allow for better manage-ability by the community, easier capability to add features, and provide a simpler approach to calling / running the various test suites. Refactor dev/run-tests(-jenkins) from Bash to Python Key: SPARK-7016 URL: https://issues.apache.org/jira/browse/SPARK-7016 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Brennon York Currently the {{dev/run-tests}} and {{dev/run-tests-jenkins}} scripts are written in Bash and becoming quite unwieldy to manage, both in their current state and for future contributions. This proposal is to refactor both scripts into Python to allow for better manage-ability by the community, easier capability to add features, and provide a simpler approach to calling / running the various test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7016) Refactor {dev/run-tests(-jenkins)} from Bash to Python
Brennon York created SPARK-7016: --- Summary: Refactor {dev/run-tests(-jenkins)} from Bash to Python Key: SPARK-7016 URL: https://issues.apache.org/jira/browse/SPARK-7016 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Brennon York Currently the {dev/run-tests} and {dev/run-tests-jenkins} scripts are written in Bash and becoming quite unwieldy to manage, both in their current state and for future contributions. This proposal is to refactor both scripts into Python to allow for better manage-ability by the community, easier capability to add features, and provide a simpler approach to calling / running the various test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504030#comment-14504030 ] Michael Armbrust edited comment on SPARK-6635 at 4/21/15 1:07 AM: -- +1 to {{withColumn}} overwriting existing columns. was (Author: marmbrus): +1 to {{withName}} overwriting existing columns. DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at
[jira] [Commented] (SPARK-5995) Make ML Prediction Developer APIs public
[ https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504090#comment-14504090 ] Joseph K. Bradley commented on SPARK-5995: -- I just updated the design doc linked above with a new section Post-Part 1 Assessment detailing a few issues. Make ML Prediction Developer APIs public Key: SPARK-5995 URL: https://issues.apache.org/jira/browse/SPARK-5995 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Previously, some Developer APIs were added to spark.ml for classification and regression to make it easier to add new algorithms and models: [SPARK-4789] There are ongoing discussions about the best design of the API. This JIRA is to continue that discussion and try to finalize those Developer APIs so that they can be made public. Please see [this design doc from SPARK-4789 | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] for details on the original API design. Some issues under debate: * Should there be strongly typed APIs for fit()? * Should the strongly typed API for transform() be public (vs. protected)? * What transformation methods should the API make developers implement for classification? (See details below.) * Should there be a way to transform a single Row (instead of only DataFrames)? More on What transformation methods should the API make developers implement for classification?: * Goals: ** Optimize transform: Make it fast, and make it output only the desired columns. ** Easy development ** Support Classifier, Regressor, and ProbabilisticClassifier * (currently) Developers implement predictX methods for each output column X. They may override transform() to optimize speed. ** Pros: predictX is easy to understand. ** Cons: An optimized transform() is annoying to write. * Developers implement more basic transformation methods, such as features2raw, raw2pred, raw2prob. ** Pros: Abstract classes may implement optimized transform(). ** Cons: Different types of predictors require different methods: *** Predictor and Regressor: features2pred *** Classifier: features2raw, raw2pred *** ProbabilisticClassifier: raw2prob * Developers implement a single predict() method which takes parameters for what columns to output (returning tuple or some type with None for missing values). Abstract classes take the outputs they want and put them into columns. ** Pros: Developers only write 1 method and can optimize it as much as they want. It could be more optimized than the previous 2 options; e.g., if LogisticRegressionModel only wants the prediction, then it never has to construct intermediate results such as the vector of raw predictions. ** Cons: predict() will have a different signature for different abstractions, based on the possible output columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7025) Create a Java-friendly input source API
Reynold Xin created SPARK-7025: -- Summary: Create a Java-friendly input source API Key: SPARK-7025 URL: https://issues.apache.org/jira/browse/SPARK-7025 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: An InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6529) Word2Vec transformer
[ https://issues.apache.org/jira/browse/SPARK-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504180#comment-14504180 ] Joseph K. Bradley commented on SPARK-6529: -- [~yinxusen] brings up a good point (in the PR) that Word2Vec and Word2VecModel take input columns of different types. This is a problem with current Estimator-Model approaches since they always share the same {{inputCol}} param. Thinking about this, I believe the Estimator and Model {{inputCol}} params must be different. In a Pipeline, we need to be able to specify both input columns before fitting, and we will not always have the chance to reset the input column before testing. CC: [~mengxr] since you'll be interested Word2Vec transformer Key: SPARK-6529 URL: https://issues.apache.org/jira/browse/SPARK-6529 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin Assignee: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7022) PySpark is missing ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7022: - Assignee: Omede Firouz PySpark is missing ParamGridBuilder --- Key: SPARK-7022 URL: https://issues.apache.org/jira/browse/SPARK-7022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz Assignee: Omede Firouz PySpark is missing the entirety of ML.Tuning (see: https://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7022) PySpark is missing ParamGridBuilder
[ https://issues.apache.org/jira/browse/SPARK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7022: - Target Version/s: 1.4.0 PySpark is missing ParamGridBuilder --- Key: SPARK-7022 URL: https://issues.apache.org/jira/browse/SPARK-7022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz Assignee: Omede Firouz PySpark is missing the entirety of ML.Tuning (see: https://issues.apache.org/jira/browse/SPARK-6940) This is a subticket specifically to track the ParamGridBuilder. The CrossValidator will be dealt with in a followup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4521) Parquet fails to read columns with spaces in the name
[ https://issues.apache.org/jira/browse/SPARK-4521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-4521. --- Resolution: Done This ticket is covered by SPARK-6607. Parquet fails to read columns with spaces in the name - Key: SPARK-4521 URL: https://issues.apache.org/jira/browse/SPARK-4521 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust I think this is actually a bug in parquet, but it would be good to track it here as well. To reproduce: {code} jsonRDD(sparkContext.parallelize({number of clusters: 1}::Nil)).saveAsParquetFile(test) parquetFile(test).collect() {code} {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 13, localhost): java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'of' at line 1: optional int32 number of at parquet.schema.MessageTypeParser.check(MessageTypeParser.java:209) at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:182) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:108) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:189) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6635. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5541 [https://github.com/apache/spark/pull/5541] DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Fix For: 1.4.0 DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in was: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen reopened SPARK-6738: -- There is a in SizeEstimator EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in SizeEstimator.visitArray. was: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in SizeEstimator.visitArray. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Comment Edited] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504202#comment-14504202 ] Hong Shen edited comment on SPARK-6738 at 4/21/15 2:54 AM: --- There is a bug in SizeEstimator was (Author: shenhong): There is a in SizeEstimator EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4131) Support Writing data into the filesystem from queries
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4131: --- Assignee: Fei Wang (was: Apache Spark) Support Writing data into the filesystem from queries --- Key: SPARK-4131 URL: https://issues.apache.org/jira/browse/SPARK-4131 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: XiaoJing wang Assignee: Fei Wang Priority: Critical Original Estimate: 0.05h Remaining Estimate: 0.05h Writing data into the filesystem from queries,SparkSql is not support . eg: {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4131) Support Writing data into the filesystem from queries
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4131: --- Assignee: Apache Spark (was: Fei Wang) Support Writing data into the filesystem from queries --- Key: SPARK-4131 URL: https://issues.apache.org/jira/browse/SPARK-4131 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: XiaoJing wang Assignee: Apache Spark Priority: Critical Original Estimate: 0.05h Remaining Estimate: 0.05h Writing data into the filesystem from queries,SparkSql is not support . eg: {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7025) Create a Java-friendly input source API
[ https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7025: --- Description: The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD 2. Hadoop MapReduce InputFormat Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. was: The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. Create a Java-friendly input source API --- Key: SPARK-7025 URL: https://issues.apache.org/jira/browse/SPARK-7025 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD 2. Hadoop MapReduce InputFormat Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7015) Multiclass to Binary Reduction
[ https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504197#comment-14504197 ] Ram Sriharsha commented on SPARK-7015: -- sounds good. Let me know what reference you had in mind.. I am familiar with Beygelzimer,Langford's error correcting tournaments http://hunch.net/~beygel/tournament.pdf but if you have a better reference in mind, let me know I can use that as the starting point. Multiclass to Binary Reduction -- Key: SPARK-7015 URL: https://issues.apache.org/jira/browse/SPARK-7015 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Original Estimate: 336h Remaining Estimate: 336h With the new Pipeline API, it is possible to seamlessly support machine learning reductions as meta algorithms. GBDT and SVM today are binary classifiers and we can implement multi class classification as a One vs All, or All vs All (or even more sophisticated reduction) using binary classifiers as primitives. This JIRA is to track the creation of a reduction API for multi class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6954) Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative
[ https://issues.apache.org/jira/browse/SPARK-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated SPARK-6954: - Attachment: without_fix.png with_fix.png I am uploading two diagrams that shows how the following variables move over time w/ and w/o my patch- * numExecutorsPending * executorIds.size * executorsPendingToRemove.size * targetNumExecutors # The {{with_fix.png}} shows 4 consecutive runs of my query. As can be seen, {{targetNumExecutors}} and {{numExecutorsPending}} stays above zero. # The {{without_fix.png}} shows a single run of my query. As can be seen, {{targetNumExecutors}} and {{numExecutorsPending}} goes negative after the 1st run. Here is how I collected data in the source code- {code} private def targetNumExecutors(): Int = { logInfo(ZZZ + numExecutorsPending + , + executorIds.size + , + executorsPendingToRemove.size + , + (numExecutorsPending + executorIds.size - executorsPendingToRemove.size)) numExecutorsPending + executorIds.size - executorsPendingToRemove.size } {code} Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative - Key: SPARK-6954 URL: https://issues.apache.org/jira/browse/SPARK-6954 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1 Reporter: Cheolsoo Park Assignee: Cheolsoo Park Labels: yarn Attachments: with_fix.png, without_fix.png I have a simple test case for dynamic allocation on YARN that fails with the following stack trace- {code} 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -21 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} My test is as follows- # Start spark-shell with a single executor. # Run a {{select count(\*)}} query. The number of executors rises as input size is non-trivial. # After the job finishes, the number of executors falls as most of them become idle. # Rerun the same query again, and the request to add executors fails with the above error. In fact, the job itself continues to run with whatever executors it already has, but it never gets more executors unless the shell is closed and restarted. In fact, this error only happens when I configure {{executorIdleTimeout}} very small. For eg, I can reproduce it with the following configs- {code} spark.dynamicAllocation.executorIdleTimeout 5 spark.dynamicAllocation.schedulerBacklogTimeout 5 {code} Although I can simply increase {{executorIdleTimeout}} to something like 60 secs to avoid the error, I think this is still a bug to be fixed. The root cause seems that {{numExecutorsPending}} accidentally becomes negative if executors are killed too aggressively (i.e. {{executorIdleTimeout}} is too small) because under that circumstance,
[jira] [Updated] (SPARK-7025) Create a Java-friendly input source API
[ https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7025: --- Description: The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. was: The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: An InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. Create a Java-friendly input source API --- Key: SPARK-7025 URL: https://issues.apache.org/jira/browse/SPARK-7025 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7015) Multiclass to Binary Reduction
[ https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504189#comment-14504189 ] Joseph K. Bradley edited comment on SPARK-7015 at 4/21/15 2:43 AM: --- +1 I'd strongly vote for supporting error-correcting output codes from early on. It's not that much harder to implement, and it can perform much better in practice (and in theory). I can provide some references if it'd be helpful. was (Author: josephkb): +1 I'd strongly vote for supporting error-correcting output codes from the beginning. It's not that much harder to implement, and it can perform much better in practice (and in theory). I can provide some references if it'd be helpful. Multiclass to Binary Reduction -- Key: SPARK-7015 URL: https://issues.apache.org/jira/browse/SPARK-7015 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Ram Sriharsha Assignee: Ram Sriharsha Original Estimate: 336h Remaining Estimate: 336h With the new Pipeline API, it is possible to seamlessly support machine learning reductions as meta algorithms. GBDT and SVM today are binary classifiers and we can implement multi class classification as a One vs All, or All vs All (or even more sophisticated reduction) using binary classifiers as primitives. This JIRA is to track the creation of a reduction API for multi class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7015) Multiclass to Binary Reduction
[ https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504189#comment-14504189 ] Joseph K. Bradley commented on SPARK-7015: -- +1 I'd strongly vote for supporting error-correcting output codes from the beginning. It's not that much harder to implement, and it can perform much better in practice (and in theory). I can provide some references if it'd be helpful. Multiclass to Binary Reduction -- Key: SPARK-7015 URL: https://issues.apache.org/jira/browse/SPARK-7015 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Ram Sriharsha Assignee: Ram Sriharsha Original Estimate: 336h Remaining Estimate: 336h With the new Pipeline API, it is possible to seamlessly support machine learning reductions as meta algorithms. GBDT and SVM today are binary classifiers and we can implement multi class classification as a One vs All, or All vs All (or even more sophisticated reduction) using binary classifiers as primitives. This JIRA is to track the creation of a reduction API for multi class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7023) [Spark SQL] Can't populate table size inforamtion into Hive metastore when create table or insert into table
Yi Zhou created SPARK-7023: -- Summary: [Spark SQL] Can't populate table size inforamtion into Hive metastore when create table or insert into table Key: SPARK-7023 URL: https://issues.apache.org/jira/browse/SPARK-7023 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yi Zhou After run below create tables SQL statement on Spark SQL, there is no 'totalSize, numRows, rawDataSize' properties in 'parameters' field.. CREATE TABLE IF NOT EXISTS customer STORED AS PARQUET AS SELECT * FROM customer_temp; hive describe extended customer; OK c_customer_sk bigint c_customer_id string c_current_cdemo_sk bigint c_current_hdemo_sk bigint c_current_addr_sk bigint c_first_shipto_date_sk bigint c_first_sales_date_sk bigint c_salutationstring c_first_namestring c_last_name string c_preferred_cust_flag string c_birth_day int c_birth_month int c_birth_yearint c_birth_country string c_login string c_email_address string c_last_review_date string Detailed Table Information ... parameters:{transient_lastDdlTime=1429582149}... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5100) Spark Thrift server monitor page
[ https://issues.apache.org/jira/browse/SPARK-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504132#comment-14504132 ] Cheng Lian commented on SPARK-5100: --- Had offline discussion with [~tianyi], he's rebasing PR #3946. I'll revisit it once he finishes rebasing. Spark Thrift server monitor page Key: SPARK-5100 URL: https://issues.apache.org/jira/browse/SPARK-5100 Project: Spark Issue Type: New Feature Components: SQL, Web UI Reporter: Yi Tian Priority: Critical Attachments: Spark Thrift-server monitor page.pdf, prototype-screenshot.png In the latest Spark release, there is a Spark Streaming tab on the driver web UI, which shows information about running streaming application. It should be helpful for providing a monitor page in Thrift server, because both streaming and Thrift server are long-term applications, and the details of the application do not show on stage page or job page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6368) Build a specialized serializer for Exchange operator.
[ https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6368. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5497 [https://github.com/apache/spark/pull/5497] Build a specialized serializer for Exchange operator. -- Key: SPARK-6368 URL: https://issues.apache.org/jira/browse/SPARK-6368 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Fix For: 1.4.0 Attachments: Kryo.nps, SchemaBased.nps Kryo is still pretty slow because it works on individual objects and relative expensive to allocate. For Exchange operator, because the schema for key and value are already defined, we can create a specialized serializer to handle the specific schemas of key and value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7015) Multiclass to Binary Reduction
[ https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7015: - Component/s: (was: MLlib) ML Multiclass to Binary Reduction -- Key: SPARK-7015 URL: https://issues.apache.org/jira/browse/SPARK-7015 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Original Estimate: 336h Remaining Estimate: 336h With the new Pipeline API, it is possible to seamlessly support machine learning reductions as meta algorithms. GBDT and SVM today are binary classifiers and we can implement multi class classification as a One vs All, or All vs All (or even more sophisticated reduction) using binary classifiers as primitives. This JIRA is to track the creation of a reduction API for multi class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4521) Parquet fails to read columns with spaces in the name
[ https://issues.apache.org/jira/browse/SPARK-4521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504127#comment-14504127 ] Cheng Lian commented on SPARK-4521: --- Yes, I'm resolving this one. Parquet fails to read columns with spaces in the name - Key: SPARK-4521 URL: https://issues.apache.org/jira/browse/SPARK-4521 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust I think this is actually a bug in parquet, but it would be good to track it here as well. To reproduce: {code} jsonRDD(sparkContext.parallelize({number of clusters: 1}::Nil)).saveAsParquetFile(test) parquetFile(test).collect() {code} {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 13, localhost): java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'of' at line 1: optional int32 number of at parquet.schema.MessageTypeParser.check(MessageTypeParser.java:209) at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:182) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:108) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:189) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4766) ML Estimator Params should be distinct from Transformer Params
[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4766: - Description: Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator. These could sometimes be the same, but for other models, we may need either (a) to make them distinct or (b) to have the Estimator params class extend the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} It's also weird to be able to: * Wrap LogisticRegressionModel (a Transformer) with CrossValidator * Pass a set of ParamMaps to CrossValidator which includes parameter LogisticRegressionModel.maxIter * (CrossValidator would try to set that parameter.) * I'm not sure if this would cause a failure or just be a noop. See the comment below about Word2Vec as well. was: Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator, where the Estimator params class extends the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} It's also weird to be able to: * Wrap LogisticRegressionModel (a Transformer) with CrossValidator * Pass a set of ParamMaps to CrossValidator which includes parameter LogisticRegressionModel.maxIter * (CrossValidator would try to set that parameter.) * I'm not sure if this would cause a failure or just be a noop. ML Estimator Params should be distinct from Transformer Params -- Key: SPARK-4766 URL: https://issues.apache.org/jira/browse/SPARK-4766 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator. These could sometimes be the same, but for other models, we may need either (a) to make them distinct or (b) to have the Estimator params class extend the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} It's also weird to be able to: * Wrap LogisticRegressionModel (a Transformer) with CrossValidator * Pass a set of ParamMaps to CrossValidator which includes parameter LogisticRegressionModel.maxIter * (CrossValidator would try to set that parameter.) * I'm not sure if this would cause a failure or just be a noop. See the comment below about Word2Vec as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params
[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504181#comment-14504181 ] Joseph K. Bradley commented on SPARK-4766: -- *Update*: A new issue was brought up by the PR for Word2Vec for this JIRA: [https://issues.apache.org/jira/browse/SPARK-6529] Basically, the Estimator and Model take different input column types, so they should (probably) use different input column parameters. See that JIRA for the discussion. ML Estimator Params should subclass Transformer Params -- Key: SPARK-4766 URL: https://issues.apache.org/jira/browse/SPARK-4766 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator, where the Estimator params class extends the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} It's also weird to be able to: * Wrap LogisticRegressionModel (a Transformer) with CrossValidator * Pass a set of ParamMaps to CrossValidator which includes parameter LogisticRegressionModel.maxIter * (CrossValidator would try to set that parameter.) * I'm not sure if this would cause a failure or just be a noop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4766) ML Estimator Params should be distinct from Transformer Params
[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4766: - Summary: ML Estimator Params should be distinct from Transformer Params (was: ML Estimator Params should subclass Transformer Params) ML Estimator Params should be distinct from Transformer Params -- Key: SPARK-4766 URL: https://issues.apache.org/jira/browse/SPARK-4766 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator, where the Estimator params class extends the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} It's also weird to be able to: * Wrap LogisticRegressionModel (a Transformer) with CrossValidator * Pass a set of ParamMaps to CrossValidator which includes parameter LogisticRegressionModel.maxIter * (CrossValidator would try to set that parameter.) * I'm not sure if this would cause a failure or just be a noop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6900) spark ec2 script enters infinite loop when run-instance fails
[ https://issues.apache.org/jira/browse/SPARK-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504236#comment-14504236 ] Guodong Wang commented on SPARK-6900: - Hi Nick, sorry for my late reply. I mark this as a major issue because I am using spark-ec2 script to launch/setup/destroy the spark cluster automatically in aws. This is integrated with our computation platform service. we don't expect any manually operations when launching the cluster. I agree with you that it would be a major issue if I use spark-ec2 manually. But in my case, I use the script as a automation tool. So, I think it would be nice if the script can handle this case. Although, the AWS failure is rare case, we are heavily using the AWS now(launching/destroying a bunch of separated spark-clusters in each day). It would be nice to me if spark-ec2 script can handle such aws failure. Here is more information about my case. In my environment, spark-ec2 script justs waited for all the instances to become 'ssh-ready' for ever. It would not try to ssh any instances before exitting the loop. I have to kill the script process in such senario. I went through the spark-ec2 script, and I think sshing to the instance hosts would happen after all the instances enter running state. Because one of the instance is terminated as soon as it was launched. It never entered the running state. Then, is_cluster_ssh_available is short circuited, because not all the instances are running. Here is the code {code} if all(i.state == 'running' for i in cluster_instances) and \ all(s.system_status.status == 'ok' for s in statuses) and \ all(s.instance_status.status == 'ok' for s in statuses) and \ is_cluster_ssh_available(cluster_instances, opts): {code} Then, the scripts enters the infinite loop. and would not print any ssh failure message. If I made some mistakes in above analysis, please tell me. spark ec2 script enters infinite loop when run-instance fails - Key: SPARK-6900 URL: https://issues.apache.org/jira/browse/SPARK-6900 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.3.0 Reporter: Guodong Wang I am using spark-ec2 scripts to launch spark cluters in AWS. Recently, in our AWS region, there were some tech issues about AWS EC2 service. When spark-ec2 send the run-instance requests to EC2, not all the requested instances were launched. Some instance was terminated by AWS-EC2 service before it was up. But spark-ec2 script would wait for all the instances to enter 'ssh-ready' status. So, the script enters the infinite loop. Because the terminated instances would never be 'ssh-ready'. In my opinion, it should be OK if some of the slave instances were terminated. As long as the master node is running, the terminated slaves should be filtered and the cluster should be setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7024) Improve performance of function containsStar
Yadong Qi created SPARK-7024: Summary: Improve performance of function containsStar Key: SPARK-7024 URL: https://issues.apache.org/jira/browse/SPARK-7024 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Yadong Qi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7024) Improve performance of function containsStar
[ https://issues.apache.org/jira/browse/SPARK-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7024: --- Assignee: (was: Apache Spark) Improve performance of function containsStar Key: SPARK-7024 URL: https://issues.apache.org/jira/browse/SPARK-7024 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Yadong Qi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7024) Improve performance of function containsStar
[ https://issues.apache.org/jira/browse/SPARK-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504255#comment-14504255 ] Apache Spark commented on SPARK-7024: - User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/5602 Improve performance of function containsStar Key: SPARK-7024 URL: https://issues.apache.org/jira/browse/SPARK-7024 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Yadong Qi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6900) spark ec2 script enters infinite loop when run-instance fails
[ https://issues.apache.org/jira/browse/SPARK-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504253#comment-14504253 ] Guodong Wang commented on SPARK-6900: - In my opinion, it does not cost us much to fix this issue. Currrently, I propose 2 ways to fix it 1. the first one is a simple fix. *Adding a timeout for wait_for_cluster_state* . If wait_for_cluster_state is timed out, just exit the script with non-zero code. Then, we can add --resume opt when retrying to launch the cluster in the next time. 2. the second one is more robust. *filtering terminated instances when wait_for_cluster_state to ssh-ready*. If all the non-terminated instances are ssh-ready, return the function. Then, if the master is terminated, setup cluster would failure. otherwise, the cluster is up although some slave instance is down. What your opinion?[~nchammas] I would be happy to discuss the fix with you and provide a patch. Thanks spark ec2 script enters infinite loop when run-instance fails - Key: SPARK-6900 URL: https://issues.apache.org/jira/browse/SPARK-6900 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.3.0 Reporter: Guodong Wang I am using spark-ec2 scripts to launch spark cluters in AWS. Recently, in our AWS region, there were some tech issues about AWS EC2 service. When spark-ec2 send the run-instance requests to EC2, not all the requested instances were launched. Some instance was terminated by AWS-EC2 service before it was up. But spark-ec2 script would wait for all the instances to enter 'ssh-ready' status. So, the script enters the infinite loop. Because the terminated instances would never be 'ssh-ready'. In my opinion, it should be OK if some of the slave instances were terminated. As long as the master node is running, the terminated slaves should be filtered and the cluster should be setup. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7024) Improve performance of function containsStar
[ https://issues.apache.org/jira/browse/SPARK-7024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7024: --- Assignee: Apache Spark Improve performance of function containsStar Key: SPARK-7024 URL: https://issues.apache.org/jira/browse/SPARK-7024 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Yadong Qi Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7015) Multiclass to Binary Reduction
[ https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504335#comment-14504335 ] Joseph K. Bradley edited comment on SPARK-7015 at 4/21/15 5:21 AM: --- Your reference looks newer than ones I've used before. After a quick glance, it looks like it examines generalizations of methods I've seen. This are the ones I've used: * Dietterich Bakiri. Solving Multiclass Learning Problems via Error-Correcting Output Codes. 1995. ** [https://www.jair.org/media/105/live-105-1426-jair.pdf] * Allwein et al. Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. 2000. ** [http://www.jmlr.org/papers/volume1/allwein00a/allwein00a.pdf] Thinking about it, I'm fine if we start by supporting one-vs-all or something simple which everyone has heard of and will expect to find, and then add better approaches later (after I've had time to refresh myself on that literature!). was (Author: josephkb): Your reference looks newer than ones I've used before. After a quick glance, it looks like it examines generalizations of methods I've seen. This is the one I've used: * Dietterich Bakiri. Solving Multiclass Learning Problems via Error-Correcting Output Codes. 1995. ** [https://www.jair.org/media/105/live-105-1426-jair.pdf] Thinking about it, I'm fine if we start by supporting one-vs-all or something simple which everyone has heard of and will expect to find, and then add better approaches later (after I've had time to refresh myself on that literature!). Multiclass to Binary Reduction -- Key: SPARK-7015 URL: https://issues.apache.org/jira/browse/SPARK-7015 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Original Estimate: 336h Remaining Estimate: 336h With the new Pipeline API, it is possible to seamlessly support machine learning reductions as meta algorithms. GBDT and SVM today are binary classifiers and we can implement multi class classification as a One vs All, or All vs All (or even more sophisticated reduction) using binary classifiers as primitives. This JIRA is to track the creation of a reduction API for multi class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7015) Multiclass to Binary Reduction
[ https://issues.apache.org/jira/browse/SPARK-7015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504335#comment-14504335 ] Joseph K. Bradley commented on SPARK-7015: -- Your reference looks newer than ones I've used before. After a quick glance, it looks like it examines generalizations of methods I've seen. This is the one I've used: * Dietterich Bakiri. Solving Multiclass Learning Problems via Error-Correcting Output Codes. 1995. ** [https://www.jair.org/media/105/live-105-1426-jair.pdf] Thinking about it, I'm fine if we start by supporting one-vs-all or something simple which everyone has heard of and will expect to find, and then add better approaches later (after I've had time to refresh myself on that literature!). Multiclass to Binary Reduction -- Key: SPARK-7015 URL: https://issues.apache.org/jira/browse/SPARK-7015 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Original Estimate: 336h Remaining Estimate: 336h With the new Pipeline API, it is possible to seamlessly support machine learning reductions as meta algorithms. GBDT and SVM today are binary classifiers and we can implement multi class classification as a One vs All, or All vs All (or even more sophisticated reduction) using binary classifiers as primitives. This JIRA is to track the creation of a reduction API for multi class classification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7025) Create a Java-friendly input source API
[ https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504368#comment-14504368 ] Apache Spark commented on SPARK-7025: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5603 Create a Java-friendly input source API --- Key: SPARK-7025 URL: https://issues.apache.org/jira/browse/SPARK-7025 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7025) Create a Java-friendly input source API
[ https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7025: --- Assignee: Reynold Xin (was: Apache Spark) Create a Java-friendly input source API --- Key: SPARK-7025 URL: https://issues.apache.org/jira/browse/SPARK-7025 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) An Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504369#comment-14504369 ] Xiangrui Meng commented on SPARK-7008: -- [~podongfeng] You implementation assumes that the model can be stored locally, which is not true for big models. [~gq]'s GraphX-based implementation should have better scalability, but slower on small datasets. We need more time to understand the algorithm and decide whether to include it in MLlib. As Sean suggested, it would be nice if you can submit both packages to spark-pacakges.org. [~podongfeng] and [~gq], I like the simplicity and the expressiveness of FM. I have a few questions to understand FM better. FM uses SGD on a non-convex objective. What is FM's convergence rate you observed in practice? Does it sensitive to local minimals (run FM multiple times and see whether there are big variance on the objective values)? Does it sensitive to the learning rate? An Implement of Factorization Machine (LibFM) - Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7025) Create a Java-friendly input source API
[ https://issues.apache.org/jira/browse/SPARK-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7025: --- Assignee: Apache Spark (was: Reynold Xin) Create a Java-friendly input source API --- Key: SPARK-7025 URL: https://issues.apache.org/jira/browse/SPARK-7025 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Apache Spark The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: an InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems
[ https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502695#comment-14502695 ] Sean Owen commented on SPARK-7009: -- Let's see if I remember this correctly: Java 7 supports zip64, so there's no problem if building/running with Java 7+ only. Some (early) Java 6 won't read zip64 correctly though. I think the implicit workaround there was to update to a later Java 6, since it doesn't affect most releases. Java 6 has some *different* hacky extension to zip that lets it read/write more than 65K files though, which means that weirdly Java 6-built assemblies might work on old Java 6 after all. I think we only officially support the zip64 version. Implicitly, actually, early Java 6 doesn't necessarily work with Spark. So... does this end up helping this weird situation if Ant is only making zip64 archives? (Nice that this doesn't actually involve adding an Ant script) Build assembly JAR via ant to avoid zip64 problems -- Key: SPARK-7009 URL: https://issues.apache.org/jira/browse/SPARK-7009 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Environment: Java 7+ Reporter: Steve Loughran Original Estimate: 2h Remaining Estimate: 2h SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a format incompatible with Java and pyspark. Provided the total number of .class files+resources is 64K, ant can be used to make the final JAR instead, perhaps by unzipping the maven-generated JAR then rezipping it with zip64=never, before publishing the artifact via maven. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
[ https://issues.apache.org/jira/browse/SPARK-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reassigned SPARK-7011: -- Assignee: Prashant Sharma Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package. Key: SPARK-7011 URL: https://issues.apache.org/jira/browse/SPARK-7011 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Assignee: Prashant Sharma I am not sure why this does not fail while building with scala 2.10, looks like scala bug ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3276: - Assignee: Emre Sevinç Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming -- Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Assignee: Emre Sevinç Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems
[ https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502657#comment-14502657 ] Steve Loughran commented on SPARK-7009: --- It's only 30 lines of diff including the antrun plugin config; trivial compared to the shade plugin itself. As you note though, it's not enough: there are 64K .class files. Which means that the use java6 to compile warning note of SPARK-1911 probably isn't going to work either, unless a java6 build includes less classes in the shaded jar. Build assembly JAR via ant to avoid zip64 problems -- Key: SPARK-7009 URL: https://issues.apache.org/jira/browse/SPARK-7009 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Environment: Java 7+ Reporter: Steve Loughran Original Estimate: 2h Remaining Estimate: 2h SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a format incompatible with Java and pyspark. Provided the total number of .class files+resources is 64K, ant can be used to make the final JAR instead, perhaps by unzipping the maven-generated JAR then rezipping it with zip64=never, before publishing the artifact via maven. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
[ https://issues.apache.org/jira/browse/SPARK-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7011: --- Assignee: Apache Spark (was: Prashant Sharma) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package. Key: SPARK-7011 URL: https://issues.apache.org/jira/browse/SPARK-7011 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Assignee: Apache Spark I am not sure why this does not fail while building with scala 2.10, looks like scala bug ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502702#comment-14502702 ] Emre Sevinç commented on SPARK-3276: Can someone with enough access rights assign this issue to me, (currently it is not assigned to anyone)? (Now that I've already discussed it with Spark developers and prepared a Pull Request on Github) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming -- Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
[ https://issues.apache.org/jira/browse/SPARK-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7011: --- Assignee: Prashant Sharma (was: Apache Spark) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package. Key: SPARK-7011 URL: https://issues.apache.org/jira/browse/SPARK-7011 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Assignee: Prashant Sharma I am not sure why this does not fail while building with scala 2.10, looks like scala bug ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
[ https://issues.apache.org/jira/browse/SPARK-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502703#comment-14502703 ] Apache Spark commented on SPARK-7011: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/5593 Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package. Key: SPARK-7011 URL: https://issues.apache.org/jira/browse/SPARK-7011 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Assignee: Prashant Sharma I am not sure why this does not fail while building with scala 2.10, looks like scala bug ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7009) Build assembly JAR via ant to avoid zip64 problems
[ https://issues.apache.org/jira/browse/SPARK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502675#comment-14502675 ] Steve Loughran commented on SPARK-7009: --- Looking at the [openJDK issue|https://bugs.openjdk.java.net/browse/JDK-4828461], Java6 appears to be generating a header/footer that stops at 64K, and doesn't bother reading that header when enumerating zip file. Java 7 (presumably) handles reads the same way, but uses zip64 to generate the artifacts. Ant can be told not to generate zip64 files, but it does zip16 properly, rejecting source filesets that are too large There isn't an obvious/immediate solution for this on Java7+; except to extend Ant to generate the same hacked zip files, then wait for that to trickle into the maven ant-run plugin, which would be about 3+ months after ant 1.9.x ships. That's a long term project, though something to consider starting now, to get the feature later in 2015 Build assembly JAR via ant to avoid zip64 problems -- Key: SPARK-7009 URL: https://issues.apache.org/jira/browse/SPARK-7009 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Environment: Java 7+ Reporter: Steve Loughran Original Estimate: 2h Remaining Estimate: 2h SPARK-1911 shows the problem that JDK7+ is using zip64 to build large JARs; a format incompatible with Java and pyspark. Provided the total number of .class files+resources is 64K, ant can be used to make the final JAR instead, perhaps by unzipping the maven-generated JAR then rezipping it with zip64=never, before publishing the artifact via maven. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7007) Add metrics source for ExecutorAllocationManager to expose internal status
Saisai Shao created SPARK-7007: -- Summary: Add metrics source for ExecutorAllocationManager to expose internal status Key: SPARK-7007 URL: https://issues.apache.org/jira/browse/SPARK-7007 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.0 Reporter: Saisai Shao Priority: Minor Add a metric source to expose the internal status of ExecutorAllocationManager to better monitoring the executor allocation when running on Yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7010) How can i custom the external initialize when start the spark cluster
[ https://issues.apache.org/jira/browse/SPARK-7010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7010. -- Resolution: Invalid (Ask questions at u...@spark.apache.org) How can i custom the external initialize when start the spark cluster - Key: SPARK-7010 URL: https://issues.apache.org/jira/browse/SPARK-7010 Project: Spark Issue Type: Question Components: SQL Affects Versions: 1.3.0 Reporter: Jacky19820629 How can i config the custom initialize when start the spark , like cache table , crate temporary table etc . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7006) Inconsistent behavior for ctrl-c in Spark shells
[ https://issues.apache.org/jira/browse/SPARK-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502612#comment-14502612 ] Cheolsoo Park commented on SPARK-7006: -- Thanks for asking about Ctrl-D. During a job is running, Ctrl-D doesn't seem to have any effect (i.e. no response), but after the job is finished, it terminates the shell. Actually, FB Presto uses Ctrl-D to exit the shell and Ctrl-C to cancel the running job. A lot of users find this quite convenient. Inconsistent behavior for ctrl-c in Spark shells Key: SPARK-7006 URL: https://issues.apache.org/jira/browse/SPARK-7006 Project: Spark Issue Type: Wish Components: Spark Shell, YARN Affects Versions: 1.3.1 Environment: YARN Reporter: Cheolsoo Park Priority: Minor Labels: shell, yarn When ctrl-c is pressed in shell, behaviors are not consistent across spark-sql, spark-shell, and pyspark resulting in confusion for users. Here is the summary- ||shell||after ctrl-c| |spark-sql|cancels the running job| |spark-shell|exits the shell| |pyspark|throws error \[1\] and doesn't cancel the job| Particularly, pyspark is worst because it gives a wrong impression that the job is cancelled although it is not. Ideally, every shell should act like {{spark-sql}} because it allows users to cancel the running job while staying in shell. (Pressing ctrl-c twice exits the shell.) \[1\] pyspark error for ctrl-c {code} Traceback (most recent call last): File stdin, line 1, in module File /home/cheolsoop/spark/jars/spark-1.3.1/python/pyspark/sql/dataframe.py, line 284, in count return self._jdf.count() File /home/cheolsoop/spark/jars/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 536, in __call__ File /home/cheolsoop/spark/jars/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 364, in send_command File /home/cheolsoop/spark/jars/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 473, in send_command File /usr/lib/python2.7/socket.py, line 430, in readline data = recv(1) KeyboardInterrupt {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7008) Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502616#comment-14502616 ] Guoqiang Li edited comment on SPARK-7008 at 4/20/15 10:34 AM: -- Here's a graphx-based implementation(WIP): https://github.com/witgo/zen/tree/FactorizationMachine was (Author: gq): Here's a graphx-based implementation: https://github.com/witgo/zen/tree/FactorizationMachine Implement of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7011) Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
Prashant Sharma created SPARK-7011: -- Summary: Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package. Key: SPARK-7011 URL: https://issues.apache.org/jira/browse/SPARK-7011 Project: Spark Issue Type: Bug Reporter: Prashant Sharma I am not sure why this does not fail while building with scala 2.10, looks like scala bug ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7005) resetProb error in pagerank
[ https://issues.apache.org/jira/browse/SPARK-7005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lisendong updated SPARK-7005: - Comment: was deleted (was: oh...you are right... I'm so sorry, the result is exactly being scaled by N... ) resetProb error in pagerank --- Key: SPARK-7005 URL: https://issues.apache.org/jira/browse/SPARK-7005 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: lisendong Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h in the page rank code, the resetProb should be divided by #vertex according to the wikipedia: http://en.wikipedia.org/wiki/PageRank that is: PR[i] = alpha / N + (1 - alpha) * inNbrs[i].map(j = oldPR[j] / outDeg[j]).sum but the code is (org.apache.spark.graphx.lib.PageRank) PR[i] = alpha + (1 - alpha) * inNbrs[i].map(j = oldPR[j] / outDeg[j]).sum -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7007) Add metrics source for ExecutorAllocationManager to expose internal status
[ https://issues.apache.org/jira/browse/SPARK-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502542#comment-14502542 ] Apache Spark commented on SPARK-7007: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/5589 Add metrics source for ExecutorAllocationManager to expose internal status -- Key: SPARK-7007 URL: https://issues.apache.org/jira/browse/SPARK-7007 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.0 Reporter: Saisai Shao Priority: Minor Add a metric source to expose the internal status of ExecutorAllocationManager to better monitoring the executor allocation when running on Yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7007) Add metrics source for ExecutorAllocationManager to expose internal status
[ https://issues.apache.org/jira/browse/SPARK-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7007: --- Assignee: (was: Apache Spark) Add metrics source for ExecutorAllocationManager to expose internal status -- Key: SPARK-7007 URL: https://issues.apache.org/jira/browse/SPARK-7007 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.0 Reporter: Saisai Shao Priority: Minor Add a metric source to expose the internal status of ExecutorAllocationManager to better monitoring the executor allocation when running on Yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1911) Warn users if their assembly jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502589#comment-14502589 ] Steve Loughran commented on SPARK-1911: --- This doesn't fix the problem, merely documents it. It should be doable by using Ant's zip task, which doesn't use the JDK zip routines. The assembly would be unzipped first, then zipped with zip63 option set to never see [https://ant.apache.org/manual/Tasks/zip.html] Warn users if their assembly jars are not built with Java 6 --- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Sean Owen Fix For: 1.2.2, 1.3.0 The root cause of the problem is detailed in: https://issues.apache.org/jira/browse/SPARK-1520. In short, an assembly jar built with Java 7+ is not always accessible by Python or other versions of Java (especially Java 6). If the assembly jar is not built on the cluster itself, this problem may manifest itself in strange exceptions that are not trivial to debug. This is an issue especially for PySpark on YARN, which relies on the python files included within the assembly jar. Currently we warn users only in make-distribution.sh, but most users build the jars directly. At the very least we need to emphasize this in the docs (currently missing entirely). The next step is to add a warning prompt in the mvn scripts whenever Java 7+ is detected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7008) An Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-7008: Description: An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf was: An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf Summary: An Implement of Factorization Machine (LibFM) (was: Implement of Factorization Machine (LibFM)) An Implement of Factorization Machine (LibFM) - Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implement of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7008) Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7008: --- Assignee: Apache Spark Implement of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Assignee: Apache Spark Labels: features, patch An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) Implement of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502644#comment-14502644 ] Apache Spark commented on SPARK-7008: - User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/5591 Implement of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0, 1.3.1, 1.3.2 Reporter: zhengruifeng Labels: features, patch An implementation of Factorization Machines based on Scala and Spark MLlib. Factorization Machine is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. Factorization Machines works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org