[jira] [Updated] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10329: -- Assignee: hujiayin Cost RDD in k-means|| initialization is not storage-efficient - Key: SPARK-10329 URL: https://issues.apache.org/jira/browse/SPARK-10329 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Xiangrui Meng Assignee: hujiayin Labels: clustering Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10290) Spark can register temp table and hive table with the same table name
[ https://issues.apache.org/jira/browse/SPARK-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10290: Assignee: Apache Spark Spark can register temp table and hive table with the same table name - Key: SPARK-10290 URL: https://issues.apache.org/jira/browse/SPARK-10290 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: pin_zhang Assignee: Apache Spark Spark sql allow to create hive table and register temp table with the same name no way to run query on the hive table table with the following code // register hive table DataFrame df = hctx_.read().json(test.json); df.write().mode(SaveMode.Overwrite).saveAsTable(test); // register temp table hctx_.registerDataFrameAsTable(hctx_.sql(select id from test), test); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10290) Spark can register temp table and hive table with the same table name
[ https://issues.apache.org/jira/browse/SPARK-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721493#comment-14721493 ] Apache Spark commented on SPARK-10290: -- User 'mzorro' has created a pull request for this issue: https://github.com/apache/spark/pull/8529 Spark can register temp table and hive table with the same table name - Key: SPARK-10290 URL: https://issues.apache.org/jira/browse/SPARK-10290 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: pin_zhang Spark sql allow to create hive table and register temp table with the same name no way to run query on the hive table table with the following code // register hive table DataFrame df = hctx_.read().json(test.json); df.write().mode(SaveMode.Overwrite).saveAsTable(test); // register temp table hctx_.registerDataFrameAsTable(hctx_.sql(select id from test), test); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10290) Spark can register temp table and hive table with the same table name
[ https://issues.apache.org/jira/browse/SPARK-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10290: Assignee: (was: Apache Spark) Spark can register temp table and hive table with the same table name - Key: SPARK-10290 URL: https://issues.apache.org/jira/browse/SPARK-10290 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: pin_zhang Spark sql allow to create hive table and register temp table with the same name no way to run query on the hive table table with the following code // register hive table DataFrame df = hctx_.read().json(test.json); df.write().mode(SaveMode.Overwrite).saveAsTable(test); // register temp table hctx_.registerDataFrameAsTable(hctx_.sql(select id from test), test); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721419#comment-14721419 ] Xiangrui Meng edited comment on SPARK-10329 at 8/30/15 6:53 AM: Assigned. I will send a small PR to fix apparent issues (SPARK-10354), hopefully in 1.5. was (Author: mengxr): Assigned. I will send a small PR to fix apparent issues (SPARK-10354). Cost RDD in k-means|| initialization is not storage-efficient - Key: SPARK-10329 URL: https://issues.apache.org/jira/browse/SPARK-10329 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Xiangrui Meng Assignee: hujiayin Labels: clustering Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10184) Optimization for bounds determination in RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10184: -- Assignee: Jigao Fu Optimization for bounds determination in RangePartitioner - Key: SPARK-10184 URL: https://issues.apache.org/jira/browse/SPARK-10184 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jigao Fu Assignee: Jigao Fu Priority: Minor Fix For: 1.6.0 Original Estimate: 10m Remaining Estimate: 10m Change {{cumWeight target}} to {{cumWeight = target}} in {{RangePartitioner.determineBounds}} method to make the output partitions more balanced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721491#comment-14721491 ] Vinod KC commented on SPARK-10199: -- [~fliang] , As you suggested, 1) I've made micro-benchmarks by surrounding createDataFrame in model save methods and Loader.checkSchema in load methods with below timing code def time[R](block: = R): R = { val t0 = System.nanoTime() val result = block val t1 = System.nanoTime() println(Elapsed time: + (t1 - t0) + ns) result } 2) Then I ran mllib test suites on code before and after the change. Please see the measurements and performance gain % in this google docs https://docs.google.com/spreadsheets/d/1TPUctB62xAHx0IaJttyx98MjRo4zVmO4neTkdi7uVDs/edit?usp=sharing There is good performance improvement without reflection Avoid using reflections for parquet model save -- Key: SPARK-10199 URL: https://issues.apache.org/jira/browse/SPARK-10199 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Feynman Liang Priority: Minor These items are not high priority since the overhead writing to Parquest is much greater than for runtime reflections. Multiple model save/load in MLlib use case classes to infer a schema for the data frame saved to Parquet. However, inferring a schema from case classes or tuples uses [runtime reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] which is unnecessary since the types are already known at the time `save` is called. It would be better to just specify the schema for the data frame directly using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10356) MLlib: Normalization should use absolute values
Carsten Schnober created SPARK-10356: Summary: MLlib: Normalization should use absolute values Key: SPARK-10356 URL: https://issues.apache.org/jira/browse/SPARK-10356 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Carsten Schnober The normalizer does not handle vectors with negative values properly. It can be tested with the following code {{ val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 }} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10356) MLlib: Normalization should use absolute values
[ https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carsten Schnober updated SPARK-10356: - Description: The normalizer does not handle vectors with negative values properly. It can be tested with the following code {code} val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 {code} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. was: The normalizer does not handle vectors with negative values properly. It can be tested with the following code {{val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0}} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. MLlib: Normalization should use absolute values --- Key: SPARK-10356 URL: https://issues.apache.org/jira/browse/SPARK-10356 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Carsten Schnober Labels: easyfix Original Estimate: 2h Remaining Estimate: 2h The normalizer does not handle vectors with negative values properly. It can be tested with the following code {code} val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 {code} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10356) MLlib: Normalization should use absolute values
[ https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carsten Schnober updated SPARK-10356: - Description: The normalizer does not handle vectors with negative values properly. It can be tested with the following code {{val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0}} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. was: The normalizer does not handle vectors with negative values properly. It can be tested with the following code {{ val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 }} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. MLlib: Normalization should use absolute values --- Key: SPARK-10356 URL: https://issues.apache.org/jira/browse/SPARK-10356 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Carsten Schnober Labels: easyfix Original Estimate: 2h Remaining Estimate: 2h The normalizer does not handle vectors with negative values properly. It can be tested with the following code {{val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0}} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10184) Optimization for bounds determination in RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10184. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8397 [https://github.com/apache/spark/pull/8397] Optimization for bounds determination in RangePartitioner - Key: SPARK-10184 URL: https://issues.apache.org/jira/browse/SPARK-10184 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jigao Fu Priority: Minor Fix For: 1.6.0 Original Estimate: 10m Remaining Estimate: 10m Change {{cumWeight target}} to {{cumWeight = target}} in {{RangePartitioner.determineBounds}} method to make the output partitions more balanced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10226) Error occured in SparkSQL when using !=
[ https://issues.apache.org/jira/browse/SPARK-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10226: -- Assignee: wangwei Error occured in SparkSQL when using != Key: SPARK-10226 URL: https://issues.apache.org/jira/browse/SPARK-10226 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: wangwei Assignee: wangwei Fix For: 1.5.0 DataSource: src/main/resources/kv1.txt SQL: 1. create table src(id string, name string); 2. load data local inpath '${SparkHome}/examples/src/main/resources/kv1.txt' into table src; 3. select count( * ) from src where id != '0'; [ERROR] Could not expand event java.lang.IllegalArgumentException: != 0;: event not found at jline.console.ConsoleReader.expandEvents(ConsoleReader.java:779) at jline.console.ConsoleReader.finishBuffer(ConsoleReader.java:631) at jline.console.ConsoleReader.accept(ConsoleReader.java:2019) at jline.console.ConsoleReader.readLine(ConsoleReader.java:2666) at jline.console.ConsoleReader.readLine(ConsoleReader.java:2269) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:231) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:666) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:178) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10350) Fix SQL Programming Guide
[ https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10350: -- Assignee: Guoqiang Li Fix SQL Programming Guide - Key: SPARK-10350 URL: https://issues.apache.org/jira/browse/SPARK-10350 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.5.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Minor Fix For: 1.5.0 [b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95#diff-d8aa7a37d17a1227cba38c99f9f22511R1383] contains duplicate content: {{spark.sql.parquet.mergeSchema}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10355) Add Python API for SQLTransformer
Yanbo Liang created SPARK-10355: --- Summary: Add Python API for SQLTransformer Key: SPARK-10355 URL: https://issues.apache.org/jira/browse/SPARK-10355 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add Python API for SQLTransformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail
[ https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10301: --- Description: We hit this issue when reading a complex Parquet dateset without turning on schema merging. The data set consists of Parquet files with different but compatible schemas. In this way, the schema of the dataset is defined by either a summary file or a random physical Parquet file if no summary files are available. Apparently, this schema may not containing all fields appeared in all physicla files. Parquet was designed with schema evolution and column pruning in mind, so it should be legal for a user to use a tailored schema to read the dataset to save disk IO. For example, say we have a Parquet dataset consisting of two physical Parquet files with the following two schemas: {noformat} message m0 { optional group f0 { optional int64 f00; optional int64 f01; } } message m1 { optional group f0 { optional int64 f01; optional int64 f01; optional int64 f02; } optional double f1; } {noformat} Users should be allowed to read the dataset with the following schema: {noformat} message m1 { optional group f0 { optional int64 f01; optional int64 f02; } } {noformat} so that {{f0.f00}} and {{f1}} are never touched. The above case can be expressed by the following {{spark-shell}} snippet: {noformat} import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types.{LongType, StructType} val path = /tmp/spark/parquet range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id) AS f0).coalesce(1) .write.mode(overwrite).parquet(path) range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0, CAST(id AS DOUBLE) AS f1).coalesce(1) .write.mode(append).parquet(path) val tailoredSchema = new StructType() .add( f0, new StructType() .add(f01, LongType, nullable = true) .add(f02, LongType, nullable = true), nullable = true) read.schema(tailoredSchema).parquet(path).show() {noformat} Expected output should be: {noformat} ++ | f0| ++ |[0,null]| |[1,null]| |[2,null]| | [0,0]| | [1,1]| | [2,2]| ++ {noformat} However, current 1.5-SNAPSHOT version throws the following exception: {noformat} org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:206) at
[jira] [Updated] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
[ https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10023: Description: checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval is also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API. Because it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext, it not a common shared param. was: checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. Unified DecisionTreeParams checkpointInterval between Scala and Python API. - Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval is also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API. Because it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext, it not a common shared param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
[ https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721477#comment-14721477 ] Apache Spark commented on SPARK-10023: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8528 Unified DecisionTreeParams checkpointInterval between Scala and Python API. - Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
[ https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10023: Assignee: (was: Apache Spark) Unified DecisionTreeParams checkpointInterval between Scala and Python API. - Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9926) Parallelize file listing for partitioned Hive table
[ https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721483#comment-14721483 ] Apache Spark commented on SPARK-9926: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/8512 Parallelize file listing for partitioned Hive table --- Key: SPARK-9926 URL: https://issues.apache.org/jira/browse/SPARK-9926 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1, 1.5.0 Reporter: Cheolsoo Park Assignee: Cheolsoo Park In Spark SQL, short queries like {{select * from table limit 10}} run very slowly against partitioned Hive tables because of file listing. In particular, if a large number of partitions are scanned on storage like S3, the queries run extremely slowly. Here are some example benchmarks in my environment- * Parquet-backed Hive table * Partitioned by dateint and hour * Stored on S3 ||\# of partitions||\# of files||runtime||query|| |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 10;| |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;| |240|136222|1 hour|select * from nccp_log where dateint=20150601 and dateint=20150610 limit 10;| The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive partition path and group them into a UnionRDD. Then, all the input files are listed sequentially. In other tools such as Hive and Pig, this can be solved by setting [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml] high. But in Spark, since each HadoopRDD lists only one partition path, setting this property doesn't help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10356) MLlib: Normalization should use absolute values
[ https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721502#comment-14721502 ] Carsten Schnober edited comment on SPARK-10356 at 8/30/15 12:00 PM: According to https://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm, each value's absolute value should be used to compute the norm: {code} ||x||_p := (sum(|x|^p)^1/p {code} For p = 1, this results in: {code} ||x||_1 := sum(|x|) {code} I suppose the issue is thus actually located in the norm() method. was (Author: carschno): According to [[Wikipedia][https://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm], each value's absolute value should be used to compute the norm: {||x||_p := (sum(|x|^p)^1/p} For p = 1, this results in: {||x||_1 := sum(|x|)} I suppose the issue is thus actually located in the {norm()} method. MLlib: Normalization should use absolute values --- Key: SPARK-10356 URL: https://issues.apache.org/jira/browse/SPARK-10356 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Carsten Schnober Labels: easyfix Original Estimate: 2h Remaining Estimate: 2h The normalizer does not handle vectors with negative values properly. It can be tested with the following code {code} val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 {code} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10331) Update user guide to address minor comments during code review
[ https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10331. --- Resolution: Fixed Fix Version/s: 1.5.1 Update user guide to address minor comments during code review -- Key: SPARK-10331 URL: https://issues.apache.org/jira/browse/SPARK-10331 Project: Spark Issue Type: Improvement Components: Documentation, ML, MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.5.1 Clean-up user guides to address some minor comments in: https://github.com/apache/spark/pull/8304 https://github.com/apache/spark/pull/8487 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10346) SparkR mutate and transform should replace column with same name to match R data.frame behavior
[ https://issues.apache.org/jira/browse/SPARK-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-10346: - Component/s: SparkR SparkR mutate and transform should replace column with same name to match R data.frame behavior --- Key: SPARK-10346 URL: https://issues.apache.org/jira/browse/SPARK-10346 Project: Spark Issue Type: Bug Components: R, SparkR Affects Versions: 1.5.0 Reporter: Felix Cheung Spark doesn't seem to replace existing column with the name in mutate (ie. mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform. Though it is clearly stated it should replace column with matching name: https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html The tags are matched against names(_data), and for those that match, the value replace the corresponding variable in _data, and the others are appended to _data. Also the resulting DataFrame might be hard to work with if one is to use select with column names, or to register the table to SQL, and so on, since then 2 columns have the same name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail
[ https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10301: --- Description: We hit this issue when reading a complex Parquet dateset without turning on schema merging. The data set consists of Parquet files with different but compatible schemas. In this way, the schema of the dataset is defined by either a summary file or a random physical Parquet file if no summary files are available. Apparently, this schema may not containing all fields appeared in all physicla files. Parquet was designed with schema evolution and column pruning in mind, so it should be legal for a user to use a tailored schema to read the dataset to save disk IO. For example, say we have a Parquet dataset consisting of two physical Parquet files with the following two schemas: {noformat} message m0 { optional group f0 { optional int64 f00; optional int64 f01; } } message m1 { optional group f0 { optional int64 f01; optional int64 f01; optional int64 f02; } optional double f1; } {noformat} Users should be allowed to read the dataset with the following schema: {noformat} message m1 { optional group f0 { optional int64 f01; optional int64 f02; } } {noformat} so that {{f0.f00}} and {{f1}} are never touched. The above case can be expressed by the following {{spark-shell}} snippet: {noformat} import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types.{LongType, StructType} val path = /tmp/spark/parquet range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id) AS f0).coalesce(1) .write.mode(overwrite).parquet(path) range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0, CAST(id AS DOUBLE) AS f1).coalesce(1) .write.mode(append).parquet(path) val tailoredSchema = new StructType() .add( f0, new StructType() .add(f01, LongType, nullable = true) .add(f02, LongType, nullable = true), nullable = true) read.schema(tailoredSchema).parquet(path).show() {noformat} Expected output should be: {noformat} ++ | f0| ++ |[0,null]| |[1,null]| |[2,null]| | [0,0]| | [1,1]| | [2,2]| ++ {noformat} However, current 1.5-SNAPSHOT version throws the following exception: {noformat} org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:206) at
[jira] [Created] (SPARK-10354) First cost RDD shouldn't be cached in k-means|| and the following cost RDD should use MEMORY_AND_DISK
Xiangrui Meng created SPARK-10354: - Summary: First cost RDD shouldn't be cached in k-means|| and the following cost RDD should use MEMORY_AND_DISK Key: SPARK-10354 URL: https://issues.apache.org/jira/browse/SPARK-10354 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor The first RDD doesn't need to be cached, other cost RDDs should use MEMORY_AND_DISK to avoid recomputing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721419#comment-14721419 ] Xiangrui Meng commented on SPARK-10329: --- Assigned. I will send a small PR to fix apparent issues (SPARK-10534). Cost RDD in k-means|| initialization is not storage-efficient - Key: SPARK-10329 URL: https://issues.apache.org/jira/browse/SPARK-10329 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Xiangrui Meng Assignee: hujiayin Labels: clustering Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721419#comment-14721419 ] Xiangrui Meng edited comment on SPARK-10329 at 8/30/15 6:36 AM: Assigned. I will send a small PR to fix apparent issues (SPARK-10354). was (Author: mengxr): Assigned. I will send a small PR to fix apparent issues (SPARK-10534). Cost RDD in k-means|| initialization is not storage-efficient - Key: SPARK-10329 URL: https://issues.apache.org/jira/browse/SPARK-10329 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Xiangrui Meng Assignee: hujiayin Labels: clustering Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail
[ https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721447#comment-14721447 ] Cheng Lian commented on SPARK-10301: Updated ticket description to provide a more general view of this issue. Would also be helpful for reviewing https://github.com/apache/spark/pull/8509 For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail Key: SPARK-10301 URL: https://issues.apache.org/jira/browse/SPARK-10301 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Cheng Lian Priority: Critical We hit this issue when reading a complex Parquet dateset without turning on schema merging. The data set consists of Parquet files with different but compatible schemas. In this way, the schema of the dataset is defined by either a summary file or a random physical Parquet file if no summary files are available. Apparently, this schema may not containing all fields appeared in all physicla files. Parquet was designed with schema evolution and column pruning in mind, so it should be legal for a user to use a tailored schema to read the dataset to save disk IO. For example, say we have a Parquet dataset consisting of two physical Parquet files with the following two schemas: {noformat} message m0 { optional group f0 { optional int64 f00; optional int64 f01; } } message m1 { optional group f0 { optional int64 f01; optional int64 f01; optional int64 f02; } optional double f1; } {noformat} Users should be allowed to read the dataset with the following schema: {noformat} message m1 { optional group f0 { optional int64 f01; optional int64 f02; } } {noformat} so that {{f0.f00}} and {{f1}} are never touched. The above case can be expressed by the following {{spark-shell}} snippet: {noformat} import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types.{LongType, StructType} val path = /tmp/spark/parquet range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id) AS f0).coalesce(1) .write.mode(overwrite).parquet(path) range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0, CAST(id AS DOUBLE) AS f1).coalesce(1) .write.mode(append).parquet(path) val tailoredSchema = new StructType() .add( f0, new StructType() .add(f01, LongType, nullable = true) .add(f02, LongType, nullable = true), nullable = true) read.schema(tailoredSchema).parquet(path).show() {noformat} Expected output should be: {noformat} ++ | f0| ++ |[0,null]| |[1,null]| |[2,null]| | [0,0]| | [1,1]| | [2,2]| ++ {noformat} However, current 1.5-SNAPSHOT version throws the following exception: {noformat} org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844)
[jira] [Resolved] (SPARK-10348) Improve Spark ML user guide
[ https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10348. --- Resolution: Fixed Fix Version/s: 1.5.1 Issue resolved by pull request 8517 [https://github.com/apache/spark/pull/8517] Improve Spark ML user guide --- Key: SPARK-10348 URL: https://issues.apache.org/jira/browse/SPARK-10348 Project: Spark Issue Type: Improvement Components: Documentation, ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.5.1 improve ml-guide: * replace `ML Dataset` by `DataFrame` to simplify the abstraction * remove links to Scala API doc in the main guide * change ML algorithms to pipeline components -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10331) Update user guide to address minor comments during code review
[ https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10331: -- Description: Clean-up user guides to address some minor comments in: https://github.com/apache/spark/pull/8304 https://github.com/apache/spark/pull/8487 Some code examples were introduced in 1.2 before `createDataFrame`. We should switch to that. was: Clean-up user guides to address some minor comments in: https://github.com/apache/spark/pull/8304 https://github.com/apache/spark/pull/8487 Update user guide to address minor comments during code review -- Key: SPARK-10331 URL: https://issues.apache.org/jira/browse/SPARK-10331 Project: Spark Issue Type: Improvement Components: Documentation, ML, MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.5.1 Clean-up user guides to address some minor comments in: https://github.com/apache/spark/pull/8304 https://github.com/apache/spark/pull/8487 Some code examples were introduced in 1.2 before `createDataFrame`. We should switch to that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
[ https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10023: Description: checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, because of it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. was: checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: Make checkpointInterval member of DecisionTreeParams for Python API, because of it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. Unified DecisionTreeParams checkpointInterval between Scala and Python API. - Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, because of it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10356) MLlib: Normalization should use absolute values
[ https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721494#comment-14721494 ] Sean Owen commented on SPARK-10356: --- It's not true that the sum of the elements will be 1 after this normalization. It's true that the sum of their absolute values will be. {code} scala val v = Vectors.dense(-1.0, 2.0) v: org.apache.spark.mllib.linalg.Vector = [-1.0,2.0] scala new Normalizer(1.0).transform(v) res2: org.apache.spark.mllib.linalg.Vector = [-0.,0.] {code} That looks correct. You're not expecting the result to have all positive entries right? MLlib: Normalization should use absolute values --- Key: SPARK-10356 URL: https://issues.apache.org/jira/browse/SPARK-10356 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Carsten Schnober Labels: easyfix Original Estimate: 2h Remaining Estimate: 2h The normalizer does not handle vectors with negative values properly. It can be tested with the following code {code} val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 {code} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10264) Add @Since annotation to ml.recoomendation
[ https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721682#comment-14721682 ] Tijo Thomas commented on SPARK-10264: - I am working on this. Thanks Add @Since annotation to ml.recoomendation -- Key: SPARK-10264 URL: https://issues.apache.org/jira/browse/SPARK-10264 Project: Spark Issue Type: Sub-task Components: Documentation, ML Reporter: Xiangrui Meng Priority: Minor Labels: starter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication
[ https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10353: -- Affects Version/s: 1.3.1 1.4.1 MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication -- Key: SPARK-10353 URL: https://issues.apache.org/jira/browse/SPARK-10353 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Burak Yavuz Fix For: 1.4.2, 1.5.1 Basically {code} if (beta != 0.0) { f2jBLAS.dscal(C.values.length, beta, C.values, 1) } {code} should be {code} if (beta != 1.0) { f2jBLAS.dscal(C.values.length, beta, C.values, 1) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication
[ https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10353: -- Fix Version/s: 1.5.1 1.4.2 MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication -- Key: SPARK-10353 URL: https://issues.apache.org/jira/browse/SPARK-10353 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Burak Yavuz Fix For: 1.4.2, 1.5.1 Basically {code} if (beta != 0.0) { f2jBLAS.dscal(C.values.length, beta, C.values, 1) } {code} should be {code} if (beta != 1.0) { f2jBLAS.dscal(C.values.length, beta, C.values, 1) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication
[ https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10353: -- Assignee: Burak Yavuz MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication -- Key: SPARK-10353 URL: https://issues.apache.org/jira/browse/SPARK-10353 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Burak Yavuz Assignee: Burak Yavuz Fix For: 1.4.2, 1.5.1 Basically {code} if (beta != 0.0) { f2jBLAS.dscal(C.values.length, beta, C.values, 1) } {code} should be {code} if (beta != 1.0) { f2jBLAS.dscal(C.values.length, beta, C.values, 1) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication
[ https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721689#comment-14721689 ] Xiangrui Meng commented on SPARK-10353: --- Leave the JIRA open for 1.3. MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication -- Key: SPARK-10353 URL: https://issues.apache.org/jira/browse/SPARK-10353 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Burak Yavuz Assignee: Burak Yavuz Fix For: 1.4.2, 1.5.1 Basically {code} if (beta != 0.0) { f2jBLAS.dscal(C.values.length, beta, C.values, 1) } {code} should be {code} if (beta != 1.0) { f2jBLAS.dscal(C.values.length, beta, C.values, 1) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8684: - Fix Version/s: (was: 1.5.0) Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10189) python rdd socket connection problem
[ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-10189: --- Description: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) ``` ``` Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) ``` Current piece of code in rdd.py- ``` def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() ``` On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - ``` port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) ``` was: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) ``` ``` Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) ``` Current piece of code in rdd.py- ``` def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() ``` python rdd socket connection problem Key: SPARK-10189 URL: https://issues.apache.org/jira/browse/SPARK-10189 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Reporter: ABHISHEK CHOUDHARY Labels: pyspark, socket I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles =
[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9663: --- Description: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for ML: ** attribute SPARK-10025 ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** SQLTransformer SPARK-10355 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing classes for MLlib: ** fpm *** PrefixSpan SPARK-10028 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 was: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for ML: ** attribute SPARK-10025 ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing classes for MLlib: ** fpm *** PrefixSpan SPARK-10028 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for ML: ** attribute SPARK-10025 ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** SQLTransformer SPARK-10355 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing classes for MLlib: ** fpm *** PrefixSpan SPARK-10028 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10355) Add Python API for SQLTransformer
[ https://issues.apache.org/jira/browse/SPARK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10355: Assignee: (was: Apache Spark) Add Python API for SQLTransformer - Key: SPARK-10355 URL: https://issues.apache.org/jira/browse/SPARK-10355 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add Python API for SQLTransformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10355) Add Python API for SQLTransformer
[ https://issues.apache.org/jira/browse/SPARK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721467#comment-14721467 ] Apache Spark commented on SPARK-10355: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8527 Add Python API for SQLTransformer - Key: SPARK-10355 URL: https://issues.apache.org/jira/browse/SPARK-10355 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add Python API for SQLTransformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10355) Add Python API for SQLTransformer
[ https://issues.apache.org/jira/browse/SPARK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10355: Assignee: Apache Spark Add Python API for SQLTransformer - Key: SPARK-10355 URL: https://issues.apache.org/jira/browse/SPARK-10355 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Assignee: Apache Spark Priority: Minor Add Python API for SQLTransformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
[ https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10023: Description: checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: Make checkpointInterval member of DecisionTreeParams for Python API, because of it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. was: checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: Make checkpointInterval shared param for Scala API Unified DecisionTreeParams checkpointInterval between Scala and Python API. - Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: Make checkpointInterval member of DecisionTreeParams for Python API, because of it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
[ https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10023: Description: checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. was: checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, because of it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. Unified DecisionTreeParams checkpointInterval between Scala and Python API. - Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
[ https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10023: Assignee: Apache Spark Unified DecisionTreeParams checkpointInterval between Scala and Python API. - Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang Assignee: Apache Spark checkpointInterval is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. * checkpointInterval ** member of DecisionTreeParams - Scala API ** shared param used for all ML Transformer/Estimator - Python API Proposal: checkpointInterval also used at ALS but the meaning for that is different from here. So we make checkpointInterval member of DecisionTreeParams for Python API, it only validate when cacheNodeIds is true and the checkpoint directory is set in the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10356) MLlib: Normalization should use absolute values
[ https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721502#comment-14721502 ] Carsten Schnober commented on SPARK-10356: -- According to [[Wikipedia][https://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm], each value's absolute value should be used to compute the norm: {||x||_p := (sum(|x|^p)^1/p} For p = 1, this results in: {||x||_1 := sum(|x|)} I suppose the issue is thus actually located in the {norm()} method. MLlib: Normalization should use absolute values --- Key: SPARK-10356 URL: https://issues.apache.org/jira/browse/SPARK-10356 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Carsten Schnober Labels: easyfix Original Estimate: 2h Remaining Estimate: 2h The normalizer does not handle vectors with negative values properly. It can be tested with the following code {code} val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 {code} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10189) python rdd socket connection problem
[ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-10189: --- Description: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - {code} textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) {code} Current piece of code in rdd.py- {code:title=rdd.py|borderStyle=solid} def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() {code} On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - {code:title=context.py|borderStyle=solid} port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) {code} was: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - {code} textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) {code} Current piece of code in rdd.py- {code:title=rdd.py|borderStyle=solid} def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() {code} On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - ``` port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) ``` python rdd socket connection problem Key: SPARK-10189 URL:
[jira] [Updated] (SPARK-10189) python rdd socket connection problem
[ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-10189: --- Description: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - {code} textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) {code} Current piece of code in rdd.py- {code:title=rdd.py|borderStyle=solid} def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() {code} On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - ``` port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) ``` was: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) ``` ``` Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) ``` Current piece of code in rdd.py- ``` def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() ``` On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - ``` port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) ``` python rdd socket connection problem Key: SPARK-10189 URL: https://issues.apache.org/jira/browse/SPARK-10189 Project: Spark
[jira] [Comment Edited] (SPARK-8292) ShortestPaths run with error result
[ https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721554#comment-14721554 ] Anita Tailor edited comment on SPARK-8292 at 8/30/15 3:13 PM: -- https://issues.apache.org/jira/browse/SPARK-5343 explains existing behaviour is correct. SPF implementation here gives shortest paths to the given set of landmark vertices was (Author: atailor22): https://issues.apache.org/jira/browse/SPARK-5343 explains existing behaviour is correct. ShortestPaths run with error result --- Key: SPARK-8292 URL: https://issues.apache.org/jira/browse/SPARK-8292 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.1 Environment: Ubuntu 64bit Reporter: Bruce Chen Priority: Minor Labels: patch Attachments: ShortestPaths.patch In graphx/lib/ShortestPaths, i run an example with input data: 0\t2 0\t4 2\t3 3\t6 4\t2 4\t5 5\t3 5\t6 then i write a function and set point '0' as the source point, and calculate the shortest path from point 0 to the others points, the code like this: val source: Seq[VertexId] = Seq(0) val ss = ShortestPaths.run(graph, source) then, i get the run result of all the vertex's shortest path value: (4,Map()) (0,Map(0 - 0)) (6,Map()) (3,Map()) (5,Map()) (2,Map()) but the right result should be: (4,Map(0 - 1)) (0,Map(0 - 0)) (6,Map(0 - 3)) (3,Map(0 - 2)) (5,Map(0 - 2)) (2,Map(0 - 1)) so, i check the source code of spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala and find a bug. The patch list in the following. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721566#comment-14721566 ] Vincent Warmerdam commented on SPARK-8684: -- upgrading the spark ec2 ami would take significant work and might not be something for the upcoming spark release. this is (to my knowledge) still something that is unresolved. Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Fix For: 1.5.0 Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8292) ShortestPaths run with error result
[ https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8292. -- Resolution: Not A Problem ShortestPaths run with error result --- Key: SPARK-8292 URL: https://issues.apache.org/jira/browse/SPARK-8292 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.1 Environment: Ubuntu 64bit Reporter: Bruce Chen Priority: Minor Labels: patch Attachments: ShortestPaths.patch In graphx/lib/ShortestPaths, i run an example with input data: 0\t2 0\t4 2\t3 3\t6 4\t2 4\t5 5\t3 5\t6 then i write a function and set point '0' as the source point, and calculate the shortest path from point 0 to the others points, the code like this: val source: Seq[VertexId] = Seq(0) val ss = ShortestPaths.run(graph, source) then, i get the run result of all the vertex's shortest path value: (4,Map()) (0,Map(0 - 0)) (6,Map()) (3,Map()) (5,Map()) (2,Map()) but the right result should be: (4,Map(0 - 1)) (0,Map(0 - 0)) (6,Map(0 - 3)) (3,Map(0 - 2)) (5,Map(0 - 2)) (2,Map(0 - 1)) so, i check the source code of spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala and find a bug. The patch list in the following. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10356) MLlib: Normalization should use absolute values
[ https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10356. --- Resolution: Not A Problem MLlib: Normalization should use absolute values --- Key: SPARK-10356 URL: https://issues.apache.org/jira/browse/SPARK-10356 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Carsten Schnober Labels: easyfix Original Estimate: 2h Remaining Estimate: 2h The normalizer does not handle vectors with negative values properly. It can be tested with the following code {code} val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 {code} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10356) MLlib: Normalization should use absolute values
[ https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721511#comment-14721511 ] Sean Owen commented on SPARK-10356: --- Exactly. Your code does not compute a 1 norm because you are summing elements and not their absolute values. The normalization in Spark is correct. To be clear normalization makes the norm 1; you are testing some other condition that is not true. MLlib: Normalization should use absolute values --- Key: SPARK-10356 URL: https://issues.apache.org/jira/browse/SPARK-10356 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Carsten Schnober Labels: easyfix Original Estimate: 2h Remaining Estimate: 2h The normalizer does not handle vectors with negative values properly. It can be tested with the following code {code} val normalized = new Normalizer(1.0).transform(v: Vector) normalizer.toArray.sum == 1.0 {code} This yields true if all values in Vector v are positive, but false when v contains one or more negative values. This is because the values in v are taken immediately without applying {{abs()}}, This (probably) does not occur for {{p=2.0}} because the values are squared and hence positive anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Warmerdam updated SPARK-8684: - Comment: was deleted (was: closed due to github confusion. reopened due to sanity.) Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Fix For: 1.5.0 Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8292) ShortestPaths run with error result
[ https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721554#comment-14721554 ] Anita Tailor commented on SPARK-8292: - https://issues.apache.org/jira/browse/SPARK-5343 explains existing behaviour is correct. ShortestPaths run with error result --- Key: SPARK-8292 URL: https://issues.apache.org/jira/browse/SPARK-8292 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.1 Environment: Ubuntu 64bit Reporter: Bruce Chen Priority: Minor Labels: patch Attachments: ShortestPaths.patch In graphx/lib/ShortestPaths, i run an example with input data: 0\t2 0\t4 2\t3 3\t6 4\t2 4\t5 5\t3 5\t6 then i write a function and set point '0' as the source point, and calculate the shortest path from point 0 to the others points, the code like this: val source: Seq[VertexId] = Seq(0) val ss = ShortestPaths.run(graph, source) then, i get the run result of all the vertex's shortest path value: (4,Map()) (0,Map(0 - 0)) (6,Map()) (3,Map()) (5,Map()) (2,Map()) but the right result should be: (4,Map(0 - 1)) (0,Map(0 - 0)) (6,Map(0 - 3)) (3,Map(0 - 2)) (5,Map(0 - 2)) (2,Map(0 - 1)) so, i check the source code of spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala and find a bug. The patch list in the following. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721585#comment-14721585 ] Yin Huai commented on SPARK-8684: - ah, I see. I thought the jira of this task had been all done because the fix version is set. Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Fix For: 1.5.0 Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9642) LinearRegression should supported weighted data
[ https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721801#comment-14721801 ] Meihua Wu commented on SPARK-9642: -- [~sethah] Thank you for your help. I worked on this and have a draft version, which makes use of a few components in the PR for a similar issue of logistics regression (https://issues.apache.org/jira/browse/SPARK-7685). I am planning to send a PR after the issue 7685 is resolved. LinearRegression should supported weighted data --- Key: SPARK-9642 URL: https://issues.apache.org/jira/browse/SPARK-9642 Project: Spark Issue Type: New Feature Components: ML Reporter: Meihua Wu Labels: 1.6 In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-10329: - Comment: was deleted (was: ok, I will try to fix it today) Cost RDD in k-means|| initialization is not storage-efficient - Key: SPARK-10329 URL: https://issues.apache.org/jira/browse/SPARK-10329 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Xiangrui Meng Assignee: hujiayin Labels: clustering Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient
[ https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721799#comment-14721799 ] hujiayin commented on SPARK-10329: -- ok, I will try to fix it today Cost RDD in k-means|| initialization is not storage-efficient - Key: SPARK-10329 URL: https://issues.apache.org/jira/browse/SPARK-10329 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1, 1.4.1, 1.5.0 Reporter: Xiangrui Meng Assignee: hujiayin Labels: clustering Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel for reporting this issue! There are several solutions: 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record. 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead. Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10357) DataFrames unable to drop unwanted columns
Randy Gelhausen created SPARK-10357: --- Summary: DataFrames unable to drop unwanted columns Key: SPARK-10357 URL: https://issues.apache.org/jira/browse/SPARK-10357 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Randy Gelhausen spark-csv seems to be exposing an issue with DataFrame's inability to drop unwanted columns. Related GitHub issue: https://github.com/databricks/spark-csv/issues/61 My data (with header) looks like: MI_PRINX,offense_id,rpt_date,occur_date,occur_time,poss_date,poss_time,beat,apt_office_prefix,apt_office_num,location,MinOfucr,MinOfibr_code,dispo_code,MaxOfnum_victims,Shift,Avg Day,loc_type,UC2 Literal,neighborhood,npu,x,y,,, 934782,90360664,2/5/2009,2/3/2009,13:50:00,2/3/2009,15:00:00,305,NULL,NULL,55 MCDONOUGH BLVD SW,670,2308,NULL,1,Day,Tue,35,LARCENY-NON VEHICLE,South Atlanta,Y,-84.38654,33.72024,,, 934783,90370891,2/6/2009,2/6/2009,8:50:00,2/6/2009,10:45:00,502,NULL,NULL,464 ANSLEY WALK TER NW,640,2305,NULL,1,Day,Fri,18,LARCENY-FROM VEHICLE,Ansley Park,E,-84.37276,33.79685,,, Despite using sqlContext (also tried with the programmatic raw.select, same result) to remove columns from the dataframe, attempts to operate on it cause failures. Snippet: // Read CSV file, clean field names val raw = sqlContext.read.format(com.databricks.spark.csv).option(header, true).option(DROPMALFORMED, true).load(input) val columns = raw.columns.map(x = x.replaceAll( , _)) raw.toDF(columns:_*).registerTempTable(table) val clean = sqlContext.sql(select + columns.filter(x = x.length() 0 x != ).mkString(, ) + from + table) System.err.println(clean.schema) System.err.println(clean.columns.mkString(,)) System.err.println(clean.take(1).mkString(|)) StackTrace: 15/08/30 18:23:13 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, docker.dev, NODE_LOCAL, 1482 bytes) 15/08/30 18:23:14 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on docker.dev:58272 (size: 1811.0 B, free: 530.0 MB) 15/08/30 18:23:14 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on docker.dev:58272 (size: 21.9 KB, free: 530.0 MB) 15/08/30 18:23:15 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1350 ms on docker.dev (1/1) 15/08/30 18:23:15 INFO scheduler.DAGScheduler: ResultStage 0 (take at CsvRelation.scala:174) finished in 1.354 s 15/08/30 18:23:15 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/08/30 18:23:15 INFO scheduler.DAGScheduler: Job 0 finished: take at CsvRelation.scala:174, took 1.413674 s StructType(StructField(MI_PRINX,StringType,true), StructField(offense_id,StringType,true), StructField(rpt_date,StringType,true), StructField(occur_date,StringType,true), StructField(occur_time,StringType,true), StructField(poss_date,StringType,true), StructField(poss_time,StringType,true), StructField(beat,StringType,true), StructField(apt_office_prefix,StringType,true), StructField(apt_office_num,StringType,true), StructField(location,StringType,true), StructField(MinOfucr,StringType,true), StructField(MinOfibr_code,StringType,true), StructField(dispo_code,StringType,true), StructField(MaxOfnum_victims,StringType,true), StructField(Shift,StringType,true), StructField(Avg_Day,StringType,true), StructField(loc_type,StringType,true), StructField(UC2_Literal,StringType,true), StructField(neighborhood,StringType,true), StructField(npu,StringType,true), StructField(x,StringType,true), StructField(y,StringType,true)) MI_PRINX,offense_id,rpt_date,occur_date,occur_time,poss_date,poss_time,beat,apt_office_prefix,apt_office_num,location,MinOfucr,MinOfibr_code,dispo_code,MaxOfnum_victims,Shift,Avg_Day,loc_type,UC2_Literal,neighborhood,npu,x,y 15/08/30 18:23:16 INFO storage.MemoryStore: ensureFreeSpace(232400) called with curMem=259660, maxMem=278019440 15/08/30 18:23:16 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 227.0 KB, free 264.7 MB) 15/08/30 18:23:16 INFO storage.MemoryStore: ensureFreeSpace(22377) called with curMem=492060, maxMem=278019440 15/08/30 18:23:16 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 21.9 KB, free 264.6 MB) 15/08/30 18:23:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.17.0.19:41088 (size: 21.9 KB, free: 265.1 MB) 15/08/30 18:23:16 INFO spark.SparkContext: Created broadcast 2 from textFile at TextFile.scala:30 Exception in thread main java.lang.IllegalArgumentException: The header contains a duplicate entry: '' in [MI_PRINX, offense_id, rpt_date, occur_date, occur_time, poss_date, poss_time, beat, apt_office_prefix, apt_office_num, location, MinOfucr, MinOfibr_code, dispo_code, MaxOfnum_victims, Shift, Avg Day, loc_type, UC2 Literal,
[jira] [Created] (SPARK-10358) Spark-sql throws IOException on exit when using HDFS to store event log.
Sioa Song created SPARK-10358: - Summary: Spark-sql throws IOException on exit when using HDFS to store event log. Key: SPARK-10358 URL: https://issues.apache.org/jira/browse/SPARK-10358 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: * spark-1.3.1-bin-hadoop2.6 * hadoop-2.6.0 * Red hat 2.6.32-504.el6.x86_64 Reporter: Sioa Song Priority: Minor Fix For: 1.3.1 h2. Summary In Spark 1.3.1, if using HDFS to store event log, spark-sql will throw an java.io.IOException: Filesystem closed when exit. h2. How to reproduce 1. Enable event log mechanism, and configure the file location to HDFS. You can do this by setting these two properties in spark-defaults.conf: spark.eventLog.enabled true spark.eventLog.dir hdfs://x:x/spark-events 2. start spark-sql, and type exit once it starts. {noformat} spark-sql exit; 15/08/14 06:29:20 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:181) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1678) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1946) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) ... 19 more 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 15/08/14 06:29:20 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 15/08/14 06:29:20 INFO
[jira] [Commented] (SPARK-9666) ML 1.5 QA: model save/load audit
[ https://issues.apache.org/jira/browse/SPARK-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721906#comment-14721906 ] yuhao yang commented on SPARK-9666: --- models have no change in 1.5: LogisticRegressionModel LassoModel SVMModel RidgeRegressionModel NaiveBayesModel LinearRegressionModel DecisionTreeModel MatrixFactorizationModel RandomForestModel GradientBoostedTreesModel ML 1.5 QA: model save/load audit Key: SPARK-9666 URL: https://issues.apache.org/jira/browse/SPARK-9666 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: yuhao yang We should check to make sure no changes broke model import/export in spark.mllib. * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721907#comment-14721907 ] Feynman Liang commented on SPARK-10199: --- [~vinodkc] Thanks! I think these results are convincing. Let's see what others think but FWIW I'm all for these changes, particularly because it sets precedence for future model save/load to explicitly specify the schema. Avoid using reflections for parquet model save -- Key: SPARK-10199 URL: https://issues.apache.org/jira/browse/SPARK-10199 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Feynman Liang Priority: Minor These items are not high priority since the overhead writing to Parquest is much greater than for runtime reflections. Multiple model save/load in MLlib use case classes to infer a schema for the data frame saved to Parquet. However, inferring a schema from case classes or tuples uses [runtime reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] which is unnecessary since the types are already known at the time `save` is called. It would be better to just specify the schema for the data frame directly using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9642) LinearRegression should supported weighted data
[ https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721791#comment-14721791 ] Seth Hendrickson commented on SPARK-9642: - I'd like to take this one if no one else is working on it. LinearRegression should supported weighted data --- Key: SPARK-9642 URL: https://issues.apache.org/jira/browse/SPARK-9642 Project: Spark Issue Type: New Feature Components: ML Reporter: Meihua Wu Labels: 1.6 In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10264) Add @Since annotation to ml.recoomendation
[ https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14722969#comment-14722969 ] Apache Spark commented on SPARK-10264: -- User 'tijoparacka' has created a pull request for this issue: https://github.com/apache/spark/pull/8532 > Add @Since annotation to ml.recoomendation > -- > > Key: SPARK-10264 > URL: https://issues.apache.org/jira/browse/SPARK-10264 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10264) Add @Since annotation to ml.recoomendation
[ https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10264: Assignee: Apache Spark > Add @Since annotation to ml.recoomendation > -- > > Key: SPARK-10264 > URL: https://issues.apache.org/jira/browse/SPARK-10264 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10264) Add @Since annotation to ml.recoomendation
[ https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10264: Assignee: (was: Apache Spark) > Add @Since annotation to ml.recoomendation > -- > > Key: SPARK-10264 > URL: https://issues.apache.org/jira/browse/SPARK-10264 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14722999#comment-14722999 ] Meethu Mathew commented on SPARK-6724: -- [~josephkb] Could you plz give your opinion on this ? > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9545) Run Maven tests in pull request builder if title has [test-maven] in it
[ https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-9545: --- Summary: Run Maven tests in pull request builder if title has [test-maven] in it (was: Run Maven tests in pull request builder if title has [maven-test] in it) Run Maven tests in pull request builder if title has [test-maven] in it - Key: SPARK-9545 URL: https://issues.apache.org/jira/browse/SPARK-9545 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.6.0 We have infrastructure now in the build tooling for running maven tests, but it's not actually used anywhere. With a very minor change we can support running maven tests if the pull request title has maven-test in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9547) Allow testing pull requests with different Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-9547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-9547. Resolution: Fixed Fix Version/s: 1.6.0 Allow testing pull requests with different Hadoop versions -- Key: SPARK-9547 URL: https://issues.apache.org/jira/browse/SPARK-9547 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.6.0 Similar to SPARK-9545 we should allow testing different Hadoop profiles in the PRB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9545) Run Maven tests in pull request builder if title has [maven-test] in it
[ https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-9545. Resolution: Fixed Fix Version/s: 1.6.0 Run Maven tests in pull request builder if title has [maven-test] in it - Key: SPARK-9545 URL: https://issues.apache.org/jira/browse/SPARK-9545 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.6.0 We have infrastructure now in the build tooling for running maven tests, but it's not actually used anywhere. With a very minor change we can support running maven tests if the pull request title has maven-test in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests
[ https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10359: Assignee: Apache Spark (was: Patrick Wendell) Enumerate Spark's dependencies in a file and diff against it for new pull requests --- Key: SPARK-10359 URL: https://issues.apache.org/jira/browse/SPARK-10359 Project: Spark Issue Type: New Feature Components: Build Reporter: Patrick Wendell Assignee: Apache Spark Sometimes when we have dependency changes it can be pretty unclear what transitive set of things are changing. If we enumerate all of the dependencies and put them in a source file in the repo, we can make it so that it is very explicit what is changing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests
[ https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10359: Assignee: Patrick Wendell (was: Apache Spark) Enumerate Spark's dependencies in a file and diff against it for new pull requests --- Key: SPARK-10359 URL: https://issues.apache.org/jira/browse/SPARK-10359 Project: Spark Issue Type: New Feature Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Sometimes when we have dependency changes it can be pretty unclear what transitive set of things are changing. If we enumerate all of the dependencies and put them in a source file in the repo, we can make it so that it is very explicit what is changing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests
[ https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14722716#comment-14722716 ] Apache Spark commented on SPARK-10359: -- User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/8531 Enumerate Spark's dependencies in a file and diff against it for new pull requests --- Key: SPARK-10359 URL: https://issues.apache.org/jira/browse/SPARK-10359 Project: Spark Issue Type: New Feature Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Sometimes when we have dependency changes it can be pretty unclear what transitive set of things are changing. If we enumerate all of the dependencies and put them in a source file in the repo, we can make it so that it is very explicit what is changing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests
Patrick Wendell created SPARK-10359: --- Summary: Enumerate Spark's dependencies in a file and diff against it for new pull requests Key: SPARK-10359 URL: https://issues.apache.org/jira/browse/SPARK-10359 Project: Spark Issue Type: New Feature Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Sometimes when we have dependency changes it can be pretty unclear what transitive set of things are changing. If we enumerate all of the dependencies and put them in a source file in the repo, we can make it so that it is very explicit what is changing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org