date:20150830


 [ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10329:
--
Assignee: hujiayin

 Cost RDD in k-means|| initialization is not storage-efficient
 -

 Key: SPARK-10329
 URL: https://issues.apache.org/jira/browse/SPARK-10329
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Xiangrui Meng
Assignee: hujiayin
  Labels: clustering

 Currently we use `RDD[Vector]` to store point cost during k-means|| 
 initialization, where each `Vector` has size `runs`. This is not 
 storage-efficient because `runs` is usually 1 and then each record is a 
 Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
 introduce two objects (DenseVector and its values array), which could cost 16 
 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
 for reporting this issue!
 There are several solutions:
 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
 record.
 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
 `Array[Double]` object covers 1024 instances, which could remove most of the 
 overhead.
 Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
 kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10290) Spark can register temp table and hive table with the same table name


 [ 
https://issues.apache.org/jira/browse/SPARK-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10290:


Assignee: Apache Spark

 Spark can register temp table and hive table with the same table name
 -

 Key: SPARK-10290
 URL: https://issues.apache.org/jira/browse/SPARK-10290
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: pin_zhang
Assignee: Apache Spark

 Spark sql allow to create hive table and register temp table with the same 
 name
 no way to run query on the hive table table with the following code
 // register hive table
 DataFrame df = hctx_.read().json(test.json);
 df.write().mode(SaveMode.Overwrite).saveAsTable(test);
  // register temp table
 hctx_.registerDataFrameAsTable(hctx_.sql(select id from test), test);
   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10290) Spark can register temp table and hive table with the same table name


[ 
https://issues.apache.org/jira/browse/SPARK-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721493#comment-14721493
 ] 

Apache Spark commented on SPARK-10290:
--

User 'mzorro' has created a pull request for this issue:
https://github.com/apache/spark/pull/8529

 Spark can register temp table and hive table with the same table name
 -

 Key: SPARK-10290
 URL: https://issues.apache.org/jira/browse/SPARK-10290
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: pin_zhang

 Spark sql allow to create hive table and register temp table with the same 
 name
 no way to run query on the hive table table with the following code
 // register hive table
 DataFrame df = hctx_.read().json(test.json);
 df.write().mode(SaveMode.Overwrite).saveAsTable(test);
  // register temp table
 hctx_.registerDataFrameAsTable(hctx_.sql(select id from test), test);
   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10290) Spark can register temp table and hive table with the same table name


 [ 
https://issues.apache.org/jira/browse/SPARK-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10290:


Assignee: (was: Apache Spark)

 Spark can register temp table and hive table with the same table name
 -

 Key: SPARK-10290
 URL: https://issues.apache.org/jira/browse/SPARK-10290
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: pin_zhang

 Spark sql allow to create hive table and register temp table with the same 
 name
 no way to run query on the hive table table with the following code
 // register hive table
 DataFrame df = hctx_.read().json(test.json);
 df.write().mode(SaveMode.Overwrite).saveAsTable(test);
  // register temp table
 hctx_.registerDataFrameAsTable(hctx_.sql(select id from test), test);
   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient


[ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721419#comment-14721419
 ] 

Xiangrui Meng edited comment on SPARK-10329 at 8/30/15 6:53 AM:


Assigned. I will send a small PR to fix apparent issues (SPARK-10354), 
hopefully in 1.5.


was (Author: mengxr):
Assigned. I will send a small PR to fix apparent issues (SPARK-10354).

 Cost RDD in k-means|| initialization is not storage-efficient
 -

 Key: SPARK-10329
 URL: https://issues.apache.org/jira/browse/SPARK-10329
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Xiangrui Meng
Assignee: hujiayin
  Labels: clustering

 Currently we use `RDD[Vector]` to store point cost during k-means|| 
 initialization, where each `Vector` has size `runs`. This is not 
 storage-efficient because `runs` is usually 1 and then each record is a 
 Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
 introduce two objects (DenseVector and its values array), which could cost 16 
 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
 for reporting this issue!
 There are several solutions:
 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
 record.
 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
 `Array[Double]` object covers 1024 instances, which could remove most of the 
 overhead.
 Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
 kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10184) Optimization for bounds determination in RangePartitioner


 [ 
https://issues.apache.org/jira/browse/SPARK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10184:
--
Assignee: Jigao Fu

 Optimization for bounds determination in RangePartitioner
 -

 Key: SPARK-10184
 URL: https://issues.apache.org/jira/browse/SPARK-10184
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jigao Fu
Assignee: Jigao Fu
Priority: Minor
 Fix For: 1.6.0

   Original Estimate: 10m
  Remaining Estimate: 10m

 Change {{cumWeight  target}} to {{cumWeight = target}} in 
 {{RangePartitioner.determineBounds}} method to make the output partitions 
 more balanced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-30 Thread Vinod KC (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721491#comment-14721491
 ] 

Vinod KC commented on SPARK-10199:
--

[~fliang] , As you suggested, 

1) I've made  micro-benchmarks by surrounding   createDataFrame in model save 
methods and Loader.checkSchema in load methods with below  timing code 

def time[R](block: = R): R = {
val t0 = System.nanoTime()
val result = block   
val t1 = System.nanoTime()
println(Elapsed time:  + (t1 - t0) + ns)
result
  }

2) Then I ran mllib test suites on code before and after the change.

Please see the measurements and performance gain % in this google docs 

https://docs.google.com/spreadsheets/d/1TPUctB62xAHx0IaJttyx98MjRo4zVmO4neTkdi7uVDs/edit?usp=sharing

There is good  performance  improvement without reflection


 Avoid using reflections for parquet model save
 --

 Key: SPARK-10199
 URL: https://issues.apache.org/jira/browse/SPARK-10199
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Feynman Liang
Priority: Minor

 These items are not high priority since the overhead writing to Parquest is 
 much greater than for runtime reflections.
 Multiple model save/load in MLlib use case classes to infer a schema for the 
 data frame saved to Parquet. However, inferring a schema from case classes or 
 tuples uses [runtime 
 reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
  which is unnecessary since the types are already known at the time `save` is 
 called.
 It would be better to just specify the schema for the data frame directly 
 using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10356) MLlib: Normalization should use absolute values

Carsten Schnober created SPARK-10356:


 Summary: MLlib: Normalization should use absolute values
 Key: SPARK-10356
 URL: https://issues.apache.org/jira/browse/SPARK-10356
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Carsten Schnober


The normalizer does not handle vectors with negative values properly. It can be 
tested with the following code

{{
val normalized = new Normalizer(1.0).transform(v: Vector)
normalizer.toArray.sum == 1.0
}}

This yields true if all values in Vector v are positive, but false when v 
contains one or more negative values. This is because the values in v are taken 
immediately without applying {{abs()}},

This (probably) does not occur for {{p=2.0}} because the values are squared and 
hence positive anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10356) MLlib: Normalization should use absolute values


 [ 
https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carsten Schnober updated SPARK-10356:
-
Description: 
The normalizer does not handle vectors with negative values properly. It can be 
tested with the following code

{code}
val normalized = new Normalizer(1.0).transform(v: Vector)
normalizer.toArray.sum == 1.0
{code}

This yields true if all values in Vector v are positive, but false when v 
contains one or more negative values. This is because the values in v are taken 
immediately without applying {{abs()}},

This (probably) does not occur for {{p=2.0}} because the values are squared and 
hence positive anyway.

  was:
The normalizer does not handle vectors with negative values properly. It can be 
tested with the following code

{{val normalized = new Normalizer(1.0).transform(v: Vector)
normalizer.toArray.sum == 1.0}}

This yields true if all values in Vector v are positive, but false when v 
contains one or more negative values. This is because the values in v are taken 
immediately without applying {{abs()}},

This (probably) does not occur for {{p=2.0}} because the values are squared and 
hence positive anyway.


 MLlib: Normalization should use absolute values
 ---

 Key: SPARK-10356
 URL: https://issues.apache.org/jira/browse/SPARK-10356
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Carsten Schnober
  Labels: easyfix
   Original Estimate: 2h
  Remaining Estimate: 2h

 The normalizer does not handle vectors with negative values properly. It can 
 be tested with the following code
 {code}
 val normalized = new Normalizer(1.0).transform(v: Vector)
 normalizer.toArray.sum == 1.0
 {code}
 This yields true if all values in Vector v are positive, but false when v 
 contains one or more negative values. This is because the values in v are 
 taken immediately without applying {{abs()}},
 This (probably) does not occur for {{p=2.0}} because the values are squared 
 and hence positive anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10356) MLlib: Normalization should use absolute values


 [ 
https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carsten Schnober updated SPARK-10356:
-
Description: 
The normalizer does not handle vectors with negative values properly. It can be 
tested with the following code

{{val normalized = new Normalizer(1.0).transform(v: Vector)
normalizer.toArray.sum == 1.0}}

This yields true if all values in Vector v are positive, but false when v 
contains one or more negative values. This is because the values in v are taken 
immediately without applying {{abs()}},

This (probably) does not occur for {{p=2.0}} because the values are squared and 
hence positive anyway.

  was:
The normalizer does not handle vectors with negative values properly. It can be 
tested with the following code

{{
val normalized = new Normalizer(1.0).transform(v: Vector)
normalizer.toArray.sum == 1.0
}}

This yields true if all values in Vector v are positive, but false when v 
contains one or more negative values. This is because the values in v are taken 
immediately without applying {{abs()}},

This (probably) does not occur for {{p=2.0}} because the values are squared and 
hence positive anyway.


 MLlib: Normalization should use absolute values
 ---

 Key: SPARK-10356
 URL: https://issues.apache.org/jira/browse/SPARK-10356
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Carsten Schnober
  Labels: easyfix
   Original Estimate: 2h
  Remaining Estimate: 2h

 The normalizer does not handle vectors with negative values properly. It can 
 be tested with the following code
 {{val normalized = new Normalizer(1.0).transform(v: Vector)
 normalizer.toArray.sum == 1.0}}
 This yields true if all values in Vector v are positive, but false when v 
 contains one or more negative values. This is because the values in v are 
 taken immediately without applying {{abs()}},
 This (probably) does not occur for {{p=2.0}} because the values are squared 
 and hence positive anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10184) Optimization for bounds determination in RangePartitioner


 [ 
https://issues.apache.org/jira/browse/SPARK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10184.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8397
[https://github.com/apache/spark/pull/8397]

 Optimization for bounds determination in RangePartitioner
 -

 Key: SPARK-10184
 URL: https://issues.apache.org/jira/browse/SPARK-10184
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jigao Fu
Priority: Minor
 Fix For: 1.6.0

   Original Estimate: 10m
  Remaining Estimate: 10m

 Change {{cumWeight  target}} to {{cumWeight = target}} in 
 {{RangePartitioner.determineBounds}} method to make the output partitions 
 more balanced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10226) Error occured in SparkSQL when using !=


 [ 
https://issues.apache.org/jira/browse/SPARK-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10226:
--
Assignee: wangwei

 Error occured in SparkSQL when using  !=
 

 Key: SPARK-10226
 URL: https://issues.apache.org/jira/browse/SPARK-10226
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: wangwei
Assignee: wangwei
 Fix For: 1.5.0


 DataSource:  
 src/main/resources/kv1.txt
 SQL: 
   1. create table src(id string, name string);
   2. load data local inpath 
 '${SparkHome}/examples/src/main/resources/kv1.txt' into table src;
   3. select count( * ) from src where id != '0';
 [ERROR] Could not expand event
 java.lang.IllegalArgumentException: != 0;: event not found
   at jline.console.ConsoleReader.expandEvents(ConsoleReader.java:779)
   at jline.console.ConsoleReader.finishBuffer(ConsoleReader.java:631)
   at jline.console.ConsoleReader.accept(ConsoleReader.java:2019)
   at jline.console.ConsoleReader.readLine(ConsoleReader.java:2666)
   at jline.console.ConsoleReader.readLine(ConsoleReader.java:2269)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:231)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:666)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:178)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:203)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10350) Fix SQL Programming Guide


 [ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10350:
--
Assignee: Guoqiang Li

 Fix SQL Programming Guide
 -

 Key: SPARK-10350
 URL: https://issues.apache.org/jira/browse/SPARK-10350
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.5.0
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Priority: Minor
 Fix For: 1.5.0


 [b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95#diff-d8aa7a37d17a1227cba38c99f9f22511R1383]
  contains duplicate content:  {{spark.sql.parquet.mergeSchema}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10355) Add Python API for SQLTransformer

Yanbo Liang created SPARK-10355:
---

 Summary: Add Python API for SQLTransformer
 Key: SPARK-10355
 URL: https://issues.apache.org/jira/browse/SPARK-10355
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor


Add Python API for SQLTransformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-08-30 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10301:
---
Description: 
We hit this issue when reading a complex Parquet dateset without turning on 
schema merging.  The data set consists of Parquet files with different but 
compatible schemas.  In this way, the schema of the dataset is defined by 
either a summary file or a random physical Parquet file if no summary files are 
available.  Apparently, this schema may not containing all fields appeared in 
all physicla files.

Parquet was designed with schema evolution and column pruning in mind, so it 
should be legal for a user to use a tailored schema to read the dataset to save 
disk IO.  For example, say we have a Parquet dataset consisting of two physical 
Parquet files with the following two schemas:
{noformat}
message m0 {
  optional group f0 {
optional int64 f00;
optional int64 f01;
  }
}

message m1 {
  optional group f0 {
optional int64 f01;
optional int64 f01;
optional int64 f02;
  }

  optional double f1;
}
{noformat}
Users should be allowed to read the dataset with the following schema:
{noformat}
message m1 {
  optional group f0 {
optional int64 f01;
optional int64 f02;
  }
}
{noformat}
so that {{f0.f00}} and {{f1}} are never touched.  The above case can be 
expressed by the following {{spark-shell}} snippet:
{noformat}
import sqlContext._
import sqlContext.implicits._
import org.apache.spark.sql.types.{LongType, StructType}

val path = /tmp/spark/parquet
range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id) AS f0).coalesce(1)
.write.mode(overwrite).parquet(path)

range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0, 
CAST(id AS DOUBLE) AS f1).coalesce(1)
.write.mode(append).parquet(path)

val tailoredSchema =
  new StructType()
.add(
  f0,
  new StructType()
.add(f01, LongType, nullable = true)
.add(f02, LongType, nullable = true),
  nullable = true)

read.schema(tailoredSchema).parquet(path).show()
{noformat}
Expected output should be:
{noformat}
++
|  f0|
++
|[0,null]|
|[1,null]|
|[2,null]|
|   [0,0]|
|   [1,1]|
|   [2,2]|
++
{noformat}
However, current 1.5-SNAPSHOT version throws the following exception:
{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:206)
at

[jira] [Updated] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.

[
https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yanbo Liang updated SPARK-10023:

Description:
checkpointInterval is member of DecisionTreeParams in Scala API which is
inconsistency with Python API, we should unified them.

* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API

Proposal:
checkpointInterval is also used at ALS but the meaning for that is different
from here.
So we make checkpointInterval member of DecisionTreeParams for Python API.
Because it only validate when cacheNodeIds is true and the checkpoint directory
is set in the SparkContext, it not a common shared param.

was:
checkpointInterval is member of DecisionTreeParams in Scala API which is
inconsistency with Python API, we should unified them.

* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API

Proposal:
checkpointInterval also used at ALS but the meaning for that is different
from here.
So we make checkpointInterval member of DecisionTreeParams for Python API, it
only validate when cacheNodeIds is true and the checkpoint directory is set in
the SparkContext.

Unified DecisionTreeParams checkpointInterval between Scala and Python API.
-

Key: SPARK-10023
URL: https://issues.apache.org/jira/browse/SPARK-10023
Project: Spark
Issue Type: Sub-task
Components: ML, PySpark
Reporter: Yanbo Liang

checkpointInterval is member of DecisionTreeParams in Scala API which is
inconsistency with Python API, we should unified them.
* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API
Proposal:
checkpointInterval is also used at ALS but the meaning for that is
different from here.
So we make checkpointInterval member of DecisionTreeParams for Python API.
Because it only validate when cacheNodeIds is true and the checkpoint
directory is set in the SparkContext, it not a common shared param.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.


[ 
https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721477#comment-14721477
 ] 

Apache Spark commented on SPARK-10023:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8528

 Unified DecisionTreeParams checkpointInterval between Scala and Python API.
 -

 Key: SPARK-10023
 URL: https://issues.apache.org/jira/browse/SPARK-10023
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang

 checkpointInterval is member of DecisionTreeParams in Scala API which is 
 inconsistency with Python API, we should unified them.
 * checkpointInterval
 ** member of DecisionTreeParams - Scala API
 ** shared param used for all ML Transformer/Estimator - Python API
 Proposal:
  checkpointInterval also used at ALS but the meaning for that is different 
 from here.
 So we make checkpointInterval member of DecisionTreeParams for Python API, 
 it only validate when cacheNodeIds is true and the checkpoint directory is 
 set in the SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.


 [ 
https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10023:


Assignee: (was: Apache Spark)

 Unified DecisionTreeParams checkpointInterval between Scala and Python API.
 -

 Key: SPARK-10023
 URL: https://issues.apache.org/jira/browse/SPARK-10023
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang

 checkpointInterval is member of DecisionTreeParams in Scala API which is 
 inconsistency with Python API, we should unified them.
 * checkpointInterval
 ** member of DecisionTreeParams - Scala API
 ** shared param used for all ML Transformer/Estimator - Python API
 Proposal:
  checkpointInterval also used at ALS but the meaning for that is different 
 from here.
 So we make checkpointInterval member of DecisionTreeParams for Python API, 
 it only validate when cacheNodeIds is true and the checkpoint directory is 
 set in the SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9926) Parallelize file listing for partitioned Hive table


[ 
https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721483#comment-14721483
 ] 

Apache Spark commented on SPARK-9926:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8512

 Parallelize file listing for partitioned Hive table
 ---

 Key: SPARK-9926
 URL: https://issues.apache.org/jira/browse/SPARK-9926
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1, 1.5.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park

 In Spark SQL, short queries like {{select * from table limit 10}} run very 
 slowly against partitioned Hive tables because of file listing. In 
 particular, if a large number of partitions are scanned on storage like S3, 
 the queries run extremely slowly. Here are some example benchmarks in my 
 environment-
 * Parquet-backed Hive table
 * Partitioned by dateint and hour
 * Stored on S3
 ||\# of partitions||\# of files||runtime||query||
 |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 
 10;|
 |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
 |240|136222|1 hour|select * from nccp_log where dateint=20150601 and 
 dateint=20150610 limit 10;|
 The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive 
 partition path and group them into a UnionRDD. Then, all the input files are 
 listed sequentially. In other tools such as Hive and Pig, this can be solved 
 by setting 
 [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
  high. But in Spark, since each HadoopRDD lists only one partition path, 
 setting this property doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10356) MLlib: Normalization should use absolute values


[ 
https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721502#comment-14721502
 ] 

Carsten Schnober edited comment on SPARK-10356 at 8/30/15 12:00 PM:


According to https://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm, each 
value's absolute value should be used to compute the norm:

{code}
||x||_p := (sum(|x|^p)^1/p
{code}

For p = 1, this results in:

{code}
||x||_1 := sum(|x|)
{code}

I suppose the issue is thus actually located in the norm() method.



was (Author: carschno):
According to 
[[Wikipedia][https://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm], each 
value's absolute value should be used to compute the norm:

{||x||_p := (sum(|x|^p)^1/p}

For p = 1, this results in:

{||x||_1 := sum(|x|)}

I suppose the issue is thus actually located in the {norm()} method.


 MLlib: Normalization should use absolute values
 ---

 Key: SPARK-10356
 URL: https://issues.apache.org/jira/browse/SPARK-10356
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Carsten Schnober
  Labels: easyfix
   Original Estimate: 2h
  Remaining Estimate: 2h

 The normalizer does not handle vectors with negative values properly. It can 
 be tested with the following code
 {code}
 val normalized = new Normalizer(1.0).transform(v: Vector)
 normalizer.toArray.sum == 1.0
 {code}
 This yields true if all values in Vector v are positive, but false when v 
 contains one or more negative values. This is because the values in v are 
 taken immediately without applying {{abs()}},
 This (probably) does not occur for {{p=2.0}} because the values are squared 
 and hence positive anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10331) Update user guide to address minor comments during code review


 [ 
https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10331.
---
   Resolution: Fixed
Fix Version/s: 1.5.1

 Update user guide to address minor comments during code review
 --

 Key: SPARK-10331
 URL: https://issues.apache.org/jira/browse/SPARK-10331
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML, MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.5.1


 Clean-up user guides to address some minor comments in:
 https://github.com/apache/spark/pull/8304
 https://github.com/apache/spark/pull/8487



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10346) SparkR mutate and transform should replace column with same name to match R data.frame behavior

2015-08-30 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-10346:
-
Component/s: SparkR

 SparkR mutate and transform should replace column with same name to match R 
 data.frame behavior
 ---

 Key: SPARK-10346
 URL: https://issues.apache.org/jira/browse/SPARK-10346
 Project: Spark
  Issue Type: Bug
  Components: R, SparkR
Affects Versions: 1.5.0
Reporter: Felix Cheung

 Spark doesn't seem to replace existing column with the name in mutate (ie. 
 mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same 
 name 'age'), so therefore not doing that for now in transform.
 Though it is clearly stated it should replace column with matching name:
 https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html
 The tags are matched against names(_data), and for those that match, the 
 value replace the corresponding variable in _data, and the others are 
 appended to _data.
 Also the resulting DataFrame might be hard to work with if one is to use 
 select with column names, or to register the table to SQL, and so on, since 
 then 2 columns have the same name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-08-30 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10301:
---
Description: 
We hit this issue when reading a complex Parquet dateset without turning on 
schema merging.  The data set consists of Parquet files with different but 
compatible schemas.  In this way, the schema of the dataset is defined by 
either a summary file or a random physical Parquet file if no summary files are 
available.  Apparently, this schema may not containing all fields appeared in 
all physicla files.

Parquet was designed with schema evolution and column pruning in mind, so it 
should be legal for a user to use a tailored schema to read the dataset to save 
disk IO.  For example, say we have a Parquet dataset consisting of two physical 
Parquet files with the following two schemas:
{noformat}
message m0 {
  optional group f0 {
optional int64 f00;
optional int64 f01;
  }
}

message m1 {
  optional group f0 {
optional int64 f01;
optional int64 f01;
optional int64 f02;
  }

  optional double f1;
}
{noformat}
Users should be allowed to read the dataset with the following schema:
{noformat}
message m1 {
  optional group f0 {
optional int64 f01;
optional int64 f02;
  }
}
{noformat}
so that {{f0.f00}} and {{f1}} are never touched.  The above case can be 
expressed by the following {{spark-shell}} snippet:
{noformat}
import sqlContext._
import sqlContext.implicits._
import org.apache.spark.sql.types.{LongType, StructType}

val path = /tmp/spark/parquet
range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id) AS f0).coalesce(1)
.write.mode(overwrite).parquet(path)

range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0, 
CAST(id AS DOUBLE) AS f1).coalesce(1)
.write.mode(append).parquet(path)

val tailoredSchema =
  new StructType()
.add(
  f0,
  new StructType()
.add(f01, LongType, nullable = true)
.add(f02, LongType, nullable = true),
  nullable = true)

read.schema(tailoredSchema).parquet(path).show()
{noformat}
Expected output should be:
{noformat}
++
|  f0|
++
|[0,null]|
|[1,null]|
|[2,null]|
|   [0,0]|
|   [1,1]|
|   [2,2]|
++
{noformat}
However, current 1.5-SNAPSHOT version throws the following exception:
{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:206)
at

[jira] [Created] (SPARK-10354) First cost RDD shouldn't be cached in k-means|| and the following cost RDD should use MEMORY_AND_DISK

Xiangrui Meng created SPARK-10354:
-

 Summary: First cost RDD shouldn't be cached in k-means|| and the 
following cost RDD should use MEMORY_AND_DISK
 Key: SPARK-10354
 URL: https://issues.apache.org/jira/browse/SPARK-10354
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


The first RDD doesn't need to be cached, other cost RDDs should use 
MEMORY_AND_DISK to avoid recomputing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient


[ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721419#comment-14721419
 ] 

Xiangrui Meng commented on SPARK-10329:
---

Assigned. I will send a small PR to fix apparent issues (SPARK-10534).

 Cost RDD in k-means|| initialization is not storage-efficient
 -

 Key: SPARK-10329
 URL: https://issues.apache.org/jira/browse/SPARK-10329
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Xiangrui Meng
Assignee: hujiayin
  Labels: clustering

 Currently we use `RDD[Vector]` to store point cost during k-means|| 
 initialization, where each `Vector` has size `runs`. This is not 
 storage-efficient because `runs` is usually 1 and then each record is a 
 Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
 introduce two objects (DenseVector and its values array), which could cost 16 
 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
 for reporting this issue!
 There are several solutions:
 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
 record.
 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
 `Array[Double]` object covers 1024 instances, which could remove most of the 
 overhead.
 Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
 kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient


[ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721419#comment-14721419
 ] 

Xiangrui Meng edited comment on SPARK-10329 at 8/30/15 6:36 AM:


Assigned. I will send a small PR to fix apparent issues (SPARK-10354).


was (Author: mengxr):
Assigned. I will send a small PR to fix apparent issues (SPARK-10534).

 Cost RDD in k-means|| initialization is not storage-efficient
 -

 Key: SPARK-10329
 URL: https://issues.apache.org/jira/browse/SPARK-10329
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Xiangrui Meng
Assignee: hujiayin
  Labels: clustering

 Currently we use `RDD[Vector]` to store point cost during k-means|| 
 initialization, where each `Vector` has size `runs`. This is not 
 storage-efficient because `runs` is usually 1 and then each record is a 
 Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
 introduce two objects (DenseVector and its values array), which could cost 16 
 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
 for reporting this issue!
 There are several solutions:
 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
 record.
 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
 `Array[Double]` object covers 1024 instances, which could remove most of the 
 overhead.
 Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
 kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-08-30 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721447#comment-14721447
 ] 

Cheng Lian commented on SPARK-10301:


Updated ticket description to provide a more general view of this issue. Would 
also be helpful for reviewing https://github.com/apache/spark/pull/8509

 For struct type, if parquet's global schema has less fields than a file's 
 schema, data reading will fail
 

 Key: SPARK-10301
 URL: https://issues.apache.org/jira/browse/SPARK-10301
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Critical

 We hit this issue when reading a complex Parquet dateset without turning on 
 schema merging.  The data set consists of Parquet files with different but 
 compatible schemas.  In this way, the schema of the dataset is defined by 
 either a summary file or a random physical Parquet file if no summary files 
 are available.  Apparently, this schema may not containing all fields 
 appeared in all physicla files.
 Parquet was designed with schema evolution and column pruning in mind, so it 
 should be legal for a user to use a tailored schema to read the dataset to 
 save disk IO.  For example, say we have a Parquet dataset consisting of two 
 physical Parquet files with the following two schemas:
 {noformat}
 message m0 {
   optional group f0 {
 optional int64 f00;
 optional int64 f01;
   }
 }
 message m1 {
   optional group f0 {
 optional int64 f01;
 optional int64 f01;
 optional int64 f02;
   }
   optional double f1;
 }
 {noformat}
 Users should be allowed to read the dataset with the following schema:
 {noformat}
 message m1 {
   optional group f0 {
 optional int64 f01;
 optional int64 f02;
   }
 }
 {noformat}
 so that {{f0.f00}} and {{f1}} are never touched.  The above case can be 
 expressed by the following {{spark-shell}} snippet:
 {noformat}
 import sqlContext._
 import sqlContext.implicits._
 import org.apache.spark.sql.types.{LongType, StructType}
 val path = /tmp/spark/parquet
 range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id) AS f0).coalesce(1)
 .write.mode(overwrite).parquet(path)
 range(3).selectExpr(NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0, 
 CAST(id AS DOUBLE) AS f1).coalesce(1)
 .write.mode(append).parquet(path)
 val tailoredSchema =
   new StructType()
 .add(
   f0,
   new StructType()
 .add(f01, LongType, nullable = true)
 .add(f02, LongType, nullable = true),
   nullable = true)
 read.schema(tailoredSchema).parquet(path).show()
 {noformat}
 Expected output should be:
 {noformat}
 ++
 |  f0|
 ++
 |[0,null]|
 |[1,null]|
 |[2,null]|
 |   [0,0]|
 |   [1,1]|
 |   [2,2]|
 ++
 {noformat}
 However, current 1.5-SNAPSHOT version throws the following exception:
 {noformat}
 org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
 block -1 in file 
 hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
 at 
 org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
 at 
 org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1844)

[jira] [Resolved] (SPARK-10348) Improve Spark ML user guide


 [ 
https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10348.
---
   Resolution: Fixed
Fix Version/s: 1.5.1

Issue resolved by pull request 8517
[https://github.com/apache/spark/pull/8517]

 Improve Spark ML user guide
 ---

 Key: SPARK-10348
 URL: https://issues.apache.org/jira/browse/SPARK-10348
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.5.1


 improve ml-guide:
 * replace `ML Dataset` by `DataFrame` to simplify the abstraction
 * remove links to Scala API doc in the main guide
 * change ML algorithms to pipeline components



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10331) Update user guide to address minor comments during code review


 [ 
https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10331:
--
Description: 
Clean-up user guides to address some minor comments in:

https://github.com/apache/spark/pull/8304
https://github.com/apache/spark/pull/8487

Some code examples were introduced in 1.2 before `createDataFrame`. We should 
switch to that.


  was:
Clean-up user guides to address some minor comments in:

https://github.com/apache/spark/pull/8304
https://github.com/apache/spark/pull/8487





 Update user guide to address minor comments during code review
 --

 Key: SPARK-10331
 URL: https://issues.apache.org/jira/browse/SPARK-10331
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML, MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.5.1


 Clean-up user guides to address some minor comments in:
 https://github.com/apache/spark/pull/8304
 https://github.com/apache/spark/pull/8487
 Some code examples were introduced in 1.2 before `createDataFrame`. We should 
 switch to that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.


 [ 
https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10023:

Description: 
checkpointInterval is member of DecisionTreeParams in Scala API which is 
inconsistency with Python API, we should unified them.

* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API

Proposal:
 checkpointInterval also used at ALS but the meaning for that is different 
from here.
So we make checkpointInterval member of DecisionTreeParams for Python API, 
because of it only validate when cacheNodeIds is true and the checkpoint 
directory is set in the SparkContext.

  was:
checkpointInterval is member of DecisionTreeParams in Scala API which is 
inconsistency with Python API, we should unified them.

* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API

Proposal: Make checkpointInterval member of DecisionTreeParams for Python 
API, because of it only validate when cacheNodeIds is true and the checkpoint 
directory is set in the SparkContext.


 Unified DecisionTreeParams checkpointInterval between Scala and Python API.
 -

 Key: SPARK-10023
 URL: https://issues.apache.org/jira/browse/SPARK-10023
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang

 checkpointInterval is member of DecisionTreeParams in Scala API which is 
 inconsistency with Python API, we should unified them.
 * checkpointInterval
 ** member of DecisionTreeParams - Scala API
 ** shared param used for all ML Transformer/Estimator - Python API
 Proposal:
  checkpointInterval also used at ALS but the meaning for that is different 
 from here.
 So we make checkpointInterval member of DecisionTreeParams for Python API, 
 because of it only validate when cacheNodeIds is true and the checkpoint 
 directory is set in the SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10356) MLlib: Normalization should use absolute values


[ 
https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721494#comment-14721494
 ] 

Sean Owen commented on SPARK-10356:
---

It's not true that the sum of the elements will be 1 after this normalization. 
It's true that the sum of their absolute values will be.

{code}
scala val v = Vectors.dense(-1.0, 2.0)
v: org.apache.spark.mllib.linalg.Vector = [-1.0,2.0]
scala new Normalizer(1.0).transform(v)
res2: org.apache.spark.mllib.linalg.Vector = 
[-0.,0.]
{code}

That looks correct. You're not expecting the result to have all positive 
entries right?

 MLlib: Normalization should use absolute values
 ---

 Key: SPARK-10356
 URL: https://issues.apache.org/jira/browse/SPARK-10356
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Carsten Schnober
  Labels: easyfix
   Original Estimate: 2h
  Remaining Estimate: 2h

 The normalizer does not handle vectors with negative values properly. It can 
 be tested with the following code
 {code}
 val normalized = new Normalizer(1.0).transform(v: Vector)
 normalizer.toArray.sum == 1.0
 {code}
 This yields true if all values in Vector v are positive, but false when v 
 contains one or more negative values. This is because the values in v are 
 taken immediately without applying {{abs()}},
 This (probably) does not occur for {{p=2.0}} because the values are squared 
 and hence positive anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10264) Add @Since annotation to ml.recoomendation

2015-08-30 Thread Tijo Thomas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721682#comment-14721682
 ] 

Tijo Thomas commented on SPARK-10264:
-

I am working on this.
Thanks 

 Add @Since annotation to ml.recoomendation
 --

 Key: SPARK-10264
 URL: https://issues.apache.org/jira/browse/SPARK-10264
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication


 [ 
https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10353:
--
Affects Version/s: 1.3.1
   1.4.1

 MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose 
 matrix multiplication
 --

 Key: SPARK-10353
 URL: https://issues.apache.org/jira/browse/SPARK-10353
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Burak Yavuz
 Fix For: 1.4.2, 1.5.1


 Basically 
 {code}
 if (beta != 0.0) {
   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
 }
 {code}
 should be
 {code}
 if (beta != 1.0) {
   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication


 [ 
https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10353:
--
Fix Version/s: 1.5.1
   1.4.2

 MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose 
 matrix multiplication
 --

 Key: SPARK-10353
 URL: https://issues.apache.org/jira/browse/SPARK-10353
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Burak Yavuz
 Fix For: 1.4.2, 1.5.1


 Basically 
 {code}
 if (beta != 0.0) {
   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
 }
 {code}
 should be
 {code}
 if (beta != 1.0) {
   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication


 [ 
https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10353:
--
Assignee: Burak Yavuz

 MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose 
 matrix multiplication
 --

 Key: SPARK-10353
 URL: https://issues.apache.org/jira/browse/SPARK-10353
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz
 Fix For: 1.4.2, 1.5.1


 Basically 
 {code}
 if (beta != 0.0) {
   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
 }
 {code}
 should be
 {code}
 if (beta != 1.0) {
   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication


[ 
https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721689#comment-14721689
 ] 

Xiangrui Meng commented on SPARK-10353:
---

Leave the JIRA open for 1.3.

 MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose 
 matrix multiplication
 --

 Key: SPARK-10353
 URL: https://issues.apache.org/jira/browse/SPARK-10353
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz
 Fix For: 1.4.2, 1.5.1


 Basically 
 {code}
 if (beta != 0.0) {
   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
 }
 {code}
 should be
 {code}
 if (beta != 1.0) {
   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8684) Update R version in Spark EC2 AMI


 [ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8684:
-
Fix Version/s: (was: 1.5.0)

 Update R version in Spark EC2 AMI
 -

 Key: SPARK-8684
 URL: https://issues.apache.org/jira/browse/SPARK-8684
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the R version in the AMI is 3.1 -- However a number of R libraries 
 need R version 3.2 and it will be good to update the R version on the AMI 
 while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL: https://issues.apache.org/jira/browse/SPARK-10189
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
Reporter: ABHISHEK CHOUDHARY
  Labels: pyspark, socket

 I am trying to use wholeTextFiles with pyspark , and now I am getting the 
 same error -
 ```
 textFiles =

[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA


 [ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9663:
---
Description: 
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for ML:
** attribute SPARK-10025
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** SQLTransformer SPARK-10355
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing classes for MLlib:
** fpm
*** PrefixSpan SPARK-10028
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022

  was:
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for ML:
** attribute SPARK-10025
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing classes for MLlib:
** fpm
*** PrefixSpan SPARK-10028
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022


 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for ML:
 ** attribute SPARK-10025
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-8472
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-8530
 *** SQLTransformer SPARK-10355
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 *** IndexToString SPARK-10021
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing classes for MLlib:
 ** fpm
 *** PrefixSpan SPARK-10028
 * Missing User Guide documents for PySpark SPARK-8757
 * Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10355) Add Python API for SQLTransformer


 [ 
https://issues.apache.org/jira/browse/SPARK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10355:


Assignee: (was: Apache Spark)

 Add Python API for SQLTransformer
 -

 Key: SPARK-10355
 URL: https://issues.apache.org/jira/browse/SPARK-10355
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor

 Add Python API for SQLTransformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10355) Add Python API for SQLTransformer


[ 
https://issues.apache.org/jira/browse/SPARK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721467#comment-14721467
 ] 

Apache Spark commented on SPARK-10355:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8527

 Add Python API for SQLTransformer
 -

 Key: SPARK-10355
 URL: https://issues.apache.org/jira/browse/SPARK-10355
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor

 Add Python API for SQLTransformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10355) Add Python API for SQLTransformer


 [ 
https://issues.apache.org/jira/browse/SPARK-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10355:


Assignee: Apache Spark

 Add Python API for SQLTransformer
 -

 Key: SPARK-10355
 URL: https://issues.apache.org/jira/browse/SPARK-10355
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Assignee: Apache Spark
Priority: Minor

 Add Python API for SQLTransformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.


 [ 
https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10023:

Description: 
checkpointInterval is member of DecisionTreeParams in Scala API which is 
inconsistency with Python API, we should unified them.

* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API

Proposal: Make checkpointInterval member of DecisionTreeParams for Python 
API, because of it only validate when cacheNodeIds is true and the checkpoint 
directory is set in the SparkContext.

  was:
checkpointInterval is member of DecisionTreeParams in Scala API which is 
inconsistency with Python API, we should unified them.

* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API

Proposal: Make checkpointInterval shared param for Scala API


 Unified DecisionTreeParams checkpointInterval between Scala and Python API.
 -

 Key: SPARK-10023
 URL: https://issues.apache.org/jira/browse/SPARK-10023
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang

 checkpointInterval is member of DecisionTreeParams in Scala API which is 
 inconsistency with Python API, we should unified them.
 * checkpointInterval
 ** member of DecisionTreeParams - Scala API
 ** shared param used for all ML Transformer/Estimator - Python API
 Proposal: Make checkpointInterval member of DecisionTreeParams for Python 
 API, because of it only validate when cacheNodeIds is true and the checkpoint 
 directory is set in the SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.


 [ 
https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10023:

Description: 
checkpointInterval is member of DecisionTreeParams in Scala API which is 
inconsistency with Python API, we should unified them.

* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API

Proposal:
 checkpointInterval also used at ALS but the meaning for that is different 
from here.
So we make checkpointInterval member of DecisionTreeParams for Python API, it 
only validate when cacheNodeIds is true and the checkpoint directory is set in 
the SparkContext.

  was:
checkpointInterval is member of DecisionTreeParams in Scala API which is 
inconsistency with Python API, we should unified them.

* checkpointInterval
** member of DecisionTreeParams - Scala API
** shared param used for all ML Transformer/Estimator - Python API

Proposal:
 checkpointInterval also used at ALS but the meaning for that is different 
from here.
So we make checkpointInterval member of DecisionTreeParams for Python API, 
because of it only validate when cacheNodeIds is true and the checkpoint 
directory is set in the SparkContext.


 Unified DecisionTreeParams checkpointInterval between Scala and Python API.
 -

 Key: SPARK-10023
 URL: https://issues.apache.org/jira/browse/SPARK-10023
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang

 checkpointInterval is member of DecisionTreeParams in Scala API which is 
 inconsistency with Python API, we should unified them.
 * checkpointInterval
 ** member of DecisionTreeParams - Scala API
 ** shared param used for all ML Transformer/Estimator - Python API
 Proposal:
  checkpointInterval also used at ALS but the meaning for that is different 
 from here.
 So we make checkpointInterval member of DecisionTreeParams for Python API, 
 it only validate when cacheNodeIds is true and the checkpoint directory is 
 set in the SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.


 [ 
https://issues.apache.org/jira/browse/SPARK-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10023:


Assignee: Apache Spark

 Unified DecisionTreeParams checkpointInterval between Scala and Python API.
 -

 Key: SPARK-10023
 URL: https://issues.apache.org/jira/browse/SPARK-10023
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang
Assignee: Apache Spark

 checkpointInterval is member of DecisionTreeParams in Scala API which is 
 inconsistency with Python API, we should unified them.
 * checkpointInterval
 ** member of DecisionTreeParams - Scala API
 ** shared param used for all ML Transformer/Estimator - Python API
 Proposal:
  checkpointInterval also used at ALS but the meaning for that is different 
 from here.
 So we make checkpointInterval member of DecisionTreeParams for Python API, 
 it only validate when cacheNodeIds is true and the checkpoint directory is 
 set in the SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10356) MLlib: Normalization should use absolute values


[ 
https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721502#comment-14721502
 ] 

Carsten Schnober commented on SPARK-10356:
--

According to 
[[Wikipedia][https://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm], each 
value's absolute value should be used to compute the norm:

{||x||_p := (sum(|x|^p)^1/p}

For p = 1, this results in:

{||x||_1 := sum(|x|)}

I suppose the issue is thus actually located in the {norm()} method.


 MLlib: Normalization should use absolute values
 ---

 Key: SPARK-10356
 URL: https://issues.apache.org/jira/browse/SPARK-10356
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Carsten Schnober
  Labels: easyfix
   Original Estimate: 2h
  Remaining Estimate: 2h

 The normalizer does not handle vectors with negative values properly. It can 
 be tested with the following code
 {code}
 val normalized = new Normalizer(1.0).transform(v: Vector)
 normalizer.toArray.sum == 1.0
 {code}
 This yields true if all values in Vector v are positive, but false when v 
 contains one or more negative values. This is because the values in v are 
 taken immediately without applying {{abs()}},
 This (probably) does not occur for {{p=2.0}} because the values are squared 
 and hence positive anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
{code:title=context.py|borderStyle=solid}
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
{code}

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL:

[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL: https://issues.apache.org/jira/browse/SPARK-10189
 Project: Spark

[jira] [Comment Edited] (SPARK-8292) ShortestPaths run with error result

2015-08-30 Thread Anita Tailor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721554#comment-14721554
 ] 

Anita Tailor edited comment on SPARK-8292 at 8/30/15 3:13 PM:
--

https://issues.apache.org/jira/browse/SPARK-5343 explains existing behaviour is 
correct. 
SPF implementation here gives shortest paths to the given set of landmark 
vertices


was (Author: atailor22):
https://issues.apache.org/jira/browse/SPARK-5343 explains existing behaviour is 
correct. 

 ShortestPaths run with error result
 ---

 Key: SPARK-8292
 URL: https://issues.apache.org/jira/browse/SPARK-8292
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.1
 Environment: Ubuntu 64bit 
Reporter: Bruce Chen
Priority: Minor
  Labels: patch
 Attachments: ShortestPaths.patch


 In graphx/lib/ShortestPaths, i run an example with input data:
 0\t2
 0\t4
 2\t3
 3\t6
 4\t2
 4\t5
 5\t3
 5\t6
 then i write a function and set point '0' as the source point, and calculate 
 the shortest path from point 0 to the others points, the code like this:
 val source: Seq[VertexId] = Seq(0)
 val ss = ShortestPaths.run(graph, source)
 then, i get the run result of all the vertex's shortest path value:
 (4,Map())
 (0,Map(0 - 0))
 (6,Map())
 (3,Map())
 (5,Map())
 (2,Map())
 but the right result should be:
 (4,Map(0 - 1))
 (0,Map(0 - 0))
 (6,Map(0 - 3))
 (3,Map(0 - 2))
 (5,Map(0 - 2))
 (2,Map(0 - 1))
 so, i check the source code of 
 spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala 
 and find a bug.
 The patch list in the following.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI

2015-08-30 Thread Vincent Warmerdam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721566#comment-14721566
 ] 

Vincent Warmerdam commented on SPARK-8684:
--

upgrading the spark ec2 ami would take significant work and might not be 
something for the upcoming spark release. this is (to my knowledge) still 
something that is unresolved. 

 Update R version in Spark EC2 AMI
 -

 Key: SPARK-8684
 URL: https://issues.apache.org/jira/browse/SPARK-8684
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.5.0


 Right now the R version in the AMI is 3.1 -- However a number of R libraries 
 need R version 3.2 and it will be good to update the R version on the AMI 
 while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8292) ShortestPaths run with error result


 [ 
https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8292.
--
Resolution: Not A Problem

 ShortestPaths run with error result
 ---

 Key: SPARK-8292
 URL: https://issues.apache.org/jira/browse/SPARK-8292
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.1
 Environment: Ubuntu 64bit 
Reporter: Bruce Chen
Priority: Minor
  Labels: patch
 Attachments: ShortestPaths.patch


 In graphx/lib/ShortestPaths, i run an example with input data:
 0\t2
 0\t4
 2\t3
 3\t6
 4\t2
 4\t5
 5\t3
 5\t6
 then i write a function and set point '0' as the source point, and calculate 
 the shortest path from point 0 to the others points, the code like this:
 val source: Seq[VertexId] = Seq(0)
 val ss = ShortestPaths.run(graph, source)
 then, i get the run result of all the vertex's shortest path value:
 (4,Map())
 (0,Map(0 - 0))
 (6,Map())
 (3,Map())
 (5,Map())
 (2,Map())
 but the right result should be:
 (4,Map(0 - 1))
 (0,Map(0 - 0))
 (6,Map(0 - 3))
 (3,Map(0 - 2))
 (5,Map(0 - 2))
 (2,Map(0 - 1))
 so, i check the source code of 
 spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala 
 and find a bug.
 The patch list in the following.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10356) MLlib: Normalization should use absolute values


 [ 
https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10356.
---
Resolution: Not A Problem

 MLlib: Normalization should use absolute values
 ---

 Key: SPARK-10356
 URL: https://issues.apache.org/jira/browse/SPARK-10356
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Carsten Schnober
  Labels: easyfix
   Original Estimate: 2h
  Remaining Estimate: 2h

 The normalizer does not handle vectors with negative values properly. It can 
 be tested with the following code
 {code}
 val normalized = new Normalizer(1.0).transform(v: Vector)
 normalizer.toArray.sum == 1.0
 {code}
 This yields true if all values in Vector v are positive, but false when v 
 contains one or more negative values. This is because the values in v are 
 taken immediately without applying {{abs()}},
 This (probably) does not occur for {{p=2.0}} because the values are squared 
 and hence positive anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10356) MLlib: Normalization should use absolute values


[ 
https://issues.apache.org/jira/browse/SPARK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721511#comment-14721511
 ] 

Sean Owen commented on SPARK-10356:
---

Exactly. Your code does not compute a 1 norm because you are summing elements 
and not their absolute values. The normalization in Spark is correct. To be 
clear normalization makes the norm 1; you are testing some other condition that 
is not true. 

 MLlib: Normalization should use absolute values
 ---

 Key: SPARK-10356
 URL: https://issues.apache.org/jira/browse/SPARK-10356
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Carsten Schnober
  Labels: easyfix
   Original Estimate: 2h
  Remaining Estimate: 2h

 The normalizer does not handle vectors with negative values properly. It can 
 be tested with the following code
 {code}
 val normalized = new Normalizer(1.0).transform(v: Vector)
 normalizer.toArray.sum == 1.0
 {code}
 This yields true if all values in Vector v are positive, but false when v 
 contains one or more negative values. This is because the values in v are 
 taken immediately without applying {{abs()}},
 This (probably) does not occur for {{p=2.0}} because the values are squared 
 and hence positive anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-8684) Update R version in Spark EC2 AMI

2015-08-30 Thread Vincent Warmerdam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Warmerdam updated SPARK-8684:
-
Comment: was deleted

(was: closed due to github confusion. reopened due to sanity.)

 Update R version in Spark EC2 AMI
 -

 Key: SPARK-8684
 URL: https://issues.apache.org/jira/browse/SPARK-8684
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.5.0


 Right now the R version in the AMI is 3.1 -- However a number of R libraries 
 need R version 3.2 and it will be good to update the R version on the AMI 
 while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8292) ShortestPaths run with error result

2015-08-30 Thread Anita Tailor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721554#comment-14721554
 ] 

Anita Tailor commented on SPARK-8292:
-

https://issues.apache.org/jira/browse/SPARK-5343 explains existing behaviour is 
correct. 

 ShortestPaths run with error result
 ---

 Key: SPARK-8292
 URL: https://issues.apache.org/jira/browse/SPARK-8292
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.1
 Environment: Ubuntu 64bit 
Reporter: Bruce Chen
Priority: Minor
  Labels: patch
 Attachments: ShortestPaths.patch


 In graphx/lib/ShortestPaths, i run an example with input data:
 0\t2
 0\t4
 2\t3
 3\t6
 4\t2
 4\t5
 5\t3
 5\t6
 then i write a function and set point '0' as the source point, and calculate 
 the shortest path from point 0 to the others points, the code like this:
 val source: Seq[VertexId] = Seq(0)
 val ss = ShortestPaths.run(graph, source)
 then, i get the run result of all the vertex's shortest path value:
 (4,Map())
 (0,Map(0 - 0))
 (6,Map())
 (3,Map())
 (5,Map())
 (2,Map())
 but the right result should be:
 (4,Map(0 - 1))
 (0,Map(0 - 0))
 (6,Map(0 - 3))
 (3,Map(0 - 2))
 (5,Map(0 - 2))
 (2,Map(0 - 1))
 so, i check the source code of 
 spark/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala 
 and find a bug.
 The patch list in the following.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI

2015-08-30 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721585#comment-14721585
 ] 

Yin Huai commented on SPARK-8684:
-

ah, I see. I thought the jira of this task had been all done because the fix 
version is set. 

 Update R version in Spark EC2 AMI
 -

 Key: SPARK-8684
 URL: https://issues.apache.org/jira/browse/SPARK-8684
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.5.0


 Right now the R version in the AMI is 3.1 -- However a number of R libraries 
 need R version 3.2 and it will be good to update the R version on the AMI 
 while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9642) LinearRegression should supported weighted data

2015-08-30 Thread Meihua Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721801#comment-14721801
 ] 

Meihua Wu commented on SPARK-9642:
--

[~sethah] Thank you for your help. I worked on this and have a draft version, 
which makes use of a few components in the PR for a similar issue of logistics 
regression (https://issues.apache.org/jira/browse/SPARK-7685). I am planning to 
send a PR after the issue 7685 is resolved. 

 LinearRegression should supported weighted data
 ---

 Key: SPARK-9642
 URL: https://issues.apache.org/jira/browse/SPARK-9642
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Meihua Wu
  Labels: 1.6

 In many modeling application, data points are not necessarily sampled with 
 equal probabilities. Linear regression should support weighting which account 
 the over or under sampling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient

2015-08-30 Thread hujiayin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-10329:
-
Comment: was deleted

(was: ok, I will try to fix it today)

 Cost RDD in k-means|| initialization is not storage-efficient
 -

 Key: SPARK-10329
 URL: https://issues.apache.org/jira/browse/SPARK-10329
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Xiangrui Meng
Assignee: hujiayin
  Labels: clustering

 Currently we use `RDD[Vector]` to store point cost during k-means|| 
 initialization, where each `Vector` has size `runs`. This is not 
 storage-efficient because `runs` is usually 1 and then each record is a 
 Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
 introduce two objects (DenseVector and its values array), which could cost 16 
 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
 for reporting this issue!
 There are several solutions:
 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
 record.
 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
 `Array[Double]` object covers 1024 instances, which could remove most of the 
 overhead.
 Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
 kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient

2015-08-30 Thread hujiayin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721799#comment-14721799
 ] 

hujiayin commented on SPARK-10329:
--

ok, I will try to fix it today

 Cost RDD in k-means|| initialization is not storage-efficient
 -

 Key: SPARK-10329
 URL: https://issues.apache.org/jira/browse/SPARK-10329
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1, 1.4.1, 1.5.0
Reporter: Xiangrui Meng
Assignee: hujiayin
  Labels: clustering

 Currently we use `RDD[Vector]` to store point cost during k-means|| 
 initialization, where each `Vector` has size `runs`. This is not 
 storage-efficient because `runs` is usually 1 and then each record is a 
 Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
 introduce two objects (DenseVector and its values array), which could cost 16 
 bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
 for reporting this issue!
 There are several solutions:
 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
 record.
 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
 `Array[Double]` object covers 1024 instances, which could remove most of the 
 overhead.
 Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
 kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10357) DataFrames unable to drop unwanted columns

2015-08-30 Thread Randy Gelhausen (JIRA)

Randy Gelhausen created SPARK-10357:
---

 Summary: DataFrames unable to drop unwanted columns
 Key: SPARK-10357
 URL: https://issues.apache.org/jira/browse/SPARK-10357
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Randy Gelhausen


spark-csv seems to be exposing an issue with DataFrame's inability to drop 
unwanted columns.

Related GitHub issue: https://github.com/databricks/spark-csv/issues/61

My data (with header) looks like:
MI_PRINX,offense_id,rpt_date,occur_date,occur_time,poss_date,poss_time,beat,apt_office_prefix,apt_office_num,location,MinOfucr,MinOfibr_code,dispo_code,MaxOfnum_victims,Shift,Avg
 Day,loc_type,UC2 Literal,neighborhood,npu,x,y,,,
934782,90360664,2/5/2009,2/3/2009,13:50:00,2/3/2009,15:00:00,305,NULL,NULL,55 
MCDONOUGH BLVD SW,670,2308,NULL,1,Day,Tue,35,LARCENY-NON VEHICLE,South 
Atlanta,Y,-84.38654,33.72024,,,
934783,90370891,2/6/2009,2/6/2009,8:50:00,2/6/2009,10:45:00,502,NULL,NULL,464 
ANSLEY WALK TER NW,640,2305,NULL,1,Day,Fri,18,LARCENY-FROM VEHICLE,Ansley 
Park,E,-84.37276,33.79685,,,

Despite using sqlContext (also tried with the programmatic raw.select, same 
result) to remove columns from the dataframe, attempts to operate on it cause 
failures.

Snippet:
// Read CSV file, clean field  names
val raw = 
sqlContext.read.format(com.databricks.spark.csv).option(header, 
true).option(DROPMALFORMED, true).load(input)
val columns = raw.columns.map(x = x.replaceAll( , _))
raw.toDF(columns:_*).registerTempTable(table)
val clean = sqlContext.sql(select  + columns.filter(x = x.length()  0 
 x !=  ).mkString(, ) +  from  + table)

System.err.println(clean.schema)
System.err.println(clean.columns.mkString(,))
System.err.println(clean.take(1).mkString(|))

StackTrace:
15/08/30 18:23:13 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 
(TID 0, docker.dev, NODE_LOCAL, 1482 bytes)
15/08/30 18:23:14 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
memory on docker.dev:58272 (size: 1811.0 B, free: 530.0 MB)
15/08/30 18:23:14 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on docker.dev:58272 (size: 21.9 KB, free: 530.0 MB)
15/08/30 18:23:15 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 
(TID 0) in 1350 ms on docker.dev (1/1)
15/08/30 18:23:15 INFO scheduler.DAGScheduler: ResultStage 0 (take at 
CsvRelation.scala:174) finished in 1.354 s
15/08/30 18:23:15 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks 
have all completed, from pool 
15/08/30 18:23:15 INFO scheduler.DAGScheduler: Job 0 finished: take at 
CsvRelation.scala:174, took 1.413674 s
StructType(StructField(MI_PRINX,StringType,true), 
StructField(offense_id,StringType,true), StructField(rpt_date,StringType,true), 
StructField(occur_date,StringType,true), 
StructField(occur_time,StringType,true), 
StructField(poss_date,StringType,true), StructField(poss_time,StringType,true), 
StructField(beat,StringType,true), 
StructField(apt_office_prefix,StringType,true), 
StructField(apt_office_num,StringType,true), 
StructField(location,StringType,true), StructField(MinOfucr,StringType,true), 
StructField(MinOfibr_code,StringType,true), 
StructField(dispo_code,StringType,true), 
StructField(MaxOfnum_victims,StringType,true), 
StructField(Shift,StringType,true), StructField(Avg_Day,StringType,true), 
StructField(loc_type,StringType,true), 
StructField(UC2_Literal,StringType,true), 
StructField(neighborhood,StringType,true), StructField(npu,StringType,true), 
StructField(x,StringType,true), StructField(y,StringType,true))
MI_PRINX,offense_id,rpt_date,occur_date,occur_time,poss_date,poss_time,beat,apt_office_prefix,apt_office_num,location,MinOfucr,MinOfibr_code,dispo_code,MaxOfnum_victims,Shift,Avg_Day,loc_type,UC2_Literal,neighborhood,npu,x,y
15/08/30 18:23:16 INFO storage.MemoryStore: ensureFreeSpace(232400) called with 
curMem=259660, maxMem=278019440
15/08/30 18:23:16 INFO storage.MemoryStore: Block broadcast_2 stored as values 
in memory (estimated size 227.0 KB, free 264.7 MB)
15/08/30 18:23:16 INFO storage.MemoryStore: ensureFreeSpace(22377) called with 
curMem=492060, maxMem=278019440
15/08/30 18:23:16 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as 
bytes in memory (estimated size 21.9 KB, free 264.6 MB)
15/08/30 18:23:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
memory on 172.17.0.19:41088 (size: 21.9 KB, free: 265.1 MB)
15/08/30 18:23:16 INFO spark.SparkContext: Created broadcast 2 from textFile at 
TextFile.scala:30
Exception in thread main java.lang.IllegalArgumentException: The header 
contains a duplicate entry: '' in [MI_PRINX, offense_id, rpt_date, occur_date, 
occur_time, poss_date, poss_time, beat, apt_office_prefix, apt_office_num, 
location, MinOfucr, MinOfibr_code, dispo_code, MaxOfnum_victims, Shift, Avg 
Day, loc_type, UC2 Literal,

[jira] [Created] (SPARK-10358) Spark-sql throws IOException on exit when using HDFS to store event log.

2015-08-30 Thread Sioa Song (JIRA)

Sioa Song created SPARK-10358:
-

 Summary: Spark-sql throws IOException on exit when using HDFS to 
store event log.
 Key: SPARK-10358
 URL: https://issues.apache.org/jira/browse/SPARK-10358
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: * spark-1.3.1-bin-hadoop2.6
* hadoop-2.6.0
* Red hat 2.6.32-504.el6.x86_64
Reporter: Sioa Song
Priority: Minor
 Fix For: 1.3.1


h2. Summary 
In Spark 1.3.1, if using HDFS to store event log, spark-sql will throw an 
java.io.IOException: Filesystem closed when exit. 
h2. How to reproduce 
1. Enable event log mechanism, and configure the file location to HDFS. 
   You can do this by setting these two properties in spark-defaults.conf: 
spark.eventLog.enabled  true 
spark.eventLog.dir  hdfs://x:x/spark-events 
2. start spark-sql, and type exit once it starts. 
{noformat} 
spark-sql exit; 
15/08/14 06:29:20 ERROR scheduler.LiveListenerBus: Listener 
EventLoggingListener threw an exception 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 
at java.lang.reflect.Method.invoke(Method.java:597) 
at 
org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
 
at 
org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
 
at 
org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:181)
 
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
 
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) 
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36)
 
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76)
 
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
 
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
 
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1678) 
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)
 
Caused by: java.io.IOException: Filesystem closed 
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795) 
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1985) 
at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1946) 
at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) 
... 19 more 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/metrics/json,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/static,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/executors/threadDump,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/executors/json,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/executors,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/environment/json,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/environment,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/storage/rdd,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/storage/json,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/storage,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/stages/pool/json,null} 
15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
o.s.j.s.ServletContextHandler{/stages/pool,null} 
15/08/14 06:29:20 INFO

[jira] [Commented] (SPARK-9666) ML 1.5 QA: model save/load audit

2015-08-30 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721906#comment-14721906
 ] 

yuhao yang commented on SPARK-9666:
---

models have no change in 1.5:
LogisticRegressionModel
LassoModel
SVMModel
RidgeRegressionModel
NaiveBayesModel
LinearRegressionModel
DecisionTreeModel
MatrixFactorizationModel
RandomForestModel
GradientBoostedTreesModel

 ML 1.5 QA: model save/load audit
 

 Key: SPARK-9666
 URL: https://issues.apache.org/jira/browse/SPARK-9666
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 We should check to make sure no changes broke model import/export in 
 spark.mllib.
 * If a model's name, data members, or constructors have changed _at all_, 
 then we likely need to support a new save/load format version.  Different 
 versions must be tested in unit tests to ensure backwards compatibility 
 (i.e., verify we can load old model formats).
 * Examples in the programming guide should include save/load when available.  
 It's important to try running each example in the guide whenever it is 
 modified (since there are no automated tests).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-30 Thread Feynman Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721907#comment-14721907
 ] 

Feynman Liang commented on SPARK-10199:
---

[~vinodkc] Thanks! I think these results are convincing. Let's see what others 
think but FWIW I'm all for these changes, particularly because it sets 
precedence for future model save/load to explicitly specify the schema.

 Avoid using reflections for parquet model save
 --

 Key: SPARK-10199
 URL: https://issues.apache.org/jira/browse/SPARK-10199
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Feynman Liang
Priority: Minor

 These items are not high priority since the overhead writing to Parquest is 
 much greater than for runtime reflections.
 Multiple model save/load in MLlib use case classes to infer a schema for the 
 data frame saved to Parquet. However, inferring a schema from case classes or 
 tuples uses [runtime 
 reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
  which is unnecessary since the types are already known at the time `save` is 
 called.
 It would be better to just specify the schema for the data frame directly 
 using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9642) LinearRegression should supported weighted data

2015-08-30 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721791#comment-14721791
 ] 

Seth Hendrickson commented on SPARK-9642:
-

I'd like to take this one if no one else is working on it.

 LinearRegression should supported weighted data
 ---

 Key: SPARK-9642
 URL: https://issues.apache.org/jira/browse/SPARK-9642
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Meihua Wu
  Labels: 1.6

 In many modeling application, data points are not necessarily sampled with 
 equal probabilities. Linear regression should support weighting which account 
 the over or under sampling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10264) Add @Since annotation to ml.recoomendation


[ 
https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14722969#comment-14722969
 ] 

Apache Spark commented on SPARK-10264:
--

User 'tijoparacka' has created a pull request for this issue:
https://github.com/apache/spark/pull/8532

> Add @Since annotation to ml.recoomendation
> --
>
> Key: SPARK-10264
> URL: https://issues.apache.org/jira/browse/SPARK-10264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10264) Add @Since annotation to ml.recoomendation


 [ 
https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10264:


Assignee: Apache Spark

> Add @Since annotation to ml.recoomendation
> --
>
> Key: SPARK-10264
> URL: https://issues.apache.org/jira/browse/SPARK-10264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10264) Add @Since annotation to ml.recoomendation


 [ 
https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10264:


Assignee: (was: Apache Spark)

> Add @Since annotation to ml.recoomendation
> --
>
> Key: SPARK-10264
> URL: https://issues.apache.org/jira/browse/SPARK-10264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-08-30 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14722999#comment-14722999
 ] 

Meethu Mathew commented on SPARK-6724:
--

[~josephkb] Could you plz give your opinion on this ?

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9545) Run Maven tests in pull request builder if title has [test-maven] in it


 [ 
https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-9545:
---
Summary: Run Maven tests in pull request builder if title has 
[test-maven] in it  (was: Run Maven tests in pull request builder if title 
has [maven-test] in it)

 Run Maven tests in pull request builder if title has [test-maven] in it
 -

 Key: SPARK-9545
 URL: https://issues.apache.org/jira/browse/SPARK-9545
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 We have infrastructure now in the build tooling for running maven tests, but 
 it's not actually used anywhere. With a very minor change we can support 
 running maven tests if the pull request title has maven-test in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9547) Allow testing pull requests with different Hadoop versions


 [ 
https://issues.apache.org/jira/browse/SPARK-9547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-9547.

   Resolution: Fixed
Fix Version/s: 1.6.0

 Allow testing pull requests with different Hadoop versions
 --

 Key: SPARK-9547
 URL: https://issues.apache.org/jira/browse/SPARK-9547
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 Similar to SPARK-9545 we should allow testing different Hadoop profiles in 
 the PRB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9545) Run Maven tests in pull request builder if title has [maven-test] in it


 [ 
https://issues.apache.org/jira/browse/SPARK-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-9545.

   Resolution: Fixed
Fix Version/s: 1.6.0

 Run Maven tests in pull request builder if title has [maven-test] in it
 -

 Key: SPARK-9545
 URL: https://issues.apache.org/jira/browse/SPARK-9545
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.6.0


 We have infrastructure now in the build tooling for running maven tests, but 
 it's not actually used anywhere. With a very minor change we can support 
 running maven tests if the pull request title has maven-test in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests


 [ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10359:


Assignee: Apache Spark  (was: Patrick Wendell)

 Enumerate Spark's dependencies in a file and diff against it for new pull 
 requests 
 ---

 Key: SPARK-10359
 URL: https://issues.apache.org/jira/browse/SPARK-10359
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Patrick Wendell
Assignee: Apache Spark

 Sometimes when we have dependency changes it can be pretty unclear what 
 transitive set of things are changing. If we enumerate all of the 
 dependencies and put them in a source file in the repo, we can make it so 
 that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests


 [ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10359:


Assignee: Patrick Wendell  (was: Apache Spark)

 Enumerate Spark's dependencies in a file and diff against it for new pull 
 requests 
 ---

 Key: SPARK-10359
 URL: https://issues.apache.org/jira/browse/SPARK-10359
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 Sometimes when we have dependency changes it can be pretty unclear what 
 transitive set of things are changing. If we enumerate all of the 
 dependencies and put them in a source file in the repo, we can make it so 
 that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests


[ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14722716#comment-14722716
 ] 

Apache Spark commented on SPARK-10359:
--

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/8531

 Enumerate Spark's dependencies in a file and diff against it for new pull 
 requests 
 ---

 Key: SPARK-10359
 URL: https://issues.apache.org/jira/browse/SPARK-10359
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 Sometimes when we have dependency changes it can be pretty unclear what 
 transitive set of things are changing. If we enumerate all of the 
 dependencies and put them in a source file in the repo, we can make it so 
 that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests