date:20141215

[jira] [Created] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Nitin Goyal (JIRA)

Nitin Goyal created SPARK-4849:
--

 Summary: Pass partitioning information (distribute by) to 
In-memory caching
 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Nitin Goyal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-4849:
---
Description: 
HQL distribute by column_name partitions data based on specified column 
values. We can pass this information to in-memory caching for further 
performance improvements. e..g. in Joins, an extra partition step can be saved 
based on this information.

Refer - 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type

2014-12-15 Thread Chaozhong Yang (JIRA)

Chaozhong Yang created SPARK-4850:
-

 Summary: GROUP BY can't work if the schema of SchemaRDD contains 
struct or array type
 Key: SPARK-4850
 URL: https://issues.apache.org/jira/browse/SPARK-4850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
Reporter: Chaozhong Yang


In Spark Shell as follows:

```
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = path/to/json
sqlContext.jsonFile(path).register(Table)
val t = sqlContext.sql(select * from Table group by a)
t.collect
```

Let's look into the schema of `Table`
root
 |-- a: integer (nullable = true)
 |-- arr: array (nullable = true)
 ||-- element: integer (containsNull = false)
 |-- createdAt: string (nullable = true)
 |-- f: struct (nullable = true)
 ||-- __type: string (nullable = true)
 ||-- className: string (nullable = true)
 ||-- objectId: string (nullable = true)
 |-- objectId: string (nullable = true)
 |-- s: string (nullable = true)
 |-- updatedAt: string (nullable = true)

Exception will be throwed:

```

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: arr#9, tree:
Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
 Subquery TestImport
  LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
MappedRDD[18] at map at JsonRDD.scala:47

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at $iwC$$iwC$$iwC$$iwC.init(console:17)
at $iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC.init(console:24)
at $iwC.init(console:26)
at init(console:28)
at .init(console:32)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

[jira] [Updated] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type

2014-12-15 Thread Chaozhong Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaozhong Yang updated SPARK-4850:
--
Description: 
Code in Spark Shell as follows:

```
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = path/to/json
sqlContext.jsonFile(path).register(Table)
val t = sqlContext.sql(select * from Table group by a)
t.collect
```

Let's look into the schema of `Table`
root
 |-- a: integer (nullable = true)
 |-- arr: array (nullable = true)
 ||-- element: integer (containsNull = false)
 |-- createdAt: string (nullable = true)
 |-- f: struct (nullable = true)
 ||-- __type: string (nullable = true)
 ||-- className: string (nullable = true)
 ||-- objectId: string (nullable = true)
 |-- objectId: string (nullable = true)
 |-- s: string (nullable = true)
 |-- updatedAt: string (nullable = true)

Exception will be throwed:

```

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: arr#9, tree:
Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
 Subquery TestImport
  LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
MappedRDD[18] at map at JsonRDD.scala:47

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at $iwC$$iwC$$iwC$$iwC.init(console:17)
at $iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC.init(console:24)
at $iwC.init(console:26)
at init(console:28)
at .init(console:32)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at

[jira] [Updated] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-15 Thread Zhang, Liye (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhang, Liye updated SPARK-4740:
---
Attachment: repartition test.7z

Hi [~rxin], [~adav], I made several tests on HDDs and ramdisk with
*repartition(192)* to test shuffle performance for NIO and Netty, with the same
dataset as before (400GB). I uploaded archived file *repatition test.7z*, in
which there are 6 tests' results:
1, NIO on ramdisk
2, NIO on HDDs
3, Netty on ramdisk with connectionPerPeer set to 1
4, Netty on ramdisk with connectionPerPeer set to 8
5, Netty on HDDs with connectionPerPeer set to 1
6, Netty on HDDs with connectionPerPeer set to 8
P.S. in the attached htmls, unit of IO throughput is requests instead of byte.

From the 6 tests, it's very obvious that the reduce performance increases a
lot by setting *connectionPerPeer* from 1 to 8. Both with Ramdisk and HDDs.

For HDDs, the reduce time of Netty with *connectionPerPeer=8* is about the same
with NIO (about 6.7 mins).

For Ramdisk, Netty outperforms NIO even with *connectionPerPeer=1*. That is
because the memory bandwidth has reaches bound for NIO, it's memory bandwidth
bound, which I have confirmed with other tools. That's why the CPU utilization
of NIO in reduce phase is only about 50%. While Netty still can get some
performance gain by increasing *connectionPerPeer*'s value. This is execpected
because NIO need some extra memory copy than Netty.

Before these 6 tests, I have monitored the IO with *iostat* for HDDs case. When
keeping *connectionPerPeer* as default (=1), Netty's read requests queue size,
read requests, await, %util are all smaller than NIO, which means Netty's read
parallelism is not well profiled.

Till now, we can confirm that Netty doesn't get good read concurrency for small
size cluster with many disks (if not set *connetionPerPeer*), but still we can
not make a conclusion that Netty can run faster than NIO on HDDs.

Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

Key: SPARK-4740
URL: https://issues.apache.org/jira/browse/SPARK-4740
Project: Spark
Issue Type: Improvement
Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
Assignee: Reynold Xin
Attachments: (rxin patch better executor)TestRunner sort-by-key -
Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner
sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report
16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner
sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip,
TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per
node).zip, repartition test.7z,
rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z

When testing current spark master (1.3.0-snapshot) with spark-perf
(sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService
takes much longer time than NIO based shuffle transferService. The network
throughput of Netty is only about half of that of NIO.
We tested with standalone mode, and the data set we used for test is 20
billion records, and the total size is about 400GB. Spark-perf test is
Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each
executor memory is 64GB. The reduce tasks number is set to 1000.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD

2014-12-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246655#comment-14246655
 ] 

Sean Owen commented on SPARK-4845:
--

I'm interested in the motivation for this. Mapping down to fewer partitions 
will increase the amount of shuffling, right? when is this preferable over 
simply repartitioning directly?

 Adding a parallelismRatio to control the partitions num of shuffledRDD
 --

 Key: SPARK-4845
 URL: https://issues.apache.org/jira/browse/SPARK-4845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.3.0


 Adding parallelismRatio to control the partitions num of shuffledRDD, the 
 rule is:
  Math.max(1, parallelismRatio * number of partitions of the largest upstream 
 RDD)
 The ratio is 1.0 by default to make it compatible with the old version. 
 When we have a good experience on it, we can change this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4844) SGD should support custom sampling.

2014-12-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246657#comment-14246657
 ] 

Sean Owen commented on SPARK-4844:
--

Hm, in what case would you want to not sample the minibatch uniformly at random 
in SGD?

 SGD should support custom sampling.
 ---

 Key: SPARK-4844
 URL: https://issues.apache.org/jira/browse/SPARK-4844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2014-12-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246665#comment-14246665
 ] 

Sean Owen commented on SPARK-4846:
--

I think you're just running out of memory on your driver. It fails to have 
enough memory to copy and serialize two data structures, syn0Global and 
syn1Global which contain (vocab size * vector length) floats. With a default 
vector length of 100, and 10M vocab, that's at least 8GB of RAM, and the 
default for the driver isn't nearly that big.

I think this is just a matter of increasing your driver memory. I imagine you 
will need 16GB+

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Priority: Critical

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2014-12-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246732#comment-14246732
 ] 

Sean Owen commented on SPARK-4846:
--

But being lazy doesn't really change whether it is serialized, right? one way 
or the other the recipients of the higher-order function have to get the same 
data. The function does use the data structures; it's not a question of simply 
keeping something out of the closure that shouldn't be there.

Is the problem that only part of this large data structure should go to each 
partition?

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Priority: Critical

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4844) SGD should support custom sampling.

2014-12-15 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246738#comment-14246738
 ] 

Guoqiang Li commented on SPARK-4844:


The main reason is that {{RDD.sample}} is not efficient.  {{RDD.sample}} loads 
all data into memory.
See https://github.com/witgo/spark/compare/SPARK-4844 

 SGD should support custom sampling.
 ---

 Key: SPARK-4844
 URL: https://issues.apache.org/jira/browse/SPARK-4844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4844) SGD should support custom sampling.

2014-12-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246749#comment-14246749
 ] 

Sean Owen commented on SPARK-4844:
--

No, it definitely does not. See {{PartitionwiseSampledRDD}}, and how it uses 
{{BernoulliSampler}} and {{PoissonSampler}}, which are already pluggable if you 
want. They use gap-sampling iterators. If that's the only change, I would close 
this. The PR reinvents some of the classes above.

 SGD should support custom sampling.
 ---

 Key: SPARK-4844
 URL: https://issues.apache.org/jira/browse/SPARK-4844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4547) OOM when making bins in BinaryClassificationMetrics

2014-12-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246794#comment-14246794
 ] 

Apache Spark commented on SPARK-4547:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3702

 OOM when making bins in BinaryClassificationMetrics
 ---

 Key: SPARK-4547
 URL: https://issues.apache.org/jira/browse/SPARK-4547
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 Also following up on 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoxuxf...@mail.gmail.com%3E
  -- this one I intend to make a PR for a bit later. The conversation was 
 basically:
 {quote}
 Recently I was using BinaryClassificationMetrics to build an AUC curve for a 
 classifier over a reasonably large number of points (~12M). The scores were 
 all probabilities, so tended to be almost entirely unique.
 The computation does some operations by key, and this ran out of memory. It's 
 something you can solve with more than the default amount of memory, but in 
 this case, it seemed unuseful to create an AUC curve with such fine-grained 
 resolution.
 I ended up just binning the scores so there were ~1000 unique values
 and then it was fine.
 {quote}
 and:
 {quote}
 Yes, if there are many distinct values, we need binning to compute the AUC 
 curve. Usually, the scores are not evenly distribution, we cannot simply 
 truncate the digits. Estimating the quantiles for binning is necessary, 
 similar to RangePartitioner:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
 Limiting the number of bins is definitely useful.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4852) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers

2014-12-15 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4852:
--
Priority: Minor  (was: Major)

 Hive query plan deserialization failure caused by shaded hive-exec jar file 
 when generating golden answers
 --

 Key: SPARK-4852
 URL: https://issues.apache.org/jira/browse/SPARK-4852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Cheng Lian
Priority: Minor

 When adding Hive 0.13.1 support for Spark SQL Thrift server in PR 
 [2685|https://github.com/apache/spark/pull/2685], Kryo 2.22 used by original 
 hive-exec-0.13.1.jar was shaded by Kryo 2.21 used by Spark SQL because of 
 dependency hell. Unfortunately, Kryo 2.21 has a known bug that may cause Hive 
 query plan deserialization failure. This bug was fixed in Kryo 2.22.
 Normally, this issue doesn't affect Spark SQL because we don't even generate 
 Hive query plan. But when running Hive test suites like 
 {{HiveCompatibilitySuite}}, golden answer files must be generated by Hive, 
 and thus triggers this issue. A workaround is to replace 
 {{hive-exec-0.13.1.jar}} under {{$HIVE_HOME/lib}} with Spark's 
 {{hive-exec-0.13.1a.jar}} and {{kryo-2.21.jar}} under 
 {{$SPARK_DEV_HOME/lib_managed/jars}}. Then add {{$HIVE_HOME/lib}} to 
 {{$HADOOP_CLASSPATH}}.
 Upgrading to some newer version of Kryo which is binary compatible with Kryo 
 2.22 (if there is one) may fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4852) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers

2014-12-15 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246799#comment-14246799
 ] 

Cheng Lian commented on SPARK-4852:
---

Lowered priority to Minor since this issue only affects Spark SQL developers.

 Hive query plan deserialization failure caused by shaded hive-exec jar file 
 when generating golden answers
 --

 Key: SPARK-4852
 URL: https://issues.apache.org/jira/browse/SPARK-4852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Cheng Lian
Priority: Minor

 When adding Hive 0.13.1 support for Spark SQL Thrift server in PR 
 [2685|https://github.com/apache/spark/pull/2685], Kryo 2.22 used by original 
 hive-exec-0.13.1.jar was shaded by Kryo 2.21 used by Spark SQL because of 
 dependency hell. Unfortunately, Kryo 2.21 has a known bug that may cause Hive 
 query plan deserialization failure. This bug was fixed in Kryo 2.22.
 Normally, this issue doesn't affect Spark SQL because we don't even generate 
 Hive query plan. But when running Hive test suites like 
 {{HiveCompatibilitySuite}}, golden answer files must be generated by Hive, 
 and thus triggers this issue. A workaround is to replace 
 {{hive-exec-0.13.1.jar}} under {{$HIVE_HOME/lib}} with Spark's 
 {{hive-exec-0.13.1a.jar}} and {{kryo-2.21.jar}} under 
 {{$SPARK_DEV_HOME/lib_managed/jars}}. Then add {{$HIVE_HOME/lib}} to 
 {{$HADOOP_CLASSPATH}}.
 Upgrading to some newer version of Kryo which is binary compatible with Kryo 
 2.22 (if there is one) may fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-12-15 Thread Jing Dong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246831#comment-14246831
 ] 

Jing Dong commented on SPARK-3619:
--

[~tnachen] Will this be released with Spark 1.2.0? 
I also noticed the documentation on Spark saying Mesos compatibility is 0.18.1. 
Is this up-to-date?

 Upgrade to Mesos 0.21 to work around MESOS-1688
 ---

 Key: SPARK-3619
 URL: https://issues.apache.org/jira/browse/SPARK-3619
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Matei Zaharia

 When Mesos 0.21 comes out, it will have a fix for 
 https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1442) Add Window function support

2014-12-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246910#comment-14246910
 ] 

Apache Spark commented on SPARK-1442:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3703

 Add Window function support
 ---

 Key: SPARK-1442
 URL: https://issues.apache.org/jira/browse/SPARK-1442
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Chengxiang Li
 Attachments: Window Function.pdf


 similiar to Hive, add window function support for catalyst.
 https://issues.apache.org/jira/browse/HIVE-4197
 https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4740.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
 

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
Assignee: Reynold Xin
 Fix For: 1.2.0

 Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
 Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
 sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
 sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
 TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
 node).zip, repartition test.7z, 
 rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 
 ---
 Reynold update on Dec 15, 2014: The problem is that in NIO we have multiple 
 connections between two nodes, but in Netty we only had one. We introduced a 
 new config option spark.shuffle.io.numConnectionsPerPeer to allow users to 
 explicitly increase the number of connections between two nodes. SPARK-4853 
 is a follow-up ticket to investigate setting this automatically by Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option

2014-12-15 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246996#comment-14246996
 ] 

Patrick Wendell commented on SPARK-4837:


Hey [~aash] because there is a work around (you can simply switch back to the 
old IO mode) we probably won't block on it, but I can include it in the release 
notes as a known issue. We can also spin a bug-fix release to address this in a 
week or two. It is indeed an annoying issue and will be bad for usability if 
someone upgrades.

 NettyBlockTransferService does not abide by spark.blockManager.port config 
 option
 -

 Key: SPARK-4837
 URL: https://issues.apache.org/jira/browse/SPARK-4837
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 The NettyBlockTransferService always binds to a random port, and does not use 
 the spark.blockManager.port config as specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option

2014-12-15 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4837:
---
Target Version/s: 1.2.1

 NettyBlockTransferService does not abide by spark.blockManager.port config 
 option
 -

 Key: SPARK-4837
 URL: https://issues.apache.org/jira/browse/SPARK-4837
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 The NettyBlockTransferService always binds to a random port, and does not use 
 the spark.blockManager.port config as specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option

2014-12-15 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247030#comment-14247030
 ] 

Andrew Ash commented on SPARK-4837:
---

Ok that's fair -- a release note and targeting 1.2.1 sounds good.

Draft language for that release note could be:

- Spark 1.2.0 changes the default block transfer service to 
NettyBlockTransferService, a higher performance block transfer service than the 
old XYZBlockTransferService.  The new transfer service does not yet respect 
`spark.blockManager.port` so deployments needing full control of Spark's 
network ports in 1.2.0 should temporarily set `spark.abc=xyz` and watch 
SPARK-4837

 NettyBlockTransferService does not abide by spark.blockManager.port config 
 option
 -

 Key: SPARK-4837
 URL: https://issues.apache.org/jira/browse/SPARK-4837
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 The NettyBlockTransferService always binds to a random port, and does not use 
 the spark.blockManager.port config as specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: java.lang.IllegalStateException: File exists and there is no append support!

2014-12-15 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247043#comment-14247043
 ] 

Patrick Wendell commented on SPARK-4826:


I pushed a hotfix disabling these tests, but let's re-enable them once things 
are working.

 Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
 java.lang.IllegalStateException: File exists and there is no append support!
 

 Key: SPARK-4826
 URL: https://issues.apache.org/jira/browse/SPARK-4826
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Tathagata Das
  Labels: flaky-test

 I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
 where four tests failed with the same exception.
 [Link to test result (this will eventually 
 break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
   In case that link breaks:
 The failed tests:
 {code}
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in block manager, not in write ahead log
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in write ahead log, not in block manager
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in write ahead log, and test storing in block manager
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 with partially available in block manager, and rest in write ahead log
 {code}
 The error messages are all (essentially) the same:
 {code}
  java.lang.IllegalStateException: File exists and there is no append 
 support!
   at 
 org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.init(WriteAheadLogWriter.scala:42)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at

[jira] [Commented] (SPARK-4810) Failed to run collect

2014-12-15 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247081#comment-14247081
 ] 

Patrick Wendell commented on SPARK-4810:


Actually can I suggest we move this to the spark users list? This JIRA we use 
primarily for tracking of identified bugs. For information how to join the user 
list see this page:

http://spark.apache.org/community.html

 Failed to run collect
 -

 Key: SPARK-4810
 URL: https://issues.apache.org/jira/browse/SPARK-4810
 Project: Spark
  Issue Type: Question
 Environment: Spark 1.1.1 prebuilt for hadoop 2.4.0
Reporter: newjunwei

 my application failed like below.i want to know the possible reason.Not 
 enough memory may cause this?
 Evironment: Spark 1.1.1 prebuilt for hadoop 2.4.0, standalone deploying mode.
 But no problem when running using local master for test  or running to 
 process another smaller size data.
 I am sure my real data to process is large which is about 200 million 
 key-value data.The smaller size data is about one tenth of the real. I got my 
 result by collect, and  the result will be very large size too. Now, i 
 consider this problem is caused of so many  failed task when to collect a 
 large result. Is it the truth?
 2014-12-09 21:51:47,830 WARN 
 org.apache.spark.Logging$class.logWarning(Logging.scala:71) - Lost task 60.1 
 in stage 1.1 (TID 566, server-21): java.io.IOException: 
 org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of 
 broadcast_4
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:930)
 
 org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)
 sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:160)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 java.lang.Thread.run(Thread.java:662)
 2014-12-09 21:51:49,460 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Starting task 60.2 
 in stage 1.1 (TID 603, server-11, PROCESS_LOCAL, 1295 bytes)
 2014-12-09 21:51:49,461 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Lost task 9.3 in 
 stage 1.1 (TID 579) on executor server-11: java.io.IOException 
 (org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of 
 broadcast_4) [duplicate 1]
 2014-12-09 21:51:49,487 ERROR 
 org.apache.spark.Logging$class.logError(Logging.scala:75) - Task 9 in stage 
 1.1 failed 4 times; aborting job
 2014-12-09 21:51:49,494 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Cancelling stage 1
 2014-12-09 21:51:49,498 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Stage 1 was 
 cancelled
 2014-12-09 21:51:49,511 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Failed to run 
 collect at StatVideoService.scala:62



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: java.lang.IllegalStateException: File exists and there is no append support!

2014-12-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247082#comment-14247082
 ] 

Apache Spark commented on SPARK-4826:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3704

 Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
 java.lang.IllegalStateException: File exists and there is no append support!
 

 Key: SPARK-4826
 URL: https://issues.apache.org/jira/browse/SPARK-4826
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Tathagata Das
  Labels: flaky-test

 I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
 where four tests failed with the same exception.
 [Link to test result (this will eventually 
 break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
   In case that link breaks:
 The failed tests:
 {code}
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in block manager, not in write ahead log
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in write ahead log, not in block manager
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in write ahead log, and test storing in block manager
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 with partially available in block manager, and rest in write ahead log
 {code}
 The error messages are all (essentially) the same:
 {code}
  java.lang.IllegalStateException: File exists and there is no append 
 support!
   at 
 org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.init(WriteAheadLogWriter.scala:42)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at

[jira] [Updated] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip

2014-12-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4841:
-
Assignee: Davies Liu

 Batch serializer bug in PySpark's RDD.zip
 -

 Key: SPARK-4841
 URL: https://issues.apache.org/jira/browse/SPARK-4841
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Davies Liu

 {code}
 t = sc.textFile(README.md)
 t.zip(t).count()
 {code}
 {code}
 Py4JJavaError Traceback (most recent call last)
 ipython-input-6-60fdeb8339fd in module()
  1 readme.zip(readme).count()
 /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self)
 817 3
 818 
 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
 820
 821 def stats(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self)
 808 6.0
 809 
 -- 810 return self.mapPartitions(lambda x: 
 [sum(x)]).reduce(operator.add)
 811
 812 def count(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f)
 713 yield reduce(f, iterator, initial)
 714
 -- 715 vals = self.mapPartitions(func).collect()
 716 if vals:
 717 return reduce(f, vals)
 /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self)
 674 
 675 with SCCallSiteSync(self.context) as css:
 -- 676 bytesInJava = self._jrdd.collect().iterator()
 677 return list(self._collect_iterator_through_file(bytesInJava))
 678
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
 __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o69.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback 
 (most recent call last):
   File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main
 process()
   File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in 
 dump_stream
 raise NotImplementedError
 NotImplementedError
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at

[jira] [Commented] (SPARK-2121) Not fully cached when there is enough memory in ALS

2014-12-15 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247148#comment-14247148
 ] 

sam commented on SPARK-2121:


I seem to be getting this problem too, I have a job that should be caching a 
reasonably small data set (using _AND_DISK just in case, but it really is a 
small data set).  Unfortunately my executors get killed and my job re-runs 
around half of the job to recover the lost data.  Now it's a very expensive 
job, although it's a small data set it iterates over a large broadcasted 
variable many many times, so I really don't want it to get recomputed 
(difference between a 6 hour job and a 9 hour job).

Is there currently any work around? Like just increasing some of the many many 
configurables?

 Not fully cached when there is enough memory in ALS
 ---

 Key: SPARK-2121
 URL: https://issues.apache.org/jira/browse/SPARK-2121
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Shuo Xiang

 While factorizing a large matrix using the latest Alternating Least Squares 
 (ALS) in mllib, from sparkUI it looks like that spark fail to cache all the 
 partitions of some RDD while memory is sufficient. Please find [this 
 post](http://apache-spark-user-list.1001560.n3.nabble.com/Not-fully-cached-when-there-is-enough-memory-tt7429.html)
  for screenshots. This may cause subsequent job failures while executing 
 `userOut.Count()` or `productsOut.count`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2121) Not fully cached when there is enough memory in ALS

2014-12-15 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247167#comment-14247167
 ] 

sam commented on SPARK-2121:


I've tried increasing spark.yarn.executor.memoryOverhead but I get:

14/12/15 20:11:20 WARN cluster.YarnClientClusterScheduler: Initial job has not 
accepted any resources; check your cluster UI to ensure that workers are 
registered and have sufficient memory

I've checked the UI and it doesn't really help me determine whether I have 
sufficient memory.

I'm thinking the only work around is to write out the results to disk since the 
caching functionality is not behaving.

 Not fully cached when there is enough memory in ALS
 ---

 Key: SPARK-2121
 URL: https://issues.apache.org/jira/browse/SPARK-2121
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Shuo Xiang

 While factorizing a large matrix using the latest Alternating Least Squares 
 (ALS) in mllib, from sparkUI it looks like that spark fail to cache all the 
 partitions of some RDD while memory is sufficient. Please find [this 
 post](http://apache-spark-user-list.1001560.n3.nabble.com/Not-fully-cached-when-there-is-enough-memory-tt7429.html)
  for screenshots. This may cause subsequent job failures while executing 
 `userOut.Count()` or `productsOut.count`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247184#comment-14247184
 ] 

Michael Armbrust commented on SPARK-4849:
-

The trick here will be to make sure that the outputPartitioning is correctly 
output from the InMemoryColumnarTableScan.

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications

2014-12-15 Thread Chip Senkbeil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247235#comment-14247235
 ] 

Chip Senkbeil commented on SPARK-4605:
--

[~rdhyee], the short answer is no. The Spark Kernel is a pure Scala kernel that 
can be connected to from IPython and supports Scala. PySpark is a way to 
connect to the Spark cluster using a Python environment.

 Proposed Contribution: Spark Kernel to enable interactive Spark applications
 

 Key: SPARK-4605
 URL: https://issues.apache.org/jira/browse/SPARK-4605
 Project: Spark
  Issue Type: New Feature
Reporter: Chip Senkbeil
 Attachments: Kernel Architecture Widescreen.pdf, Kernel 
 Architecture.pdf


 Project available on Github: https://github.com/ibm-et/spark-kernel
 
 This architecture is describing running kernel code that was demonstrated at 
 the StrataConf in Barcelona, Spain.
 
 Enables applications to interact with a Spark cluster using Scala in several 
 ways:
 * Defining and running core Spark Tasks
 * Collecting results from a cluster without needing to write to external data 
 store
 ** Ability to stream results using well-defined protocol
 * Arbitrary Scala code definition and execution (without submitting 
 heavy-weight jars)
 Applications can be hosted and managed separate from the Spark cluster using 
 the kernel as a proxy to communicate requests.
 The Spark Kernel implements the server side of the IPython Kernel protocol, 
 the rising “de-facto” protocol for language (Python, Haskell, etc.) execution.
 Inherits a suite of industry adopted clients such as the IPython Notebook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-12-15 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247269#comment-14247269
 ] 

Xiangrui Meng commented on SPARK-4510:
--

The N^2 factor was what I was worried about. MLlib is supposed to deal with 
distributed datasets, which usually means very large N. Given the N^2 factor, 
the k-medroids implementation won't scale to even one million examples.

 Add k-medoids Partitioning Around Medoids (PAM) algorithm
 -

 Key: SPARK-4510
 URL: https://issues.apache.org/jira/browse/SPARK-4510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
 because it minimizes a sum of pairwise dissimilarities instead of a sum of 
 squared Euclidean distances. A medoid can be defined as the object of a 
 cluster, whose average dissimilarity to all the objects in the cluster is 
 minimal i.e. it is a most centrally located point in the cluster.
 The most common realisation of k-medoid clustering is the Partitioning Around 
 Medoids (PAM) algorithm and is as follows:
 Initialize: randomly select (without replacement) k of the n data points as 
 the medoids
 Associate each data point to the closest medoid. (closest here is defined 
 using any valid distance metric, most commonly Euclidean distance, Manhattan 
 distance or Minkowski distance)
 For each medoid m
 For each non-medoid data point o
 Swap m and o and compute the total cost of the configuration
 Select the configuration with the lowest cost.
 Repeat steps 2 to 4 until there is no change in the medoid.
 The new feature for MLlib will contain 5 new files
 /main/scala/org/apache/spark/mllib/clustering/PAM.scala
 /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
 /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
 /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
 /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4494) IDFModel.transform() add support for single vector

2014-12-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4494.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3603
[https://github.com/apache/spark/pull/3603]

 IDFModel.transform() add support for single vector
 --

 Key: SPARK-4494
 URL: https://issues.apache.org/jira/browse/SPARK-4494
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
Reporter: Jean-Philippe Quemener
Priority: Minor
 Fix For: 1.3.0


 For now when using the tfidf implementation of mllib you have no other 
 possibility to map your data back onto i.e. labels or ids than use a hackish 
 way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
 vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
 new vector to LabeledPoint{quote}
 Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
 I think as in production alot of users want to map their data back to some 
 identifier, it would be a good imporvement to allow using a single vector on 
 IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector

2014-12-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4494:
-
Assignee: Yu Ishikawa

 IDFModel.transform() add support for single vector
 --

 Key: SPARK-4494
 URL: https://issues.apache.org/jira/browse/SPARK-4494
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
Reporter: Jean-Philippe Quemener
Assignee: Yu Ishikawa
Priority: Minor
 Fix For: 1.3.0


 For now when using the tfidf implementation of mllib you have no other 
 possibility to map your data back onto i.e. labels or ids than use a hackish 
 way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
 vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
 new vector to LabeledPoint{quote}
 Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
 I think as in production alot of users want to map their data back to some 
 identifier, it would be a good imporvement to allow using a single vector on 
 IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip

2014-12-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247376#comment-14247376
 ] 

Apache Spark commented on SPARK-4841:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3706

 Batch serializer bug in PySpark's RDD.zip
 -

 Key: SPARK-4841
 URL: https://issues.apache.org/jira/browse/SPARK-4841
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Davies Liu

 {code}
 t = sc.textFile(README.md)
 t.zip(t).count()
 {code}
 {code}
 Py4JJavaError Traceback (most recent call last)
 ipython-input-6-60fdeb8339fd in module()
  1 readme.zip(readme).count()
 /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self)
 817 3
 818 
 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
 820
 821 def stats(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self)
 808 6.0
 809 
 -- 810 return self.mapPartitions(lambda x: 
 [sum(x)]).reduce(operator.add)
 811
 812 def count(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f)
 713 yield reduce(f, iterator, initial)
 714
 -- 715 vals = self.mapPartitions(func).collect()
 716 if vals:
 717 return reduce(f, vals)
 /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self)
 674 
 675 with SCCallSiteSync(self.context) as css:
 -- 676 bytesInJava = self._jrdd.collect().iterator()
 677 return list(self._collect_iterator_through_file(bytesInJava))
 678
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
 __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o69.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback 
 (most recent call last):
   File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main
 process()
   File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in 
 dump_stream
 raise NotImplementedError
 NotImplementedError
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)

85 matches

Mail list logo