[jira] [Created] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-4849:
--

 Summary: Pass partitioning information (distribute by) to 
In-memory caching
 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-4849:
---
Description: 
HQL distribute by column_name partitions data based on specified column 
values. We can pass this information to in-memory caching for further 
performance improvements. e..g. in Joins, an extra partition step can be saved 
based on this information.

Refer - 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type

2014-12-15 Thread Chaozhong Yang (JIRA)
Chaozhong Yang created SPARK-4850:
-

 Summary: GROUP BY can't work if the schema of SchemaRDD contains 
struct or array type
 Key: SPARK-4850
 URL: https://issues.apache.org/jira/browse/SPARK-4850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
Reporter: Chaozhong Yang


In Spark Shell as follows:

```
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = path/to/json
sqlContext.jsonFile(path).register(Table)
val t = sqlContext.sql(select * from Table group by a)
t.collect
```

Let's look into the schema of `Table`
root
 |-- a: integer (nullable = true)
 |-- arr: array (nullable = true)
 ||-- element: integer (containsNull = false)
 |-- createdAt: string (nullable = true)
 |-- f: struct (nullable = true)
 ||-- __type: string (nullable = true)
 ||-- className: string (nullable = true)
 ||-- objectId: string (nullable = true)
 |-- objectId: string (nullable = true)
 |-- s: string (nullable = true)
 |-- updatedAt: string (nullable = true)

Exception will be throwed:

```

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: arr#9, tree:
Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
 Subquery TestImport
  LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
MappedRDD[18] at map at JsonRDD.scala:47

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at $iwC$$iwC$$iwC$$iwC.init(console:17)
at $iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC.init(console:24)
at $iwC.init(console:26)
at init(console:28)
at .init(console:32)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 

[jira] [Updated] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type

2014-12-15 Thread Chaozhong Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaozhong Yang updated SPARK-4850:
--
Description: 
Code in Spark Shell as follows:

```
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = path/to/json
sqlContext.jsonFile(path).register(Table)
val t = sqlContext.sql(select * from Table group by a)
t.collect
```

Let's look into the schema of `Table`
root
 |-- a: integer (nullable = true)
 |-- arr: array (nullable = true)
 ||-- element: integer (containsNull = false)
 |-- createdAt: string (nullable = true)
 |-- f: struct (nullable = true)
 ||-- __type: string (nullable = true)
 ||-- className: string (nullable = true)
 ||-- objectId: string (nullable = true)
 |-- objectId: string (nullable = true)
 |-- s: string (nullable = true)
 |-- updatedAt: string (nullable = true)

Exception will be throwed:

```

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: arr#9, tree:
Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
 Subquery TestImport
  LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
MappedRDD[18] at map at JsonRDD.scala:47

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at $iwC$$iwC$$iwC$$iwC.init(console:17)
at $iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC.init(console:24)
at $iwC.init(console:26)
at init(console:28)
at .init(console:32)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at 

[jira] [Updated] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-15 Thread Zhang, Liye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-4740:
---
Attachment: repartition test.7z

Hi [~rxin], [~adav], I made several tests on HDDs and ramdisk with 
*repartition(192)* to test shuffle performance for NIO and Netty, with the same 
dataset as before (400GB). I uploaded archived file *repatition test.7z*, in 
which there are 6 tests' results:
1, NIO on ramdisk
2, NIO on HDDs
3, Netty on ramdisk with connectionPerPeer set to 1
4, Netty on ramdisk with connectionPerPeer set to 8
5, Netty on HDDs with connectionPerPeer set to 1
6, Netty on HDDs with connectionPerPeer set to 8
P.S. in the attached htmls, unit of IO throughput is requests instead of byte.

From the 6 tests, it's very obvious that the reduce performance increases a 
lot by setting *connectionPerPeer* from 1 to 8. Both with Ramdisk and HDDs. 

For HDDs, the reduce time of Netty with *connectionPerPeer=8* is about the same 
with NIO (about 6.7 mins).

For Ramdisk, Netty outperforms NIO even with *connectionPerPeer=1*. That is 
because the memory bandwidth has reaches bound for NIO, it's memory bandwidth 
bound, which I have confirmed with other tools. That's why the CPU utilization 
of NIO in reduce phase is only about 50%. While Netty still can get some 
performance gain by increasing *connectionPerPeer*'s value. This is execpected 
because NIO need some extra memory copy than Netty.

Before these 6 tests, I have monitored the IO with *iostat* for HDDs case. When 
keeping *connectionPerPeer* as default (=1), Netty's read requests queue size, 
read requests, await, %util are all smaller than NIO, which means Netty's read 
parallelism is not well profiled. 

Till now, we can confirm that Netty doesn't get good read concurrency for small 
size cluster with many disks (if not set *connetionPerPeer*), but still we can 
not make a conclusion that Netty can run faster than NIO on HDDs.

 Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
 

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
Assignee: Reynold Xin
 Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
 Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
 sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
 sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
 TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
 node).zip, repartition test.7z, 
 rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD

2014-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246655#comment-14246655
 ] 

Sean Owen commented on SPARK-4845:
--

I'm interested in the motivation for this. Mapping down to fewer partitions 
will increase the amount of shuffling, right? when is this preferable over 
simply repartitioning directly?

 Adding a parallelismRatio to control the partitions num of shuffledRDD
 --

 Key: SPARK-4845
 URL: https://issues.apache.org/jira/browse/SPARK-4845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.3.0


 Adding parallelismRatio to control the partitions num of shuffledRDD, the 
 rule is:
  Math.max(1, parallelismRatio * number of partitions of the largest upstream 
 RDD)
 The ratio is 1.0 by default to make it compatible with the old version. 
 When we have a good experience on it, we can change this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4844) SGD should support custom sampling.

2014-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246657#comment-14246657
 ] 

Sean Owen commented on SPARK-4844:
--

Hm, in what case would you want to not sample the minibatch uniformly at random 
in SGD?

 SGD should support custom sampling.
 ---

 Key: SPARK-4844
 URL: https://issues.apache.org/jira/browse/SPARK-4844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2014-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246665#comment-14246665
 ] 

Sean Owen commented on SPARK-4846:
--

I think you're just running out of memory on your driver. It fails to have 
enough memory to copy and serialize two data structures, syn0Global and 
syn1Global which contain (vocab size * vector length) floats. With a default 
vector length of 100, and 10M vocab, that's at least 8GB of RAM, and the 
default for the driver isn't nearly that big.

I think this is just a matter of increasing your driver memory. I imagine you 
will need 16GB+

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Priority: Critical

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2014-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246732#comment-14246732
 ] 

Sean Owen commented on SPARK-4846:
--

But being lazy doesn't really change whether it is serialized, right? one way 
or the other the recipients of the higher-order function have to get the same 
data. The function does use the data structures; it's not a question of simply 
keeping something out of the closure that shouldn't be there.

Is the problem that only part of this large data structure should go to each 
partition?

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Priority: Critical

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4844) SGD should support custom sampling.

2014-12-15 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246738#comment-14246738
 ] 

Guoqiang Li commented on SPARK-4844:


The main reason is that {{RDD.sample}} is not efficient.  {{RDD.sample}} loads 
all data into memory.
See https://github.com/witgo/spark/compare/SPARK-4844 

 SGD should support custom sampling.
 ---

 Key: SPARK-4844
 URL: https://issues.apache.org/jira/browse/SPARK-4844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4844) SGD should support custom sampling.

2014-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246749#comment-14246749
 ] 

Sean Owen commented on SPARK-4844:
--

No, it definitely does not. See {{PartitionwiseSampledRDD}}, and how it uses 
{{BernoulliSampler}} and {{PoissonSampler}}, which are already pluggable if you 
want. They use gap-sampling iterators. If that's the only change, I would close 
this. The PR reinvents some of the classes above.

 SGD should support custom sampling.
 ---

 Key: SPARK-4844
 URL: https://issues.apache.org/jira/browse/SPARK-4844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4547) OOM when making bins in BinaryClassificationMetrics

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246794#comment-14246794
 ] 

Apache Spark commented on SPARK-4547:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3702

 OOM when making bins in BinaryClassificationMetrics
 ---

 Key: SPARK-4547
 URL: https://issues.apache.org/jira/browse/SPARK-4547
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 Also following up on 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoxuxf...@mail.gmail.com%3E
  -- this one I intend to make a PR for a bit later. The conversation was 
 basically:
 {quote}
 Recently I was using BinaryClassificationMetrics to build an AUC curve for a 
 classifier over a reasonably large number of points (~12M). The scores were 
 all probabilities, so tended to be almost entirely unique.
 The computation does some operations by key, and this ran out of memory. It's 
 something you can solve with more than the default amount of memory, but in 
 this case, it seemed unuseful to create an AUC curve with such fine-grained 
 resolution.
 I ended up just binning the scores so there were ~1000 unique values
 and then it was fine.
 {quote}
 and:
 {quote}
 Yes, if there are many distinct values, we need binning to compute the AUC 
 curve. Usually, the scores are not evenly distribution, we cannot simply 
 truncate the digits. Estimating the quantiles for binning is necessary, 
 similar to RangePartitioner:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
 Limiting the number of bins is definitely useful.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4852) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers

2014-12-15 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4852:
--
Priority: Minor  (was: Major)

 Hive query plan deserialization failure caused by shaded hive-exec jar file 
 when generating golden answers
 --

 Key: SPARK-4852
 URL: https://issues.apache.org/jira/browse/SPARK-4852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Cheng Lian
Priority: Minor

 When adding Hive 0.13.1 support for Spark SQL Thrift server in PR 
 [2685|https://github.com/apache/spark/pull/2685], Kryo 2.22 used by original 
 hive-exec-0.13.1.jar was shaded by Kryo 2.21 used by Spark SQL because of 
 dependency hell. Unfortunately, Kryo 2.21 has a known bug that may cause Hive 
 query plan deserialization failure. This bug was fixed in Kryo 2.22.
 Normally, this issue doesn't affect Spark SQL because we don't even generate 
 Hive query plan. But when running Hive test suites like 
 {{HiveCompatibilitySuite}}, golden answer files must be generated by Hive, 
 and thus triggers this issue. A workaround is to replace 
 {{hive-exec-0.13.1.jar}} under {{$HIVE_HOME/lib}} with Spark's 
 {{hive-exec-0.13.1a.jar}} and {{kryo-2.21.jar}} under 
 {{$SPARK_DEV_HOME/lib_managed/jars}}. Then add {{$HIVE_HOME/lib}} to 
 {{$HADOOP_CLASSPATH}}.
 Upgrading to some newer version of Kryo which is binary compatible with Kryo 
 2.22 (if there is one) may fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4852) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers

2014-12-15 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246799#comment-14246799
 ] 

Cheng Lian commented on SPARK-4852:
---

Lowered priority to Minor since this issue only affects Spark SQL developers.

 Hive query plan deserialization failure caused by shaded hive-exec jar file 
 when generating golden answers
 --

 Key: SPARK-4852
 URL: https://issues.apache.org/jira/browse/SPARK-4852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Cheng Lian
Priority: Minor

 When adding Hive 0.13.1 support for Spark SQL Thrift server in PR 
 [2685|https://github.com/apache/spark/pull/2685], Kryo 2.22 used by original 
 hive-exec-0.13.1.jar was shaded by Kryo 2.21 used by Spark SQL because of 
 dependency hell. Unfortunately, Kryo 2.21 has a known bug that may cause Hive 
 query plan deserialization failure. This bug was fixed in Kryo 2.22.
 Normally, this issue doesn't affect Spark SQL because we don't even generate 
 Hive query plan. But when running Hive test suites like 
 {{HiveCompatibilitySuite}}, golden answer files must be generated by Hive, 
 and thus triggers this issue. A workaround is to replace 
 {{hive-exec-0.13.1.jar}} under {{$HIVE_HOME/lib}} with Spark's 
 {{hive-exec-0.13.1a.jar}} and {{kryo-2.21.jar}} under 
 {{$SPARK_DEV_HOME/lib_managed/jars}}. Then add {{$HIVE_HOME/lib}} to 
 {{$HADOOP_CLASSPATH}}.
 Upgrading to some newer version of Kryo which is binary compatible with Kryo 
 2.22 (if there is one) may fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-12-15 Thread Jing Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246831#comment-14246831
 ] 

Jing Dong commented on SPARK-3619:
--

[~tnachen] Will this be released with Spark 1.2.0? 
I also noticed the documentation on Spark saying Mesos compatibility is 0.18.1. 
Is this up-to-date?

 Upgrade to Mesos 0.21 to work around MESOS-1688
 ---

 Key: SPARK-3619
 URL: https://issues.apache.org/jira/browse/SPARK-3619
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Matei Zaharia

 When Mesos 0.21 comes out, it will have a fix for 
 https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1442) Add Window function support

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246910#comment-14246910
 ] 

Apache Spark commented on SPARK-1442:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3703

 Add Window function support
 ---

 Key: SPARK-1442
 URL: https://issues.apache.org/jira/browse/SPARK-1442
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Chengxiang Li
 Attachments: Window Function.pdf


 similiar to Hive, add window function support for catalyst.
 https://issues.apache.org/jira/browse/HIVE-4197
 https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4740.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
 

 Key: SPARK-4740
 URL: https://issues.apache.org/jira/browse/SPARK-4740
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Zhang, Liye
Assignee: Reynold Xin
 Fix For: 1.2.0

 Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
 Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
 sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
 sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
 TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
 node).zip, repartition test.7z, 
 rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z


 When testing current spark master (1.3.0-snapshot) with spark-perf 
 (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
 takes much longer time than NIO based shuffle transferService. The network 
 throughput of Netty is only about half of that of NIO. 
 We tested with standalone mode, and the data set we used for test is 20 
 billion records, and the total size is about 400GB. Spark-perf test is 
 Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
 executor memory is 64GB. The reduce tasks number is set to 1000. 
 ---
 Reynold update on Dec 15, 2014: The problem is that in NIO we have multiple 
 connections between two nodes, but in Netty we only had one. We introduced a 
 new config option spark.shuffle.io.numConnectionsPerPeer to allow users to 
 explicitly increase the number of connections between two nodes. SPARK-4853 
 is a follow-up ticket to investigate setting this automatically by Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option

2014-12-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246996#comment-14246996
 ] 

Patrick Wendell commented on SPARK-4837:


Hey [~aash] because there is a work around (you can simply switch back to the 
old IO mode) we probably won't block on it, but I can include it in the release 
notes as a known issue. We can also spin a bug-fix release to address this in a 
week or two. It is indeed an annoying issue and will be bad for usability if 
someone upgrades.

 NettyBlockTransferService does not abide by spark.blockManager.port config 
 option
 -

 Key: SPARK-4837
 URL: https://issues.apache.org/jira/browse/SPARK-4837
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 The NettyBlockTransferService always binds to a random port, and does not use 
 the spark.blockManager.port config as specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option

2014-12-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4837:
---
Target Version/s: 1.2.1

 NettyBlockTransferService does not abide by spark.blockManager.port config 
 option
 -

 Key: SPARK-4837
 URL: https://issues.apache.org/jira/browse/SPARK-4837
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 The NettyBlockTransferService always binds to a random port, and does not use 
 the spark.blockManager.port config as specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option

2014-12-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247030#comment-14247030
 ] 

Andrew Ash commented on SPARK-4837:
---

Ok that's fair -- a release note and targeting 1.2.1 sounds good.

Draft language for that release note could be:

- Spark 1.2.0 changes the default block transfer service to 
NettyBlockTransferService, a higher performance block transfer service than the 
old XYZBlockTransferService.  The new transfer service does not yet respect 
`spark.blockManager.port` so deployments needing full control of Spark's 
network ports in 1.2.0 should temporarily set `spark.abc=xyz` and watch 
SPARK-4837

 NettyBlockTransferService does not abide by spark.blockManager.port config 
 option
 -

 Key: SPARK-4837
 URL: https://issues.apache.org/jira/browse/SPARK-4837
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 The NettyBlockTransferService always binds to a random port, and does not use 
 the spark.blockManager.port config as specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: java.lang.IllegalStateException: File exists and there is no append support!

2014-12-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247043#comment-14247043
 ] 

Patrick Wendell commented on SPARK-4826:


I pushed a hotfix disabling these tests, but let's re-enable them once things 
are working.

 Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
 java.lang.IllegalStateException: File exists and there is no append support!
 

 Key: SPARK-4826
 URL: https://issues.apache.org/jira/browse/SPARK-4826
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Tathagata Das
  Labels: flaky-test

 I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
 where four tests failed with the same exception.
 [Link to test result (this will eventually 
 break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
   In case that link breaks:
 The failed tests:
 {code}
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in block manager, not in write ahead log
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in write ahead log, not in block manager
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in write ahead log, and test storing in block manager
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 with partially available in block manager, and rest in write ahead log
 {code}
 The error messages are all (essentially) the same:
 {code}
  java.lang.IllegalStateException: File exists and there is no append 
 support!
   at 
 org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.init(WriteAheadLogWriter.scala:42)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at 

[jira] [Commented] (SPARK-4810) Failed to run collect

2014-12-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247081#comment-14247081
 ] 

Patrick Wendell commented on SPARK-4810:


Actually can I suggest we move this to the spark users list? This JIRA we use 
primarily for tracking of identified bugs. For information how to join the user 
list see this page:

http://spark.apache.org/community.html

 Failed to run collect
 -

 Key: SPARK-4810
 URL: https://issues.apache.org/jira/browse/SPARK-4810
 Project: Spark
  Issue Type: Question
 Environment: Spark 1.1.1 prebuilt for hadoop 2.4.0
Reporter: newjunwei

 my application failed like below.i want to know the possible reason.Not 
 enough memory may cause this?
 Evironment: Spark 1.1.1 prebuilt for hadoop 2.4.0, standalone deploying mode.
 But no problem when running using local master for test  or running to 
 process another smaller size data.
 I am sure my real data to process is large which is about 200 million 
 key-value data.The smaller size data is about one tenth of the real. I got my 
 result by collect, and  the result will be very large size too. Now, i 
 consider this problem is caused of so many  failed task when to collect a 
 large result. Is it the truth?
 2014-12-09 21:51:47,830 WARN 
 org.apache.spark.Logging$class.logWarning(Logging.scala:71) - Lost task 60.1 
 in stage 1.1 (TID 566, server-21): java.io.IOException: 
 org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of 
 broadcast_4
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:930)
 
 org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)
 sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:160)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 java.lang.Thread.run(Thread.java:662)
 2014-12-09 21:51:49,460 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Starting task 60.2 
 in stage 1.1 (TID 603, server-11, PROCESS_LOCAL, 1295 bytes)
 2014-12-09 21:51:49,461 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Lost task 9.3 in 
 stage 1.1 (TID 579) on executor server-11: java.io.IOException 
 (org.apache.spark.SparkException: Failed to get broadcast_4_piece0 of 
 broadcast_4) [duplicate 1]
 2014-12-09 21:51:49,487 ERROR 
 org.apache.spark.Logging$class.logError(Logging.scala:75) - Task 9 in stage 
 1.1 failed 4 times; aborting job
 2014-12-09 21:51:49,494 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Cancelling stage 1
 2014-12-09 21:51:49,498 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Stage 1 was 
 cancelled
 2014-12-09 21:51:49,511 INFO 
 org.apache.spark.Logging$class.logInfo(Logging.scala:59) - Failed to run 
 collect at StatVideoService.scala:62



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: java.lang.IllegalStateException: File exists and there is no append support!

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247082#comment-14247082
 ] 

Apache Spark commented on SPARK-4826:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3704

 Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
 java.lang.IllegalStateException: File exists and there is no append support!
 

 Key: SPARK-4826
 URL: https://issues.apache.org/jira/browse/SPARK-4826
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Tathagata Das
  Labels: flaky-test

 I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
 where four tests failed with the same exception.
 [Link to test result (this will eventually 
 break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
   In case that link breaks:
 The failed tests:
 {code}
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in block manager, not in write ahead log
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in write ahead log, not in block manager
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 available only in write ahead log, and test storing in block manager
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
 with partially available in block manager, and rest in write ahead log
 {code}
 The error messages are all (essentially) the same:
 {code}
  java.lang.IllegalStateException: File exists and there is no append 
 support!
   at 
 org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
   at 
 org.apache.spark.streaming.util.WriteAheadLogWriter.init(WriteAheadLogWriter.scala:42)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at 

[jira] [Updated] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip

2014-12-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4841:
-
Assignee: Davies Liu

 Batch serializer bug in PySpark's RDD.zip
 -

 Key: SPARK-4841
 URL: https://issues.apache.org/jira/browse/SPARK-4841
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Davies Liu

 {code}
 t = sc.textFile(README.md)
 t.zip(t).count()
 {code}
 {code}
 Py4JJavaError Traceback (most recent call last)
 ipython-input-6-60fdeb8339fd in module()
  1 readme.zip(readme).count()
 /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self)
 817 3
 818 
 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
 820
 821 def stats(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self)
 808 6.0
 809 
 -- 810 return self.mapPartitions(lambda x: 
 [sum(x)]).reduce(operator.add)
 811
 812 def count(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f)
 713 yield reduce(f, iterator, initial)
 714
 -- 715 vals = self.mapPartitions(func).collect()
 716 if vals:
 717 return reduce(f, vals)
 /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self)
 674 
 675 with SCCallSiteSync(self.context) as css:
 -- 676 bytesInJava = self._jrdd.collect().iterator()
 677 return list(self._collect_iterator_through_file(bytesInJava))
 678
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
 __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o69.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback 
 (most recent call last):
   File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main
 process()
   File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in 
 dump_stream
 raise NotImplementedError
 NotImplementedError
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at 

[jira] [Commented] (SPARK-2121) Not fully cached when there is enough memory in ALS

2014-12-15 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247148#comment-14247148
 ] 

sam commented on SPARK-2121:


I seem to be getting this problem too, I have a job that should be caching a 
reasonably small data set (using _AND_DISK just in case, but it really is a 
small data set).  Unfortunately my executors get killed and my job re-runs 
around half of the job to recover the lost data.  Now it's a very expensive 
job, although it's a small data set it iterates over a large broadcasted 
variable many many times, so I really don't want it to get recomputed 
(difference between a 6 hour job and a 9 hour job).

Is there currently any work around? Like just increasing some of the many many 
configurables?

 Not fully cached when there is enough memory in ALS
 ---

 Key: SPARK-2121
 URL: https://issues.apache.org/jira/browse/SPARK-2121
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Shuo Xiang

 While factorizing a large matrix using the latest Alternating Least Squares 
 (ALS) in mllib, from sparkUI it looks like that spark fail to cache all the 
 partitions of some RDD while memory is sufficient. Please find [this 
 post](http://apache-spark-user-list.1001560.n3.nabble.com/Not-fully-cached-when-there-is-enough-memory-tt7429.html)
  for screenshots. This may cause subsequent job failures while executing 
 `userOut.Count()` or `productsOut.count`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2121) Not fully cached when there is enough memory in ALS

2014-12-15 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247167#comment-14247167
 ] 

sam commented on SPARK-2121:


I've tried increasing spark.yarn.executor.memoryOverhead but I get:

14/12/15 20:11:20 WARN cluster.YarnClientClusterScheduler: Initial job has not 
accepted any resources; check your cluster UI to ensure that workers are 
registered and have sufficient memory

I've checked the UI and it doesn't really help me determine whether I have 
sufficient memory.

I'm thinking the only work around is to write out the results to disk since the 
caching functionality is not behaving.

 Not fully cached when there is enough memory in ALS
 ---

 Key: SPARK-2121
 URL: https://issues.apache.org/jira/browse/SPARK-2121
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Shuo Xiang

 While factorizing a large matrix using the latest Alternating Least Squares 
 (ALS) in mllib, from sparkUI it looks like that spark fail to cache all the 
 partitions of some RDD while memory is sufficient. Please find [this 
 post](http://apache-spark-user-list.1001560.n3.nabble.com/Not-fully-cached-when-there-is-enough-memory-tt7429.html)
  for screenshots. This may cause subsequent job failures while executing 
 `userOut.Count()` or `productsOut.count`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247184#comment-14247184
 ] 

Michael Armbrust commented on SPARK-4849:
-

The trick here will be to make sure that the outputPartitioning is correctly 
output from the InMemoryColumnarTableScan.

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4605) Proposed Contribution: Spark Kernel to enable interactive Spark applications

2014-12-15 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247235#comment-14247235
 ] 

Chip Senkbeil commented on SPARK-4605:
--

[~rdhyee], the short answer is no. The Spark Kernel is a pure Scala kernel that 
can be connected to from IPython and supports Scala. PySpark is a way to 
connect to the Spark cluster using a Python environment.

 Proposed Contribution: Spark Kernel to enable interactive Spark applications
 

 Key: SPARK-4605
 URL: https://issues.apache.org/jira/browse/SPARK-4605
 Project: Spark
  Issue Type: New Feature
Reporter: Chip Senkbeil
 Attachments: Kernel Architecture Widescreen.pdf, Kernel 
 Architecture.pdf


 Project available on Github: https://github.com/ibm-et/spark-kernel
 
 This architecture is describing running kernel code that was demonstrated at 
 the StrataConf in Barcelona, Spain.
 
 Enables applications to interact with a Spark cluster using Scala in several 
 ways:
 * Defining and running core Spark Tasks
 * Collecting results from a cluster without needing to write to external data 
 store
 ** Ability to stream results using well-defined protocol
 * Arbitrary Scala code definition and execution (without submitting 
 heavy-weight jars)
 Applications can be hosted and managed separate from the Spark cluster using 
 the kernel as a proxy to communicate requests.
 The Spark Kernel implements the server side of the IPython Kernel protocol, 
 the rising “de-facto” protocol for language (Python, Haskell, etc.) execution.
 Inherits a suite of industry adopted clients such as the IPython Notebook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-12-15 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247269#comment-14247269
 ] 

Xiangrui Meng commented on SPARK-4510:
--

The N^2 factor was what I was worried about. MLlib is supposed to deal with 
distributed datasets, which usually means very large N. Given the N^2 factor, 
the k-medroids implementation won't scale to even one million examples.

 Add k-medoids Partitioning Around Medoids (PAM) algorithm
 -

 Key: SPARK-4510
 URL: https://issues.apache.org/jira/browse/SPARK-4510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
 because it minimizes a sum of pairwise dissimilarities instead of a sum of 
 squared Euclidean distances. A medoid can be defined as the object of a 
 cluster, whose average dissimilarity to all the objects in the cluster is 
 minimal i.e. it is a most centrally located point in the cluster.
 The most common realisation of k-medoid clustering is the Partitioning Around 
 Medoids (PAM) algorithm and is as follows:
 Initialize: randomly select (without replacement) k of the n data points as 
 the medoids
 Associate each data point to the closest medoid. (closest here is defined 
 using any valid distance metric, most commonly Euclidean distance, Manhattan 
 distance or Minkowski distance)
 For each medoid m
 For each non-medoid data point o
 Swap m and o and compute the total cost of the configuration
 Select the configuration with the lowest cost.
 Repeat steps 2 to 4 until there is no change in the medoid.
 The new feature for MLlib will contain 5 new files
 /main/scala/org/apache/spark/mllib/clustering/PAM.scala
 /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
 /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
 /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
 /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4494) IDFModel.transform() add support for single vector

2014-12-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4494.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3603
[https://github.com/apache/spark/pull/3603]

 IDFModel.transform() add support for single vector
 --

 Key: SPARK-4494
 URL: https://issues.apache.org/jira/browse/SPARK-4494
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
Reporter: Jean-Philippe Quemener
Priority: Minor
 Fix For: 1.3.0


 For now when using the tfidf implementation of mllib you have no other 
 possibility to map your data back onto i.e. labels or ids than use a hackish 
 way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
 vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
 new vector to LabeledPoint{quote}
 Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
 I think as in production alot of users want to map their data back to some 
 identifier, it would be a good imporvement to allow using a single vector on 
 IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector

2014-12-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4494:
-
Assignee: Yu Ishikawa

 IDFModel.transform() add support for single vector
 --

 Key: SPARK-4494
 URL: https://issues.apache.org/jira/browse/SPARK-4494
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
Reporter: Jean-Philippe Quemener
Assignee: Yu Ishikawa
Priority: Minor
 Fix For: 1.3.0


 For now when using the tfidf implementation of mllib you have no other 
 possibility to map your data back onto i.e. labels or ids than use a hackish 
 way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
 vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
 new vector to LabeledPoint{quote}
 Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
 I think as in production alot of users want to map their data back to some 
 identifier, it would be a good imporvement to allow using a single vector on 
 IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247376#comment-14247376
 ] 

Apache Spark commented on SPARK-4841:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3706

 Batch serializer bug in PySpark's RDD.zip
 -

 Key: SPARK-4841
 URL: https://issues.apache.org/jira/browse/SPARK-4841
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Davies Liu

 {code}
 t = sc.textFile(README.md)
 t.zip(t).count()
 {code}
 {code}
 Py4JJavaError Traceback (most recent call last)
 ipython-input-6-60fdeb8339fd in module()
  1 readme.zip(readme).count()
 /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self)
 817 3
 818 
 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
 820
 821 def stats(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self)
 808 6.0
 809 
 -- 810 return self.mapPartitions(lambda x: 
 [sum(x)]).reduce(operator.add)
 811
 812 def count(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f)
 713 yield reduce(f, iterator, initial)
 714
 -- 715 vals = self.mapPartitions(func).collect()
 716 if vals:
 717 return reduce(f, vals)
 /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self)
 674 
 675 with SCCallSiteSync(self.context) as css:
 -- 676 bytesInJava = self._jrdd.collect().iterator()
 677 return list(self._collect_iterator_through_file(bytesInJava))
 678
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
 __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o69.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback 
 (most recent call last):
   File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main
 process()
   File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in 
 dump_stream
 raise NotImplementedError
 NotImplementedError
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)

[jira] [Updated] (SPARK-1037) the name of findTaskFromList findTask in TaskSetManager.scala is confusing

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1037:
--
Description: 
the name of these two functions is confusing 

though in the comments the author said that the method does dequeue tasks 
from the list but from the name, it is not explicitly indicating that the 
method will mutate the parameter

in 

{code}
private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = {
while (!list.isEmpty) {
  val index = list.last
  list.trimEnd(1)
  if (copiesRunning(index) == 0  !successful(index)) {
return Some(index)
  }
}
None
  }
{code}

  was:
the name of these two functions is confusing 

though in the comments the author said that the method does dequeue tasks 
from the list but from the name, it is not explicitly indicating that the 
method will mutate the parameter

in 

private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = {
while (!list.isEmpty) {
  val index = list.last
  list.trimEnd(1)
  if (copiesRunning(index) == 0  !successful(index)) {
return Some(index)
  }
}
None
  }



 the name of findTaskFromList  findTask in TaskSetManager.scala is confusing
 

 Key: SPARK-1037
 URL: https://issues.apache.org/jira/browse/SPARK-1037
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
Reporter: Nan Zhu
Priority: Minor
  Labels: starter

 the name of these two functions is confusing 
 though in the comments the author said that the method does dequeue tasks 
 from the list but from the name, it is not explicitly indicating that the 
 method will mutate the parameter
 in 
 {code}
 private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = {
 while (!list.isEmpty) {
   val index = list.last
   list.trimEnd(1)
   if (copiesRunning(index) == 0  !successful(index)) {
 return Some(index)
   }
 }
 None
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1037) the name of findTaskFromList findTask in TaskSetManager.scala is confusing

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1037.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3665
[https://github.com/apache/spark/pull/3665]

 the name of findTaskFromList  findTask in TaskSetManager.scala is confusing
 

 Key: SPARK-1037
 URL: https://issues.apache.org/jira/browse/SPARK-1037
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
Reporter: Nan Zhu
Priority: Minor
  Labels: starter
 Fix For: 1.3.0


 the name of these two functions is confusing 
 though in the comments the author said that the method does dequeue tasks 
 from the list but from the name, it is not explicitly indicating that the 
 method will mutate the parameter
 in 
 {code}
 private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = {
 while (!list.isEmpty) {
   val index = list.last
   list.trimEnd(1)
   if (copiesRunning(index) == 0  !successful(index)) {
 return Some(index)
   }
 }
 None
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1037) the name of findTaskFromList findTask in TaskSetManager.scala is confusing

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1037:
--
Assignee: Ilya Ganelin  (was: Josh Rosen)

 the name of findTaskFromList  findTask in TaskSetManager.scala is confusing
 

 Key: SPARK-1037
 URL: https://issues.apache.org/jira/browse/SPARK-1037
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
Reporter: Nan Zhu
Assignee: Ilya Ganelin
Priority: Minor
  Labels: starter
 Fix For: 1.3.0


 the name of these two functions is confusing 
 though in the comments the author said that the method does dequeue tasks 
 from the list but from the name, it is not explicitly indicating that the 
 method will mutate the parameter
 in 
 {code}
 private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = {
 while (!list.isEmpty) {
   val index = list.last
   list.trimEnd(1)
   if (copiesRunning(index) == 0  !successful(index)) {
 return Some(index)
   }
 }
 None
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip

2014-12-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4841:
-
Priority: Blocker  (was: Major)

 Batch serializer bug in PySpark's RDD.zip
 -

 Key: SPARK-4841
 URL: https://issues.apache.org/jira/browse/SPARK-4841
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Davies Liu
Priority: Blocker

 {code}
 t = sc.textFile(README.md)
 t.zip(t).count()
 {code}
 {code}
 Py4JJavaError Traceback (most recent call last)
 ipython-input-6-60fdeb8339fd in module()
  1 readme.zip(readme).count()
 /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self)
 817 3
 818 
 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
 820
 821 def stats(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self)
 808 6.0
 809 
 -- 810 return self.mapPartitions(lambda x: 
 [sum(x)]).reduce(operator.add)
 811
 812 def count(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f)
 713 yield reduce(f, iterator, initial)
 714
 -- 715 vals = self.mapPartitions(func).collect()
 716 if vals:
 717 return reduce(f, vals)
 /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self)
 674 
 675 with SCCallSiteSync(self.context) as css:
 -- 676 bytesInJava = self._jrdd.collect().iterator()
 677 return list(self._collect_iterator_through_file(bytesInJava))
 678
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
 __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o69.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback 
 (most recent call last):
   File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main
 process()
   File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in 
 dump_stream
 raise NotImplementedError
 NotImplementedError
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 

[jira] [Commented] (SPARK-1216) Add a OneHotEncoder for handling categorical features

2014-12-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247414#comment-14247414
 ] 

Joseph K. Bradley commented on SPARK-1216:
--

(Addressing old comments I just saw now...)
[~jaggi] I'd recommend keeping this separate from 
[https://issues.apache.org/jira/browse/SPARK-1303] since it is a different kind 
of transformation.
[~srowen] This is a bit different from [SPARK-4081]; see the comment in my PR: 
[https://github.com/apache/spark/pull/3000#issuecomment-62630207].  I'd 
recommend keeping them separate.

I don't immediately see a good way to organize these, but it could be worth 
discussing.

 Add a OneHotEncoder for handling categorical features
 -

 Key: SPARK-1216
 URL: https://issues.apache.org/jira/browse/SPARK-1216
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Ryza

 It would be nice to add something to MLLib to make it easy to do one-of-K 
 encoding of categorical features.
 Something like:
 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4501) Create build/mvn to automatically download maven/zinc/scalac

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247448#comment-14247448
 ] 

Apache Spark commented on SPARK-4501:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/3707

 Create build/mvn to automatically download maven/zinc/scalac
 

 Key: SPARK-4501
 URL: https://issues.apache.org/jira/browse/SPARK-4501
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma

 For a long time we've had the sbt/sbt and this works well for users who want 
 to build Spark with minimal dependencies (only Java). It would be nice to 
 generalize this to maven as well and have build/sbt and build/mvn, where 
 build/mvn was a script that downloaded Maven, Zinc, and Scala locally and set 
 them up correctly. This would be totally opt in and people using system 
 maven would be able to continue doing so.
 My sense is that very few maven users are currently using Zinc even though 
 from some basic tests I saw a huge improvement from using this. Also, having 
 a simple way to use Zinc would make it easier to use Maven on our jenkins 
 test machines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions

2014-12-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247473#comment-14247473
 ] 

Josh Rosen commented on SPARK-785:
--

I've merged https://github.com/apache/spark/pull/3690 to fix this in the 
maintenance branches and have tagged this for a 1.2.1 backport.

 ClosureCleaner not invoked on most PairRDDFunctions
 ---

 Key: SPARK-785
 URL: https://issues.apache.org/jira/browse/SPARK-785
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Sean Owen
  Labels: backport-needed
 Fix For: 1.1.1, 0.9.3, 1.0.3, 1.3.0


 It's pretty weird that we've missed this so far, but it seems to be the case. 
 Unfortunately it may not be good to fix this in 0.7.3 because it could change 
 behavior in unexpected ways; I haven't decided yet. But we should definitely 
 do it for 0.8, and add tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-785:
-
Labels: backport-needed  (was: )

 ClosureCleaner not invoked on most PairRDDFunctions
 ---

 Key: SPARK-785
 URL: https://issues.apache.org/jira/browse/SPARK-785
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Sean Owen
  Labels: backport-needed
 Fix For: 1.1.1, 0.9.3, 1.0.3, 1.3.0


 It's pretty weird that we've missed this so far, but it seems to be the case. 
 Unfortunately it may not be good to fix this in 0.7.3 because it could change 
 behavior in unexpected ways; I haven't decided yet. But we should definitely 
 do it for 0.8, and add tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-785:
-
Target Version/s: 1.2.1
Assignee: Sean Owen

 ClosureCleaner not invoked on most PairRDDFunctions
 ---

 Key: SPARK-785
 URL: https://issues.apache.org/jira/browse/SPARK-785
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Sean Owen
  Labels: backport-needed
 Fix For: 1.1.1, 0.9.3, 1.0.3, 1.3.0


 It's pretty weird that we've missed this so far, but it seems to be the case. 
 Unfortunately it may not be good to fix this in 0.7.3 because it could change 
 behavior in unexpected ways; I haven't decided yet. But we should definitely 
 do it for 0.8, and add tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4320:
--
Fix Version/s: (was: 1.1.1)
   1.1.2

 JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object 
 

 Key: SPARK-4320
 URL: https://issues.apache.org/jira/browse/SPARK-4320
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Spark Core
Reporter: Corey J. Nolet
 Fix For: 1.2.0, 1.1.2


 I am outputting data to Accumulo using a custom OutputFormat. I have tried 
 using saveAsNewHadoopFile() and that works- though passing an empty path is a 
 bit weird. Being that it isn't really a file I'm storing, but rather a  
 generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() 
 method, though I'm not at all interested in using the legacy mapred API.
 Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
 there should be two ways of calling into this method. Instead of forcing the 
 user to always set up the Job object explicitly, I'm in the camp of having 
 the following method signature:
 saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
 Class[? extends OutputFormat], conf : Configuration). This way, if I'm 
 writing spark jobs that are going from Hadoop back into Hadoop, I can 
 construct my Configuration once.
 Perhaps an overloaded method signature could be:
 saveAsNewHadoopDataset(job : Job)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4232) Truncate table not works when specific the table from non-current database session

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4232:
--
Fix Version/s: (was: 1.1.1)
   1.1.2

 Truncate table not works when specific the table from non-current database 
 session
 --

 Key: SPARK-4232
 URL: https://issues.apache.org/jira/browse/SPARK-4232
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: shengli
Priority: Minor
 Fix For: 1.2.0, 1.1.2


 Currently the truncate table works fine in the situation that current table 
 is in the corresponding database session. But it doesn't work in the scenario 
 which the table is not in the corresponding database session. 
 What I mean is :
 Assume we have two database:default, dw. A table named test_table in database 
 dw
 By default we login as default database session. So I run:
 use dw;
 truncate table test_table [partions..];  is OK.
 If I just use the default database default to run:
 use default;
 truncate table dw.test_table;
 It will throw exception
 Failed to parse: truncate table dw.test_table.
  line 1:17 missing EOF at '.' near 'dw'
 It's a bug when parsing the truncate table xxx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4355:
--
Fix Version/s: (was: 1.1.1)
   1.1.2

 OnlineSummarizer doesn't merge mean correctly
 -

 Key: SPARK-4355
 URL: https://issues.apache.org/jira/browse/SPARK-4355
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.2.0, 1.1.2


 It happens when the mean on one side is zero. I will send an PR with some 
 code clean-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4265) Better extensibility for TaskEndReason

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4265:
--
Fix Version/s: (was: 1.1.1)
   1.1.2

 Better extensibility for TaskEndReason
 --

 Key: SPARK-4265
 URL: https://issues.apache.org/jira/browse/SPARK-4265
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor
  Labels: api-change
 Fix For: 1.1.2


 Now all subclasses of TaskEndReason are case classes. As per discussion in 
 https://github.com/apache/spark/pull/3073#discussion_r19920257 , it's hard to 
 extend them (such as add/remove fields) without breaks compatibility for 
 matching.
 It's better to change them to regular classes for extensibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-785) ClosureCleaner not invoked on most PairRDDFunctions

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-785:
-
Fix Version/s: (was: 1.1.1)
   1.1.2

 ClosureCleaner not invoked on most PairRDDFunctions
 ---

 Key: SPARK-785
 URL: https://issues.apache.org/jira/browse/SPARK-785
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Sean Owen
  Labels: backport-needed
 Fix For: 0.9.3, 1.0.3, 1.3.0, 1.1.2


 It's pretty weird that we've missed this so far, but it seems to be the case. 
 Unfortunately it may not be good to fix this in 0.7.3 because it could change 
 behavior in unexpected ways; I haven't decided yet. But we should definitely 
 do it for 0.8, and add tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4006:
--
Fix Version/s: (was: 1.1.1)
   1.1.2

 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Assignee: Tal Sliwowicz
Priority: Critical
 Fix For: 1.2.0, 1.1.2


 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 {code}
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3987) NNLS generates incorrect result

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3987:
--
Fix Version/s: (was: 1.1.2)
   1.1.1

 NNLS generates incorrect result
 ---

 Key: SPARK-3987
 URL: https://issues.apache.org/jira/browse/SPARK-3987
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Debasish Das
Assignee: Shuo Xiang
 Fix For: 1.1.1, 1.2.0


 Hi,
 Please see the example gram matrix and linear term:
 val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 
 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, 
 -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, 
 -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, 
 -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, 
 -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 
 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 
 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 
 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 
 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, 
 -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, 
 -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, 
 -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, 
 -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 
 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 
 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 
 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, 
 -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 
 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 
 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 
 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 
 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, 
 -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, 
 -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 
 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 
 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, 
 -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, 
 -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, 
 -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 
 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, 
 -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, 
 -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, 
 -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, 
 -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 
 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 
 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 
 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, 
 -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 
 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 
 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 
 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, 
 -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 
 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 
 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 
 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 
 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, 
 -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, 
 -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, 
 -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 
 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, 
 -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, 
 -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, 
 -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, 
 -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 
 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 
 9735.061160, -45360.674033, 10634.633559, 0.00, -11652.364691, 
 15039.566630, -1202.539106, -293517.883778, 56991.742991, -183046.845592, 
 148311.355507, 42231.381747, -12524.380088, 219025.288975, 203889.695169, 
 -88370.612394, -24443.378205, -295133.248586, 

[jira] [Updated] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4006:
--
Fix Version/s: (was: 1.1.2)
   1.1.1

 Spark Driver crashes whenever an Executor is registered twice
 -

 Key: SPARK-4006
 URL: https://issues.apache.org/jira/browse/SPARK-4006
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0
 Environment: Mesos, Coarse Grained
Reporter: Tal Sliwowicz
Assignee: Tal Sliwowicz
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 This is a huge robustness issue for us (Taboola), in mission critical , time 
 sensitive (real time) spark jobs.
 We have long running spark drivers and even though we have state of the art 
 hardware, from time to time executors disconnect. In many cases, the 
 RemoveExecutor is not received, and when the new executor registers, the 
 driver crashes. In mesos coarse grained, executor ids are fixed. 
 The issue is with the System.exit(1) in BlockManagerMasterActor
 {code}
 private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: 
 ActorRef) {
 if (!blockManagerInfo.contains(id)) {
   blockManagerIdByExecutor.get(id.executorId) match {
 case Some(manager) =
   // A block manager of the same executor already exists.
   // This should never happen. Let's just quit.
   logError(Got two different block manager registrations on  + 
 id.executorId)
   System.exit(1)
 case None =
   blockManagerIdByExecutor(id.executorId) = id
   }
   logInfo(Registering block manager %s with %s RAM.format(
 id.hostPort, Utils.bytesToString(maxMemSize)))
   blockManagerInfo(id) =
 new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, 
 slaveActor)
 }
 listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2823) GraphX jobs throw IllegalArgumentException

2014-12-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247480#comment-14247480
 ] 

Josh Rosen commented on SPARK-2823:
---

I've removed the Fix Versions from this JIRA because its fix was reverted.

 GraphX jobs throw IllegalArgumentException
 --

 Key: SPARK-2823
 URL: https://issues.apache.org/jira/browse/SPARK-2823
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Lu Lu

 If the users set “spark.default.parallelism” and the value is different with 
 the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException:
 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to 
 exception - job: 1
 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
 at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1
 97)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:272)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:274)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:274)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279)
 at 
 org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-2823) GraphX jobs throw IllegalArgumentException

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2823:
--
Fix Version/s: (was: 1.1.2)
   (was: 1.0.3)
   (was: 1.2.0)

 GraphX jobs throw IllegalArgumentException
 --

 Key: SPARK-2823
 URL: https://issues.apache.org/jira/browse/SPARK-2823
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Lu Lu

 If the users set “spark.default.parallelism” and the value is different with 
 the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException:
 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to 
 exception - job: 1
 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
 at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1
 97)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:272)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:274)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:274)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279)
 at 
 org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Updated] (SPARK-2823) GraphX jobs throw IllegalArgumentException

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2823:
--
Target Version/s: 1.0.3, 1.3.0, 1.1.2, 1.2.1

 GraphX jobs throw IllegalArgumentException
 --

 Key: SPARK-2823
 URL: https://issues.apache.org/jira/browse/SPARK-2823
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Lu Lu

 If the users set “spark.default.parallelism” and the value is different with 
 the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException:
 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to 
 exception - job: 1
 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
 at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1
 97)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:272)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:274)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:274)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279)
 at 
 org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4355:
--
Fix Version/s: (was: 1.1.2)
   1.1.1

 OnlineSummarizer doesn't merge mean correctly
 -

 Key: SPARK-4355
 URL: https://issues.apache.org/jira/browse/SPARK-4355
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.1, 1.2.0


 It happens when the mean on one side is zero. I will send an PR with some 
 code clean-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4320:
--
Target Version/s: 1.1.2, 1.2.1
   Fix Version/s: (was: 1.1.2)
  (was: 1.2.0)

I've changed this issue's Fix Version/s into Target Version/s since it 
hasn't actually been fixed yet.

 JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object 
 

 Key: SPARK-4320
 URL: https://issues.apache.org/jira/browse/SPARK-4320
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Spark Core
Reporter: Corey J. Nolet

 I am outputting data to Accumulo using a custom OutputFormat. I have tried 
 using saveAsNewHadoopFile() and that works- though passing an empty path is a 
 bit weird. Being that it isn't really a file I'm storing, but rather a  
 generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() 
 method, though I'm not at all interested in using the legacy mapred API.
 Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think 
 there should be two ways of calling into this method. Instead of forcing the 
 user to always set up the Job object explicitly, I'm in the camp of having 
 the following method signature:
 saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : 
 Class[? extends OutputFormat], conf : Configuration). This way, if I'm 
 writing spark jobs that are going from Hadoop back into Hadoop, I can 
 construct my Configuration once.
 Perhaps an overloaded method signature could be:
 saveAsNewHadoopDataset(job : Job)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4265) Better extensibility for TaskEndReason

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4265:
--
Fix Version/s: (was: 1.1.2)

Removing the Fix Version/s field, since we reserve that for versions where a 
fix was actually applied.  Please use Target Version/s to plan future fixes.

 Better extensibility for TaskEndReason
 --

 Key: SPARK-4265
 URL: https://issues.apache.org/jira/browse/SPARK-4265
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor
  Labels: api-change

 Now all subclasses of TaskEndReason are case classes. As per discussion in 
 https://github.com/apache/spark/pull/3073#discussion_r19920257 , it's hard to 
 extend them (such as add/remove fields) without breaks compatibility for 
 matching.
 It's better to change them to regular classes for extensibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4151) Add string operation function trim, ltrim, rtrim, length to support SparkSql (HiveQL)

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4151:
--
Target Version/s: 1.1.2  (was: 1.1.0)
   Fix Version/s: (was: 1.1.2)
  (was: 1.1.0)

Removing the Fix Version/s field, since we only use that to indicate where 
fixes have been applied, not where we plan to apply them; moved those versions 
over to Target Version/s instead.

 Add string operation function trim, ltrim, rtrim, length to support SparkSql 
 (HiveQL) 
 --

 Key: SPARK-4151
 URL: https://issues.apache.org/jira/browse/SPARK-4151
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: shengli
Priority: Minor
   Original Estimate: 72h
  Remaining Estimate: 72h

 Add three string operation functions to support spark sql and hiveql.
 eg:
 sql(select trim(' a b ') from src ).collect() -- 'a b'
 sql(select ltrim(' a b ') from src ).collect() -- 'a b ' 
 sql(select rtrim(' a b ') from src ).collect() -- ' a b'
 sql(select length('ab') from src ).collect() -- 2
 And Rename the trait of stringOperations.scala.
 I prefer to rename trait CaseConversionExpression to 
 StringTransformationExpression, it is more make sence than before so that 
 this trait can support more string transformation but not only caseconversion.
 And also add a trait StringCalculationExpression that do string computation 
 like length, indexof etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2014-12-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247487#comment-14247487
 ] 

Joseph K. Bradley commented on SPARK-4846:
--

I agree with [~srowen] that the current implementation has to serialize those 
big data structures, no matter what.  Splitting the big syn0Global and 
syn1Global data structures across partitions sounds possible, but I would guess 
that the Word2VecModel itself would then need to be distributed as well since 
it occupies the same order of memory.  A distributed Word2VecModel sounds like 
a much bigger PR.

In the meantime, a simpler  faster solution might be nice.  The easiest would 
be to catch the error and print a warning.  A fancier but better solution might 
be to automatically minCount as much as necessary (and print a warning about 
this automatic change).

 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Priority: Critical

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4232) Truncate table not works when specific the table from non-current database session

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4232:
--
Target Version/s: 1.1.2, 1.2.1  (was: 1.1.0)
   Fix Version/s: (was: 1.1.2)
  (was: 1.2.0)

Removing the Fix Version/s field, since we only use that to indicate where 
fixes have been applied, not where we plan to apply them; moved those versions 
over to Target Version/s instead.

 Truncate table not works when specific the table from non-current database 
 session
 --

 Key: SPARK-4232
 URL: https://issues.apache.org/jira/browse/SPARK-4232
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: shengli
Priority: Minor

 Currently the truncate table works fine in the situation that current table 
 is in the corresponding database session. But it doesn't work in the scenario 
 which the table is not in the corresponding database session. 
 What I mean is :
 Assume we have two database:default, dw. A table named test_table in database 
 dw
 By default we login as default database session. So I run:
 use dw;
 truncate table test_table [partions..];  is OK.
 If I just use the default database default to run:
 use default;
 truncate table dw.test_table;
 It will throw exception
 Failed to parse: truncate table dw.test_table.
  line 1:17 missing EOF at '.' near 'dw'
 It's a bug when parsing the truncate table xxx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4323) Utils#fetchFile method should close lock file certainly

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4323.
---
Resolution: Not a Problem

Resolving this as Not a Problem since the pull request was closed after some 
discussion.

 Utils#fetchFile method should close lock file certainly
 ---

 Key: SPARK-4323
 URL: https://issues.apache.org/jira/browse/SPARK-4323
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Kousuke Saruta

 In Utils#fetchFile method, lock file is created like as follows.
 {code}
   val raf = new RandomAccessFile(lockFile, rw)  
   
   
   // Only one executor entry. 
   
   
   // The FileLock is only used to control synchronization for executors 
 download file,
 
   // it's always safe regardless of lock type (mandatory or advisory).
   
   
   val lock = raf.getChannel().lock()  
   
   
   val cachedFile = new File(localDir, cachedFileName) 
   
   
   try {   
   
   
 if (!cachedFile.exists()) {   
   
   
   doFetchFile(url, localDir, cachedFileName, conf, securityMgr, 
 hadoopConf)   
 
 } 
   
   
   } finally { 
   
   
 lock.release()
   
   
   }  
 {code}
 If some error occurs between opening RandomAccessFile and getting lock, lock 
 file can be not closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4793:
--
Labels: backport-needed  (was: )

 way to find assembly jar is too strict
 --

 Key: SPARK-4793
 URL: https://issues.apache.org/jira/browse/SPARK-4793
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.1.0
Reporter: Adrian Wang
Assignee: Adrian Wang
Priority: Minor
  Labels: backport-needed
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2980) Python support for chi-squared test

2014-12-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247505#comment-14247505
 ] 

Joseph K. Bradley commented on SPARK-2980:
--

Duplicated by later JIRA which has been fixed

 Python support for chi-squared test
 ---

 Key: SPARK-2980
 URL: https://issues.apache.org/jira/browse/SPARK-2980
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2980) Python support for chi-squared test

2014-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-2980.

   Resolution: Duplicate
Fix Version/s: 1.2.0
 Assignee: Davies Liu

 Python support for chi-squared test
 ---

 Key: SPARK-2980
 URL: https://issues.apache.org/jira/browse/SPARK-2980
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin
Assignee: Davies Liu
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4855) Python tests for hypothesis testing

2014-12-15 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-4855:


 Summary: Python tests for hypothesis testing
 Key: SPARK-4855
 URL: https://issues.apache.org/jira/browse/SPARK-4855
 Project: Spark
  Issue Type: Test
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Ben Cook
Priority: Minor


Add Python unit tests for Chi-Squared hypothesis testing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4814:
--
Fix Version/s: 1.2.1
   1.1.2
   1.3.0

 Enable assertions in SBT, Maven tests / AssertionError from Hive's 
 LazyBinaryInteger
 

 Key: SPARK-4814
 URL: https://issues.apache.org/jira/browse/SPARK-4814
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Sean Owen
 Fix For: 1.3.0, 1.1.2, 1.2.1


 Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
 in Maven, in part because a Java test actually fails with {{AssertionError}}. 
 That code/test was fixed in SPARK-4850.
 The reason it wasn't caught by SBT tests was that they don't run with 
 assertions on, and Maven's surefire does.
 Turning on assertions in the SBT build is trivial, adding one line:
 {code}
 javaOptions in Test += -ea,
 {code}
 This reveals a test failure in Scala test suites though:
 {code}
 [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
 [info]   Failed to execute query using catalyst:
 [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
 failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
 localhost): java.lang.AssertionError
 [info]at 
 org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
 [info]at 
 org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
 [info]at 
 org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
 [info]at 
 org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
 [info]at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
 [info]at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
 [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
 [info]at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
 [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
 [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
 [info]at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 [info]at 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
 [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
 [info]at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 [info]at 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
 [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
 [info]at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 [info]at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
 [info]at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
 [info]at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [info]at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [info]at java.lang.Thread.run(Thread.java:745)
 {code}
 The items for this JIRA are therefore:
 - Enable assertions in SBT
 - Fix this failure
 - Figure out why Maven scalatest didn't trigger it - may need assertions 
 explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4814:
--
Target Version/s: 1.0.3  (was: 1.3.0)
Assignee: Sean Owen
  Labels: backport-needed  (was: )

Alright, I've merged Sean's PR into master, branch-1.2, and branch-1.1.  
There's a large-ish merge conflict that we'll have to fix to get this into 
branch-1.0 (or I can just manually fix things up there).  Tagging this as 
{{backport-needed}} so I remember to come back and do that.

 Enable assertions in SBT, Maven tests / AssertionError from Hive's 
 LazyBinaryInteger
 

 Key: SPARK-4814
 URL: https://issues.apache.org/jira/browse/SPARK-4814
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Sean Owen
Assignee: Sean Owen
  Labels: backport-needed
 Fix For: 1.3.0, 1.1.2, 1.2.1


 Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
 in Maven, in part because a Java test actually fails with {{AssertionError}}. 
 That code/test was fixed in SPARK-4850.
 The reason it wasn't caught by SBT tests was that they don't run with 
 assertions on, and Maven's surefire does.
 Turning on assertions in the SBT build is trivial, adding one line:
 {code}
 javaOptions in Test += -ea,
 {code}
 This reveals a test failure in Scala test suites though:
 {code}
 [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
 [info]   Failed to execute query using catalyst:
 [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
 failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
 localhost): java.lang.AssertionError
 [info]at 
 org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
 [info]at 
 org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
 [info]at 
 org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
 [info]at 
 org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
 [info]at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
 [info]at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
 [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
 [info]at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
 [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
 [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
 [info]at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 [info]at 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
 [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
 [info]at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 [info]at 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
 [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
 [info]at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 [info]at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
 [info]at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
 [info]at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [info]at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [info]at java.lang.Thread.run(Thread.java:745)
 {code}
 The items for this JIRA are therefore:
 - Enable assertions in SBT
 - Fix this failure
 - Figure out why Maven scalatest didn't trigger it - may need assertions 
 explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-15 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-4856:


 Summary: Null  empty string should not be considered as 
StringType at begining in Json schema inferring
 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao


We have data like:
{panel}
TestSQLContext.sparkContext.parallelize(
  
{ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} ::
{ip:27.31.100.29,headers:{}} ::
{ip:27.31.100.29,headers:} :: Nil)
{panel}

As empty string will be considered as String in the beginning (in line 2 and 
3), it ignores the real nested data type (struct type in line 1), and also take 
the line 1 (the headers) as String Type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-15 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4856:
-
Description: 
We have data like:
{panel}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{panel}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.

  was:
We have data like:
{panel}
TestSQLContext.sparkContext.parallelize(
  
{ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} ::
{ip:27.31.100.29,headers:{}} ::
{ip:27.31.100.29,headers:} :: Nil)
{panel}

As empty string will be considered as String in the beginning (in line 2 and 
3), it ignores the real nested data type (struct type in line 1), and also take 
the line 1 (the headers) as String Type.


 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 We have data like:
 {panel}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {panel}
 As empty string (the headers) will be considered as String in the beginning 
 (in line 2 and 3), it ignores the real nested data type (struct type 
 headers in line 1), and also take the line 1 (the headers) as String 
 Type, which is not our expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-15 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4856:
-
Description: 
We have data like:
{code:java}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{code:xml}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.

  was:
We have data like:
{panel}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{panel}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.


 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 We have data like:
 {code:java}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {code:xml}
 As empty string (the headers) will be considered as String in the beginning 
 (in line 2 and 3), it ignores the real nested data type (struct type 
 headers in line 1), and also take the line 1 (the headers) as String 
 Type, which is not our expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247571#comment-14247571
 ] 

Apache Spark commented on SPARK-4856:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3708

 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 We have data like:
 {code:java}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {code:xml}
 As empty string (the headers) will be considered as String in the beginning 
 (in line 2 and 3), it ignores the real nested data type (struct type 
 headers in line 1), and also take the line 1 (the headers) as String 
 Type, which is not our expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4857) Add Executor Events to SparkListener

2014-12-15 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247578#comment-14247578
 ] 

Kostas Sakellis commented on SPARK-4857:


I'll work on this.

 Add Executor Events to SparkListener
 

 Key: SPARK-4857
 URL: https://issues.apache.org/jira/browse/SPARK-4857
 Project: Spark
  Issue Type: Improvement
Reporter: Kostas Sakellis

 We need to add events to the SparkListener to indicate an executor has been 
 added or removed with corresponding information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4857) Add Executor Events to SparkListener

2014-12-15 Thread Kostas Sakellis (JIRA)
Kostas Sakellis created SPARK-4857:
--

 Summary: Add Executor Events to SparkListener
 Key: SPARK-4857
 URL: https://issues.apache.org/jira/browse/SPARK-4857
 Project: Spark
  Issue Type: Improvement
Reporter: Kostas Sakellis


We need to add events to the SparkListener to indicate an executor has been 
added or removed with corresponding information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2014-12-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247582#comment-14247582
 ] 

Joseph K. Bradley commented on SPARK-3702:
--

I'm canceling my WIP PR for this since I have begun breaking that PR into 
smaller PRs.
The WIP PR branch is in [my ml-api branch | 
https://github.com/jkbradley/spark/tree/ml-api].

Here's the description of the WIP PR:

This is WIP effort to standardize abstractions and developer API for prediction 
tasks (classification and regression) for the new ML api (org.apache.spark.ml).
* Please comment on:
** abstractions, class hierarchy
** functionality required by each abstraction
** naming of types and methods
** ease of use for developers
** ease of use for users migrating from org.apache.spark.mllib
* Please ignore for now:
** missing tests and examples
** private/public API (I will make more things private to ml after writing 
tests and examples.)
** style and other details
** the many TODO items noted in the code

Please refer to [https://issues.apache.org/jira/browse/SPARK-3702] for some 
discussion on design, and [this design doc | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]
 for major design decisions.

This is not intended to cover all algorithms; e.g., one big missing item is 
porting the GeneralizedLinearModel class to the new API.  But it hopefully lays 
a fair amount of groundwork.

I have included a limited number of concrete classes in this WIP PR, for 
purposes of illustration:
* LogisticRegression (edited, to show effects of abstract classes)
* NaiveBayes (simple to show ease of use for developers)
* AdaBoost (demonstration of meta-algorithms taking advantage of abstractions)
** (Note discussion of strong vs. weak types for ensemble methods in design 
doc.)
** This implementation is very incomplete but illustrates using the 
abstractions.
* LinearRegression (example of Regressor, for completeness)
* evaluators (to provide default evaluators in the class hierarchy)
* IterativeSolver and IterativeEstimator (to expose iterative algorithms)
* LabeledPoint (Q: Should this include an instance weight?)

Items remaining:
- [ ] helper method for simulating a distribution over weighted instances by 
subsampling (for algorithms which do not support instance weights)
- [ ] several TODO items noted in the code
- [ ] add tests and examples
- [ ] general cleanup
- [ ] make more of hierarchy private to ml
- [ ] split into several smaller PRs

General plan for splitting into multiple PRs, in order:
1. Simple class hierarchy
2. Evaluators
3. IterativeEstimator
4. AdaBoost
5. NaiveBayes (Any time after Evaluators)

Thanks to @epahomov and @BigCrunsh for input, including from 
[https://github.com/apache/spark/pull/2137] which improves upon the 
org.apache.spark.mllib APIs.


 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
 of subtasks).  See the requires links below for subtasks.
 Goals:
 * give intuitive structure to API, both for developers and for generated 
 documentation
 * support meta-algorithms (e.g., boosting)
 * support generic functionality (e.g., evaluation)
 * reduce code duplication across classes
 [Design doc for class hierarchy | 
 https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4838) StackOverflowError when serialization task

2014-12-15 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247629#comment-14247629
 ] 

Hong Shen commented on SPARK-4838:
--

Here is the sql, it contain 2928 partitions.
{code:title=Bar.java|borderStyle=solid}
SELECT
DISTINCT a.uin
FROM
(
SELECT
DISTINCT uin AS uin
FROM
hlw :: t_dw_pf00135 a1
WHERE
imp_date BETWEEN 2014081008 AND 2014121007
AND (
oper1 LIKE '%设置气泡%'
AND oper2  '0'
)
) a
{code}

Here is full stack,
Error message from spark is:Job aborted due to stage failure: Task 
serialization failed: java.lang.StackOverflowError
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)

[jira] [Commented] (SPARK-4844) SGD should support custom sampling.

2014-12-15 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247652#comment-14247652
 ] 

Guoqiang Li commented on SPARK-4844:


Sorry, I mean, all of the data need to be serialized in {{RDD.sample}},  it is 
very inefficient. 

 SGD should support custom sampling.
 ---

 Key: SPARK-4844
 URL: https://issues.apache.org/jira/browse/SPARK-4844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4838) StackOverflowError when serialization task

2014-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247662#comment-14247662
 ] 

Sean Owen commented on SPARK-4838:
--

[~shenhong] This stack trace is very large but does not show any new 
information. What I meant was, is there anything different at the root? or was 
it not present in your logs? Obviously the problem is a serialization graph 
that is way too deeply nested, so more copies of these 5 lines aren't needed, 
but it might help to show where the call originated.

 StackOverflowError when serialization task
 --

 Key: SPARK-4838
 URL: https://issues.apache.org/jira/browse/SPARK-4838
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Hong Shen

 When run a sql with more than 2000 partitions, each partition a  HadoopRDD, 
 it will cause java.lang.StackOverflowError at serialize task.
  Error message from spark is:Job aborted due to stage failure: Task 
 serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 ..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4855) Python tests for hypothesis testing

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247665#comment-14247665
 ] 

Apache Spark commented on SPARK-4855:
-

User 'jbencook' has created a pull request for this issue:
https://github.com/apache/spark/pull/3679

 Python tests for hypothesis testing
 ---

 Key: SPARK-4855
 URL: https://issues.apache.org/jira/browse/SPARK-4855
 Project: Spark
  Issue Type: Test
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Ben Cook
Priority: Minor

 Add Python unit tests for Chi-Squared hypothesis testing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4838) StackOverflowError when serialization task

2014-12-15 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247676#comment-14247676
 ] 

Hong Shen commented on SPARK-4838:
--

This is the whole stack.
All we can know is it thow from DAGScheduler.submitMissingTasks, when serialize 
stage.rdd.
{code}
var taskBinary: Broadcast[Array[Byte]] = null
try {
  // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
  // For ResultTask, serialize and broadcast (rdd, func).
  val taskBinaryBytes: Array[Byte] =
if (stage.isShuffleMap) {
  closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : 
AnyRef).array()
} else {
  closureSerializer.serialize((stage.rdd, stage.resultOfJob.get.func) : 
AnyRef).array()
}
  taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
  // In the case of a failure during serialization, abort the stage.
  case e: NotSerializableException =
abortStage(stage, Task not serializable:  + e.toString)
runningStages -= stage
return
  case NonFatal(e) =
abortStage(stage, sTask serialization failed: 
$e\n${e.getStackTraceString})
runningStages -= stage
return
}
{code}


 StackOverflowError when serialization task
 --

 Key: SPARK-4838
 URL: https://issues.apache.org/jira/browse/SPARK-4838
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Hong Shen

 When run a sql with more than 2000 partitions, each partition a  HadoopRDD, 
 it will cause java.lang.StackOverflowError at serialize task.
  Error message from spark is:Job aborted due to stage failure: Task 
 serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 ..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4858) Add an option to turn off a progress bar in spark-shell

2014-12-15 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-4858:
---

 Summary: Add an option to turn off a progress bar in spark-shell
 Key: SPARK-4858
 URL: https://issues.apache.org/jira/browse/SPARK-4858
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Spark Shell
Reporter: Takeshi Yamamuro
Priority: Minor


Add an '--no-progress-bar' option to easily turn off a progress bar in 
spark-shell for users who'd like to look into debug logs or something.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4858) Add an option to turn off a progress bar in spark-shell

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247722#comment-14247722
 ] 

Apache Spark commented on SPARK-4858:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3709

 Add an option to turn off a progress bar in spark-shell
 ---

 Key: SPARK-4858
 URL: https://issues.apache.org/jira/browse/SPARK-4858
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Spark Shell
Reporter: Takeshi Yamamuro
Priority: Minor

 Add an '--no-progress-bar' option to easily turn off a progress bar in 
 spark-shell for users who'd like to look into debug logs or something.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4859) Improve StreamingListenerBus

2014-12-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-4859:
---

 Summary: Improve StreamingListenerBus
 Key: SPARK-4859
 URL: https://issues.apache.org/jira/browse/SPARK-4859
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu
Priority: Minor


Fix the race condition of `queueFullErrorMessageLogged`.
Log the error from listener rather than crashing `listenerThread`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4859) Improve StreamingListenerBus

2014-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247801#comment-14247801
 ] 

Apache Spark commented on SPARK-4859:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3710

 Improve StreamingListenerBus
 

 Key: SPARK-4859
 URL: https://issues.apache.org/jira/browse/SPARK-4859
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu
Priority: Minor

 Fix the race condition of `queueFullErrorMessageLogged`.
 Log the error from listener rather than crashing `listenerThread`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4792) Add some checks and messages on making local dir

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4792.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3635
[https://github.com/apache/spark/pull/3635]

 Add some checks and messages on making local dir
 

 Key: SPARK-4792
 URL: https://issues.apache.org/jira/browse/SPARK-4792
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor
 Fix For: 1.3.0


 Lacking judgument if making local dir successfully. If unsuccessfully, should 
 log some error/warning message. Also, should judge the dir is exist or not 
 before making it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4792) Add some checks and messages on making local dir

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4792:
--
Assignee: meiyoula

 Add some checks and messages on making local dir
 

 Key: SPARK-4792
 URL: https://issues.apache.org/jira/browse/SPARK-4792
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Assignee: meiyoula
Priority: Minor
 Fix For: 1.3.0


 Lacking judgument if making local dir successfully. If unsuccessfully, should 
 log some error/warning message. Also, should judge the dir is exist or not 
 before making it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip

2014-12-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4841:
--
Fix Version/s: 1.3.0
   Labels: backport-needed  (was: )

I've merged https://github.com/apache/spark/pull/3706 into master; adding a 
{{backport-needed}} label so that this makes it into 1.2.1.

 Batch serializer bug in PySpark's RDD.zip
 -

 Key: SPARK-4841
 URL: https://issues.apache.org/jira/browse/SPARK-4841
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Davies Liu
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0


 {code}
 t = sc.textFile(README.md)
 t.zip(t).count()
 {code}
 {code}
 Py4JJavaError Traceback (most recent call last)
 ipython-input-6-60fdeb8339fd in module()
  1 readme.zip(readme).count()
 /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self)
 817 3
 818 
 -- 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
 820
 821 def stats(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self)
 808 6.0
 809 
 -- 810 return self.mapPartitions(lambda x: 
 [sum(x)]).reduce(operator.add)
 811
 812 def count(self):
 /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f)
 713 yield reduce(f, iterator, initial)
 714
 -- 715 vals = self.mapPartitions(func).collect()
 716 if vals:
 717 return reduce(f, vals)
 /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self)
 674 
 675 with SCCallSiteSync(self.context) as css:
 -- 676 bytesInJava = self._jrdd.collect().iterator()
 677 return list(self._collect_iterator_through_file(bytesInJava))
 678
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
 __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:
 /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o69.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback 
 (most recent call last):
   File /Users/meng/src/spark/python/pyspark/worker.py, line 107, in main
 process()
   File /Users/meng/src/spark/python/pyspark/worker.py, line 98, in process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 198, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /Users/meng/src/spark/python/pyspark/serializers.py, line 81, in 
 dump_stream
 raise NotImplementedError
 NotImplementedError
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
   at 
 

[jira] [Commented] (SPARK-4790) Flaky test in ReceivedBlockTrackerSuite: block addition, block to batch allocation, and cleanup with write ahead log

2014-12-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247900#comment-14247900
 ] 

Josh Rosen commented on SPARK-4790:
---

/cc [~hshreedharan], you might want to take a look at this issue, too.  This 
test seems to fail intermittently on Jenkins.  Since this is a new test, I 
think we should fix its flakiness in its own PR.

 Flaky test in ReceivedBlockTrackerSuite: block addition, block to batch 
 allocation, and cleanup with write ahead log
 --

 Key: SPARK-4790
 URL: https://issues.apache.org/jira/browse/SPARK-4790
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen
Assignee: Tathagata Das
  Labels: flaky-test

 Found another flaky streaming test, 
 org.apache.spark.streaming.ReceivedBlockTrackerSuite.block addition, block 
 to batch allocation and cleanup with write ahead log:
 {code}
 Error Message
 File /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist.
 Stacktrace
 sbt.ForkMain$ForkError: File 
 /tmp/1418069118106-0/receivedBlockMetadata/log-0-1000 does not exist.
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:324)
   at 
 org.apache.spark.streaming.util.WriteAheadLogSuite$.getLogFilesInDirectory(WriteAheadLogSuite.scala:344)
   at 
 org.apache.spark.streaming.ReceivedBlockTrackerSuite.getWriteAheadLogFiles(ReceivedBlockTrackerSuite.scala:248)
   at 
 org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply$mcV$sp(ReceivedBlockTrackerSuite.scala:173)
   at 
 org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96)
   at 
 org.apache.spark.streaming.ReceivedBlockTrackerSuite$$anonfun$4.apply(ReceivedBlockTrackerSuite.scala:96)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$runTest(ReceivedBlockTrackerSuite.scala:41)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.ReceivedBlockTrackerSuite.runTest(ReceivedBlockTrackerSuite.scala:41)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.streaming.ReceivedBlockTrackerSuite.org$scalatest$BeforeAndAfter$$super$run(ReceivedBlockTrackerSuite.scala:41)
   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
   at