[jira] [Created] (SPARK-2613) CLONE - word2vec: Distributed Representation of Words

2014-07-22 Thread Yifan Yang (JIRA)
Yifan Yang created SPARK-2613:
-

 Summary: CLONE - word2vec: Distributed Representation of Words
 Key: SPARK-2613
 URL: https://issues.apache.org/jira/browse/SPARK-2613
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yifan Yang
Assignee: Liquan Pei


We would like to add parallel implementation of word2vec to MLlib. word2vec 
finds distributed representation of words through training of large data sets. 
The Spark programming model fits nicely with word2vec as the training algorithm 
of word2vec is embarrassingly parallel. We will focus on skip-gram model and 
negative sampling in our initial implementation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys

2014-07-22 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069880#comment-14069880
 ] 

Sandy Ryza commented on SPARK-2421:
---

It should be relatively straightforward to add a WritableSerializer.  One issue 
is that Spark doesn't pass the types in the conf in the way MR does, so on the 
read side we need a way to know what kind of objects to instantiate.  I'm 
messing around with a prototype that just writes out the class name as Text at 
the beginning of the stream.

 Spark should treat writable as serializable for keys
 

 Key: SPARK-2421
 URL: https://issues.apache.org/jira/browse/SPARK-2421
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Java API
Affects Versions: 1.0.0
Reporter: Xuefu Zhang

 It seems that Spark requires the key be serializable (class implement 
 Serializable interface). In Hadoop world, Writable interface is used for the 
 same purpose. A lot of existing classes, while writable, are not considered 
 by Spark as Serializable. It would be nice if Spark can treate Writable as 
 serializable and automatically serialize and de-serialize these classes using 
 writable interface.
 This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2612) ALS has data skew for popular product

2014-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069883#comment-14069883
 ] 

Apache Spark commented on SPARK-2612:
-

User 'renozhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/1521

 ALS has data skew for popular product
 -

 Key: SPARK-2612
 URL: https://issues.apache.org/jira/browse/SPARK-2612
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Peng Zhang

 Usually there are some popular products which are related with many users in 
 Rating inputs. 
 groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather 
 data of the popular product to one task, because it's RDD's partitioner may 
 be not used as the join() partitioner. 
 The following join() need to shuffle from the aggregated product data. The 
 shuffle block can easily be bigger than 2G, and shuffle failed as mentioned 
 in SPARK-1476
 And increasing blocks number doesn't work.  
 IMHO, groupByKey() should use the same partitioner as the other RDD in 
 join(). So groupByKey() and join() will be in the same stage, and shuffle 
 data from many previous tasks will not trigger 2G limits.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -PDeb)

2014-07-22 Thread Christian Tzolov (JIRA)
Christian Tzolov created SPARK-2614:
---

 Summary: Add the spark-examples-xxx-.jar to the Debian package 
created by assembly/pom.xml (e.g. -PDeb)
 Key: SPARK-2614
 URL: https://issues.apache.org/jira/browse/SPARK-2614
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov


The tar.gz distribution includes already the spark-examples.jar in the bundle. 
It is a common practice for installers to run SparkPi as a smoke test to verify 
that the installation is OK

/usr/share/spark/bin/spark-submit \
  --num-executors 10  --master yarn-cluster \
  --class org.apache.spark.examples.SparkPi \
  /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2615) Add == support for HiveQl

2014-07-22 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2615:


 Summary: Add == support for HiveQl
 Key: SPARK-2615
 URL: https://issues.apache.org/jira/browse/SPARK-2615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


Currently, if passing == other than = in expression of Hive QL, will cause 
exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2615) Add == support for HiveQl

2014-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069934#comment-14069934
 ] 

Apache Spark commented on SPARK-2615:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/1522

 Add == support for HiveQl
 ---

 Key: SPARK-2615
 URL: https://issues.apache.org/jira/browse/SPARK-2615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor

 Currently, if passing == other than = in expression of Hive QL, will 
 cause exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2616) Update Mesos to 0.19.1

2014-07-22 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-2616:
---

 Summary: Update Mesos to 0.19.1
 Key: SPARK-2616
 URL: https://issues.apache.org/jira/browse/SPARK-2616
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


Let's update Mesos to 0.19.1 and verify that it works.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2452) Multi-statement input to spark repl does not work

2014-07-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2452.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1441
[https://github.com/apache/spark/pull/1441]

 Multi-statement input to spark repl does not work
 -

 Key: SPARK-2452
 URL: https://issues.apache.org/jira/browse/SPARK-2452
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Timothy Hunter
Assignee: Prashant Sharma
Priority: Blocker
 Fix For: 1.1.0


 Here is an example:
 {code}
 scala val x = 4 ; def f() = x
 x: Int = 4
 f: ()Int
 scala f()
 console:11: error: $VAL5 is already defined as value $VAL5
 val $VAL5 = INSTANCE;
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2615) Add == support for HiveQl

2014-07-22 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069948#comment-14069948
 ] 

Cheng Hao commented on SPARK-2615:
--

https://github.com/apache/spark/pull/1522

 Add == support for HiveQl
 ---

 Key: SPARK-2615
 URL: https://issues.apache.org/jira/browse/SPARK-2615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor

 Currently, if passing == other than = in expression of Hive QL, will 
 cause exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-2615) Add == support for HiveQl

2014-07-22 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-2615:
-

Comment: was deleted

(was: https://github.com/apache/spark/pull/1522)

 Add == support for HiveQl
 ---

 Key: SPARK-2615
 URL: https://issues.apache.org/jira/browse/SPARK-2615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor

 Currently, if passing == other than = in expression of Hive QL, will 
 cause exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0

2014-07-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069979#comment-14069979
 ] 

Sean Owen commented on SPARK-2599:
--

Yeah they're tracking roughly the same issue but it would be good to note this 
discussion in SPARK-2479, including the current issue with 0.

I'd favor one absolute error method, changing the one caller as needed. If 
necessary, a second relative error method to accommodate this one use case.

 almostEquals mllib.util.TestingUtils does not behave as expected when 
 comparing against 0.0
 ---

 Key: SPARK-2599
 URL: https://issues.apache.org/jira/browse/SPARK-2599
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Doris Xin
Priority: Minor

 DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, 
 would always produce an epsilon of 1  1e-10, causing false failure when 
 comparing very small numbers with 0.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2617) Correct doc and usage of preservesPartitioning

2014-07-22 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-2617:


 Summary: Correct doc and usage of preservesPartitioning
 Key: SPARK-2617
 URL: https://issues.apache.org/jira/browse/SPARK-2617
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib, Spark Core
Affects Versions: 1.0.1
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


The name `preservesPartitioning` is ambiguous: 1) preserves the indices of 
partitions, 2) preserves the partitioner. The latter is correct and 
`preservesPartitioning` should really be called `preservesPartitioner`. 
Unfortunately, this is already part of the API and we cannot change.

We should be clear in the doc and fix wrong usages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2612) ALS has data skew for popular product

2014-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2612.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

 ALS has data skew for popular product
 -

 Key: SPARK-2612
 URL: https://issues.apache.org/jira/browse/SPARK-2612
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Peng Zhang
Assignee: Peng Zhang
 Fix For: 1.1.0


 Usually there are some popular products which are related with many users in 
 Rating inputs. 
 groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather 
 data of the popular product to one task, because it's RDD's partitioner may 
 be not used as the join() partitioner. 
 The following join() need to shuffle from the aggregated product data. The 
 shuffle block can easily be bigger than 2G, and shuffle failed as mentioned 
 in SPARK-1476
 And increasing blocks number doesn't work.  
 IMHO, groupByKey() should use the same partitioner as the other RDD in 
 join(). So groupByKey() and join() will be in the same stage, and shuffle 
 data from many previous tasks will not trigger 2G limits.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2612) ALS has data skew for popular product

2014-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2612:
-

Assignee: Peng Zhang

 ALS has data skew for popular product
 -

 Key: SPARK-2612
 URL: https://issues.apache.org/jira/browse/SPARK-2612
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Peng Zhang
Assignee: Peng Zhang
 Fix For: 1.1.0


 Usually there are some popular products which are related with many users in 
 Rating inputs. 
 groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather 
 data of the popular product to one task, because it's RDD's partitioner may 
 be not used as the join() partitioner. 
 The following join() need to shuffle from the aggregated product data. The 
 shuffle block can easily be bigger than 2G, and shuffle failed as mentioned 
 in SPARK-1476
 And increasing blocks number doesn't work.  
 IMHO, groupByKey() should use the same partitioner as the other RDD in 
 join(). So groupByKey() and join() will be in the same stage, and shuffle 
 data from many previous tasks will not trigger 2G limits.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -PDeb)

2014-07-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070197#comment-14070197
 ] 

Apache Spark commented on SPARK-2614:
-

User 'tzolov' has created a pull request for this issue:
https://github.com/apache/spark/pull/1527

 Add the spark-examples-xxx-.jar to the Debian package created by 
 assembly/pom.xml (e.g. -PDeb)
 --

 Key: SPARK-2614
 URL: https://issues.apache.org/jira/browse/SPARK-2614
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov

 The tar.gz distribution includes already the spark-examples.jar in the 
 bundle. It is a common practice for installers to run SparkPi as a smoke test 
 to verify that the installation is OK
 /usr/share/spark/bin/spark-submit \
   --num-executors 10  --master yarn-cluster \
   --class org.apache.spark.examples.SparkPi \
   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb)

2014-07-22 Thread Christian Tzolov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Tzolov updated SPARK-2614:


Summary: Add the spark-examples-xxx-.jar to the Debian package created by 
assembly/pom.xml (e.g. -Pdeb)  (was: Add the spark-examples-xxx-.jar to the 
Debian package created by assembly/pom.xml (e.g. -PDeb))

 Add the spark-examples-xxx-.jar to the Debian package created by 
 assembly/pom.xml (e.g. -Pdeb)
 --

 Key: SPARK-2614
 URL: https://issues.apache.org/jira/browse/SPARK-2614
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov

 The tar.gz distribution includes already the spark-examples.jar in the 
 bundle. It is a common practice for installers to run SparkPi as a smoke test 
 to verify that the installation is OK
 /usr/share/spark/bin/spark-submit \
   --num-executors 10  --master yarn-cluster \
   --class org.apache.spark.examples.SparkPi \
   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2618) use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler

2014-07-22 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-2618:
---

 Summary: use config spark.scheduler.priority for specifying 
TaskSet's priority on DAGScheduler
 Key: SPARK-2618
 URL: https://issues.apache.org/jira/browse/SPARK-2618
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement

2014-07-22 Thread Twinkle Sachdeva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070227#comment-14070227
 ] 

Twinkle Sachdeva commented on SPARK-2604:
-

I tried running in yarn-cluster mode. After setting property of 
spark.yarn.max.executor.failures to some number. Application do gets failed, 
but with misleading exception ( pasted at the end ). Instead of handling the 
condition this way, probably we should be doing the check for the overhead 
memory amount at the validation itself. Please share your thoughts, if you 
think otherwise.

Stacktrace :
Application application_1405933848949_0024 failed 2 times due to Error 
launching appattempt_1405933848949_0024_02. Got exception: 
java.net.ConnectException: Call From NN46/192.168.156.46 to localhost:51322 
failed on connection exception: java.net.ConnectException: Connection refused; 
For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy28.startContainers(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

 Spark Application hangs on yarn in edge case scenario of executor memory 
 requirement
 

 Key: SPARK-2604
 URL: https://issues.apache.org/jira/browse/SPARK-2604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Twinkle Sachdeva

 In yarn environment, let's say :
 MaxAM = Maximum allocatable memory
 ExecMem - Executor's memory
 if (MaxAM  ExecMem  ( MaxAM - ExecMem)  384m ))
   then Maximum resource validation fails w.r.t executor memory , and 
 application master gets launched, but when resource is allocated and again 
 validated, they are returned and application appears to be hanged.
 Typical use case is to ask for executor memory = maximum allowed memory as 
 per yarn config



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2619) Configurable file-mode for spark/bin folder in the .deb package.

2014-07-22 Thread Christian Tzolov (JIRA)
Christian Tzolov created SPARK-2619:
---

 Summary: Configurable file-mode for spark/bin folder in the .deb 
package. 
 Key: SPARK-2619
 URL: https://issues.apache.org/jira/browse/SPARK-2619
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov


Currently the /bin folder in the .dep package is hardcoded to 744. So only the 
Root user (deb.user defaults to root) can run Spark jobs. 

If we make /bin filemode a configural maven property then we easily generate a 
package with less restrictive execution rights. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2446) Add BinaryType support to Parquet I/O.

2014-07-22 Thread Teng Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070310#comment-14070310
 ] 

Teng Qiu commented on SPARK-2446:
-

hi [~marmbrus] impala creating parquet file also without UTF8 annotation for 
strings, i just tried the newest impala release in CDH 5.1.0, it is still so.

is it possible, that add one more parameter in sqlContext.parquetFile(), to 
disable/enable BinaryType ?

i am not sure if it's worth, but it is more flexible, and in our use case, we 
have many parquet files they were created by impala... :)

for example change
def parquetFile(path: String): SchemaRDD
to
def parquetFile(path: String, allowBinaryType: Boolean = true): SchemaRDD

user can call sqlContext.parquetFile(xxx, false) to access parquet files made 
by old spark version and impala.

then it is backward compatible, what do you think?

 Add BinaryType support to Parquet I/O.
 --

 Key: SPARK-2446
 URL: https://issues.apache.org/jira/browse/SPARK-2446
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0


 To support {{BinaryType}}, the following changes are needed:
 - Make {{StringType}} use {{OriginalType.UTF8}}
 - Add {{BinaryType}} using {{PrimitiveTypeName.BINARY}} without 
 {{OriginalType}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2446) Add BinaryType support to Parquet I/O.

2014-07-22 Thread Teng Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070310#comment-14070310
 ] 

Teng Qiu edited comment on SPARK-2446 at 7/22/14 2:47 PM:
--

hi [~marmbrus] impala creating parquet file also without UTF8 annotation for 
strings, i just tried the newest impala release in CDH 5.1.0, it is still so.

is it possible, that add one more parameter in sqlContext.parquetFile(), to 
disable/enable BinaryType ?

i am not sure if it's worth, but it is more flexible, and in our use case, we 
have many parquet files they were created by impala... :)

for example change
def parquetFile(path: String): SchemaRDD
to
def parquetFile(path: String, allowBinaryType: Boolean = true): SchemaRDD

by default allowBinaryType is set to true, but user can call 
sqlContext.parquetFile(xxx, false) to access parquet files made by old spark 
version and impala.

then it is backward compatible, what do you think?


was (Author: chutium):
hi [~marmbrus] impala creating parquet file also without UTF8 annotation for 
strings, i just tried the newest impala release in CDH 5.1.0, it is still so.

is it possible, that add one more parameter in sqlContext.parquetFile(), to 
disable/enable BinaryType ?

i am not sure if it's worth, but it is more flexible, and in our use case, we 
have many parquet files they were created by impala... :)

for example change
def parquetFile(path: String): SchemaRDD
to
def parquetFile(path: String, allowBinaryType: Boolean = true): SchemaRDD

user can call sqlContext.parquetFile(xxx, false) to access parquet files made 
by old spark version and impala.

then it is backward compatible, what do you think?

 Add BinaryType support to Parquet I/O.
 --

 Key: SPARK-2446
 URL: https://issues.apache.org/jira/browse/SPARK-2446
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0


 To support {{BinaryType}}, the following changes are needed:
 - Make {{StringType}} use {{OriginalType.UTF8}}
 - Add {{BinaryType}} using {{PrimitiveTypeName.BINARY}} without 
 {{OriginalType}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-22 Thread Ken Carlile (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070367#comment-14070367
 ] 

Ken Carlile commented on SPARK-2282:


Hi Aaron, 

Another question for you. Would it work for me to just drop the two changed 
files into our install of Spark 1.0.1 release copy, or is that likely to cause 
issues? 

Thanks, 
Ken

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2620) case class cannot be used as key for reduce

2014-07-22 Thread Gerard Maas (JIRA)
Gerard Maas created SPARK-2620:
--

 Summary: case class cannot be used as key for reduce
 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical


Using a case class as a key doesn't seem to work properly on Spark 1.0.0

A minimal example:

case class P(name:String)
val ps = Array(P(alice), P(bob), P(charly), P(bob))
sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
[Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), 
(P(abe),1), (P(charly),1))

In contrast to the expected behavior, that should be equivalent to:
sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))

groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-07-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070439#comment-14070439
 ] 

Sean Owen commented on SPARK-2620:
--

Duplicate of https://issues.apache.org/jira/browse/SPARK-1199 I think?

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2621) Update task InputMetrics incrementally

2014-07-22 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-2621:
-

 Summary: Update task InputMetrics incrementally
 Key: SPARK-2621
 URL: https://issues.apache.org/jira/browse/SPARK-2621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-07-22 Thread Helena Edelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helena Edelson updated SPARK-2593:
--

Issue Type: Improvement  (was: Brainstorming)

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
 If it makes sense...
 I would like to create an Akka Extension that wraps around Spark/Spark 
 Streaming and Cassandra. So the creation would simply be this for a user
 val extension = SparkCassandra(system)
  and using is as easy as:
 import extension._
 spark. // do work or, 
 streaming. // do work
  
 and all config comes from reference.conf and user overrides of that.
 The conf file would pick up settings from the deployed environment first, 
 then fallback to -D with a final fallback to configured settings.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-07-22 Thread Helena Edelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helena Edelson updated SPARK-2593:
--

Description: 
As a developer I want to pass an existing ActorSystem into StreamingContext in 
load-time so that I do not have 2 actor systems running on a node in an Akka 
application.

This would mean having spark's actor system on its own named-dispatchers as 
well as exposing the new private creation of its own actor system.
 
I would like to create an Akka Extension that wraps around Spark/Spark 
Streaming and Cassandra. So the programmatic creation would simply be this for 
a user

val extension = SparkCassandra(system)
 

  was:
As a developer I want to pass an existing ActorSystem into StreamingContext in 
load-time so that I do not have 2 actor systems running on a node.

This would mean having spark's actor system on its own named-dispatchers as 
well as exposing the new private creation of its own actor system.

If it makes sense...

I would like to create an Akka Extension that wraps around Spark/Spark 
Streaming and Cassandra. So the creation would simply be this for a user

val extension = SparkCassandra(system)
 and using is as easy as:

import extension._
spark. // do work or, 
streaming. // do work
 
and all config comes from reference.conf and user overrides of that.
The conf file would pick up settings from the deployed environment first, then 
fallback to -D with a final fallback to configured settings.




 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
  
 I would like to create an Akka Extension that wraps around Spark/Spark 
 Streaming and Cassandra. So the programmatic creation would simply be this 
 for a user
 val extension = SparkCassandra(system)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-07-22 Thread Gerard Maas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070443#comment-14070443
 ] 

Gerard Maas commented on SPARK-2620:


[~sowen] No, doesn't look like it is.

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2615) Add == support for HiveQl

2014-07-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070448#comment-14070448
 ] 

Yin Huai commented on SPARK-2615:
-

Based on Hive language manual 
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF), == is 
invalid. But, Hive actually treats == as =. I have sent an email to 
hive-dev list to ask it.

 Add == support for HiveQl
 ---

 Key: SPARK-2615
 URL: https://issues.apache.org/jira/browse/SPARK-2615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor

 Currently, if passing == other than = in expression of Hive QL, will 
 cause exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-07-22 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070476#comment-14070476
 ] 

Evan Chan commented on SPARK-2593:
--

I would say that the base SparkContext should have this ability, after all it 
creates multiple actors as well as a base ActorSystem.

Sharing a single ActorSystem would also speed up all of Spark's tests, many of 
which repeatedly create and then tear down ActorSystems (since that's what the 
base SparkContext does).

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
  
 I would like to create an Akka Extension that wraps around Spark/Spark 
 Streaming and Cassandra. So the programmatic creation would simply be this 
 for a user
 val extension = SparkCassandra(system)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2622) Add Jenkins build numbers to SparkQA messages

2014-07-22 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-2622:


 Summary: Add Jenkins build numbers to SparkQA messages
 Key: SPARK-2622
 URL: https://issues.apache.org/jira/browse/SPARK-2622
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.1
Reporter: Xiangrui Meng
Priority: Minor


It takes Jenkins 2 hours to finish testing. It is possible to have the 
following:

{code}
Build 1 started.
PR updated.
Build 2 started.
Build 1 finished successfully.
A committer merged the PR because the last build seemed to be okay.
Build 2 failed.
{code}

It would be nice to put the build number in the SparkQA message so it is easy 
to match the result with the build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-22 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070528#comment-14070528
 ] 

Aaron Davidson commented on SPARK-2282:
---

Great to hear! These files haven't been changed since the 1.0.1 release besides 
this patch, so it should be fine to just drop them in. (A generally safer 
option would be to do a git merge, though, against Spark's refs/pull/1503/head 
branch.)

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-07-22 Thread Daniel Siegmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070543#comment-14070543
 ] 

Daniel Siegmann commented on SPARK-2620:


I have confirmed this on Spark 1.0.1 as well.

As Gerard noted on the mailing list, this bug does NOT affect Spark 0.9.1 
(confirmed in Spark shell).

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2623) Stacked Auto Encoder (Deep Learning )

2014-07-22 Thread Victor Fang (JIRA)
Victor Fang created SPARK-2623:
--

 Summary: Stacked Auto Encoder (Deep Learning )
 Key: SPARK-2623
 URL: https://issues.apache.org/jira/browse/SPARK-2623
 Project: Spark
  Issue Type: New Feature
Reporter: Victor Fang


We would like to add parallel implementation of  Stacked Auto Encoder (Deep 
Learning ) algorithm to Spark MLLib.
SAE is one of the most popular Deep Learning algorithms. It has achieved 
successful benchmarks in MNIST hand written classifications, Google's ICML2012 
cat face paper (http://icml.cc/2012/papers/73.pdf), etc.
Our focus is to leverage the RDD and get the SAE with the following capability 
with ease of use for both beginners and advanced researchers:
1, multi layer SAE deep network training and scoring.
2, unsupervised feature learning.
3, supervised learning with multinomial logistic regression (softmax). 





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2623) Stacked Auto Encoder (Deep Learning )

2014-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2623:
-

Assignee: Victor Fang

 Stacked Auto Encoder (Deep Learning )
 -

 Key: SPARK-2623
 URL: https://issues.apache.org/jira/browse/SPARK-2623
 Project: Spark
  Issue Type: New Feature
Reporter: Victor Fang
Assignee: Victor Fang
  Labels: deeplearning, machine_learning

 We would like to add parallel implementation of  Stacked Auto Encoder (Deep 
 Learning ) algorithm to Spark MLLib.
 SAE is one of the most popular Deep Learning algorithms. It has achieved 
 successful benchmarks in MNIST hand written classifications, Google's 
 ICML2012 cat face paper (http://icml.cc/2012/papers/73.pdf), etc.
 Our focus is to leverage the RDD and get the SAE with the following 
 capability with ease of use for both beginners and advanced researchers:
 1, multi layer SAE deep network training and scoring.
 2, unsupervised feature learning.
 3, supervised learning with multinomial logistic regression (softmax). 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2624) Datanucleus jars not accessible in yarn-cluster mode

2014-07-22 Thread Andrew Or (JIRA)
Andrew Or created SPARK-2624:


 Summary: Datanucleus jars not accessible in yarn-cluster mode
 Key: SPARK-2624
 URL: https://issues.apache.org/jira/browse/SPARK-2624
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.1
Reporter: Andrew Or
 Fix For: 1.1.0


This is because we add it to the class path of the command that launches spark 
submit, but the containers never get it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2620) case class cannot be used as key for reduce

2014-07-22 Thread Aaron (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070604#comment-14070604
 ] 

Aaron edited comment on SPARK-2620 at 7/22/14 6:05 PM:
---

If you look at the diff of distinct from branch-0.9 to master you see  
   -  def distinct(numPartitions: Int): RDD[T] =
   +  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): 
RDD[T] =  

Is it possible that case classes don't have an implicit ordering and that is 
why this fails?


was (Author: aaronjosephs):
If you look at the diff of distinct from branch-0.9 to master you see  
-  def distinct(numPartitions: Int): RDD[T] =
+  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = 
 

Is it possible that case classes don't have an implicit ordering and that is 
why this fails?

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-07-22 Thread Aaron (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070604#comment-14070604
 ] 

Aaron commented on SPARK-2620:
--

If you look at the diff of distinct from branch-0.9 to master you see  
-  def distinct(numPartitions: Int): RDD[T] =
+  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = 
 

Is it possible that case classes don't have an implicit ordering and that is 
why this fails?

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2620) case class cannot be used as key for reduce

2014-07-22 Thread Aaron (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070604#comment-14070604
 ] 

Aaron edited comment on SPARK-2620 at 7/22/14 6:05 PM:
---

If you look at the diff of distinct from branch-0.9 to master you see  
  -  def distinct(numPartitions: Int): RDD[T] =
   +  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): 
RDD[T] = 

Is it possible that case classes don't have an implicit ordering and that is 
why this fails?


was (Author: aaronjosephs):
If you look at the diff of distinct from branch-0.9 to master you see  
   -  def distinct(numPartitions: Int): RDD[T] =
   +  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): 
RDD[T] =  

Is it possible that case classes don't have an implicit ordering and that is 
why this fails?

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2625) Fix ShuffleReadMetrics for NettyBlockFetcherIterator

2014-07-22 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-2625:
-

 Summary: Fix ShuffleReadMetrics for NettyBlockFetcherIterator
 Key: SPARK-2625
 URL: https://issues.apache.org/jira/browse/SPARK-2625
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Priority: Minor


NettyBlockFetcherIterator doesn't report fetchWaitTime and has some race 
conditions where multiple threads can be incrementing bytes read at the same 
time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2618) use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler

2014-07-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070691#comment-14070691
 ] 

Patrick Wendell commented on SPARK-2618:


Can you explain more what you are trying to accomplish here? The Fair scheduler 
is designed to allow different scheduling policies. Prioritization of jobs is 
not handled in the DAGScheduler.

 use config spark.scheduler.priority for specifying TaskSet's priority on 
 DAGScheduler
 -

 Key: SPARK-2618
 URL: https://issues.apache.org/jira/browse/SPARK-2618
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

2014-07-22 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2047.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

 Use less memory in AppendOnlyMap.destructiveSortedIterator
 --

 Key: SPARK-2047
 URL: https://issues.apache.org/jira/browse/SPARK-2047
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Aaron Davidson
 Fix For: 1.1.0


 This method tries to sort an the key-value pairs in the map in-place but ends 
 up allocating a Tuple2 object for each one, which allocates a nontrivial 
 amount of memory (32 or more bytes per entry on a 64-bit JVM). We could 
 instead try to sort the objects in-place within the data array, or allocate 
 an int array with the indices and sort those using a custom comparator. The 
 latter is probably easiest to begin with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

2014-07-22 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2047:
-

Assignee: Aaron Davidson

 Use less memory in AppendOnlyMap.destructiveSortedIterator
 --

 Key: SPARK-2047
 URL: https://issues.apache.org/jira/browse/SPARK-2047
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Aaron Davidson
 Fix For: 1.1.0


 This method tries to sort an the key-value pairs in the map in-place but ends 
 up allocating a Tuple2 object for each one, which allocates a nontrivial 
 amount of memory (32 or more bytes per entry on a 64-bit JVM). We could 
 instead try to sort the objects in-place within the data array, or allocate 
 an int array with the indices and sort those using a custom comparator. The 
 latter is probably easiest to begin with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2626) Stop SparkContext in all examples

2014-07-22 Thread Andrew Or (JIRA)
Andrew Or created SPARK-2626:


 Summary: Stop SparkContext in all examples
 Key: SPARK-2626
 URL: https://issues.apache.org/jira/browse/SPARK-2626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Andrew Or
 Fix For: 1.1.0


Event logs rely on sc.stop() to close the file. If this is never closed, the 
history server will not be able to find the logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2627) Check for PEP 8 compliance on all Python code in the Jenkins CI cycle

2014-07-22 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-2627:
---

 Summary: Check for PEP 8 compliance on all Python code in the 
Jenkins CI cycle
 Key: SPARK-2627
 URL: https://issues.apache.org/jira/browse/SPARK-2627
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas


This issue was triggered by [the discussion 
here|https://github.com/apache/spark/pull/1505#issuecomment-49698681].

Requirements:
* make a linter script for Scala under {{dev/lint-scala}} that just calls 
{{scalastyle}}
* make a linter script for Python under {{dev/lint-python}} that calls {{pep8}} 
on all Python files
** One exception to this is {{cloudpickle.py}}, which is a third-party module 
[we don't want to 
touch|https://github.com/apache/spark/pull/1505#discussion-diff-15197904]
* Modify {{dev/run-tests}} to call both linter scripts
* Incorporate these changes into the [Contributing to 
Spark|https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] 
guide



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2628) Mesos backend throwing unable to find LoginModule

2014-07-22 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-2628:
---

 Summary: Mesos backend throwing unable to find LoginModule 
 Key: SPARK-2628
 URL: https://issues.apache.org/jira/browse/SPARK-2628
 Project: Spark
  Issue Type: Bug
Reporter: Timothy Chen


http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E

14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server
14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor 
task launch worker-1,5,main]
java.lang.Error: java.io.IOException: failure to login
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: failure to login
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
... 2 more
Caused by: javax.security.auth.login.LoginException: unable to find LoginModule 
class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
at 
javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
at java.security.AccessController.doPrivileged(Native Method)
at 
javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
... 6 more
14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor 
task launch worker-0,5,main]
java.lang.Error: java.io.IOException: failure to login
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: failure to login
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
... 2 more
Caused by: javax.security.auth.login.LoginException: unable to find LoginModule 
class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
at 
javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
at java.security.AccessController.doPrivileged(Native Method)
at 
javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
... 6 more
 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1166) leftover vpc_id may block the creation of new ec2 cluster

2014-07-22 Thread bruce szalwinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070823#comment-14070823
 ] 

bruce szalwinski commented on SPARK-1166:
-

I've been able to reproduce, but not consistently. On this occasion, I had 
previously started an instance, didn't use it for anything and soon there after 
shut it down.  Don't know if that means anything.  Using spark 1.0 from 

spark-ec2 -k mykeypair -i ~/.aws/mykeypair.pem -s 7 -r us-west-2 -t r3.large 
launch sparck-cluster
Setting up security groups...
Creating security group sparck-cluster-master
Creating security group sparck-cluster-slaves
ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidGroup.NotFound/CodeMessageThe 
security group 'sg-185fe07d' does not 
exist/Message/Error/ErrorsRequestID80f1e1e3-e340-4cd2-ba64-53c13525ab2b/RequestID/Response
Traceback (most recent call last):
  File ./spark_ec2.py, line 909, in module
main()
  File ./spark_ec2.py, line 901, in main
real_main()
  File ./spark_ec2.py, line 779, in real_main
(master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
  File ./spark_ec2.py, line 279, in launch_cluster
master_group.authorize(src_group=slave_group)
  File 
/Users/szalwinb/playpen/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py,
 line 184, in authorize
  File 
/Users/szalwinb/playpen/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
 line 2150, in authorize_security_group
  File 
/Users/szalwinb/playpen/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
 line 2093, in authorize_security_group_deprecated
  File 
/Users/szalwinb/playpen/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py,
 line 944, in get_status
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidGroup.NotFound/CodeMessageThe 
security group 'sg-185fe07d' does not 
exist/Message/Error/ErrorsRequestID80f1e1e3-e340-4cd2-ba64-53c13525ab2b/RequestID/Response

 leftover vpc_id may block the creation of new ec2 cluster
 -

 Key: SPARK-1166
 URL: https://issues.apache.org/jira/browse/SPARK-1166
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Nan Zhu
Assignee: Nan Zhu

 When I run the spark-ec2 script to build ec2 cluster in EC2, for some reason, 
 I always received errors as following:
 {code}
 Setting up security groups...
 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid 
 value 'null' for protocol. VPC security group rules must specify protocols 
 explicitly./Message/Error/ErrorsRequestIDfc56f0ba-915a-45b6-8555-05d4dd0f14ee/RequestID/Response
 Traceback (most recent call last):
   File ./spark_ec2.py, line 813, in module
 main()
   File ./spark_ec2.py, line 806, in main
 real_main()
   File ./spark_ec2.py, line 689, in real_main
 conn, opts, cluster_name)
   File ./spark_ec2.py, line 244, in launch_cluster
 slave_group.authorize(src_group=master_group)
   File 
 /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py,
  line 184, in authorize
   File 
 /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
  line 2181, in authorize_security_group
   File 
 /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py,
  line 944, in get_status
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid 
 value 'null' for protocol. VPC security group rules must specify protocols 
 explicitly./Message/Error/ErrorsRequestIDfc56f0ba-915a-45b6-8555-05d4dd0f14ee/RequestID/Response
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2628) Mesos backend throwing unable to find LoginModule

2014-07-22 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070826#comment-14070826
 ] 

Timothy Chen commented on SPARK-2628:
-

[~pwendell] please assign to me, thanks!

 Mesos backend throwing unable to find LoginModule 
 --

 Key: SPARK-2628
 URL: https://issues.apache.org/jira/browse/SPARK-2628
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Timothy Chen

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E
 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-1,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-0,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2628) Mesos backend throwing unable to find LoginModule

2014-07-22 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated SPARK-2628:


Component/s: Mesos

 Mesos backend throwing unable to find LoginModule 
 --

 Key: SPARK-2628
 URL: https://issues.apache.org/jira/browse/SPARK-2628
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Timothy Chen

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E
 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-1,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-0,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2452) Multi-statement input to spark repl does not work

2014-07-22 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070841#comment-14070841
 ] 

Timothy Hunter commented on SPARK-2452:
---

Excellent, thanks Patrick.



 Multi-statement input to spark repl does not work
 -

 Key: SPARK-2452
 URL: https://issues.apache.org/jira/browse/SPARK-2452
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Timothy Hunter
Assignee: Prashant Sharma
Priority: Blocker
 Fix For: 1.1.0


 Here is an example:
 {code}
 scala val x = 4 ; def f() = x
 x: Int = 4
 f: ()Int
 scala f()
 console:11: error: $VAL5 is already defined as value $VAL5
 val $VAL5 = INSTANCE;
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1166) leftover vpc_id may block the creation of new ec2 cluster

2014-07-22 Thread bruce szalwinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070840#comment-14070840
 ] 

bruce szalwinski commented on SPARK-1166:
-

To resolve, I go to 
https://console.aws.amazon.com/vpc/home?region=us-west-2#securityGroups: and 
manually delete the security groups and then I'm able to start up a cluster.

 leftover vpc_id may block the creation of new ec2 cluster
 -

 Key: SPARK-1166
 URL: https://issues.apache.org/jira/browse/SPARK-1166
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Nan Zhu
Assignee: Nan Zhu

 When I run the spark-ec2 script to build ec2 cluster in EC2, for some reason, 
 I always received errors as following:
 {code}
 Setting up security groups...
 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid 
 value 'null' for protocol. VPC security group rules must specify protocols 
 explicitly./Message/Error/ErrorsRequestIDfc56f0ba-915a-45b6-8555-05d4dd0f14ee/RequestID/Response
 Traceback (most recent call last):
   File ./spark_ec2.py, line 813, in module
 main()
   File ./spark_ec2.py, line 806, in main
 real_main()
   File ./spark_ec2.py, line 689, in real_main
 conn, opts, cluster_name)
   File ./spark_ec2.py, line 244, in launch_cluster
 slave_group.authorize(src_group=master_group)
   File 
 /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py,
  line 184, in authorize
   File 
 /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
  line 2181, in authorize_security_group
   File 
 /Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py,
  line 944, in get_status
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid 
 value 'null' for protocol. VPC security group rules must specify protocols 
 explicitly./Message/Error/ErrorsRequestIDfc56f0ba-915a-45b6-8555-05d4dd0f14ee/RequestID/Response
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2628) Mesos backend throwing unable to find LoginModule

2014-07-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2628:
---

Assignee: Tim Chen

 Mesos backend throwing unable to find LoginModule 
 --

 Key: SPARK-2628
 URL: https://issues.apache.org/jira/browse/SPARK-2628
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Timothy Chen
Assignee: Tim Chen

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E
 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-1,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-0,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1642:
-

Fix Version/s: (was: 1.1.0)

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
 ---

 Key: SPARK-1642
 URL: https://issues.apache.org/jira/browse/SPARK-1642
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

 This will add support for SSL encryption between Flume AvroSink and Spark 
 Streaming.
 It is based on FLUME-2083



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1645:
-

Issue Type: Improvement  (was: Bug)

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1645:
-

Target Version/s: 1.1.0

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb)

2014-07-22 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070908#comment-14070908
 ] 

Mark Hamstra commented on SPARK-2614:
-

It's also common for installers/admins to not want all of the examples 
installed on their production machines.

This issue is really touching on the fact that Spark's Debian packaging isn't 
presently everything that it could be.  It's really just a hack to allow 
deployment tools like Chef to be able to manage Spark.  To really do Debian 
packaging right, we should be creating multiple packages -- perhaps spark-core, 
spark-sql, spark-mllib, spark-graphx, spark-examples, etc.  The only thing that 
is really preventing us from doing this is that proper Debian packaging would 
require a package maintainer willing and committed to do all of the maintenance 
work. 

 Add the spark-examples-xxx-.jar to the Debian package created by 
 assembly/pom.xml (e.g. -Pdeb)
 --

 Key: SPARK-2614
 URL: https://issues.apache.org/jira/browse/SPARK-2614
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov

 The tar.gz distribution includes already the spark-examples.jar in the 
 bundle. It is a common practice for installers to run SparkPi as a smoke test 
 to verify that the installation is OK
 /usr/share/spark/bin/spark-submit \
   --num-executors 10  --master yarn-cluster \
   --class org.apache.spark.examples.SparkPi \
   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1642:
-

Target Version/s: 1.1.0

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
 ---

 Key: SPARK-1642
 URL: https://issues.apache.org/jira/browse/SPARK-1642
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

 This will add support for SSL encryption between Flume AvroSink and Spark 
 Streaming.
 It is based on FLUME-2083



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1853:
-

Target Version/s: 1.1.0

 Show Streaming application code context (file, line number) in Spark Stages UI
 --

 Key: SPARK-1853
 URL: https://issues.apache.org/jira/browse/SPARK-1853
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Mubarak Seyed
 Fix For: 1.1.0

 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png


 Right now, the code context (file, and line number) shown for streaming jobs 
 in stages UI is meaningless as it refers to internal DStream:random line 
 rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2464) Twitter Receiver does not stop correctly when streamingContext.stop is called

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2464:
-

Target Version/s: 1.1.0, 1.0.2

 Twitter Receiver does not stop correctly when streamingContext.stop is called
 -

 Key: SPARK-2464
 URL: https://issues.apache.org/jira/browse/SPARK-2464
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2345:
-

Issue Type: Wish  (was: Bug)

 ForEachDStream should have an option of running the foreachfunc on Spark
 

 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Wish
  Components: Streaming
Reporter: Hari Shreedharan

 Today the Job generated simply calls the foreachfunc, but does not run it on 
 spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1854) Add a version of StreamingContext.fileStream that take hadoop conf object

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1854:
-

Fix Version/s: (was: 1.1.0)

 Add a version of StreamingContext.fileStream that take hadoop conf object
 -

 Key: SPARK-1854
 URL: https://issues.apache.org/jira/browse/SPARK-1854
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2464) Twitter Receiver does not stop correctly when streamingContext.stop is called

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2464:
-

Fix Version/s: (was: 1.0.2)
   (was: 1.1.0)

 Twitter Receiver does not stop correctly when streamingContext.stop is called
 -

 Key: SPARK-2464
 URL: https://issues.apache.org/jira/browse/SPARK-2464
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2379) stopReceive in dead loop, cause stackoverflow exception

2014-07-22 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070915#comment-14070915
 ] 

Tathagata Das commented on SPARK-2379:
--

Any information on this? If we have no way to reproduce this, I will close this.

 stopReceive in dead loop, cause stackoverflow exception
 ---

 Key: SPARK-2379
 URL: https://issues.apache.org/jira/browse/SPARK-2379
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0
Reporter: sunsc

 streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala
 stop will call stopReceiver and stopReceiver will call stop if exception 
 occurs, that make a dead loop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2447:
-

Target Version/s: 1.1.0

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2377) Create a Python API for Spark Streaming

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2377:
-

Target Version/s: 1.1.0

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2377) Create a Python API for Spark Streaming

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2377:
-

Fix Version/s: (was: 1.1.0)

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-2377) Create a Python API for Spark Streaming

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-2377:


Assignee: Tathagata Das

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas
Assignee: Tathagata Das

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2447:
-

Component/s: Streaming
 Spark Core

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Tathagata Das

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1729) Make Flume pull data from source, rather than the current push model

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1729:
-

Assignee: Hari Shreedharan  (was: Tathagata Das)

 Make Flume pull data from source, rather than the current push model
 

 Key: SPARK-1729
 URL: https://issues.apache.org/jira/browse/SPARK-1729
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Hari Shreedharan
 Fix For: 1.1.0


 This makes sure that the if the Spark executor running the receiver goes 
 down, the new receiver on a new node can still get data from Flume.
 This is not possible in the current model, as Flume is configured to push 
 data to a executor/worker and if that worker is down, Flume cant push data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2599) almostEquals mllib.util.TestingUtils does not behave as expected when comparing against 0.0

2014-07-22 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070929#comment-14070929
 ] 

DB Tsai commented on SPARK-2599:


I'm the original guy implementing `almostEquals` for my unit-testing, and I 
also noticed that it will be suffering when comparing against 0.0. As [~srowen] 
pointed out, it's meaningless to comparing against 0.0 (or a really small 
number) with relative error. However, people may just want to write unittest 
using relative error for even comparing those small numbers. So I purpose the 
following APIs.

`a ~== b +- eps` for relative error, and when a or b near zero, let's say 
1e-15, it falls back to absolute error.
`a === b +- eps` which is already in scalatest 2.0 for absolute error, but 
since we don't use scalatest 2.0 yet, we build the same APIs in mllib for 
absolute error.

 almostEquals mllib.util.TestingUtils does not behave as expected when 
 comparing against 0.0
 ---

 Key: SPARK-2599
 URL: https://issues.apache.org/jira/browse/SPARK-2599
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Doris Xin
Priority: Minor

 DoubleWithAlmostEquals.almostEquals, when used to compare a number with 0.0, 
 would always produce an epsilon of 1  1e-10, causing false failure when 
 comparing very small numbers with 0.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1730) Make receiver store data reliably to avoid data-loss on executor failures

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1730:
-

Assignee: Hari Shreedharan

 Make receiver store data reliably to avoid data-loss on executor failures
 -

 Key: SPARK-1730
 URL: https://issues.apache.org/jira/browse/SPARK-1730
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Hari Shreedharan
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2438) Streaming + MLLib

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2438:
-

Target Version/s: 1.1.0

 Streaming + MLLib
 -

 Key: SPARK-2438
 URL: https://issues.apache.org/jira/browse/SPARK-2438
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Assignee: Jeremy Freeman
  Labels: features

 This is a ticket to track progress on developing streaming analyses in MLLib.
 Many streaming applications benefit from or require fitting models online, 
 where the parameters of a model (e.g. regression, clustering) are updated 
 continually as new data arrive. This can be accomplished by incorporating 
 MLLib algorithms into model-updating operations over DStreams. In some cases 
 this can be achieved using existing updaters (e.g. those based on SGD), but 
 in other cases will require custom update rules (e.g. for KMeans). The goal 
 is to have streaming versions of many common algorithms, in particular 
 regression, classification, clustering, and possibly dimensionality reduction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2438) Streaming + MLLib

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2438:
-

Component/s: Streaming

 Streaming + MLLib
 -

 Key: SPARK-2438
 URL: https://issues.apache.org/jira/browse/SPARK-2438
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Assignee: Jeremy Freeman
  Labels: features

 This is a ticket to track progress on developing streaming analyses in MLLib.
 Many streaming applications benefit from or require fitting models online, 
 where the parameters of a model (e.g. regression, clustering) are updated 
 continually as new data arrive. This can be accomplished by incorporating 
 MLLib algorithms into model-updating operations over DStreams. In some cases 
 this can be achieved using existing updaters (e.g. those based on SGD), but 
 in other cases will require custom update rules (e.g. for KMeans). The goal 
 is to have streaming versions of many common algorithms, in particular 
 regression, classification, clustering, and possibly dimensionality reduction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1645:
-

Issue Type: Improvement  (was: New Feature)

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1645:
-

Issue Type: New Feature  (was: Improvement)

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2438) Streaming + MLLib

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2438:
-

Issue Type: New Feature  (was: Improvement)

 Streaming + MLLib
 -

 Key: SPARK-2438
 URL: https://issues.apache.org/jira/browse/SPARK-2438
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Assignee: Jeremy Freeman
  Labels: features

 This is a ticket to track progress on developing streaming analyses in MLLib.
 Many streaming applications benefit from or require fitting models online, 
 where the parameters of a model (e.g. regression, clustering) are updated 
 continually as new data arrive. This can be accomplished by incorporating 
 MLLib algorithms into model-updating operations over DStreams. In some cases 
 this can be achieved using existing updaters (e.g. those based on SGD), but 
 in other cases will require custom update rules (e.g. for KMeans). The goal 
 is to have streaming versions of many common algorithms, in particular 
 regression, classification, clustering, and possibly dimensionality reduction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1730) Make receiver store data reliably to avoid data-loss on executor failures

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1730:
-

Target Version/s: 1.1.0
   Fix Version/s: (was: 1.1.0)

 Make receiver store data reliably to avoid data-loss on executor failures
 -

 Key: SPARK-1730
 URL: https://issues.apache.org/jira/browse/SPARK-1730
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2548) JavaRecoverableWordCount is missing

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2548:
-

Target Version/s: 1.1.0, 1.0.2, 0.9.3  (was: 1.1.0, 0.9.3)

 JavaRecoverableWordCount is missing
 ---

 Key: SPARK-2548
 URL: https://issues.apache.org/jira/browse/SPARK-2548
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Streaming
Affects Versions: 0.9.2, 1.0.1
Reporter: Xiangrui Meng
Priority: Minor

 JavaRecoverableWordCount was mentioned in the doc but not in the codebase. We 
 need to rewrite the example because the code was lost during the migration 
 from spark/spark-incubating to apache/spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2629) Improve performance of DStream.updateStateByKey using IndexRDD

2014-07-22 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-2629:


 Summary: Improve performance of DStream.updateStateByKey using 
IndexRDD
 Key: SPARK-2629
 URL: https://issues.apache.org/jira/browse/SPARK-2629
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2629) Improve performance of DStream.updateStateByKey using IndexRDD

2014-07-22 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070939#comment-14070939
 ] 

Tathagata Das commented on SPARK-2629:
--

Index RDD is necessary for this improvement to be made.

 Improve performance of DStream.updateStateByKey using IndexRDD
 --

 Key: SPARK-2629
 URL: https://issues.apache.org/jira/browse/SPARK-2629
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083

2014-07-22 Thread Ted Malaska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070938#comment-14070938
 ] 

Ted Malaska commented on SPARK-1642:


Are there any changes needed here?

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
 ---

 Key: SPARK-1642
 URL: https://issues.apache.org/jira/browse/SPARK-1642
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

 This will add support for SSL encryption between Flume AvroSink and Spark 
 Streaming.
 It is based on FLUME-2083



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070966#comment-14070966
 ] 

Marcelo Vanzin commented on SPARK-2420:
---

I'm all for sanitizing dependencies, but just be aware that this change might 
break people's applications. Any application using APIs available in Guava  11 
only, and relying on the transitive dependency of Spark on Guava, and on the 
Guava classes being packaged with the Spark uber-jar, will break.

This is the main reason why all this dependency stuff is tricky. As soon as you 
package things in a particular way, people start depending on it, and now you 
always have to keep that in consideration when you make a change.

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI

2014-07-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1853:
-

Assignee: Mubarak Seyed  (was: Tathagata Das)

 Show Streaming application code context (file, line number) in Spark Stages UI
 --

 Key: SPARK-1853
 URL: https://issues.apache.org/jira/browse/SPARK-1853
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Mubarak Seyed
 Fix For: 1.1.0

 Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png


 Right now, the code context (file, and line number) shown for streaming jobs 
 in stages UI is meaningless as it refers to internal DStream:random line 
 rather than user application file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2630) Input data size goes overflow when size is large then 4G in one task

2014-07-22 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2630:
-

 Summary: Input data size goes overflow when size is large then 4G 
in one task
 Key: SPARK-2630
 URL: https://issues.apache.org/jira/browse/SPARK-2630
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Priority: Critical


Given one big file, such as text.4.3G, put it in one task, 

sc.textFile(text.4.3.G).coalesce(1).count()

In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2630) Input data size goes overflow when size is large then 4G in one task

2014-07-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-2630:
--

Attachment: overflow.tiff

The input size is showed as 5.8MB, but the real input size is 4.3G.

 Input data size goes overflow when size is large then 4G in one task
 

 Key: SPARK-2630
 URL: https://issues.apache.org/jira/browse/SPARK-2630
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Priority: Critical
 Attachments: overflow.tiff


 Given one big file, such as text.4.3G, put it in one task, 
 sc.textFile(text.4.3.G).coalesce(1).count()
 In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2631) In-memory Compression is not configured with SQLConf

2014-07-22 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2631:
---

 Summary: In-memory Compression is not configured with SQLConf
 Key: SPARK-2631
 URL: https://issues.apache.org/jira/browse/SPARK-2631
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071039#comment-14071039
 ] 

Patrick Wendell commented on SPARK-2282:


[~carlilek] I'd actually recommend just pulling Spark from the branch-1.0 
maintaince branch. We usually recommend users do this since we only add 
stability fixes on those branches.

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2426:
-

Target Version/s:   (was: 1.1.0)

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add ADMM and IPM based QuadraticMinimization solvers to 
 breeze.optimize.quadratic package.
 2. Add a QpSolver object in spark mllib optimization which calls breeze
 3. Add the QpSolver object in spark mllib ALS



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2613) CLONE - word2vec: Distributed Representation of Words

2014-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2613.
--

Resolution: Duplicate

 CLONE - word2vec: Distributed Representation of Words
 -

 Key: SPARK-2613
 URL: https://issues.apache.org/jira/browse/SPARK-2613
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yifan Yang
Assignee: Liquan Pei
   Original Estimate: 672h
  Remaining Estimate: 672h

 We would like to add parallel implementation of word2vec to MLlib. word2vec 
 finds distributed representation of words through training of large data 
 sets. The Spark programming model fits nicely with word2vec as the training 
 algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram 
 model and negative sampling in our initial implementation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1545) Add Random Forest algorithm to MLlib

2014-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1545:
-

Target Version/s:   (was: 1.1.0)

 Add Random Forest algorithm to MLlib
 

 Key: SPARK-1545
 URL: https://issues.apache.org/jira/browse/SPARK-1545
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Manish Amde
Assignee: Manish Amde

 This task requires adding Random Forest support to Spark MLlib. The 
 implementation needs to adapt the classic algorithm to the scalable tree 
 implementation.
 The tasks involves:
 - Comparing the various tradeoffs and finalizing the algorithm before 
 implementation
 - Code implementation
 - Unit tests
 - Functional tests
 - Performance tests
 - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-07-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1547:
-

Target Version/s:   (was: 1.1.0)

 Add gradient boosting algorithm to MLlib
 

 Key: SPARK-1547
 URL: https://issues.apache.org/jira/browse/SPARK-1547
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Manish Amde
Assignee: Manish Amde

 This task requires adding the gradient boosting algorithm to Spark MLlib. The 
 implementation needs to adapt the gradient boosting algorithm to the scalable 
 tree implementation.
 The tasks involves:
 - Comparing the various tradeoffs and finalizing the algorithm before 
 implementation
 - Code implementation
 - Unit tests
 - Functional tests
 - Performance tests
 - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2010) Support for nested data in PySpark SQL

2014-07-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2010:


Priority: Blocker  (was: Critical)

 Support for nested data in PySpark SQL
 --

 Key: SPARK-2010
 URL: https://issues.apache.org/jira/browse/SPARK-2010
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Kan Zhang
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2632) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff.

2014-07-22 Thread Yin Huai (JIRA)
Yin Huai created SPARK-2632:
---

 Summary: Importing a method of class in Spark REPL causes the REPL 
to pulls in unnecessary stuff.
 Key: SPARK-2632
 URL: https://issues.apache.org/jira/browse/SPARK-2632
 Project: Spark
  Issue Type: Bug
Reporter: Yin Huai
Priority: Blocker


 To reproduce the exception, you can start a local cluster (sbin/start-all.sh) 
then open a spark shell.

{code}
class X() { println(What!); def y = 3 }
val x = new X
import x.y
case class Person(name: String, age: Int)
sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
= Person(p(0), p(1).trim.toInt)).collect
{code}
Then you will find the exception. I am attaching the stack trace below...
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in 
stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2632) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff.

2014-07-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2632:


Description: 
 Master is affected by this bug. To reproduce the exception, you can start a 
local cluster (sbin/start-all.sh) then open a spark shell.

{code}
class X() { println(What!); def y = 3 }
val x = new X
import x.y
case class Person(name: String, age: Int)
sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
= Person(p(0), p(1).trim.toInt)).collect
{code}
Then you will find the exception. I am attaching the stack trace below...
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in 
stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

  was:
 To reproduce the exception, you can start a local cluster (sbin/start-all.sh) 
then open a spark shell.

{code}
class X() { println(What!); def y = 3 }
val x = new X
import x.y
case class Person(name: String, age: Int)
sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
= Person(p(0), p(1).trim.toInt)).collect
{code}
Then you will find the exception. I am attaching the stack trace below...
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in 
stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-22 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071192#comment-14071192
 ] 

Aaron Davidson commented on SPARK-2282:
---

[~pwendell] That would in general be the right solution, but this particular 
change hasn't been merged yet (referring to the second PR on this bug, which is 
a more complete fix).

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071197#comment-14071197
 ] 

Patrick Wendell commented on SPARK-2282:


Ah my b. I was confused.

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2615) Add == support for HiveQl

2014-07-22 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071214#comment-14071214
 ] 

Cheng Hao commented on SPARK-2615:
--

Yes, that's true.
But == is actually used in lots of unit test queries, without this being 
fixed in SparkSQL, some of them may not able to pass the unit test. 
For example:
https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/groupby_grouping_id1.q

Probably we should create a Jira issue for Hive, either for the documentation 
or source code.

 Add == support for HiveQl
 ---

 Key: SPARK-2615
 URL: https://issues.apache.org/jira/browse/SPARK-2615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor

 Currently, if passing == other than = in expression of Hive QL, will 
 cause exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian package created by assembly/pom.xml (e.g. -Pdeb)

2014-07-22 Thread Christian Tzolov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071220#comment-14071220
 ] 

Christian Tzolov commented on SPARK-2614:
-

Fair point [~markhamstra]. I agree about the benefits of having dedicated 
packages. Actually I've tried a patch that wraps the examples.jar in a separate 
spark-examples debian package. The code is here: 
https://github.com/tzolov/spark/tree/SPARK-2614-2 
Shall i pull a request for this implementation instead? At least until the 
proper packaging come in place. 

Regarding the proper Debian (and hopefully RPM) packaging I am afraid i have 
very little of experience with this subject to make it right. 

 Add the spark-examples-xxx-.jar to the Debian package created by 
 assembly/pom.xml (e.g. -Pdeb)
 --

 Key: SPARK-2614
 URL: https://issues.apache.org/jira/browse/SPARK-2614
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov

 The tar.gz distribution includes already the spark-examples.jar in the 
 bundle. It is a common practice for installers to run SparkPi as a smoke test 
 to verify that the installation is OK
 /usr/share/spark/bin/spark-submit \
   --num-executors 10  --master yarn-cluster \
   --class org.apache.spark.examples.SparkPi \
   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-975) Spark Replay Debugger

2014-07-22 Thread Phuoc Do (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071225#comment-14071225
 ] 

Phuoc Do commented on SPARK-975:


Cheng Lian, some JS libraries that can draw flow diagrams:

http://www.graphdracula.net/
http://www.daviddurman.com/automatic-graph-layout-with-jointjs-and-dagre.html

Your diagram in 01/May/14 comment doesn't look like this one:

http://spark-replay-debugger-overview.readthedocs.org/en/latest/_static/als-1-large.png

Has the requirement changed?

 Spark Replay Debugger
 -

 Key: SPARK-975
 URL: https://issues.apache.org/jira/browse/SPARK-975
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Cheng Lian
  Labels: arthur, debugger
 Attachments: RDD DAG.png


 The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical 
 report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf].
 [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur 
 Dave|https://github.com/ankurdave], is an old implementation of the Spark 
 debugger, which demonstrated both the elegance and power behind the RDD 
 abstraction.  Unfortunately, the corresponding GitHub branch was not merged 
 into the master branch and had stopped 2 years ago.  For more information 
 about Arthur, please refer to [the Spark Debugger Wiki 
 page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub 
 repository.
 As a useful tool for Spark application debugging and analysis, it would be 
 nice to have a complete Spark debugger.  In 
 [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new 
 implementation of the Spark debugger, the Spark Replay Debugger (SRD).
 [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview 
 for discussion.  In the current version, I only implemented features that can 
 illustrate the basic mechanisms.  There are still features appeared in Arthur 
 but missing in SRD, such as checksum based nondeterminsm detection and single 
 task debugging with conventional debugger (like {{jdb}}).  However, these 
 features can be easily built upon current SRD framework.  To minimize code 
 review effort, I didn't include them into the current version intentionally.
 Attached is the visualization of the MLlib ALS application (with 1 iteration) 
 generated by SRD.  For more information, please refer to [the SRD overview 
 document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/].



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-975) Spark Replay Debugger

2014-07-22 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071232#comment-14071232
 ] 

Cheng Lian commented on SPARK-975:
--

Hey [~phuocd], that image actually shows exactly the same issue as I commented. 
Take RDD #0 to #4 as an example: #0 to #3 form a stage, and #0, #1, #2 and #4 
form another. These two stages share #0 to #2, and should overlap, and the 
generated dot file describes the topology correctly. But GraphVis gives wrong 
bounding boxes, just like the right part of the image I used in the comment.

 Spark Replay Debugger
 -

 Key: SPARK-975
 URL: https://issues.apache.org/jira/browse/SPARK-975
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Cheng Lian
  Labels: arthur, debugger
 Attachments: RDD DAG.png


 The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical 
 report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf].
 [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur 
 Dave|https://github.com/ankurdave], is an old implementation of the Spark 
 debugger, which demonstrated both the elegance and power behind the RDD 
 abstraction.  Unfortunately, the corresponding GitHub branch was not merged 
 into the master branch and had stopped 2 years ago.  For more information 
 about Arthur, please refer to [the Spark Debugger Wiki 
 page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub 
 repository.
 As a useful tool for Spark application debugging and analysis, it would be 
 nice to have a complete Spark debugger.  In 
 [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new 
 implementation of the Spark debugger, the Spark Replay Debugger (SRD).
 [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview 
 for discussion.  In the current version, I only implemented features that can 
 illustrate the basic mechanisms.  There are still features appeared in Arthur 
 but missing in SRD, such as checksum based nondeterminsm detection and single 
 task debugging with conventional debugger (like {{jdb}}).  However, these 
 features can be easily built upon current SRD framework.  To minimize code 
 review effort, I didn't include them into the current version intentionally.
 Attached is the visualization of the MLlib ALS application (with 1 iteration) 
 generated by SRD.  For more information, please refer to [the SRD overview 
 document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/].



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >