date:20141106


[ 
https://issues.apache.org/jira/browse/SPARK-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199968#comment-14199968
 ] 

Kay Ousterhout commented on SPARK-1498:
---

I closed this since 0.9 seems pretty ancient now.

 Spark can hang if pyspark tasks fail
 

 Key: SPARK-1498
 URL: https://issues.apache.org/jira/browse/SPARK-1498
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 0.9.1, 0.9.2
Reporter: Kay Ousterhout
 Fix For: 1.0.0


 In pyspark, when some kinds of jobs fail, Spark hangs rather than returning 
 an error.  This is partially a scheduler problem -- the scheduler sometimes 
 thinks failed tasks succeed, even though they have a stack trace and 
 exception.
 You can reproduce this problem with:
 ardd = sc.parallelize([(1,2,3), (4,5,6)])
 brdd = sc.parallelize([(1,2,6), (4,5,9)])
 ardd.join(brdd).count()
 The last line will run forever (the problem in this code is that the RDD 
 entries have 3 values instead of the expected 2).  I haven't verified if this 
 is a problem for 1.0 as well as 0.9.
 Thanks to Shivaram for helping diagnose this issue!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1498) Spark can hang if pyspark tasks fail


 [ 
https://issues.apache.org/jira/browse/SPARK-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-1498.
---
Resolution: Fixed

 Spark can hang if pyspark tasks fail
 

 Key: SPARK-1498
 URL: https://issues.apache.org/jira/browse/SPARK-1498
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 0.9.1, 0.9.2
Reporter: Kay Ousterhout
 Fix For: 1.0.0


 In pyspark, when some kinds of jobs fail, Spark hangs rather than returning 
 an error.  This is partially a scheduler problem -- the scheduler sometimes 
 thinks failed tasks succeed, even though they have a stack trace and 
 exception.
 You can reproduce this problem with:
 ardd = sc.parallelize([(1,2,3), (4,5,6)])
 brdd = sc.parallelize([(1,2,6), (4,5,9)])
 ardd.join(brdd).count()
 The last line will run forever (the problem in this code is that the RDD 
 entries have 3 values instead of the expected 2).  I haven't verified if this 
 is a problem for 1.0 as well as 0.9.
 Thanks to Shivaram for helping diagnose this issue!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4266) Avoid $$ JavaScript for StagePages with huge numbers of tables

Kay Ousterhout created SPARK-4266:
-

 Summary: Avoid $$ JavaScript for StagePages with huge numbers of 
tables
 Key: SPARK-4266
 URL: https://issues.apache.org/jira/browse/SPARK-4266
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout


Some of the new javascript added to handle hiding metrics significantly slows 
the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, 
it took over a minute for the page to finish loading in Chrome on my laptop).  
There are at least two issues here:

(1) The new table striping java script is much slower than the old CSS.  The 
fancier javascript is only needed for the stage summary table, so we should 
change the task table back to using CSS so that it doesn't slow the page load 
for jobs with lots of tasks.

(2) The javascript associated with hiding metrics is expensive when jobs have 
lots of tasks, I think because the jQuery selectors have to traverse a much 
larger DOM.   The ID selectors are much more efficient, so we should consider 
switching to these, and/or avoiding this code in additional-metrics.js:

$(input:checkbox:not(:checked)).each(function() {
var column = table . + $(this).attr(name);
$(column).hide();
});

by initially hiding the data when we generate the page in the render function 
instead, which should be easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4266) Avoid $$ JavaScript for StagePages with huge numbers of tables

2014-11-06 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4266:
---
Priority: Critical  (was: Major)

 Avoid $$ JavaScript for StagePages with huge numbers of tables
 --

 Key: SPARK-4266
 URL: https://issues.apache.org/jira/browse/SPARK-4266
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout
Priority: Critical

 Some of the new javascript added to handle hiding metrics significantly slows 
 the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, 
 it took over a minute for the page to finish loading in Chrome on my laptop). 
  There are at least two issues here:
 (1) The new table striping java script is much slower than the old CSS.  The 
 fancier javascript is only needed for the stage summary table, so we should 
 change the task table back to using CSS so that it doesn't slow the page load 
 for jobs with lots of tasks.
 (2) The javascript associated with hiding metrics is expensive when jobs have 
 lots of tasks, I think because the jQuery selectors have to traverse a much 
 larger DOM.   The ID selectors are much more efficient, so we should consider 
 switching to these, and/or avoiding this code in additional-metrics.js:
 $(input:checkbox:not(:checked)).each(function() {
 var column = table . + $(this).attr(name);
 $(column).hide();
 });
 by initially hiding the data when we generate the page in the render function 
 instead, which should be easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4255) Table striping is incorrect on page load


 [ 
https://issues.apache.org/jira/browse/SPARK-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-4255.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Fixed by 
https://github.com/apache/spark/commit/2c84178b8283269512b1c968b9995a7bdedd7aa5

 Table striping is incorrect on page load
 

 Key: SPARK-4255
 URL: https://issues.apache.org/jira/browse/SPARK-4255
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor
 Fix For: 1.2.0


 Currently, table striping (where every other row is grey, to aid readability) 
 happens before table rows get hidden (because some metrics are hidden by 
 default).  If an odd number of contiguous table rows are hidden on page load, 
 this means adjacent rows can both end up the same color. We should fix the 
 table striping to happen after row hiding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4186) Support binaryFiles and binaryRecords API in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4186.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Support binaryFiles and binaryRecords API in Python
 ---

 Key: SPARK-4186
 URL: https://issues.apache.org/jira/browse/SPARK-4186
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Reporter: Matei Zaharia
Assignee: Davies Liu
 Fix For: 1.2.0


 After SPARK-2759, we should expose these methods in Python. Shouldn't be too 
 hard to add.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-06 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated SPARK-4267:
--
Description: 
Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses 
protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:

{code}
 ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 
-Dprotobuf.version=2.5.0
{code}

Then Spark on YARN cannot fail to run with NPE.

{code}
$ bin/spark-shell --master yarn-client
scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = 
line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 
16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
java.lang.NullPointerException  


  
at 
org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
at 
org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) 


   
at 
org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
at $iwC$$iwC$$iwC$$iwC.init(console:13) 


  
at $iwC$$iwC$$iwC.init(console:18)
at $iwC$$iwC.init(console:20)   


  
at $iwC.init(console:22)
at init(console:24) 


  
at .init(console:28)
at .clinit(console) 


  
at .init(console:7)
at .clinit(console) 


  
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  


  
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

  
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)   


   
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)  


   
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) 


  
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)


   
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
at

[jira] [Created] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-06 Thread Tsuyoshi OZAWA (JIRA)

Tsuyoshi OZAWA created SPARK-4267:
-

 Summary: Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 
or later
 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA


Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses 
protobuf 2.4.1 so I compiled with protobuf 2.5.1 like this:

{code}
 ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 
-Dprotobuf.version=2.5.0
{code}

Then Spark on YARN cannot fail to run with NPE.

{code}
$ bin/spark-shell --master yarn-client
scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = 
line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 
16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
java.lang.NullPointerException  


  
at 
org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
at 
org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) 


   
at 
org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
at $iwC$$iwC$$iwC$$iwC.init(console:13) 


  
at $iwC$$iwC$$iwC.init(console:18)
at $iwC$$iwC.init(console:20)   


  
at $iwC.init(console:22)
at init(console:24) 


  
at .init(console:28)
at .clinit(console) 


  
at .init(console:7)
at .clinit(console) 


  
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  


  
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

  
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)   


   
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)  


   
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) 


  
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)

[jira] [Updated] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-06 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated SPARK-4267:
--
Description: 
Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses 
protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:

{code}
 ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 
-Dprotobuf.version=2.5.0
{code}

Then Spark on YARN fails to launch jobs with NPE.

{code}
$ bin/spark-shell --master yarn-client
scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = 
line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 
16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
java.lang.NullPointerException  


  
at 
org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
at 
org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) 


   
at 
org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
at $iwC$$iwC$$iwC$$iwC.init(console:13) 


  
at $iwC$$iwC$$iwC.init(console:18)
at $iwC$$iwC.init(console:20)   


  
at $iwC.init(console:22)
at init(console:24) 


  
at .init(console:28)
at .clinit(console) 


  
at .init(console:7)
at .clinit(console) 


  
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  


  
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

  
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)   


   
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)  


   
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) 


  
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)


   
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
at

[jira] [Commented] (SPARK-4239) support view in HiveQL


[ 
https://issues.apache.org/jira/browse/SPARK-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1420#comment-1420
 ] 

Apache Spark commented on SPARK-4239:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3131

 support view in HiveQL
 --

 Key: SPARK-4239
 URL: https://issues.apache.org/jira/browse/SPARK-4239
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4268) Use #::: to get benefit from Stream in SqlLexical.allCaseVersions

2014-11-06 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-4268:
---

 Summary: Use #::: to get benefit from Stream in 
SqlLexical.allCaseVersions
 Key: SPARK-4268
 URL: https://issues.apache.org/jira/browse/SPARK-4268
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Trivial


`allCaseVersions` uses `++` to concat two Stream. However, to get benefit from 
Stream, using ` #:::` would be better. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4268) Use #::: to get benefit from Stream in SqlLexical.allCaseVersions


[ 
https://issues.apache.org/jira/browse/SPARK-4268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1428#comment-1428
 ] 

Apache Spark commented on SPARK-4268:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3132

 Use #::: to get benefit from Stream in SqlLexical.allCaseVersions
 -

 Key: SPARK-4268
 URL: https://issues.apache.org/jira/browse/SPARK-4268
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Trivial

 `allCaseVersions` uses `++` to concat two Stream. However, to get benefit 
 from Stream, using ` #:::` would be better. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4269) Make wait time in BroadcastHashJoin configurable

2014-11-06 Thread Jacky Li (JIRA)

Jacky Li created SPARK-4269:
---

 Summary: Make wait time in BroadcastHashJoin configurable
 Key: SPARK-4269
 URL: https://issues.apache.org/jira/browse/SPARK-4269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jacky Li
 Fix For: 1.2.0


In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to 
wait for the execution and broadcast of the small table. 

In my opinion, it should be a configurable value since broadcast may exceed 5 
minutes in some case, like in a busy/congested network environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-06 Thread zzc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200037#comment-14200037
 ] 

zzc commented on SPARK-2468:


Hi, Aaron Davidson, I set spark.shuffle.blockTransferService=netty and 
spark.shuffle.io.mode=nio, run on CentOS 5.8 with 12G files successfully , but 
when I set spark.shuffle.blockTransferService=netty and 
spark.shuffle.io.mode=epoll, there is error:
Exception in thread main java.lang.UnsatisfiedLinkError: 
/tmp/libnetty-transport-native-epoll7072694982027222413.so: /lib64/libc.so.6: 
version `GLIBC_2.10' not found

I find GLIBC_2.5 on CentOS 5.8 and can not upgrade, how to resolve it. 

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4270) Fix Cast from DateType to DecimalType.

2014-11-06 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-4270:


 Summary: Fix Cast from DateType to DecimalType.
 Key: SPARK-4270
 URL: https://issues.apache.org/jira/browse/SPARK-4270
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


{{Cast}} from {{DateType}} to {{DecimalType}} throws {{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2815) Compilation failed upon the hadoop version 2.0.0-cdh4.5.0

2014-11-06 Thread Oliver Bye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200046#comment-14200046
 ] 

Oliver Bye commented on SPARK-2815:
---

This is still and issue for CDH4.6

I can confirm https://github.com/apache/spark/pull/151/files does fix this 
issue. Specifically the patch to 
yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala

I've applied 2 line change change based on v1.1.0 and made it available here.

git clone https://github.com/olibye/spark.git --branch SPARK-2815 
--single-branch
mvn -T 1C -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests install

I've not created a pull request, as this appears to be an unsupported branch, 
and would just be a repeat.

 Compilation failed upon the hadoop version 2.0.0-cdh4.5.0
 -

 Key: SPARK-2815
 URL: https://issues.apache.org/jira/browse/SPARK-2815
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: pengyanhong
Assignee: Guoqiang Li

 compile fail via SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true 
 SPARK_HIVE=true sbt/sbt assembly,  finally get error message : [error] 
 (yarn-stable/compile:compile) Compilation failed, the following is the detail 
 error on console:
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:26:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.YarnClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:40:
  not found: value YarnClient
 [error]   val yarnClient = YarnClient.createYarnClient
 [error]^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:32:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:33:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:36:
  object util is not a member of package org.apache.hadoop.yarn.webapp
 [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:64:
  value RM_AM_MAX_ATTEMPTS is not a member of object 
 org.apache.hadoop.yarn.conf.YarnConfiguration
 [error] YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
 YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS)
 [error]   ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:66:
  not found: type AMRMClient
 [error]   private var amClient: AMRMClient[ContainerRequest] = _
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:92:
  not found: value AMRMClient
 [error] amClient = AMRMClient.createAMRMClient()
 [error]^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:137:
  not found: value WebAppUtils
 [error] val proxy = WebAppUtils.getProxyHostAndPort(conf)
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:40:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:618:
  not found: type AMRMClient
 [error]   amClient: AMRMClient[ContainerRequest],
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:596:
  not found: type AMRMClient
 [error]   amClient: AMRMClient[ContainerRequest],
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:577:
  not found: type AMRMClient
 [error]   amClient:

[jira] [Commented] (SPARK-4270) Fix Cast from DateType to DecimalType.


[ 
https://issues.apache.org/jira/browse/SPARK-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200045#comment-14200045
 ] 

Apache Spark commented on SPARK-4270:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3134

 Fix Cast from DateType to DecimalType.
 --

 Key: SPARK-4270
 URL: https://issues.apache.org/jira/browse/SPARK-4270
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 {{Cast}} from {{DateType}} to {{DecimalType}} throws {{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2815) Compilation failed upon the hadoop version 2.0.0-cdh4.5.0

2014-11-06 Thread Oliver Bye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200046#comment-14200046
 ] 

Oliver Bye edited comment on SPARK-2815 at 11/6/14 10:11 AM:
-

This is still and issue for CDH4.6

I can confirm https://github.com/apache/spark/pull/151/files does fix this 
issue. Specifically the patch to 
yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala

I've applied 2 line change change based on v1.1.0 and made it available here.

{noformat}
git clone https://github.com/olibye/spark.git --branch SPARK-2815 
--single-branch
mvn -T 1C -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests install
{noformat}
I've not created a pull request, as this appears to be an unsupported branch, 
and would just be a repeat.


was (Author: olibye):
This is still and issue for CDH4.6

I can confirm https://github.com/apache/spark/pull/151/files does fix this 
issue. Specifically the patch to 
yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala

I've applied 2 line change change based on v1.1.0 and made it available here.

git clone https://github.com/olibye/spark.git --branch SPARK-2815 
--single-branch
mvn -T 1C -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests install

I've not created a pull request, as this appears to be an unsupported branch, 
and would just be a repeat.

 Compilation failed upon the hadoop version 2.0.0-cdh4.5.0
 -

 Key: SPARK-2815
 URL: https://issues.apache.org/jira/browse/SPARK-2815
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: pengyanhong
Assignee: Guoqiang Li

 compile fail via SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true 
 SPARK_HIVE=true sbt/sbt assembly,  finally get error message : [error] 
 (yarn-stable/compile:compile) Compilation failed, the following is the detail 
 error on console:
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:26:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.YarnClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:40:
  not found: value YarnClient
 [error]   val yarnClient = YarnClient.createYarnClient
 [error]^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:32:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:33:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:36:
  object util is not a member of package org.apache.hadoop.yarn.webapp
 [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:64:
  value RM_AM_MAX_ATTEMPTS is not a member of object 
 org.apache.hadoop.yarn.conf.YarnConfiguration
 [error] YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
 YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS)
 [error]   ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:66:
  not found: type AMRMClient
 [error]   private var amClient: AMRMClient[ContainerRequest] = _
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:92:
  not found: value AMRMClient
 [error] amClient = AMRMClient.createAMRMClient()
 [error]^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:137:
  not found: value WebAppUtils
 [error] val proxy = WebAppUtils.getProxyHostAndPort(conf)
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:40:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient
 [error]  ^
 [error]

[jira] [Updated] (SPARK-3000) drop old blocks to disk in parallel when memory is not large enough for caching new blocks

2014-11-06 Thread Zhang, Liye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-3000:
---
Attachment: Spark-3000 Design Doc.pdf

 drop old blocks to disk in parallel when memory is not large enough for 
 caching new blocks
 --

 Key: SPARK-3000
 URL: https://issues.apache.org/jira/browse/SPARK-3000
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zhang, Liye
Assignee: Zhang, Liye
 Attachments: Spark-3000 Design Doc.pdf


 In spark, rdd can be cached in memory for later use, and the cached memory 
 size is *spark.executor.memory * spark.storage.memoryFraction* for spark 
 version before 1.1.0, and *spark.executor.memory * 
 spark.storage.memoryFraction * spark.storage.safetyFraction* after 
 [SPARK-1777|https://issues.apache.org/jira/browse/SPARK-1777]. 
 For Storage level *MEMORY_AND_DISK*, when free memory is not enough to cache 
 new blocks, old blocks might be dropped to disk to free up memory for new 
 blocks. This operation is processed by _ensureFreeSpace_ in 
 _MemoryStore.scala_, there will always be a *accountingLock* held by the 
 caller to ensure only one thread is dropping blocks. This method can not 
 fully used the disks throughput when there are multiple disks on the working 
 node. When testing our workload, we found this is really a bottleneck when 
 size of old blocks to be dropped is really large. 
 We have tested the parallel method on spark 1.0, the speedup is significant. 
 So it's necessary to make dropping blocks operation in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-06 Thread Lianhui Wang (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200096#comment-14200096
]

Lianhui Wang commented on SPARK-2468:
-

[~adav] i use your branch and memory overhead on yarn is exist. [~zzcclp] how
about your test.

Netty-based block server / client module

Key: SPARK-2468
URL: https://issues.apache.org/jira/browse/SPARK-2468
Project: Spark
Issue Type: Improvement
Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
Fix For: 1.2.0

Right now shuffle send goes through the block manager. This is inefficient
because it requires loading a block from disk into a kernel buffer, then into
a user space buffer, and then back to a kernel send buffer before it reaches
the NIC. It does multiple copies of the data and context switching between
kernel/user. It also creates unnecessary buffer in the JVM that increases GC
Instead, we should use FileChannel.transferTo, which handles this in the
kernel space with zero-copy. See
http://www.ibm.com/developerworks/library/j-zerocopy/
One potential solution is to use Netty. Spark already has a Netty based
network module implemented (org.apache.spark.network.netty). However, it
lacks some functionality and is turned off by default.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4271) jetty Server can't tryport+1

2014-11-06 Thread honestold3 (JIRA)

honestold3 created SPARK-4271:
-

 Summary: jetty Server can't tryport+1
 Key: SPARK-4271
 URL: https://issues.apache.org/jira/browse/SPARK-4271
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: operating system language is Chinese.
Reporter: honestold3


if operating system language is not Englisth, occur 
org.apache.spark.util.Util.isBindCollision can't contains BingException 
message. so jetty Server can't tryport+1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4271) jetty Server can't tryport+1


[ 
https://issues.apache.org/jira/browse/SPARK-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200128#comment-14200128
 ] 

Apache Spark commented on SPARK-4271:
-

User 'honestold3' has created a pull request for this issue:
https://github.com/apache/spark/pull/3135

 jetty Server can't tryport+1
 

 Key: SPARK-4271
 URL: https://issues.apache.org/jira/browse/SPARK-4271
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: operating system language is Chinese.
Reporter: honestold3

 if operating system language is not Englisth, occur 
 org.apache.spark.util.Util.isBindCollision can't contains BingException 
 message. so jetty Server can't tryport+1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4271) jetty Server can't tryport+1


 [ 
https://issues.apache.org/jira/browse/SPARK-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4271.
--
Resolution: Duplicate

Well here's a good example. I'm sure this is duplicate of 
https://issues.apache.org/jira/browse/SPARK-4169 which is solved better in its 
PR, and this is still awaiting review/commit.

 jetty Server can't tryport+1
 

 Key: SPARK-4271
 URL: https://issues.apache.org/jira/browse/SPARK-4271
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: operating system language is Chinese.
Reporter: honestold3

 if operating system language is not Englisth, occur 
 org.apache.spark.util.Util.isBindCollision can't contains BingException 
 message. so jetty Server can't tryport+1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4272) Add more unwrap functions for primitive type in TableReader

2014-11-06 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-4272:


 Summary: Add more unwrap functions for primitive type in 
TableReader
 Key: SPARK-4272
 URL: https://issues.apache.org/jira/browse/SPARK-4272
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


Currently, the data unwrap only support couple of primitive types, not all, 
it will not cause exception, but may get some performance in table scanning for 
the type like binary, date, timestamp, decimal etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4272) Add more unwrap functions for primitive type in TableReader


[ 
https://issues.apache.org/jira/browse/SPARK-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200160#comment-14200160
 ] 

Apache Spark commented on SPARK-4272:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3136

 Add more unwrap functions for primitive type in TableReader
 ---

 Key: SPARK-4272
 URL: https://issues.apache.org/jira/browse/SPARK-4272
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 Currently, the data unwrap only support couple of primitive types, not all, 
 it will not cause exception, but may get some performance in table scanning 
 for the type like binary, date, timestamp, decimal etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)

2014-11-06 Thread YanTang Zhai (JIRA)

YanTang Zhai created SPARK-4273:
---

 Summary: Providing ExternalSet to avoid OOM when count(distinct)
 Key: SPARK-4273
 URL: https://issues.apache.org/jira/browse/SPARK-4273
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: YanTang Zhai
Priority: Minor


Some task may OOM when count(distinct) if it needs to process many records. 
CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs 
many records, it may occupy large memory.
I think a data structure ExternalSet like ExternalAppendOnlyMap could be 
provided to store OpenHashSet data in disks when it's capacity exceeds some 
threshold.
For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with 
hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be 
indicated as follows:
ohs1 [d, b, c, a] = [a, b, c, d] = file1
ohs2 [e, f, g, a] = [a, e, f, g] = file2
ohs3 [e, h, i, g] = [e, g, h, i] = file3
ohs4 [j, h, a] = [a, h, j] = sortedSet
When output, all keys with the same hashCode will be put into a OpenHashSet, 
then the iterator of this OpenHashSet is accessing. The procedure could be 
indicated as follows:
file1- a - ohsA; file2 - a - ohsA; sortedSet - a - ohsA; ohsA - a;
file1 - b - ohsB; ohsB - b;
file1 - c - ohsC; ohsC - c;
file1 - d - ohsD; ohsD - d;
file2 - e - ohsE; file3 - e - ohsE; ohsE- e;
...

I think using the ExternalSet could avoid OOM when count(distinct). Welcomes 
comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)


[ 
https://issues.apache.org/jira/browse/SPARK-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200172#comment-14200172
 ] 

Apache Spark commented on SPARK-4273:
-

User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3137

 Providing ExternalSet to avoid OOM when count(distinct)
 ---

 Key: SPARK-4273
 URL: https://issues.apache.org/jira/browse/SPARK-4273
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: YanTang Zhai
Priority: Minor

 Some task may OOM when count(distinct) if it needs to process many records. 
 CombineSetsAndCountFunction puts all records into an OpenHashSet, if it 
 fetchs many records, it may occupy large memory.
 I think a data structure ExternalSet like ExternalAppendOnlyMap could be 
 provided to store OpenHashSet data in disks when it's capacity exceeds some 
 threshold.
 For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with 
 hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be 
 indicated as follows:
 ohs1 [d, b, c, a] = [a, b, c, d] = file1
 ohs2 [e, f, g, a] = [a, e, f, g] = file2
 ohs3 [e, h, i, g] = [e, g, h, i] = file3
 ohs4 [j, h, a] = [a, h, j] = sortedSet
 When output, all keys with the same hashCode will be put into a OpenHashSet, 
 then the iterator of this OpenHashSet is accessing. The procedure could be 
 indicated as follows:
 file1- a - ohsA; file2 - a - ohsA; sortedSet - a - ohsA; ohsA - a;
 file1 - b - ohsB; ohsB - b;
 file1 - c - ohsC; ohsC - c;
 file1 - d - ohsD; ohsD - d;
 file2 - e - ohsE; file3 - e - ohsE; ohsE- e;
 ...
 I think using the ExternalSet could avoid OOM when count(distinct). Welcomes 
 comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4249) A problem of EdgePartitionBuilder in Graphx


[ 
https://issues.apache.org/jira/browse/SPARK-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200188#comment-14200188
 ] 

Apache Spark commented on SPARK-4249:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3138

 A problem of EdgePartitionBuilder in Graphx
 ---

 Key: SPARK-4249
 URL: https://issues.apache.org/jira/browse/SPARK-4249
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0
Reporter: Cookies
Priority: Minor

 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartitionBuilder.scala#L48
 in function toEdgePartition, code snippet  index.update(srcIds(0), 0)
 the elements in array srcIds are all not initialized and are all 0. The 
 effect is that if vertex ids don't contain 0, indexSize is equal to 
 (realIndexSize + 1) and 0 is added to the `index`
 It seems that all versions have the problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4274) Hive comparison test framework doesn't print effective information while logical plan analyzing failed

2014-11-06 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-4274:


 Summary: Hive comparison test framework doesn't print effective 
information while logical plan analyzing failed 
 Key: SPARK-4274
 URL: https://issues.apache.org/jira/browse/SPARK-4274
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


Hive comparison test frame doesn't print informative message, if the unit test 
failed in logical plan analyzing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4274) Hive comparison test framework doesn't print effective information while logical plan analyzing failed


[ 
https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200228#comment-14200228
 ] 

Apache Spark commented on SPARK-4274:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3139

 Hive comparison test framework doesn't print effective information while 
 logical plan analyzing failed 
 ---

 Key: SPARK-4274
 URL: https://issues.apache.org/jira/browse/SPARK-4274
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 Hive comparison test frame doesn't print informative message, if the unit 
 test failed in logical plan analyzing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3913) Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and killApplication() API

2014-11-06 Thread Thomas Graves (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200270#comment-14200270
]

Thomas Graves commented on SPARK-3913:
--

Note that on areas 3 you should be able to use the yarn cli command to kill
application 'yarn kill -applicationId appid'.

Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn
Application Listener and killApplication() API
--

Key: SPARK-3913
URL: https://issues.apache.org/jira/browse/SPARK-3913
Project: Spark
Issue Type: Improvement
Components: YARN
Reporter: Chester

When working with Spark with Yarn deployment mode, we have two issues:
1) We don't know how much yarn max capacity ( memory and cores) before we
specify the number of executor and memories for spark drivers and executors.
We we set a big number, the job can potentially exceeds the limit and got
killed.
It would be better we let the application know that the yarn resource
capacity a head of time and the spark config can adjusted dynamically.

2) Once job started, we would like to have some feedbacks from yarn
application. Currently, the spark client basically block the call and returns
when the job is finished or failed or killed.
If the job runs for few hours, we have no idea how far it has gone, the
progress and resource usage, tracking URL etc.
3) Once the job is started, you basically can't stop it. The Yarn Client API
stop doesn't to work in most cases from our experience. But Yarn API does
work is killApplication(appId).
So we need to expose this killApplication() API to Spark Yarn Client as
well.

I will create one Pull Request and try to address these problems.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4238) Perform network-level retry of shuffle file fetches

2014-11-06 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200334#comment-14200334
 ] 

Nan Zhu commented on SPARK-4238:


it is related to https://issues.apache.org/jira/browse/SPARK-4188?

 Perform network-level retry of shuffle file fetches
 ---

 Key: SPARK-4238
 URL: https://issues.apache.org/jira/browse/SPARK-4238
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical

 During periods of high network (or GC) load, it is not uncommon that 
 IOExceptions crop up around connection failures when fetching shuffle files. 
 Unfortunately, when such a failure occurs, it is interpreted as an inability 
 to fetch the files, which causes us to mark the executor as lost and 
 recompute all of its shuffle outputs.
 We should allow retrying at the network level in the event of an IOException 
 in order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2815) Compilation failed upon the hadoop version 2.0.0-cdh4.5.0


[ 
https://issues.apache.org/jira/browse/SPARK-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200356#comment-14200356
 ] 

Apache Spark commented on SPARK-2815:
-

User 'pengyanhong' has created a pull request for this issue:
https://github.com/apache/spark/pull/3140

 Compilation failed upon the hadoop version 2.0.0-cdh4.5.0
 -

 Key: SPARK-2815
 URL: https://issues.apache.org/jira/browse/SPARK-2815
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: pengyanhong
Assignee: Guoqiang Li

 compile fail via SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true 
 SPARK_HIVE=true sbt/sbt assembly,  finally get error message : [error] 
 (yarn-stable/compile:compile) Compilation failed, the following is the detail 
 error on console:
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:26:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.YarnClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:40:
  not found: value YarnClient
 [error]   val yarnClient = YarnClient.createYarnClient
 [error]^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:32:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:33:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:36:
  object util is not a member of package org.apache.hadoop.yarn.webapp
 [error] import org.apache.hadoop.yarn.webapp.util.WebAppUtils
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:64:
  value RM_AM_MAX_ATTEMPTS is not a member of object 
 org.apache.hadoop.yarn.conf.YarnConfiguration
 [error] YarnConfiguration.RM_AM_MAX_ATTEMPTS, 
 YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS)
 [error]   ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:66:
  not found: type AMRMClient
 [error]   private var amClient: AMRMClient[ContainerRequest] = _
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:92:
  not found: value AMRMClient
 [error] amClient = AMRMClient.createAMRMClient()
 [error]^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:137:
  not found: value WebAppUtils
 [error] val proxy = WebAppUtils.getProxyHostAndPort(conf)
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:40:
  object api is not a member of package org.apache.hadoop.yarn.client
 [error] import org.apache.hadoop.yarn.client.api.AMRMClient
 [error]  ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:618:
  not found: type AMRMClient
 [error]   amClient: AMRMClient[ContainerRequest],
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:596:
  not found: type AMRMClient
 [error]   amClient: AMRMClient[ContainerRequest],
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:577:
  not found: type AMRMClient
 [error]   amClient: AMRMClient[ContainerRequest],
 [error] ^
 [error] 
 /Users/pengyanhong/git/spark/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala:410:
  value CONTAINER_ID is not a member of object 
 org.apache.hadoop.yarn.api.ApplicationConstants.Environment
 [error] val containerIdString = 
 System.getenv(ApplicationConstants.Environment.CONTAINER_ID.name())
 [error]   
  ^

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-06 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200373#comment-14200373
 ] 

Debasish Das commented on SPARK-4231:
-

[~coderxiang] [~mengxr] [~srowen]

I looked at Sean's implementation for MAP metric for recommendation engines and 
it computes a rank of predicted test set over all user/product predictions...I 
don't see how can I send the rank vector to the RankingMetrics API right now

http://cloudera.github.io/oryx/xref/com/cloudera/oryx/als/computation/local/ComputeMAP.html
 

Right now for each user/product I send predictions and labels from test set to 
RankingMetrics API but there is no rank order defined...I retrieved the 
predictions using ALS.predict(userId, productId) API...

 Add RankingMetrics to examples.MovieLensALS
 ---

 Key: SPARK-4231
 URL: https://issues.apache.org/jira/browse/SPARK-4231
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
 Fix For: 1.2.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 examples.MovieLensALS computes RMSE for movielens dataset but after addition 
 of RankingMetrics and enhancements to ALS, it is critical to look at not only 
 the RMSE but also measures like prec@k and MAP.
 In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
 also added a flag that takes an input whether user/product recommendation is 
 being validated.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

[
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200396#comment-14200396
]

Sean Owen commented on SPARK-4231:
--

So this method basically computes where each test item would rank if you asked
for a list of recommendations that ranks every single item. It's not
necessarily efficient, but is simple. The reason I did it that way was to avoid
recreating a lot of the recommender ranking logic. I don't think one has to
define MAP this way -- I effectively averaged over all k to the # of items.

Yes I found the straightforward definition hard to implement at scale. I ended
up opting to compute an approximation of AUC for recommender eval in this next
version I'm working on:
https://github.com/OryxProject/oryx/blob/master/oryx-ml-mllib/src/main/java/com/cloudera/oryx/ml/mllib/als/AUC.java#L106
Sorry for the hard-to-read Java 7; going to redo this in Java 8 soon.
Basically you're just sampling random relevant/not-relevant pairs and comparing
their scores. You might consider that.

I dunno if it's worth bothering with a toy implementation in the examples. The
example is already just to show Spark really not ALS.

Add RankingMetrics to examples.MovieLensALS
---

Key: SPARK-4231
URL: https://issues.apache.org/jira/browse/SPARK-4231
Project: Spark
Issue Type: Improvement
Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
Fix For: 1.2.0

Original Estimate: 24h
Remaining Estimate: 24h

examples.MovieLensALS computes RMSE for movielens dataset but after addition
of RankingMetrics and enhancements to ALS, it is critical to look at not only
the RMSE but also measures like prec@k and MAP.
In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and
also added a flag that takes an input whether user/product recommendation is
being validated.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name

2014-11-06 Thread Ravi Kiran (JIRA)

Ravi Kiran created SPARK-4275:
-

 Summary: ./sbt/sbt assembly command fails if path has space in 
the name
 Key: SPARK-4275
 URL: https://issues.apache.org/jira/browse/SPARK-4275
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Ravi Kiran
Priority: Trivial


I have downloaded branch-1.1 for building spark from scratch on my MAC. The 
path had a space like
/Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1,

1) I cd to the above directory 

2) Ran ./sbt/sbt assembly 

The command fails with weird messages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-06 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200426#comment-14200426
 ] 

Debasish Das commented on SPARK-4231:
-

[~srowen] I need a standard metric to report the ALS enhancements proposed in 
this PR: https://github.com/apache/spark/pull/2705 and more work that we are 
doing in this direction...

The metric should be consistent for topic modeling using LSA as well as the PR 
can solve LSA/PLSA...I have not yet gone into measures like perplexity etc from 
LDA PRs as they are even more complicated but MAP, prec@k and ndcg@k are the 
measures people reported LDA results as well..

Does the MAP definition I used in this PR looks correct to you ? Let me look 
into the AUC example...

 Add RankingMetrics to examples.MovieLensALS
 ---

 Key: SPARK-4231
 URL: https://issues.apache.org/jira/browse/SPARK-4231
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
 Fix For: 1.2.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 examples.MovieLensALS computes RMSE for movielens dataset but after addition 
 of RankingMetrics and enhancements to ALS, it is critical to look at not only 
 the RMSE but also measures like prec@k and MAP.
 In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
 also added a flag that takes an input whether user/product recommendation is 
 being validated.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

[
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200470#comment-14200470
]

Sean Owen commented on SPARK-4231:
--

Yes I'm mostly questioning implementing this in examples. The definition in
RankingMetrics looks like the usual one to me -- average from 1 to min(# recs,
# relevant items). You could say the version you found above is 'extended' to
look into the long tail (# recs = # items), although the long tail doesn't
affect MAP much. Same definition, different limit.

precision@k does not have the same question since there is one k value, not
lots.
AUC may not help you if you're comparing to other things for which you don't
have AUC. It was a side comment mostly.
(Anyway there is already an AUC implementation here which I am trying to see if
I can use.)

Add RankingMetrics to examples.MovieLensALS
---

Key: SPARK-4231
URL: https://issues.apache.org/jira/browse/SPARK-4231
Project: Spark
Issue Type: Improvement
Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
Fix For: 1.2.0

Original Estimate: 24h
Remaining Estimate: 24h

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2014-11-06 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200491#comment-14200491
 ] 

shane knapp commented on SPARK-4216:


yes, the sparkQA oauth token was set inside the top-level ghprb settings.
 the side effect of this was that projects like tachyon had sparkQA posting
on their projects and not amplab jenkins.

On Wed, Nov 5, 2014 at 5:15 PM, Nicholas Chammas (JIRA) j...@apache.org



 Eliminate duplicate Jenkins GitHub posts from AMPLab
 

 Key: SPARK-4216
 URL: https://issues.apache.org/jira/browse/SPARK-4216
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 * [Real Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873361]
 * [Imposter Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-644) Jobs canceled due to repeated executor failures may hang


 [ 
https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-644.
-
Resolution: Fixed

 Jobs canceled due to repeated executor failures may hang
 

 Key: SPARK-644
 URL: https://issues.apache.org/jira/browse/SPARK-644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.6.1
Reporter: Josh Rosen
Assignee: Josh Rosen

 In order to prevent an infinite loop, the standalone master aborts jobs that 
 experience more than 10 executor failures (see 
 https://github.com/mesos/spark/pull/210).  Currently, the master crashes when 
 aborting jobs (this is the issue that uncovered SPARK-643).  If we fix the 
 crash, which involves removing a {{throw}} from the actor's {{receive}} 
 method, then these failures can lead to a hang because they cause the job to 
 be removed from the master's scheduler, but the upstream scheduler components 
 aren't notified of the failure and will wait for the job to finish.
 I've considered fixing this by adding additional callbacks to propagate the 
 failure to the higher-level schedulers.  It might be cleaner to move the 
 decision to abort the job into the higher-level layers of the scheduler, 
 sending an {{AbortJob(jobId)}} method to the Master.  The Client is already 
 notified of executor state changes, so it may be able to make the decision to 
 abort (or defer that decision to a higher layer).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-643) Standalone master crashes during actor restart


 [ 
https://issues.apache.org/jira/browse/SPARK-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-643.
-
Resolution: Fixed

 Standalone master crashes during actor restart
 --

 Key: SPARK-643
 URL: https://issues.apache.org/jira/browse/SPARK-643
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.6.1
Reporter: Josh Rosen
Assignee: Josh Rosen

 The standalone master will crash if it restarts due to an exception:
 {code}
 12/12/15 03:10:47 ERROR master.Master: Job SkewBenchmark wth ID 
 job-20121215031047- failed 11 times.
 spark.SparkException: Job SkewBenchmark wth ID job-20121215031047- failed 
 11 times.
 at 
 spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:103)
 at 
 spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:62)
 at akka.actor.Actor$class.apply(Actor.scala:318)
 at spark.deploy.master.Master.apply(Master.scala:17)
 at akka.actor.ActorCell.invoke(ActorCell.scala:626)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
 at akka.dispatch.Mailbox.run(Mailbox.scala:179)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
 at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
 at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
 at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
 at 
 akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
 12/12/15 03:10:47 INFO master.Master: Starting Spark master at 
 spark://ip-10-226-87-193:7077
 12/12/15 03:10:47 INFO io.IoWorker: IoWorker thread 'spray-io-worker-1' 
 started
 12/12/15 03:10:47 ERROR master.Master: Failed to create web UI
 akka.actor.InvalidActorNameException:actor name HttpServer is not unique!
 [05aed000-4665-11e2-b361-12313d316833]
 at akka.actor.ActorCell.actorOf(ActorCell.scala:392)
 at 
 akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.liftedTree1$1(ActorRefProvider.scala:394)
 at 
 akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:394)
 at 
 akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:392)
 at akka.actor.Actor$class.apply(Actor.scala:318)
 at 
 akka.actor.LocalActorRefProvider$Guardian.apply(ActorRefProvider.scala:388)
 at akka.actor.ActorCell.invoke(ActorCell.scala:626)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
 at akka.dispatch.Mailbox.run(Mailbox.scala:179)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
 at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
 at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
 at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
 at 
 akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
 {code}
 When the Master actor restarts, Akka calls the {{postRestart}} hook.  [By 
 default|http://doc.akka.io/docs/akka/snapshot/general/supervision.html#supervision-restart],
  this calls {{preStart}}.  The standalone master's {{preStart}} method tries 
 to start the webUI but crashes because it is already running.
 I ran into this after a job failed more than 11 times, which causes the 
 Master to throw a SparkException from its {{receive}} method.
 The solution is to implement a custom {{postRestart}} hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4276) Spark streaming requires at least two working thread

2014-11-06 Thread varun sharma (JIRA)

varun sharma created SPARK-4276:
---

 Summary: Spark streaming requires at least two working thread
 Key: SPARK-4276
 URL: https://issues.apache.org/jira/browse/SPARK-4276
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: varun sharma
 Fix For: 1.1.0


Spark streaming requires at least two working threads.But example in 
spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
 
has
// Create the context with a 1 second batch size
val sparkConf = new SparkConf().setAppName(NetworkWordCount)
val ssc = new StreamingContext(sparkConf, Seconds(1))

which creates only 1 thread.
It should have atleast 2 threads:
http://spark.apache.org/docs/latest/streaming-programming-guide.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem


[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200514#comment-14200514
 ] 

Matei Zaharia commented on SPARK-677:
-

[~joshrosen] is this fixed now?

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 0.7.0
Reporter: Josh Rosen

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-681) Optimize hashtables used in Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-681.
-
Resolution: Fixed

 Optimize hashtables used in Spark
 -

 Key: SPARK-681
 URL: https://issues.apache.org/jira/browse/SPARK-681
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 The hash tables used in cogroup, join, etc take up a lot more space than they 
 need to because they're using linked data structures. It would be nice to 
 write a custom open hashtable class to use instead, especially since these 
 tables are append-only. A custom one would likely run better than fastutil 
 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200519#comment-14200519
 ] 

Aaron Davidson commented on SPARK-2468:
---

[~zzcclp] Use of epoll mode is highly dependent on your environment, and I 
personally would not recommend it due to known netty bugs which may cause it to 
be less stable. We have found nio mode to be sufficiently performant in our 
testing (and netty actually still tries to use epoll if it's available as its 
selector).

[~lianhuiwang] Could you please elaborate on what you mean?

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4188) Shuffle fetches should be retried at a lower level


[ 
https://issues.apache.org/jira/browse/SPARK-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200527#comment-14200527
 ] 

Apache Spark commented on SPARK-4188:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3101

 Shuffle fetches should be retried at a lower level
 --

 Key: SPARK-4188
 URL: https://issues.apache.org/jira/browse/SPARK-4188
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson

 During periods of high network (or GC) load, it is not uncommon that 
 IOExceptions crop up around connection failures when fetching shuffle files. 
 Unfortunately, when such a failure occurs, it is interpreted as an inability 
 to fetch the files, which causes us to mark the executor as lost and 
 recompute all of its shuffle outputs.
 We should allow retrying at the network level in the event of an IOException 
 in order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4238) Perform network-level retry of shuffle file fetches


 [ 
https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson closed SPARK-4238.
-
Resolution: Duplicate

 Perform network-level retry of shuffle file fetches
 ---

 Key: SPARK-4238
 URL: https://issues.apache.org/jira/browse/SPARK-4238
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical

 During periods of high network (or GC) load, it is not uncommon that 
 IOExceptions crop up around connection failures when fetching shuffle files. 
 Unfortunately, when such a failure occurs, it is interpreted as an inability 
 to fetch the files, which causes us to mark the executor as lost and 
 recompute all of its shuffle outputs.
 We should allow retrying at the network level in the event of an IOException 
 in order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4188) Shuffle fetches should be retried at a lower level


 [ 
https://issues.apache.org/jira/browse/SPARK-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-4188:
--
Description: 
During periods of high network (or GC) load, it is not uncommon that 
IOExceptions crop up around connection failures when fetching shuffle files. 
Unfortunately, when such a failure occurs, it is interpreted as an inability to 
fetch the files, which causes us to mark the executor as lost and recompute all 
of its shuffle outputs.
We should allow retrying at the network level in the event of an IOException in 
order to avoid this circumstance.


  was:Sometimes fetches will fail due to garbage collection pauses or network 
load. A simple retry could save recomputation of a lot of shuffle data, 
especially if it's below the task level (i.e., on the level of a single fetch).


 Shuffle fetches should be retried at a lower level
 --

 Key: SPARK-4188
 URL: https://issues.apache.org/jira/browse/SPARK-4188
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson

 During periods of high network (or GC) load, it is not uncommon that 
 IOExceptions crop up around connection failures when fetching shuffle files. 
 Unfortunately, when such a failure occurs, it is interpreted as an inability 
 to fetch the files, which causes us to mark the executor as lost and 
 recompute all of its shuffle outputs.
 We should allow retrying at the network level in the event of an IOException 
 in order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default


 [ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-993.
-
Resolution: Won't Fix

We investigated this for 1.0 but found that many InputFormats behave wrongly if 
you try to clone the object, so we won't fix it.

 Don't reuse Writable objects in HadoopRDDs by default
 -

 Key: SPARK-993
 URL: https://issues.apache.org/jira/browse/SPARK-993
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 Right now we reuse them as an optimization, which leads to weird results when 
 you call collect() on a file with distinct items. We should instead make that 
 behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4238) Perform network-level retry of shuffle file fetches


[ 
https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200526#comment-14200526
 ] 

Aaron Davidson commented on SPARK-4238:
---

Oh, whoops.

 Perform network-level retry of shuffle file fetches
 ---

 Key: SPARK-4238
 URL: https://issues.apache.org/jira/browse/SPARK-4238
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical

 During periods of high network (or GC) load, it is not uncommon that 
 IOExceptions crop up around connection failures when fetching shuffle files. 
 Unfortunately, when such a failure occurs, it is interpreted as an inability 
 to fetch the files, which causes us to mark the executor as lost and 
 recompute all of its shuffle outputs.
 We should allow retrying at the network level in the event of an IOException 
 in order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default


[ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200531#comment-14200531
 ] 

Matei Zaharia commented on SPARK-993:
-

Arun, you'd see this issue if you do collect() or take() and then println. The 
problem is that the same Text object (for example) is referenced for all 
records in the dataset. The counts will be okay.

 Don't reuse Writable objects in HadoopRDDs by default
 -

 Key: SPARK-993
 URL: https://issues.apache.org/jira/browse/SPARK-993
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 Right now we reuse them as an optimization, which leads to weird results when 
 you call collect() on a file with distinct items. We should instead make that 
 behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1000) Crash when running SparkPi example with local-cluster


 [ 
https://issues.apache.org/jira/browse/SPARK-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-1000.

Resolution: Cannot Reproduce

 Crash when running SparkPi example with local-cluster
 -

 Key: SPARK-1000
 URL: https://issues.apache.org/jira/browse/SPARK-1000
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: xiajunluan
Assignee: Andrew Or

 when I run SparkPi with local-cluster[2,2,512], it will throw following 
 exception at the end of job.
 WARNING: An exception was thrown by an exception handler.
 java.util.concurrent.RejectedExecutionException
   at 
 java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
   at 
 java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
   at 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
   at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.start(AbstractNioWorker.java:184)
   at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:330)
   at 
 org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35)
   at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:313)
   at 
 org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35)
   at 
 org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
   at 
 org.jboss.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:504)
   at 
 org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:47)
   at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannel.init(NioClientSocketChannel.java:79)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:176)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:82)
   at 
 org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:213)
   at 
 org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183)
   at 
 akka.remote.netty.ActiveRemoteClient$$anonfun$connect$1.apply$mcV$sp(Client.scala:173)
   at akka.util.Switch.liftedTree1$1(LockUtil.scala:33)
   at akka.util.Switch.transcend(LockUtil.scala:32)
   at akka.util.Switch.switchOn(LockUtil.scala:55)
   at akka.remote.netty.ActiveRemoteClient.connect(Client.scala:158)
   at 
 akka.remote.netty.NettyRemoteTransport.send(NettyRemoteSupport.scala:153)
   at akka.remote.RemoteActorRef.$bang(RemoteActorRefProvider.scala:247)
   at 
 akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559)
   at 
 akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559)
   at scala.collection.Iterator$class.foreach(Iterator.scala:772)
   at scala.collection.immutable.VectorIterator.foreach(Vector.scala:648)
   at scala.collection.IterableLike$class.foreach(IterableLike.scala:73)
   at scala.collection.immutable.Vector.foreach(Vector.scala:63)
   at akka.actor.LocalDeathWatch.publish(ActorRefProvider.scala:559)
   at 
 akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:280)
   at 
 akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:262)
   at akka.actor.ActorCell.doTerminate(ActorCell.scala:701)
   at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:747)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:608)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:209)
   at akka.dispatch.Mailbox.run(Mailbox.scala:178)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
   at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
   at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
   at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
   at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1023) Remove Thread.sleep(5000) from TaskSchedulerImpl


 [ 
https://issues.apache.org/jira/browse/SPARK-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1023.
--
Resolution: Fixed

 Remove Thread.sleep(5000) from TaskSchedulerImpl
 

 Key: SPARK-1023
 URL: https://issues.apache.org/jira/browse/SPARK-1023
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
 Fix For: 1.0.0


 This causes the unit tests to take super long. We should figure out why this 
 exists and see if we can lower it or do something smarter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1185) In Spark Programming Guide, Master URLs should mention yarn-client


 [ 
https://issues.apache.org/jira/browse/SPARK-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1185.
--
Resolution: Fixed

 In Spark Programming Guide, Master URLs should mention yarn-client
 

 Key: SPARK-1185
 URL: https://issues.apache.org/jira/browse/SPARK-1185
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Pérez González

 It would also be helpful to mention that the reason a host:post isn't 
 required for YARN mode is that it comes from the Hadoop configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-11-06 Thread Norman He (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200538#comment-14200538
 ] 

Norman He commented on SPARK-2447:
--

Hi Ted and Tathagaat Das,

Would the spark/sparkstreaming consider making HBaseContext an facade to 
encolse all the HBase simple Get /Put methods?





 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4276) Spark streaming requires at least two working thread


[ 
https://issues.apache.org/jira/browse/SPARK-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200539#comment-14200539
 ] 

Sean Owen commented on SPARK-4276:
--

This is basically the same concern addressed already by 
https://issues.apache.org/jira/browse/SPARK-4040 no?
This code can set a master of, say, local[2], but it was my understanding 
that all examples don't set a master and this is supplied by spark-submit.
Then again SPARK-4040 changed the doc example to set a local[2] master. hm.


 Spark streaming requires at least two working thread
 

 Key: SPARK-4276
 URL: https://issues.apache.org/jira/browse/SPARK-4276
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: varun sharma
 Fix For: 1.1.0


 Spark streaming requires at least two working threads.But example in 
 spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
  
 has
 // Create the context with a 1 second batch size
 val sparkConf = new SparkConf().setAppName(NetworkWordCount)
 val ssc = new StreamingContext(sparkConf, Seconds(1))
 which creates only 1 thread.
 It should have atleast 2 threads:
 http://spark.apache.org/docs/latest/streaming-programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4276) Spark streaming requires at least two working thread


[ 
https://issues.apache.org/jira/browse/SPARK-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200540#comment-14200540
 ] 

Apache Spark commented on SPARK-4276:
-

User 'svar29' has created a pull request for this issue:
https://github.com/apache/spark/pull/3141

 Spark streaming requires at least two working thread
 

 Key: SPARK-4276
 URL: https://issues.apache.org/jira/browse/SPARK-4276
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: varun sharma
 Fix For: 1.1.0


 Spark streaming requires at least two working threads.But example in 
 spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
  
 has
 // Create the context with a 1 second batch size
 val sparkConf = new SparkConf().setAppName(NetworkWordCount)
 val ssc = new StreamingContext(sparkConf, Seconds(1))
 which creates only 1 thread.
 It should have atleast 2 threads:
 http://spark.apache.org/docs/latest/streaming-programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-2237) Add ZLIBCompressionCodec code


 [ 
https://issues.apache.org/jira/browse/SPARK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-2237.

Resolution: Won't Fix

 Add ZLIBCompressionCodec code
 -

 Key: SPARK-2237
 URL: https://issues.apache.org/jira/browse/SPARK-2237
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yanjie Gao





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error


 [ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2348:
-
Priority: Critical  (was: Major)

 In Windows having a enviorinment variable named 'classpath' gives error
 ---

 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka
Assignee: Chirag Todarka
Priority: Critical

 Operating System:: Windows 7 Enterprise
 If having enviorinment variable named 'classpath' gives then starting 
 'spark-shell' gives below error::
 mydir\spark\binspark-shell
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
 acces
 sed before init set up.  Assuming no postInit code.
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 Exception in thread main java.lang.AssertionError: assertion failed: null
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
 la:202)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
 kILoop.scala:929)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
 Loader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-11-06 Thread Ted Malaska (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200554#comment-14200554
]

Ted Malaska commented on SPARK-2447:

Hey Norman,

Totally agree. TD and I talked about SparkOnHBase at Hadoop World. Times
where crazy leading up to Hadoop World.

So I'm doing the following things:
1. I'm writing up a Blog for SparkOnHBase
2. TD is working on directions for how this code should be integrated with Spark
3. I have been working out little bugs with Java integration
4. I want to build a couple more examples
5. I'm having a problem with Maven where the Java JUnits are not executing
6. I adding support for Kerberos

But yes the facade is coming. :)

Let me know if you want to help. Just do a pull request on
https://github.com/tmalaska/SparkOnHBase

Add common solution for sending upsert actions to HBase (put, deletes, and
increment)
-

Key: SPARK-2447
URL: https://issues.apache.org/jira/browse/SPARK-2447
Project: Spark
Issue Type: New Feature
Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

Going to review the design with Tdas today.
But first thoughts is to have an extension of VoidFunction that handles the
connection to HBase and allows for options such as turning auto flush off for
higher through put.
Need to answer the following questions first.
- Can it be written in Java or should it be written in Scala?
- What is the best way to add the HBase dependency? (will review how Flume
does this as the first option)
- What is the best way to do testing? (will review how Flume does this as the
first option)
- How to support python? (python may be a different Jira it is unknown at
this time)
Goals:
- Simple to use
- Stable
- Supports high load
- Documented (May be in a separate Jira need to ask Tdas)
- Supports Java, Scala, and hopefully Python
- Supports Streaming and normal Spark

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4276) Spark streaming requires at least two working thread

2014-11-06 Thread varun sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200571#comment-14200571
 ] 

varun sharma commented on SPARK-4276:
-

[~srowen] Yeah right. I spent lot of time to figure this out as example in repo 
and in documentation are different.
I now understood the concept of resource starvation here.
Thanks for pointing it out and sorry for the inconvenience.

 Spark streaming requires at least two working thread
 

 Key: SPARK-4276
 URL: https://issues.apache.org/jira/browse/SPARK-4276
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: varun sharma
 Fix For: 1.1.0


 Spark streaming requires at least two working threads.But example in 
 spark/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
  
 has
 // Create the context with a 1 second batch size
 val sparkConf = new SparkConf().setAppName(NetworkWordCount)
 val ssc = new StreamingContext(sparkConf, Seconds(1))
 which creates only 1 thread.
 It should have atleast 2 threads:
 http://spark.apache.org/docs/latest/streaming-programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-11-06 Thread Norman He (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200593#comment-14200593
]

Norman He commented on SPARK-2447:
--

Yes, I would like to help. Let me start with facade work in scala first.

Add common solution for sending upsert actions to HBase (put, deletes, and
increment)
-

Key: SPARK-2447
URL: https://issues.apache.org/jira/browse/SPARK-2447
Project: Spark
Issue Type: New Feature
Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-11-06 Thread Ted Malaska (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200602#comment-14200602
 ] 

Ted Malaska commented on SPARK-2447:


Cool thanks.

Connect me at ted.mala...@cloudera.com and I'll set up a webex for some time in 
the future so we can get this going.

Thanks

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4277) Support external shuffle service on Worker

Aaron Davidson created SPARK-4277:
-

 Summary: Support external shuffle service on Worker
 Key: SPARK-4277
 URL: https://issues.apache.org/jira/browse/SPARK-4277
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson


It's less useful to have an external shuffle service on the Spark Standalone 
Worker than on YARN or Mesos (as executor allocations tend to be more static), 
but it would be good to help test the code path. It would also make Spark more 
resilient to particular executor failures.

Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will 
take care of cleaning up executor directories, which will mean executors 
terminate more quickly and that we don't leak data if the executor dies 
forcefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4264) SQL HashJoin induces refCnt = 0 error in ShuffleBlockFetcherIterator


 [ 
https://issues.apache.org/jira/browse/SPARK-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-4264.
---
Resolution: Fixed

 SQL HashJoin induces refCnt = 0 error in ShuffleBlockFetcherIterator
 --

 Key: SPARK-4264
 URL: https://issues.apache.org/jira/browse/SPARK-4264
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 This is because it calls hasNext twice, which invokes the completion iterator 
 twice, unintuitively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4277) Support external shuffle service on Worker


[ 
https://issues.apache.org/jira/browse/SPARK-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200631#comment-14200631
 ] 

Apache Spark commented on SPARK-4277:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3142

 Support external shuffle service on Worker
 --

 Key: SPARK-4277
 URL: https://issues.apache.org/jira/browse/SPARK-4277
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson

 It's less useful to have an external shuffle service on the Spark Standalone 
 Worker than on YARN or Mesos (as executor allocations tend to be more 
 static), but it would be good to help test the code path. It would also make 
 Spark more resilient to particular executor failures.
 Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will 
 take care of cleaning up executor directories, which will mean executors 
 terminate more quickly and that we don't leak data if the executor dies 
 forcefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-11-06 Thread Patrick Wendell (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200634#comment-14200634
]

Patrick Wendell commented on SPARK-2447:

Hey All,

I have a question about this - is there any reason this can't exist as a user
library instead of being merged into Spark itself? For these utility libraries
like this, I could see ones coming for Cassandra, Mongo, etc... I don't see it
scaling to put and maintain all of these in the Spark code base. At the same
time however, they are super useful.

As an alternative - what about if it was in HBase similar to e.g. the Hadoop
InputFormat implementation?

Add common solution for sending upsert actions to HBase (put, deletes, and
increment)
-

Key: SPARK-2447
URL: https://issues.apache.org/jira/browse/SPARK-2447
Project: Spark
Issue Type: New Feature
Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4249) A problem of EdgePartitionBuilder in Graphx

2014-11-06 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-4249.
---
   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.1
   1.2.0

Issue resolved by pull request 3138
[https://github.com/apache/spark/pull/3138]

 A problem of EdgePartitionBuilder in Graphx
 ---

 Key: SPARK-4249
 URL: https://issues.apache.org/jira/browse/SPARK-4249
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0
Reporter: Cookies
Priority: Minor
 Fix For: 1.2.0, 1.1.1, 1.0.3


 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartitionBuilder.scala#L48
 in function toEdgePartition, code snippet  index.update(srcIds(0), 0)
 the elements in array srcIds are all not initialized and are all 0. The 
 effect is that if vertex ids don't contain 0, indexSize is equal to 
 (realIndexSize + 1) and 0 is added to the `index`
 It seems that all versions have the problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-11-06 Thread Ted Malaska (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200646#comment-14200646
]

Ted Malaska commented on SPARK-2447:

I totally understand. I also don't know the answer.

That is why I made the Github. Thankfully my employer also sees value in this
project and it will be moving to Cloudera Labs in the coming weeks. All that
means is I will have more help supporting it.

A blog and more improvement will be coming in the coming weeks.

Add common solution for sending upsert actions to HBase (put, deletes, and
increment)
-

Key: SPARK-2447
URL: https://issues.apache.org/jira/browse/SPARK-2447
Project: Spark
Issue Type: New Feature
Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4173) EdgePartitionBuilder uses wrong value for first clustered index

2014-11-06 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-4173.
---
   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.1
   1.2.0

Issue resolved by pull request 3138
[https://github.com/apache/spark/pull/3138]

 EdgePartitionBuilder uses wrong value for first clustered index
 ---

 Key: SPARK-4173
 URL: https://issues.apache.org/jira/browse/SPARK-4173
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: Ankur Dave
Assignee: Ankur Dave
 Fix For: 1.2.0, 1.1.1, 1.0.3


 Lines 48 and 49 in EdgePartitionBuilder reference {{srcIds}} before it has 
 been initialized, causing an incorrect value to be stored for the first 
 cluster.
 https://github.com/apache/spark/blob/23468e7e96bf047ba53806352558b9d661567b23/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartitionBuilder.scala#L48-49



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4278) SparkSQL job failing with java.lang.ClassCastException

2014-11-06 Thread Parviz Deyhim (JIRA)

Parviz Deyhim created SPARK-4278:


 Summary: SparkSQL job failing with java.lang.ClassCastException
 Key: SPARK-4278
 URL: https://issues.apache.org/jira/browse/SPARK-4278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Parviz Deyhim


The following job fails with the java.lang.ClassCastException error. Ideally 
SparkSQL should have the ability to ignore records that don't conform with the 
inferred schema. 

The steps that gets me to this error: 

1) infer schema from a small subset of data
2) apply the schema to a larger dataset
3) do a simple join of two datasets

sample code:
{code}
val sampleJson = 
sqlContext.jsonRDD(sc.textFile(.../dt=2014-10-10/file.snappy))
val mydata = sqlContext.jsonRDD(larger_dataset,sampleJson.schema)
mydata.registerTempTable(mytable1)

other dataset:
val x = sc.textFile(.)
case class Dataset(a:String,state:String, b:String, z:String, c:String, 
d:String)
val xSchemaRDD = 
x.map(_.split(\t)).map(f=Dataset(f(0),f(1),f(2),f(3),f(4),f(5)))
xSchemaRDD.registerTempTable(mytable2)

{code}


java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:389)

org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:397)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.AbstractTraversable.map(Traversable.scala:105)
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$4.apply(JsonRDD.scala:410)
scala.Option.map(Option.scala:145)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:409)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:407)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:407)
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:398)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$4.apply(JsonRDD.scala:410)
scala.Option.map(Option.scala:145)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:409)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:407)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:407)
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:398)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$4.apply(JsonRDD.scala:410)
scala.Option.map(Option.scala:145)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:409)

org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:407)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:407)

org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:41)

org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:41)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)

org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)

[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

2014-11-06 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200686#comment-14200686
 ] 

Josh Rosen commented on SPARK-677:
--

No, it's still an issue in 1.2.0:

{code}
 def collect(self):

Return a list that contains all of the elements in this RDD.

with SCCallSiteSync(self.context) as css:
bytesInJava = self._jrdd.collect().iterator()
return list(self._collect_iterator_through_file(bytesInJava))

def _collect_iterator_through_file(self, iterator):
# Transferring lots of data through Py4J can be slow because
# socket.readline() is inefficient.  Instead, we'll dump the data to a
# file and read it back.
tempFile = NamedTemporaryFile(delete=False, dir=self.ctx._temp_dir)
tempFile.close()
self.ctx._writeToFile(iterator, tempFile.name)
# Read the data into Python and deserialize it:
with open(tempFile.name, 'rb') as tempFile:
for item in self._jrdd_deserializer.load_stream(tempFile):
yield item
os.unlink(tempFile.name)
{code}

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 0.7.0
Reporter: Josh Rosen

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4279) Implementing TinkerPop on top of GraphX

2014-11-06 Thread Brennon York (JIRA)

Brennon York created SPARK-4279:
---

 Summary: Implementing TinkerPop on top of GraphX
 Key: SPARK-4279
 URL: https://issues.apache.org/jira/browse/SPARK-4279
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Brennon York
Priority: Minor


[TinkerPop|https://github.com/tinkerpop] is a great abstraction for graph 
databases and has been implemented across various graph database backends. Has 
anyone thought about integrating the TinkerPop framework with GraphX to enable 
GraphX as another backend? Not sure if this has been brought up or not, but 
would certainly volunteer to spearhead this effort if the community thinks it 
to be a good idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks

2014-11-06 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-4280:
-

 Summary: In dynamic allocation, add option to never kill executors 
with cached blocks
 Key: SPARK-4280
 URL: https://issues.apache.org/jira/browse/SPARK-4280
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza


Even with the external shuffle service, this is useful in situations like Hive 
on Spark where a query might require caching some data. We want to be able to 
give back executors after the job ends, but not during the job if it would 
delete intermediate results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0

2014-11-06 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200762#comment-14200762
 ] 

Josh Rosen commented on SPARK-4133:
---

We can't remove that StreamingContext constructor since we don't want to break 
binary compatibility, but we can add a warning / raise an error.  I'm working 
on a patch for this now.

 PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
 --

 Key: SPARK-4133
 URL: https://issues.apache.org/jira/browse/SPARK-4133
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Antonio Jesus Navarro
Priority: Blocker
 Attachments: spark_ex.logs


 Snappy related problems found when trying to upgrade existing Spark Streaming 
 App from 1.0.2 to 1.1.0.
 We can not run an existing 1.0.2 spark app if upgraded to 1.1.0
  IOException is thrown by snappy (parsing_error(2))
 {code}
 Executor task launch worker-0 DEBUG storage.BlockManager - Getting local 
 block broadcast_0
 Executor task launch worker-0 DEBUG storage.BlockManager - Level for block 
 broadcast_0 is StorageLevel(true, true, false, true, 1)
 Executor task launch worker-0 DEBUG storage.BlockManager - Getting block 
 broadcast_0 from memory
 Executor task launch worker-0 DEBUG storage.BlockManager - Getting local 
 block broadcast_0
 Executor task launch worker-0 DEBUG executor.Executor - Task 0's epoch is 0
 Executor task launch worker-0 DEBUG storage.BlockManager - Block broadcast_0 
 not registered locally
 Executor task launch worker-0 INFO  broadcast.TorrentBroadcast - Started 
 reading broadcast variable 0
 sparkDriver-akka.actor.default-dispatcher-4 INFO  
 receiver.ReceiverSupervisorImpl - Registered receiver 0
 Executor task launch worker-0 INFO  util.RecurringTimer - Started timer for 
 BlockGenerator at time 1414656492400
 Executor task launch worker-0 INFO  receiver.BlockGenerator - Started 
 BlockGenerator
 Thread-87 INFO  receiver.BlockGenerator - Started block pushing thread
 Executor task launch worker-0 INFO  receiver.ReceiverSupervisorImpl - 
 Starting receiver
 sparkDriver-akka.actor.default-dispatcher-5 INFO  scheduler.ReceiverTracker - 
 Registered receiver for stream 0 from akka://sparkDriver
 Executor task launch worker-0 INFO  kafka.KafkaReceiver - Starting Kafka 
 Consumer Stream with group: stratioStreaming
 Executor task launch worker-0 INFO  kafka.KafkaReceiver - Connecting to 
 Zookeeper: node.stratio.com:2181
 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
 received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
 cap=0]) from Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
 received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
 cap=0]) from Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] 
 received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
 cap=0]) from Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
 handled message (8.442354 ms) 
 StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
 Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
 handled message (8.412421 ms) 
 StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
 Actor[akka://sparkDriver/deadLetters]
 sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] 
 handled message (8.385471 ms) 
 StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
 Actor[akka://sparkDriver/deadLetters]
 Executor task launch worker-0 INFO  utils.VerifiableProperties - Verifying 
 properties
 Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
 group.id is overridden to stratioStreaming
 Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
 zookeeper.connect is overridden to node.stratio.com:2181
 Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
 zookeeper.connection.timeout.ms is overridden to 1
 Executor task launch worker-0 INFO  broadcast.TorrentBroadcast - Reading 
 broadcast variable 0 took 0.033998997 s
 Executor task launch worker-0 INFO  consumer.ZookeeperConsumerConnector - 
 [stratioStreaming_ajn-stratio-1414656492293-8ecb3e3a], Connecting to 
 zookeeper instance at node.stratio.com:2181
 Executor task launch worker-0 DEBUG zkclient.ZkConnection - Creating new 
 ZookKeeper instance to connect to node.stratio.com:2181.
 ZkClient-EventThread-169-node.stratio.com:2181 INFO  zkclient.ZkEventThread - 
 Starting ZkClient event thread.

[jira] [Commented] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks


[ 
https://issues.apache.org/jira/browse/SPARK-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200785#comment-14200785
 ] 

Andrew Or commented on SPARK-4280:
--

Hey [~sandyr] does this require the application to explicitly uncache the RDDs? 
My concern is that we cache some blocks behind the application's back (e.g. 
broadcast, streaming blocks) and we don't currently display these on the UI, in 
which case the application will never remove executors. It seems that some 
mechanism that would unconditionally blow away all the blocks on an executor 
will be handy.

 In dynamic allocation, add option to never kill executors with cached blocks
 

 Key: SPARK-4280
 URL: https://issues.apache.org/jira/browse/SPARK-4280
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza

 Even with the external shuffle service, this is useful in situations like 
 Hive on Spark where a query might require caching some data. We want to be 
 able to give back executors after the job ends, but not during the job if it 
 would delete intermediate results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks

2014-11-06 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200801#comment-14200801
 ] 

Sandy Ryza commented on SPARK-4280:
---

My thinking was that it would just be based on whether the node reports it's 
storing blocks.  So, e.g., applications wouldn't need to explicitly uncache the 
RDDs in situations where Spark garbage collects them because their references 
go away.

Ideally we would make a special case for broadcast variables, because they're 
not unique to any node.  Will look into whether there's a good way to do this.


 In dynamic allocation, add option to never kill executors with cached blocks
 

 Key: SPARK-4280
 URL: https://issues.apache.org/jira/browse/SPARK-4280
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza

 Even with the external shuffle service, this is useful in situations like 
 Hive on Spark where a query might require caching some data. We want to be 
 able to give back executors after the job ends, but not during the job if it 
 would delete intermediate results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4281) Yarn shuffle service jars need to include dependencies

Andrew Or created SPARK-4281:


 Summary: Yarn shuffle service jars need to include dependencies
 Key: SPARK-4281
 URL: https://issues.apache.org/jira/browse/SPARK-4281
 Project: Spark
  Issue Type: Bug
  Components: Build, YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Patrick Wendell
Priority: Blocker


When we package we only get the jars with the classes. We need to make an 
assembly jar for the network-yarn module that includes all of its 
dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4148) PySpark's sample uses the same seed for all partitions


 [ 
https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-4148:
--

reopen this issue because branch-1.0 is not fixed

 PySpark's sample uses the same seed for all partitions
 --

 Key: SPARK-4148
 URL: https://issues.apache.org/jira/browse/SPARK-4148
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.1, 1.2.0


 The current way of seed distribution makes the random sequences from 
 partition i and i+1 offset by 1.
 {code}
 In [14]: import random
 In [15]: r1 = random.Random(10)
 In [16]: r1.randint(0, 1)
 Out[16]: 1
 In [17]: r1.random()
 Out[17]: 0.4288890546751146
 In [18]: r1.random()
 Out[18]: 0.5780913011344704
 In [19]: r2 = random.Random(10)
 In [20]: r2.randint(0, 1)
 Out[20]: 1
 In [21]: r2.randint(0, 1)
 Out[21]: 0
 In [22]: r2.random()
 Out[22]: 0.5780913011344704
 {code}
 So the second value from partition 1 is the same as the first value from 
 partition 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4148) PySpark's sample uses the same seed for all partitions


 [ 
https://issues.apache.org/jira/browse/SPARK-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4148:
-
Fix Version/s: 1.2.0

 PySpark's sample uses the same seed for all partitions
 --

 Key: SPARK-4148
 URL: https://issues.apache.org/jira/browse/SPARK-4148
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.1, 1.2.0


 The current way of seed distribution makes the random sequences from 
 partition i and i+1 offset by 1.
 {code}
 In [14]: import random
 In [15]: r1 = random.Random(10)
 In [16]: r1.randint(0, 1)
 Out[16]: 1
 In [17]: r1.random()
 Out[17]: 0.4288890546751146
 In [18]: r1.random()
 Out[18]: 0.5780913011344704
 In [19]: r2 = random.Random(10)
 In [20]: r2.randint(0, 1)
 Out[20]: 1
 In [21]: r2.randint(0, 1)
 Out[21]: 0
 In [22]: r2.random()
 Out[22]: 0.5780913011344704
 {code}
 So the second value from partition 1 is the same as the first value from 
 partition 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4282) Stopping flag in YarnClientSchedulerBackend should be volatile

2014-11-06 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-4282:
-

 Summary: Stopping flag in YarnClientSchedulerBackend should be 
volatile
 Key: SPARK-4282
 URL: https://issues.apache.org/jira/browse/SPARK-4282
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Kousuke Saruta


In YarnClientSchedulerBackend, a variable stopping is used as a flag and it's 
accessed by some threads so it should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4282) Stopping flag in YarnClientSchedulerBackend should be volatile


[ 
https://issues.apache.org/jira/browse/SPARK-4282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201032#comment-14201032
 ] 

Apache Spark commented on SPARK-4282:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/3143

 Stopping flag in YarnClientSchedulerBackend should be volatile
 --

 Key: SPARK-4282
 URL: https://issues.apache.org/jira/browse/SPARK-4282
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 In YarnClientSchedulerBackend, a variable stopping is used as a flag and 
 it's accessed by some threads so it should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-11-06 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201042#comment-14201042
 ] 

Patrick Wendell commented on SPARK-3561:


Hey [~ozhurakousky] - as I said before, I just don't think it's a good idea to 
make the internals of Spark execution pluggable, given our approach to API 
compatibility. This is attempting to opening up many internal API's.

One of the main reasons for this proposal was to have better elasticity in 
YARN. Towards that end, can you try out the new elastic scaling code for YARN 
(SPARK-3174)? It will ship in Spark 1.2 and be one of the main new features. 
This integrates nicely with YARN's native shuffle service. In fact IIRC the 
main design of the shuffle service in YARN was specifically for this purpose. 
In that way, I think elements of this proposal are indeed making it into Spark.

In terms of breaking out the initialization of a SparkContext - that is 
probably a good idea (it's been discussed separately).

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@Experimental) not exposed to end users of Spark. 
 The trait will define 6 operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 * persist
 * unpersist
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4283) Spark source code does not correctly import into eclipse

2014-11-06 Thread Yang Yang (JIRA)

Yang Yang created SPARK-4283:


 Summary: Spark source code does not correctly import into eclipse
 Key: SPARK-4283
 URL: https://issues.apache.org/jira/browse/SPARK-4283
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Yang Yang
Priority: Minor


when I import spark src into eclipse, either by mvn eclipse:eclipse, then 
import existing general projects or import existing maven projects, it does 
not recognize the project as a scala project. 

I am adding a new plugin , so import works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4283) Spark source code does not correctly import into eclipse

2014-11-06 Thread Yang Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yang updated SPARK-4283:
-
Attachment: spark_eclipse.diff

patch for all the pom files

 Spark source code does not correctly import into eclipse
 

 Key: SPARK-4283
 URL: https://issues.apache.org/jira/browse/SPARK-4283
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Yang Yang
Priority: Minor
 Attachments: spark_eclipse.diff


 when I import spark src into eclipse, either by mvn eclipse:eclipse, then 
 import existing general projects or import existing maven projects, it 
 does not recognize the project as a scala project. 
 I am adding a new plugin , so import works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService


[ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201077#comment-14201077
 ] 

Apache Spark commented on SPARK-3797:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/3144

 Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
 --

 Key: SPARK-3797
 URL: https://issues.apache.org/jira/browse/SPARK-3797
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Assignee: Andrew Or
 Fix For: 1.2.0


 It's also worth considering running the shuffle service in a YARN container 
 beside the executor(s) on each node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types

2014-11-06 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-4284:


 Summary: BinaryClassificationMetrics precision-recall method names 
should correspond to return types
 Key: SPARK-4284
 URL: https://issues.apache.org/jira/browse/SPARK-4284
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor


BinaryClassificationMetrics has several methods which work with (recall, 
precision) pairs, but the method names all use the wrong order (pr).  This 
order should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4286) Support External Shuffle Service with Mesos integration

2014-11-06 Thread Timothy Chen (JIRA)

Timothy Chen created SPARK-4286:
---

 Summary: Support External Shuffle Service with Mesos integration
 Key: SPARK-4286
 URL: https://issues.apache.org/jira/browse/SPARK-4286
 Project: Spark
  Issue Type: Task
Reporter: Timothy Chen


With the new external shuffle service added, we need to also make the Mesos 
integration able to launch the shuffle service and support the auto scaling 
executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4285) Transpose RDD[Vector] to column store for ML

2014-11-06 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-4285:


 Summary: Transpose RDD[Vector] to column store for ML
 Key: SPARK-4285
 URL: https://issues.apache.org/jira/browse/SPARK-4285
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


For certain ML algorithms, a column store is more efficient than a row store 
(which is currently used everywhere).  E.g., deep decision trees can be faster 
to train when partitioning by features.

Proposal: Provide a method with the following API (probably in util/):
```
def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
```
The input Vectors will be data rows/instances, and the output Vectors will be 
columns/features paired with column/feature indices.

**Question**: Is it important to maintain matrix structure?  That is, should 
output Vectors in the same partition be adjacent columns in the matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4286) Support External Shuffle Service with Mesos integration

2014-11-06 Thread Timothy Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated SPARK-4286:

Component/s: Mesos

 Support External Shuffle Service with Mesos integration
 ---

 Key: SPARK-4286
 URL: https://issues.apache.org/jira/browse/SPARK-4286
 Project: Spark
  Issue Type: Task
  Components: Mesos
Reporter: Timothy Chen

 With the new external shuffle service added, we need to also make the Mesos 
 integration able to launch the shuffle service and support the auto scaling 
 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4286) Support External Shuffle Service with Mesos integration

2014-11-06 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201312#comment-14201312
 ] 

Timothy Chen commented on SPARK-4286:
-

Please assign to me, thanks.

 Support External Shuffle Service with Mesos integration
 ---

 Key: SPARK-4286
 URL: https://issues.apache.org/jira/browse/SPARK-4286
 Project: Spark
  Issue Type: Task
  Components: Mesos
Reporter: Timothy Chen

 With the new external shuffle service added, we need to also make the Mesos 
 integration able to launch the shuffle service and support the auto scaling 
 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4285) Transpose RDD[Vector] to column store for ML

2014-11-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4285:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-3717

 Transpose RDD[Vector] to column store for ML
 

 Key: SPARK-4285
 URL: https://issues.apache.org/jira/browse/SPARK-4285
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 For certain ML algorithms, a column store is more efficient than a row store 
 (which is currently used everywhere).  E.g., deep decision trees can be 
 faster to train when partitioning by features.
 Proposal: Provide a method with the following API (probably in util/):
 ```
 def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
 ```
 The input Vectors will be data rows/instances, and the output Vectors will be 
 columns/features paired with column/feature indices.
 **Question**: Is it important to maintain matrix structure?  That is, should 
 output Vectors in the same partition be adjacent columns in the matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4285) Transpose RDD[Vector] to column store for ML


 [ 
https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4285:
-
Assignee: Joseph K. Bradley

 Transpose RDD[Vector] to column store for ML
 

 Key: SPARK-4285
 URL: https://issues.apache.org/jira/browse/SPARK-4285
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 For certain ML algorithms, a column store is more efficient than a row store 
 (which is currently used everywhere).  E.g., deep decision trees can be 
 faster to train when partitioning by features.
 Proposal: Provide a method with the following API (probably in util/):
 ```
 def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
 ```
 The input Vectors will be data rows/instances, and the output Vectors will be 
 columns/features paired with column/feature indices.
 **Question**: Is it important to maintain matrix structure?  That is, should 
 output Vectors in the same partition be adjacent columns in the matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3066) Support recommendAll in matrix factorization model


 [ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3066:
-
Target Version/s:   (was: 1.2.0)

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4277) Support external shuffle service on Worker


 [ 
https://issues.apache.org/jira/browse/SPARK-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4277.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Support external shuffle service on Worker
 --

 Key: SPARK-4277
 URL: https://issues.apache.org/jira/browse/SPARK-4277
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 1.2.0


 It's less useful to have an external shuffle service on the Spark Standalone 
 Worker than on YARN or Mesos (as executor allocations tend to be more 
 static), but it would be good to help test the code path. It would also make 
 Spark more resilient to particular executor failures.
 Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will 
 take care of cleaning up executor directories, which will mean executors 
 terminate more quickly and that we don't leak data if the executor dies 
 forcefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4277) Support external shuffle service on Worker


 [ 
https://issues.apache.org/jira/browse/SPARK-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4277:
-
Affects Version/s: 1.2.0

 Support external shuffle service on Worker
 --

 Key: SPARK-4277
 URL: https://issues.apache.org/jira/browse/SPARK-4277
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 1.2.0


 It's less useful to have an external shuffle service on the Spark Standalone 
 Worker than on YARN or Mesos (as executor allocations tend to be more 
 static), but it would be good to help test the code path. It would also make 
 Spark more resilient to particular executor failures.
 Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will 
 take care of cleaning up executor directories, which will mean executors 
 terminate more quickly and that we don't leak data if the executor dies 
 forcefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-06 Thread zzc (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201378#comment-14201378
]

zzc commented on SPARK-2468:

@Aaron Davidson, Thank you for your recommendation. By the way, PR #3101 can
be merged into master today?

@Lianhui Wang, I haven't tested it.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4287) Add TTL-based cleanup in external shuffle service