[jira] [Updated] (SPARK-4772) Accumulators leak memory, both temporarily and permanently

2014-12-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4772:
--
Fix Version/s: 1.1.2
   1.3.0
   1.0.3

> Accumulators leak memory, both temporarily and permanently
> --
>
> Key: SPARK-4772
> URL: https://issues.apache.org/jira/browse/SPARK-4772
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>  Labels: accumulators, backport-needed
> Fix For: 1.0.3, 1.3.0, 1.1.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Accumulators.localAccums is cleared at the beginning of a task, and not at 
> the end.
> This means that any locally accumulated values hang around until another task 
> is run on that thread.
> If for some reason, the thread dies, said values hang around indefinitely.
> This is really only a big issue with very large accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4772) Accumulators leak memory, both temporarily and permanently

2014-12-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4772:
--
Labels: accumulators backport-needed  (was: accumulators)

> Accumulators leak memory, both temporarily and permanently
> --
>
> Key: SPARK-4772
> URL: https://issues.apache.org/jira/browse/SPARK-4772
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>  Labels: accumulators, backport-needed
> Fix For: 1.0.3, 1.3.0, 1.1.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Accumulators.localAccums is cleared at the beginning of a task, and not at 
> the end.
> This means that any locally accumulated values hang around until another task 
> is run on that thread.
> If for some reason, the thread dies, said values hang around indefinitely.
> This is really only a big issue with very large accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4772) Accumulators leak memory, both temporarily and permanently

2014-12-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240793#comment-14240793
 ] 

Josh Rosen commented on SPARK-4772:
---

This was fixed by https://github.com/apache/spark/pull/3570, which I committed 
for 1.0.3, 1.1.2, and 1.3.0.  I've added the {{backport-needed}} tag so that we 
remember to backport this into {{branch-1.2}} after the 1.2.0 release.

> Accumulators leak memory, both temporarily and permanently
> --
>
> Key: SPARK-4772
> URL: https://issues.apache.org/jira/browse/SPARK-4772
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>  Labels: accumulators, backport-needed
> Fix For: 1.0.3, 1.3.0, 1.1.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Accumulators.localAccums is cleared at the beginning of a task, and not at 
> the end.
> This means that any locally accumulated values hang around until another task 
> is run on that thread.
> If for some reason, the thread dies, said values hang around indefinitely.
> This is really only a big issue with very large accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4807) Add support for hadoop-2.5 + bump jets3t version

2014-12-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4807.

Resolution: Not a Problem

Closing this in agreement with [~srowen]'s comment.

> Add support for hadoop-2.5 + bump jets3t version
> 
>
> Key: SPARK-4807
> URL: https://issues.apache.org/jira/browse/SPARK-4807
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Andy Zhang
>Priority: Minor
>  Labels: build, maven
> Fix For: 1.2.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Spark's top level POM currently has no profile support for hadoop-2.5. I've 
> tested this a bit on my own, and it seems to work well. I also discovered 
> that you had to bump to the newest version of jets3t in order to avoid a 
> dependency conflict when using the java aws-sdk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240813#comment-14240813
 ] 

Saisai Shao commented on SPARK-4740:


Thanks Aaron, we will try to use ramdisk to minimize the HDD effect, also 
disable some cores to see if this problem still exist.

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4813) ContextWaiter didn't handle 'spurious wakeup'

2014-12-10 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-4813:
---

 Summary: ContextWaiter didn't handle 'spurious wakeup'
 Key: SPARK-4813
 URL: https://issues.apache.org/jira/browse/SPARK-4813
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu


According to 
[javadocs|https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#wait(long)],
{quote}
A thread can also wake up without being notified, interrupted, or timing out, a 
so-called spurious wakeup. While this will rarely occur in practice, 
applications must guard against it by testing for the condition that should 
have caused the thread to be awakened, and continuing to wait if the condition 
is not satisfied. In other words, waits should always occur in loops, like this 
one:

 synchronized (obj) {
 while ()
 obj.wait(timeout);
 ... // Perform action appropriate to condition
 }
{quote}
`wait` should always occur in loops.

But now ContextWaiter.waitForStopOrError doesn't.
{code}
  def waitForStopOrError(timeout: Long = -1) = synchronized {
// If already had error, then throw it
if (error != null) {
  throw error
}

// If not already stopped, then wait
if (!stopped) {
  if (timeout < 0) wait() else wait(timeout)
  if (error != null) throw error
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4813) ContextWaiter didn't handle 'spurious wakeup'

2014-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240824#comment-14240824
 ] 

Apache Spark commented on SPARK-4813:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3661

> ContextWaiter didn't handle 'spurious wakeup'
> -
>
> Key: SPARK-4813
> URL: https://issues.apache.org/jira/browse/SPARK-4813
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> According to 
> [javadocs|https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#wait(long)],
> {quote}
> A thread can also wake up without being notified, interrupted, or timing out, 
> a so-called spurious wakeup. While this will rarely occur in practice, 
> applications must guard against it by testing for the condition that should 
> have caused the thread to be awakened, and continuing to wait if the 
> condition is not satisfied. In other words, waits should always occur in 
> loops, like this one:
>  synchronized (obj) {
>  while ()
>  obj.wait(timeout);
>  ... // Perform action appropriate to condition
>  }
> {quote}
> `wait` should always occur in loops.
> But now ContextWaiter.waitForStopOrError doesn't.
> {code}
>   def waitForStopOrError(timeout: Long = -1) = synchronized {
> // If already had error, then throw it
> if (error != null) {
>   throw error
> }
> // If not already stopped, then wait
> if (!stopped) {
>   if (timeout < 0) wait() else wait(timeout)
>   if (error != null) throw error
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



  was:I think a memory-based shuffle can reduce some overhead of disk I/O. I 
just want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.


> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: Planned Work
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

| settings |  operation1  | operation2 |
|---|-||
| 1 |2 |2|

  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.




> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: Planned Work
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> | settings |  operation1  | operation2 |
> |---|-||
> | 1 |2 |2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  rePartition+flatMap+groupByKey | rePartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
| | | |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
| | | |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

| settings |  operation1  | operation2 |
|---|-||
| 1 |2 |2|


> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: Planned Work
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  rePartition+flatMap+groupByKey | rePartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> | | | |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> | | | |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  rePartition+flatMap+groupByKey | rePartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  rePartition+flatMap+groupByKey | rePartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
| | | |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
| | | |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |


> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: Planned Work
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  rePartition+flatMap+groupByKey | rePartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-10 Thread Paul R. Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240838#comment-14240838
 ] 

Paul R. Brown commented on SPARK-2075:
--

It's not a version of Spark that's in question; it's incompatibility between 
what's published to Maven Central and what's offered for download from the 
Apache site for what is nominally the *same* version.

That said, I don't object to closing the issue, but I agree with [~pferrel] 
above that it's a documentation issue.  The perfect-world solution would be to 
publish *multiple* Spark artifacts with Maven classifiers identifying the 
expected backplane.  This issue will continue to afflict people who build Spark 
applications with Maven Central artifacts and attempt to connect to Spark 
clusters build against a different backplane.

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Following is my testing on "InMemory Shuffle"

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  rePartition+flatMap+groupByKey | rePartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |


> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: Planned Work
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> Following is my testing on "InMemory Shuffle"
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.





  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Following is my testing on "InMemory Shuffle"

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |


> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: Planned Work
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-2075:
--

Hm on re-reading more closely, you are right, this is not attributable to 
version mismatch per se. I think it's better to leave it open even if I don't 
know of any action here. Sorry for the noise.

The same thing appears to occur in 1.1.1. Maybe this goes away, accidentally, 
when people all use Hadoop 2.x.

I don't think there will be multiple versions deployed to Maven as I think the 
theory is that the artifacts only exist for the API, and not whatever other 
libs they happen to link against. This is a leak in that theory though.

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4814) Enable assertions in SBT, Maven tests

2014-12-10 Thread Sean Owen (JIRA)
Sean Owen created SPARK-4814:


 Summary: Enable assertions in SBT, Maven tests
 Key: SPARK-4814
 URL: https://issues.apache.org/jira/browse/SPARK-4814
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sean Owen


Follow up to SPARK-4159, wherein we noticed that Java tests weren't running in 
Maven, in part because a Java test actually fails with {{AssertionError}}. That 
code/test was fixed in SPARK-4850.

The reason it wasn't caught by SBT tests was that they don't run with 
assertions on, and Maven's surefire does.

Turning on assertions in the SBT build is trivial, adding one line:

{code}
javaOptions in Test += "-ea",
{code}

This reveals a test failure in Scala test suites though:

{code}
[info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
[info]   Failed to execute query using catalyst:
[info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 failed 
1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
localhost): java.lang.AssertionError
[info]  at 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
[info]  at 
org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
[info]  at 
org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
[info]  at 
org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
[info]  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
[info]  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
[info]  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
[info]  at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
[info]  at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
[info]  at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
[info]  at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
[info]  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
[info]  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
[info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
[info]  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
[info]  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
[info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
[info]  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
[info]  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
[info]  at org.apache.spark.scheduler.Task.run(Task.scala:56)
[info]  at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
[info]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[info]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[info]  at java.lang.Thread.run(Thread.java:745)
{code}

The items for this JIRA are therefore:

- Enable assertions in SBT
- Fix this failure
- Figure out why Maven scalatest didn't trigger it - may need assertions 
explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4815) ThriftServer use only one SessionState to run sql using hive

2014-12-10 Thread guowei (JIRA)
guowei created SPARK-4815:
-

 Summary: ThriftServer use only one SessionState to run sql using 
hive 
 Key: SPARK-4815
 URL: https://issues.apache.org/jira/browse/SPARK-4815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei


ThriftServer use only one SessionState to run sql using hive, though it from 
different hive sessions.
This will make mistakes:
For example, one user run "use database" in one beeline client. the database in 
other  beeline change too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4816) Maven profile netlib-lgpl does not work

2014-12-10 Thread Guillaume Pitel (JIRA)
Guillaume Pitel created SPARK-4816:
--

 Summary: Maven profile netlib-lgpl does not work
 Key: SPARK-4816
 URL: https://issues.apache.org/jira/browse/SPARK-4816
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
 Environment: maven 3.0.5 / Ubuntu
Reporter: Guillaume Pitel
Priority: Minor


When doing what the documentation recommends to recompile Spark with Netlib 
Native system binding (i.e. to bind with openblas or, in my case, MKL), 

mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean 
package

The resulting assembly jar still lacked the netlib-system class. (I checked the 
content of spark-assembly...jar)

When forcing the netlib-lgpl profile in MLLib package to be active, the jar is 
correctly built.

So I guess it's a problem with the way maven passes profiles activitations to 
children modules.

Also, despite the documentation claiming that if the job's jar contains netlib 
with necessary bindings, it should works, it does not. The classloader must be 
unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include and not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.






> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: Planned Work
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon 
> [SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is 
> my testing result:
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Future work include and not only:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory



--
This message was sent by At

[jira] [Updated] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

宿荣全 updated SPARK-4817:
---
Summary: [streaming]Print the specified number of data and handle all of 
the elements in RDD  (was: Print the specified number of data and handle all of 
the elements in RDD)

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4817) Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA
宿荣全 created SPARK-4817:
--

 Summary: Print the specified number of data and handle all of the 
elements in RDD
 Key: SPARK-4817
 URL: https://issues.apache.org/jira/browse/SPARK-4817
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: 宿荣全
Priority: Minor


Dstream.print function:Print 10 elements and handle 11 elements.
A new function based on Dstream.print function is presented:
the new function:
Print the specified number of data and handle all of the elements in RDD.
there is a work scene:
val dstream = stream.map->filter->mapPartitions->print
the data after filter need update database in mapPartitions,but don't need 
print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4817.
--
Resolution: Duplicate

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

```
val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )
```



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include and not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory




> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
>

[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240928#comment-14240928
 ] 

Apache Spark commented on SPARK-4817:
-

User 'surq' has created a pull request for this issue:
https://github.com/apache/spark/pull/3662

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4815) ThriftServer use only one SessionState to run sql using hive

2014-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240930#comment-14240930
 ] 

Apache Spark commented on SPARK-4815:
-

User 'guowei2' has created a pull request for this issue:
https://github.com/apache/spark/pull/3663

> ThriftServer use only one SessionState to run sql using hive 
> -
>
> Key: SPARK-4815
> URL: https://issues.apache.org/jira/browse/SPARK-4815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: guowei
>
> ThriftServer use only one SessionState to run sql using hive, though it from 
> different hive sessions.
> This will make mistakes:
> For example, one user run "use database" in one beeline client. the database 
> in other  beeline change too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2014-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240933#comment-14240933
 ] 

Sean Owen commented on SPARK-4816:
--

No, you can definitely see that the profile is activated in the child module if 
you try {{mvn -Pnetlib-lgpl dependency:tree}} and see that 
{{com.github.fommil.netlib:all}} is there.

If I build and {{grep netlib}} on the assembly, I see loads of netlib items, 
including the expected native libraries. Earlier you showed that you also 
grepped for "Native" but I don't see why this is to be expected and there is no 
entry with that name.

I don't think there's an issue then?

{code}
...
org/netlib/arpack/Sstatn.class
org/netlib/arpack/Sstats.class
org/netlib/arpack/Sstqrb.class
netlib-native_ref-osx-x86_64.jnilib
netlib-native_ref-osx-x86_64.jnilib.asc
netlib-native_ref-osx-x86_64.pom
netlib-native_ref-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties
com/github/fommil/netlib/NativeRefARPACK.class
com/github/fommil/netlib/NativeRefBLAS.class
com/github/fommil/netlib/NativeRefLAPACK.class
...
{code}

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240934#comment-14240934
 ] 

宿荣全 commented on SPARK-4817:


[~srowen]
Ithink that this modification is not the same as [SPARK-3325].
[SPARK-3325]:
Print the specified number of data only.

[SPARK-4817]:
printTop(num)
1.Print the specified number of data only.
 2.Handle all of the elements in RDD.

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

```
val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

```



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

```
val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt


[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240937#comment-14240937
 ] 

Sean Owen commented on SPARK-4817:
--

If you really only mean you want to print, and do X, with a DStream, then you 
need no change to the code. You just call print and you call your other 
operations too. All of them happen.

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240954#comment-14240954
 ] 

宿荣全 commented on SPARK-4817:


Yes, with the other  streaming's operate can be replaced, but the look and feel 
of some operating redundant, not too beautiful.And use print operation  do all  
elements in RDD the Efficiency is not high.

for example:
1.val dstream = stream.map->filter->mapPartitions->count->print(num)
2.val dstream = stream.map->filter->mapPartitions->filter(false)->print(num)
this codes is equivalent :
val dstream = stream.map->filter->mapPartitions->printTop(num)

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240955#comment-14240955
 ] 

宿荣全 commented on SPARK-4817:


[~srowen]
Yes, with the other streaming's operate can be replaced, but the look and feel 
of some operating redundant, not too beautiful.And use print operation do all 
elements in RDD the Efficiency is not high.
for example:
1.val dstream = stream.map->filter->mapPartitions->count->print(num)
2.val dstream = stream.map->filter->mapPartitions->filter(false)->print(num)
this codes is equivalent :
val dstream = stream.map->filter->mapPartitions->printTop(num)

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{quote}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{quote}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

```
val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{code: borderStyle=solid}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{code}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{quote}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
v

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{code: borderStyle=solid}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{code}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

---

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cach

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

---

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{code: borderStyle=solid}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{code}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.

Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit i

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{code: borderStyle=solid}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{code}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon 
[SPARK-2044](https://issues.apache.org/jira/browse/SPARK-2044). Following is my 
testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Future work include but not only:

- 

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Component/s: Shuffle

> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: Planned Work
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> 
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon SPARK-2044. Following is my 
> testing result:
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Future work include but not only:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory
> Test code:
> {code: borderStyle=solid}
> val conf = new SparkConf().setAppName("InMemoryShuffleTest")
> val sc = new SparkContext(conf)
> val dataPath = args(0)
> val partitions = args(1).toInt
> val rdd1 = sc.textFile(dataPath).cache()
> rdd1.count()
> val startTime = System.currentTimeMillis()
> val rdd2 = rdd1.repartition(partitions)
>   .flatMap(_.split(",")).map(s => (s, s))
>   .groupBy(e => e._1)
> rdd2.count()
> val endTime = System.currentTimeMillis()
> println("time: " + (endTime - startTime) / 1000 )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Issue Type: New Feature  (was: Planned Work)

> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> 
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon SPARK-2044. Following is my 
> testing result:
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Future work include but not only:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory
> Test code:
> {code: borderStyle=solid}
> val conf = new SparkConf().setAppName("InMemoryShuffleTest")
> val sc = new SparkContext(conf)
> val dataPath = args(0)
> val partitions = args(1).toInt
> val rdd1 = sc.textFile(dataPath).cache()
> rdd1.count()
> val startTime = System.currentTimeMillis()
> val rdd2 = rdd1.repartition(partitions)
>   .flatMap(_.split(",")).map(s => (s, s))
>   .groupBy(e => e._1)
> rdd2.count()
> val endTime = System.currentTimeMillis()
> println("time: " + (endTime - startTime) / 1000 )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Affects Version/s: 1.1.0

> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: uncleGen
>Priority: Trivial
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> 
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon SPARK-2044. Following is my 
> testing result:
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Future work include but not only:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory
> Test code:
> {code: borderStyle=solid}
> val conf = new SparkConf().setAppName("InMemoryShuffleTest")
> val sc = new SparkContext(conf)
> val dataPath = args(0)
> val partitions = args(1).toInt
> val rdd1 = sc.textFile(dataPath).cache()
> rdd1.count()
> val startTime = System.currentTimeMillis()
> val rdd2 = rdd1.repartition(partitions)
>   .flatMap(_.split(",")).map(s => (s, s))
>   .groupBy(e => e._1)
> rdd2.count()
> val endTime = System.currentTimeMillis()
> println("time: " + (endTime - startTime) / 1000 )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Target Version/s: 1.3.0

> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: uncleGen
>Priority: Trivial
>  Labels: performance
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> 
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon SPARK-2044. Following is my 
> testing result:
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Future work include but not only:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory
> Test code:
> {code: borderStyle=solid}
> val conf = new SparkConf().setAppName("InMemoryShuffleTest")
> val sc = new SparkContext(conf)
> val dataPath = args(0)
> val partitions = args(1).toInt
> val rdd1 = sc.textFile(dataPath).cache()
> rdd1.count()
> val startTime = System.currentTimeMillis()
> val rdd2 = rdd1.repartition(partitions)
>   .flatMap(_.split(",")).map(s => (s, s))
>   .groupBy(e => e._1)
> rdd2.count()
> val endTime = System.currentTimeMillis()
> println("time: " + (endTime - startTime) / 1000 )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Labels: performance  (was: )

> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: uncleGen
>Priority: Trivial
>  Labels: performance
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> 
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon SPARK-2044. Following is my 
> testing result:
> | data size|  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Future work include but not only:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory
> Test code:
> {code: borderStyle=solid}
> val conf = new SparkConf().setAppName("InMemoryShuffleTest")
> val sc = new SparkContext(conf)
> val dataPath = args(0)
> val partitions = args(1).toInt
> val rdd1 = sc.textFile(dataPath).cache()
> rdd1.count()
> val startTime = System.currentTimeMillis()
> val rdd2 = rdd1.repartition(partitions)
>   .flatMap(_.split(",")).map(s => (s, s))
>   .groupBy(e => e._1)
> rdd2.count()
> val endTime = System.currentTimeMillis()
> println("time: " + (endTime - startTime) / 1000 )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance. 
However, I will continue my work and improve it. And as a alternative tuning 
option, "InMemory shuffle" is a good choice. Future works includes, but is not 
limited to:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{code: borderStyle=solid}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{code}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
|

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance. 
However, I will continue my work and improve it. And as a alternative tuning 
option, "InMemory shuffle" is a good choice. Future works includes, but is not 
limited to:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{code: borderStyle=solid}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{code}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size|  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf |

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance. 
However, I will continue my work and improve it. And as a alternative tuning 
option, "InMemory shuffle" is a good choice. Future work includes, but is not 
limited to:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{code: borderStyle=solid}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{code}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & 

[jira] [Updated] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value

2014-12-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-4812:

Description: 
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
which can be override by subclasses. Here is a simple example to show this 
issue.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

abstract class Foo {

  protected val sqlContext = "Foo"

  val codegenEnabled: Boolean = {
println(sqlContext) // it will call subclass's `sqlContext` which has not 
yet been initialized.
if (sqlContext != null) {
  true
} else {
  false
}
  }
}

class Bar extends Foo {
  override val sqlContext = "Bar"
}

println(new Bar().codegenEnabled)

// Exiting paste mode, now interpreting.

null
false
defined class Foo
defined class Bar

scala> 
{code}

To fix it, should override codegenEnabled in `InMemoryColumnarTableScan`.

  was:
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
which can be override by subclasses. Here is a simple example to show this 
issue.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

abstract class Foo {

  protected val sqlContext = "Foo"

  val codegenEnabled: Boolean = {
println(sqlContext) // it will call subclass's `sqlContext` which has not 
yet been initialized.
if (sqlContext != null) {
  true
} else {
  false
}
  }
}

class Bar extends Foo {
  override val sqlContext = "Bar"
}

println(new Bar().codegenEnabled)

// Exiting paste mode, now interpreting.

null
false
defined class Foo
defined class Bar

scala> 
{code}

To fix it, we should mark `codegenEnabled` as `lazy`.


> SparkPlan.codegenEnabled may be initialized to a wrong value
> 
>
> Key: SPARK-4812
> URL: https://issues.apache.org/jira/browse/SPARK-4812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>
> The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
> which can be override by subclasses. Here is a simple example to show this 
> issue.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> abstract class Foo {
>   protected val sqlContext = "Foo"
>   val codegenEnabled: Boolean = {
> println(sqlContext) // it will call subclass's `sqlContext` which has not 
> yet been initialized.
> if (sqlContext != null) {
>   true
> } else {
>   false
> }
>   }
> }
> class Bar extends Foo {
>   override val sqlContext = "Bar"
> }
> println(new Bar().codegenEnabled)
> // Exiting paste mode, now interpreting.
> null
> false
> defined class Foo
> defined class Bar
> scala> 
> {code}
> To fix it, should override codegenEnabled in `InMemoryColumnarTableScan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: 
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s   | 30s |
|no shuffle spill & lzf | | |
| memory |  15s | 16s |

In my implementation, I simply reused the "BlockManager" in the map-side and 
set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
data is cached in memory store. Just as Reynold Xin has pointed out, our 
disk-based shuffle manager has achieved a good performance. With  parameter 
tuning, the disk-based shuffle manager will  obtain similar performance as 
memory-based shuffle manager. However, I will continue my work and improve it. 
And as an alternative tuning option, "InMemory shuffle" is a good choice. 
Future work includes, but is not limited to:

- memory usage management in "InMemory Shuffle" mode
- data management when intermediate data can not fit in memory

Test code:

{code: borderStyle=solid}

val conf = new SparkConf().setAppName("InMemoryShuffleTest")
val sc = new SparkContext(conf)

val dataPath = args(0)
val partitions = args(1).toInt

val rdd1 = sc.textFile(dataPath).cache()
rdd1.count()
val startTime = System.currentTimeMillis()
val rdd2 = rdd1.repartition(partitions)
  .flatMap(_.split(",")).map(s => (s, s))
  .groupBy(e => e._1)
rdd2.count()
val endTime = System.currentTimeMillis()

println("time: " + (endTime - startTime) / 1000 )

{code}



  was:
I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
want to know is there any plan to do something about it. Or any suggestion 
about it. Base on the work (SPARK-2044), it is feasible to have several 
implementations of  shuffle.



Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. 
Both of them will use disk in some stages. For examples, in the map side, all 
the intermediate data will be written into temporary files. In the reduce side, 
Spark will use external sort sometimes. In any case, disk I/O will bring some 
performance loss. Maybe,we can provide a pure-memory shuffle manager. In this 
shuffle manager, intermediate data will only go through memory. In some of 
scenes, it can improve performance. Experimentally, I implemented a in-memory 
shuffle manager upon SPARK-2044. Following is my testing result:

| data size (Byte)   |  partitions  |  resources |
| 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |

| settings   |  operation1   | 
operation2 |
| shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
groupByKey | 
|memory   |   38s   |  16s |
|sort |   45s   |  28s |
|hash |   46s   |  28s |
|no shuffle spill & lz4 | | |
| memory |   16s | 16s |
| | | |
|shuffle spill & lzf | | |
|memory|  28s   | 27s |
|sort  |  29s   | 29s |
|hash  |  41s

[jira] [Commented] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240985#comment-14240985
 ] 

uncleGen commented on SPARK-3376:
-

[~rxin] Yeah, I agree with you. We can improve the I/O(disk I/O and network 
I/O) performance from hardware resources and software resources. With limited 
hardware resources, we can provide a soft way to achieve a similar performance. 
Maybe, it is a good choice to provide an alternative “memory-based” shuffle 
option.

> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: uncleGen
>Priority: Trivial
>  Labels: performance
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> 
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon SPARK-2044. Following is my 
> testing result:
> | data size (Byte)   |  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Just as Reynold Xin has pointed out, our 
> disk-based shuffle manager has achieved a good performance. With  parameter 
> tuning, the disk-based shuffle manager will  obtain similar performance as 
> memory-based shuffle manager. However, I will continue my work and improve 
> it. And as an alternative tuning option, "InMemory shuffle" is a good choice. 
> Future work includes, but is not limited to:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory
> Test code:
> {code: borderStyle=solid}
> val conf = new SparkConf().setAppName("InMemoryShuffleTest")
> val sc = new SparkContext(conf)
> val dataPath = args(0)
> val partitions = args(1).toInt
> val rdd1 = sc.textFile(dataPath).cache()
> rdd1.count()
> val startTime = System.currentTimeMillis()
> val rdd2 = rdd1.repartition(partitions)
>   .flatMap(_.split(",")).map(s => (s, s))
>   .groupBy(e => e._1)
> rdd2.count()
> val endTime = System.currentTimeMillis()
> println("time: " + (endTime - startTime) / 1000 )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-12-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Priority: Minor  (was: Trivial)

> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: uncleGen
>Priority: Minor
>  Labels: performance
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> 
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon SPARK-2044. Following is my 
> testing result:
> | data size (Byte)   |  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Just as Reynold Xin has pointed out, our 
> disk-based shuffle manager has achieved a good performance. With  parameter 
> tuning, the disk-based shuffle manager will  obtain similar performance as 
> memory-based shuffle manager. However, I will continue my work and improve 
> it. And as an alternative tuning option, "InMemory shuffle" is a good choice. 
> Future work includes, but is not limited to:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory
> Test code:
> {code: borderStyle=solid}
> val conf = new SparkConf().setAppName("InMemoryShuffleTest")
> val sc = new SparkContext(conf)
> val dataPath = args(0)
> val partitions = args(1).toInt
> val rdd1 = sc.textFile(dataPath).cache()
> rdd1.count()
> val startTime = System.currentTimeMillis()
> val rdd2 = rdd1.repartition(partitions)
>   .flatMap(_.split(",")).map(s => (s, s))
>   .groupBy(e => e._1)
> rdd2.count()
> val endTime = System.currentTimeMillis()
> println("time: " + (endTime - startTime) / 1000 )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL

2014-12-10 Thread Saurabh Santhosh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Santhosh updated SPARK-4811:

Priority: Critical  (was: Major)

> Custom UDTFs not working in Spark SQL
> -
>
> Key: SPARK-4811
> URL: https://issues.apache.org/jira/browse/SPARK-4811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Saurabh Santhosh
>Priority: Critical
> Fix For: 1.2.0
>
>
> I am using the Thrift srever interface to Spark SQL and using beeline to 
> connect to it.
> I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the 
> following exception when using any custom UDTF.
> These are the steps i did :
> *Created a UDTF 'com.x.y.xxx'.*
> Registered the UDTF using following query : 
> *create temporary function xxx as 'com.x.y.xxx'*
> The registration went through without any errors. But when i tried executing 
> the UDTF i got the following error.
> *java.lang.ClassNotFoundException: xxx*
> Funny thing is that Its trying to load the function name instead of the 
> funtion class. The exception is at *line no: 81 in hiveudfs.scala*
> I have been at it for quite a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

2014-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241001#comment-14241001
 ] 

Sean Owen commented on SPARK-4817:
--

I'm not sure if this is what you're looking for, but you can always call 
{{foreachRDD}}, and operate on all of the RDD, and then call {{take}} on the 
RDD to get a few elements to print. 

Your PR reimplements the same thing less efficiently.

But this is not what your example above does. Neither prints the "top" 
elements. Did you mean "first"?

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> ---
>
> Key: SPARK-4817
> URL: https://issues.apache.org/jira/browse/SPARK-4817
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: 宿荣全
>Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2014-12-10 Thread Guillaume Pitel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241015#comment-14241015
 ] 

Guillaume Pitel commented on SPARK-4816:


Nevermind, the problem does not occur in the v1.1.1 tag, only in the sources 
distributed as spark-1.1.1.tgz

I should have used the git to check, sorry



> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4816) Maven profile netlib-lgpl does not work

2014-12-10 Thread Guillaume Pitel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillaume Pitel closed SPARK-4816.
--
   Resolution: Fixed
Fix Version/s: 1.1.1

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4814:
-
Component/s: SQL
Summary: Enable assertions in SBT, Maven tests / AssertionError from 
Hive's LazyBinaryInteger  (was: Enable assertions in SBT, Maven tests)

CC [~lian cheng] as someone who might understand the {{AssertionError}} above. 
It's coming from Hive and I don't know if it's anything we are doing wrong. But 
unfortunately the test fails when assertions are on as a result of this. The 
changes to enable assertions themselves are trivial.

> Enable assertions in SBT, Maven tests / AssertionError from Hive's 
> LazyBinaryInteger
> 
>
> Key: SPARK-4814
> URL: https://issues.apache.org/jira/browse/SPARK-4814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>
> Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
> in Maven, in part because a Java test actually fails with {{AssertionError}}. 
> That code/test was fixed in SPARK-4850.
> The reason it wasn't caught by SBT tests was that they don't run with 
> assertions on, and Maven's surefire does.
> Turning on assertions in the SBT build is trivial, adding one line:
> {code}
> javaOptions in Test += "-ea",
> {code}
> This reveals a test failure in Scala test suites though:
> {code}
> [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
> [info]   Failed to execute query using catalyst:
> [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
> failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
> localhost): java.lang.AssertionError
> [info]at 
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
> [info]at 
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
> [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [info]at java.lang.Thread.run(Thread.java:745)
> {code}
> The items for this JIRA are therefore:
> - Enable assertions in SBT
> - Fix this failure
> - Figure out why Maven scalatest didn't trigger it - may need assertions 
> explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241206#comment-14241206
 ] 

Zhang, Liye commented on SPARK-4740:


Hi [~adav], [~rxin], I run the test with the latest master branch today, In 
which rxin's patch is merged. 

On my 4 nodes 48 cores per node cluster, I set the *spark.local.dir* to one 
tmpfs (ramdisk) dir, the ramdisk size is 136GB to make sure enough for shuffle 
(total shuffle write 284GB, total shuffle read 213GB), *spark.executor.memory* 
is set to 48GB. In this way to eliminate the disk I/O effect. Still with the 
400GB data set, the test result shows Netty is better than NIO (reduce time 
*Netty:24mins* VS *NIO:26mins*).

Also, I retested with 8HDDs, remain *spark.executor.memory* with 48GB, set 
*spark.local.dir* to 8 HDD dirs. The result is about the same as before, that 
is NIO outperforms Netty (reduce time*Netty:32mins* VS *NIO:25mins*). And in 
Netty test, unbalance still exists, the best executor finishes 308 tasks, and 
the worst executor only finished 222 tasks.

It seems NIO is not effected with whether it is HDD or ramdisk, while Netty is 
more sensitive with HDD.

Till now, maybe we can limit the problem to the different behavior between 
Netty and NIO on disk operations.

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1146) Vagrant to setup Spark cluster locally

2014-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1146.
--
Resolution: Won't Fix

(Warning: I've developed a little script to help find JIRAs whose PRs are 
resolved one way or the other, and so should be resolved. There may be a number 
of these coming in the next day or two.)

The discussion in the PR indicates this will be a separate project if anything, 
currently hosted at https://github.com/ngbinh/spark-vagrant

> Vagrant to setup Spark cluster locally
> --
>
> Key: SPARK-1146
> URL: https://issues.apache.org/jira/browse/SPARK-1146
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Project Infra
>Affects Versions: 0.9.0
>Reporter: Binh Nguyen
>  Labels: script
>
> We should use Vagrant to create a local clusters of VMs. It will allow 
> developers run and test Spark Cluster on their dev machines.
> It could be expanded to YARN and Mesos cluster mode but initial focus will be 
> on standalone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1127) Add saveAsHBase to PairRDDFunctions

2014-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1127.
--
   Resolution: Won't Fix
Fix Version/s: (was: 1.2.0)

Given the discussion in both PRs, this looks like a WontFix, and consensus was 
it should proceed in a separate project. Questions about SchemaRDD and 1.2 
sound like a new topic.

> Add saveAsHBase to PairRDDFunctions
> ---
>
> Key: SPARK-1127
> URL: https://issues.apache.org/jira/browse/SPARK-1127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: haosdent huang
>Assignee: haosdent huang
>
> Support to save data in HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1385) Use existing code-path for JSON de/serialization of BlockId

2014-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1385.
--
Resolution: Fixed

PR is https://github.com/apache/spark/pull/289. This was merged in 
https://github.com/apache/spark/commit/de8eefa804e229635eaa29a78b9e9ce161ac58e1

> Use existing code-path for JSON de/serialization of BlockId
> ---
>
> Key: SPARK-1385
> URL: https://issues.apache.org/jira/browse/SPARK-1385
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Andrew Or
>Priority: Minor
> Fix For: 1.0.0
>
>
> BlockId.scala already takes care of JSON de/serialization by parsing the 
> string to and from regex. This functionality is currently duplicated in 
> util/JsonProtocol.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1380) Add sort-merge based cogroup/joins.

2014-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1380.
--
Resolution: Won't Fix

The PR discussion suggests this is WontFix.

> Add sort-merge based cogroup/joins.
> ---
>
> Key: SPARK-1380
> URL: https://issues.apache.org/jira/browse/SPARK-1380
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Takuya Ueshin
>
> I've written cogroup/joins based on 'Sort-Merge' algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1338) Create Additional Style Rules

2014-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241297#comment-14241297
 ] 

Sean Owen commented on SPARK-1338:
--

The PR for this was abandoned. What's the thinking on these style-rule JIRAs? 
there are a number still open.

> Create Additional Style Rules
> -
>
> Key: SPARK-1338
> URL: https://issues.apache.org/jira/browse/SPARK-1338
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
> Fix For: 1.2.0
>
>
> There are a few other rules that would be helpful to have. Also we should add 
> tests for these rules because it's easy to get them wrong. I gave some 
> example comparisons from a javascript style checker.
> Require spaces in type declarations:
> def foo:String = X // no
> def foo: String = XXX
> def x:Int = 100 // no
> val x: Int = 100
> Require spaces after keywords:
> if(x - 3) // no
> if (x + 10)
> See: requireSpaceAfterKeywords from
> https://github.com/mdevils/node-jscs
> Disallow spaces inside of parentheses:
> val x = ( 3 + 5 ) // no
> val x = (3 + 5)
> See: disallowSpacesInsideParentheses from
> https://github.com/mdevils/node-jscs
> Require spaces before and after binary operators:
> See: requireSpaceBeforeBinaryOperators
> See: disallowSpaceAfterBinaryOperators
> from https://github.com/mdevils/node-jscs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4746) integration tests should be separated from faster unit tests

2014-12-10 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-4746:

Summary: integration tests should be separated from faster unit tests  
(was: integration tests should be seseparated from faster unit tests)

> integration tests should be separated from faster unit tests
> 
>
> Key: SPARK-4746
> URL: https://issues.apache.org/jira/browse/SPARK-4746
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> Currently there isn't a good way for a developer to skip the longer 
> integration tests.  This can slow down local development.  See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
> One option is to use scalatest's notion of test tags to tag all integration 
> tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-12-10 Thread Mark Fisher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241362#comment-14241362
 ] 

Mark Fisher commented on SPARK-2892:


[~srowen] SPARK-4802 is only related to the receiverInfo not being removed. 
This issue is actually much more critical, given that Receivers do not seem to 
stop other than in local mode. Please reopen.


> Socket Receiver does not stop when streaming context is stopped
> ---
>
> Key: SPARK-2892
> URL: https://issues.apache.org/jira/browse/SPARK-2892
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Running NetworkWordCount with
> {quote}  
> ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
> Thread.sleep(6)
> {quote}
> gives the following error
> {quote}
> 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
> in 10047 ms on localhost (1/1)
> 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
> ReceiverTracker.scala:275) finished in 10.056 s
> 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool
> 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
> ReceiverTracker.scala:275, took 10.179263 s
> 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
> terminated
> 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
> deregistered, Map(0 -> 
> ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
> 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
> 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
> 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
> time 1407375433000
> 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
> 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
> 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
> 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
> 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4818) Join operation should use iterator/lazy evaluation

2014-12-10 Thread Johannes Simon (JIRA)
Johannes Simon created SPARK-4818:
-

 Summary: Join operation should use iterator/lazy evaluation
 Key: SPARK-4818
 URL: https://issues.apache.org/jira/browse/SPARK-4818
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Johannes Simon
Priority: Minor


The current implementation of the join operation does not use an iterator (i.e. 
lazy evaluation), causing it to explicitly evaluate the co-grouped values. In 
big data applications, these value collections can be very large. This causes 
the *cartesian product of all co-grouped values* for a specific key of both 
RDDs to be kept in memory during the flatMapValues operation, resulting in an 
*O(size(pair._1)*size(pair._2))* memory consumption instead of *O(1)*. Very 
large value collections will therefore cause "GC overhead limit exceeded" 
exceptions and fail the task, or at least slow down execution dramatically.

{code:title=PairRDDFunctions.scala|borderStyle=solid}
//...
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
  this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1; w <- pair._2) yield (v, w)
  )
}
//...
{code}

Since cogroup returns an Iterable instance of an Array, the join implementation 
could be changed to the following, which uses lazy evaluation instead, and has 
almost no memory overhead:
{code:title=PairRDDFunctions.scala|borderStyle=solid}
//...
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
  this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
  )
}
//...
{code}

Alternatively, if the current implementation is intentionally not using lazy 
evaluation for some reason, there could be a *lazyJoin()* method next to the 
original join implementation that utilizes lazy evaluation. This of course 
applies to other join operations as well.

Thanks! :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests

2014-12-10 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241383#comment-14241383
 ] 

Imran Rashid commented on SPARK-4746:
-

I think this can be done two ways: (1) by moving all the integration tests into 
a separate folder (eg core/src/integ-test).  I don't actually know how to make 
that work in sbt & maven but I'm hoping it won't be too complicated.

or (2) we use scalatest's test tags for integration tests, so they can be 
skipped.

http://scalatest.org/user_guide/tagging_your_tests

test tags have the advantage that you can have multple tags, and you can group 
them, etc., so you can get more flexibility in what you choose to run.  This 
could be useful later eg. we might want to tag performance tests as separate 
from correctness tests, etc.  We don't need to do all that now but this would 
open the door for it at least.

I have some experience w/ using test tags.  If we want to use that approach, I 
can assign to myself and work on a PR.

> integration tests should be separated from faster unit tests
> 
>
> Key: SPARK-4746
> URL: https://issues.apache.org/jira/browse/SPARK-4746
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> Currently there isn't a good way for a developer to skip the longer 
> integration tests.  This can slow down local development.  See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
> One option is to use scalatest's notion of test tags to tag all integration 
> tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3607) ConnectionManager threads.max configs on the thread pools don't work

2014-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241431#comment-14241431
 ] 

Apache Spark commented on SPARK-3607:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/3664

> ConnectionManager threads.max configs on the thread pools don't work
> 
>
> Key: SPARK-3607
> URL: https://issues.apache.org/jira/browse/SPARK-3607
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Priority: Minor
>
> In the ConnectionManager we have a bunch of thread pools. They have settings 
> for the maximum number of threads for each Threadpool (like 
> spark.core.connection.handler.threads.max). 
> Those configs don't work because its using a unbounded queue. From the 
> threadpoolexecutor javadoc page: no more than corePoolSize threads will ever 
> be created. (And the value of the maximumPoolSize therefore doesn't have any 
> effect.)
> luckily this doesn't matter to much as you can work around it by just 
> increasing the minimum like spark.core.connection.handler.threads.min. 
> These configs aren't documented either so its more of an internal thing when 
> someone is reading the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1037) the name of findTaskFromList & findTask in TaskSetManager.scala is confusing

2014-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241442#comment-14241442
 ] 

Apache Spark commented on SPARK-1037:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/3665

> the name of findTaskFromList & findTask in TaskSetManager.scala is confusing
> 
>
> Key: SPARK-1037
> URL: https://issues.apache.org/jira/browse/SPARK-1037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.1, 1.0.0
>Reporter: Nan Zhu
>Priority: Minor
>  Labels: starter
>
> the name of these two functions is confusing 
> though in the comments the author said that the method does "dequeue" tasks 
> from the list but from the name, it is not explicitly indicating that the 
> method will mutate the parameter
> in 
> private def findTaskFromList(list: ArrayBuffer[Int]): Option[Int] = {
> while (!list.isEmpty) {
>   val index = list.last
>   list.trimEnd(1)
>   if (copiesRunning(index) == 0 && !successful(index)) {
> return Some(index)
>   }
> }
> None
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4569) Rename "externalSorting" in Aggregator

2014-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241455#comment-14241455
 ] 

Apache Spark commented on SPARK-4569:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/3666

> Rename "externalSorting" in Aggregator
> --
>
> Key: SPARK-4569
> URL: https://issues.apache.org/jira/browse/SPARK-4569
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Trivial
>
> While technically all spilling in Spark does result in sorting, calling this 
> variable externalSorting makes it seem like ExternalSorter will be used, when 
> in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241463#comment-14241463
 ] 

Aaron Davidson commented on SPARK-4740:
---

Clarification: The merged version of Reynold's patch has connectionsPerPeer set 
to 1, since we could not demonstrate a significant improvement with other 
values. In your test with HDDs, did you have it set, or was it using the 
default value?

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241523#comment-14241523
 ] 

Reynold Xin commented on SPARK-4740:


Also [~jerryshao] when I asked you to disable transferTo, the code you pasted 
still return a ByteBuffer, which wouldn't work in Netty. Was the code you 
pasted here different from what was compiled?


Can you try this patch, which disables transferTo? 
https://github.com/apache/spark/pull/3667


> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey

2014-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241524#comment-14241524
 ] 

Apache Spark commented on SPARK-4740:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3667

> Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
> 
>
> Key: SPARK-4740
> URL: https://issues.apache.org/jira/browse/SPARK-4740
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Zhang, Liye
>Assignee: Reynold Xin
> Attachments: (rxin patch better executor)TestRunner  sort-by-key - 
> Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner  
> sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 
> 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner  
> sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, 
> TestRunner  sort-by-key - Thread dump for executor 1_files (Nio-48 cores per 
> node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z
>
>
> When testing current spark master (1.3.0-snapshot) with spark-perf 
> (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService 
> takes much longer time than NIO based shuffle transferService. The network 
> throughput of Netty is only about half of that of NIO. 
> We tested with standalone mode, and the data set we used for test is 20 
> billion records, and the total size is about 400GB. Spark-perf test is 
> Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each 
> executor memory is 64GB. The reduce tasks number is set to 1000. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-10 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241535#comment-14241535
 ] 

Debasish Das commented on SPARK-4675:
-

There are few issues:

1. Batch API for topK similar users and topK similar products
2. Comparison of product x product similarities generated with 
columnSimilarities and compared with topK similar products

I added batch APIs for topK product recommendation for each user and topK user 
recommendation for each product in SPARK-4231...similar batch API will be very 
helpful for topK similar users and topK similar products...

I agree with Cosine Similarity...you should be able to re-use column similarity 
calculations...I think a better idea is to add rowMatrix.similarRows and re-use 
that code to generate product similarities and user similarities...

But my question is more on validation. We can compute product similarities on 
raw features and we can compute product similarities on matrix product 
factor...which one is better ?

> Find similar products and similar users in MatrixFactorizationModel
> ---
>
> Key: SPARK-4675
> URL: https://issues.apache.org/jira/browse/SPARK-4675
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Steven Bourke
>Priority: Trivial
>  Labels: mllib, recommender
>
> Using the latent feature space that is learnt in MatrixFactorizationModel, I 
> have added 2 new functions to find similar products and similar users. A user 
> of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4789) Standardize ML Prediction APIs

2014-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4789:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-1856

> Standardize ML Prediction APIs
> --
>
> Key: SPARK-4789
> URL: https://issues.apache.org/jira/browse/SPARK-4789
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Create a standard set of abstractions for prediction in spark.ml.  This will 
> follow the design doc specified in [SPARK-3702].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4789) Standardize ML Prediction APIs

2014-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-4789:


Assignee: Joseph K. Bradley

> Standardize ML Prediction APIs
> --
>
> Key: SPARK-4789
> URL: https://issues.apache.org/jira/browse/SPARK-4789
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Create a standard set of abstractions for prediction in spark.ml.  This will 
> follow the design doc specified in [SPARK-3702].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models

2014-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3702:
-
Description: 
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

This is a super-task of several sub-tasks (but JIRA does not allow subtasks of 
subtasks).  See the "depends on" links below for subtasks.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]

  was:
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]


> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "depends on" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2014-12-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241611#comment-14241611
 ] 

Joseph K. Bradley commented on SPARK-3702:
--

APIs for Classifiers, Regressors

> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "depends on" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models

2014-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3702:
-
Description: 
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

This is a super-task of several sub-tasks (but JIRA does not allow subtasks of 
subtasks).  See the "requires" links below for subtasks.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]

  was:
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

This is a super-task of several sub-tasks (but JIRA does not allow subtasks of 
subtasks).  See the "depends on" links below for subtasks.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]


> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3918) Forget Unpersist in RandomForest.scala(train Method)

2014-12-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241649#comment-14241649
 ] 

Joseph K. Bradley commented on SPARK-3918:
--

Oops!  I forgot to update that PR's name.  It was originally in that PR, but 
[~Junlong Liu] sent a PR with the change first:
[https://github.com/apache/spark/commit/942847fd94c920f7954ddf01f97263926e512b0e]

(The PR linked above was not tagged with this JIRA.)

> Forget Unpersist in RandomForest.scala(train Method)
> 
>
> Key: SPARK-3918
> URL: https://issues.apache.org/jira/browse/SPARK-3918
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: All
>Reporter: junlong
>Assignee: Joseph K. Bradley
>  Labels: decisiontree, train, unpersist
> Fix For: 1.1.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
>In version 1.1.0 DecisionTree.scala, train Method, treeInput has been 
> persisted in Memory, but without unpersist. It caused heavy DISK usage.
>In github version(1.2.0 maybe), RandomForest.scala, train Method, 
> baggedInput has been persisted but without unpersisted too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2951) SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6

2014-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241654#comment-14241654
 ] 

Apache Spark commented on SPARK-2951:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3668

> SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6
> --
>
> Key: SPARK-2951
> URL: https://issues.apache.org/jira/browse/SPARK-2951
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> With Python 2.6, calling SerDeUtils.pythonToPairRDD() on an RDD of pickled 
> Python array.arrays will fail with this exception:
> {code}
> ava.lang.ClassCastException: java.lang.String cannot be cast to 
> java.util.ArrayList
> 
> net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:33)
> net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
> 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
> 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:898)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:880)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> I think this is due to a difference in how array.array is pickled in Python 
> 2.6 vs. Python 2.7.  To see this, run the following script:
> {code}
> from pickletools import dis, optimize
> from pickle import dumps, loads, HIGHEST_PROTOCOL
> from array import array
> arr = array('d', [1, 2, 3])
> #protocol = HIGHEST_PROTOCOL
> protocol = 0
> pickled = dumps(arr, protocol=protocol)
> pickled = optimize(pickled)
> unpickled = loads(pickled)
> print arr
> print unpickled
> print dis(pickled)
> {code}
> In Python 2.7, this outputs
> {code}
> array('d', [1.0, 2.0, 3.0])
> array('d', [1.0, 2.0, 3.0])
> 0: cGLOBAL 'array array'
>13: (MARK
>14: SSTRING 'd'
>19: (MARK
>20: lLIST   (MARK at 19)
>21: FFLOAT  1.0
>26: aAPPEND
>27: FFLOAT  2.0
>32: aAPPEND
>33: FFLOAT  3.0
>38: aAPPEND
>39: tTUPLE  (MARK at 13)
>40: RREDUCE
>41: .STOP
> highest protocol among opcodes = 0
> None
> {code}
> whereas 2.6 outputs
> {code}
> array('d', [1.0, 2.0, 3.0])
> array('d', [1.0, 2.0, 3.0])
> 0: cGLOBAL 'array array'
>13: (MARK
>14: SSTRING 'd'
>19: SSTRING 
> '\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'
>   110: tTUPLE  (MARK at 13)
>   111: RREDUCE
>   112: .STOP
> highest protocol among opcodes = 0
> None
> {code}
> I think the Java-side depickling library doesn't expect this pickled format, 
> causing this failure.
> I noticed this when running PySpark's unit tests on 2.6 because the 
> TestOuputFormat.test_newhadoop test failed.
> I think that this issue affects all of the methods that might need to 
> depickle arrays in Java, including all of the Hadoop output format methods.
> How should we try to fix this?  Require that users upgrade to 2.7 if they 
> want to use code that requires this?  Open a bug with the depickling library 
> maintainers?  Try to hack in our own pickling routines for arrays if we 
> detect that we're using 2.6?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4161:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1  (was: 1.1.2, 1.2.1)

> Spark shell class path is not correctly set if "spark.driver.extraClassPath" 
> is set in defaults.conf
> 
>
> Key: SPARK-4161
> URL: https://issues.apache.org/jira/browse/SPARK-4161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0, 1.2.0
> Environment: Mac, Ubuntu
>Reporter: Shay Seng
>Assignee: Guoqiang Li
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> (1) I want to launch a spark-shell + with jars that are only required by the 
> driver (ie. not shipped to slaves)
>  
> (2) I added "spark.driver.extraClassPath  /mypath/to.jar" to my 
> spark-defaults.conf
> I launched spark-shell with:  ./spark-shell
> Here I see on the WebUI that spark.driver.extraClassPath has been set, but I 
> am NOT able to access any methods in the jar.
> (3) I removed "spark.driver.extraClassPath" from my spark-default.conf
> I launched spark-shell with  ./spark-shell --driver.class.path /mypath/to.jar
> Again I see that the WebUI spark.driver.extraClassPath has been set. 
> But this time I am able to access the methods in the jar. 
> Looks like when the driver class path is loaded from spark-default.conf, the 
> REPL's classpath is not correctly appended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4161:
-
Fix Version/s: 1.3.0

> Spark shell class path is not correctly set if "spark.driver.extraClassPath" 
> is set in defaults.conf
> 
>
> Key: SPARK-4161
> URL: https://issues.apache.org/jira/browse/SPARK-4161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0, 1.2.0
> Environment: Mac, Ubuntu
>Reporter: Shay Seng
>Assignee: Guoqiang Li
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> (1) I want to launch a spark-shell + with jars that are only required by the 
> driver (ie. not shipped to slaves)
>  
> (2) I added "spark.driver.extraClassPath  /mypath/to.jar" to my 
> spark-defaults.conf
> I launched spark-shell with:  ./spark-shell
> Here I see on the WebUI that spark.driver.extraClassPath has been set, but I 
> am NOT able to access any methods in the jar.
> (3) I removed "spark.driver.extraClassPath" from my spark-default.conf
> I launched spark-shell with  ./spark-shell --driver.class.path /mypath/to.jar
> Again I see that the WebUI spark.driver.extraClassPath has been set. 
> But this time I am able to access the methods in the jar. 
> Looks like when the driver class path is loaded from spark-default.conf, the 
> REPL's classpath is not correctly appended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4161) Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4161:
-
Labels: backport-needed  (was: )

> Spark shell class path is not correctly set if "spark.driver.extraClassPath" 
> is set in defaults.conf
> 
>
> Key: SPARK-4161
> URL: https://issues.apache.org/jira/browse/SPARK-4161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0, 1.2.0
> Environment: Mac, Ubuntu
>Reporter: Shay Seng
>Assignee: Guoqiang Li
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> (1) I want to launch a spark-shell + with jars that are only required by the 
> driver (ie. not shipped to slaves)
>  
> (2) I added "spark.driver.extraClassPath  /mypath/to.jar" to my 
> spark-defaults.conf
> I launched spark-shell with:  ./spark-shell
> Here I see on the WebUI that spark.driver.extraClassPath has been set, but I 
> am NOT able to access any methods in the jar.
> (3) I removed "spark.driver.extraClassPath" from my spark-default.conf
> I launched spark-shell with  ./spark-shell --driver.class.path /mypath/to.jar
> Again I see that the WebUI spark.driver.extraClassPath has been set. 
> But this time I am able to access the methods in the jar. 
> Looks like when the driver class path is loaded from spark-default.conf, the 
> REPL's classpath is not correctly appended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4329) Add indexing feature for HistoryPage

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4329.

Resolution: Fixed
  Assignee: Kousuke Saruta

> Add indexing feature for HistoryPage
> 
>
> Key: SPARK-4329
> URL: https://issues.apache.org/jira/browse/SPARK-4329
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.3.0
>
>
> Current HistoryPage have links only to previous page or next page.
> I suggest to add index to access history pages easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4329) Add indexing feature for HistoryPage

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4329:
-
Fix Version/s: 1.3.0

> Add indexing feature for HistoryPage
> 
>
> Key: SPARK-4329
> URL: https://issues.apache.org/jira/browse/SPARK-4329
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
> Fix For: 1.3.0
>
>
> Current HistoryPage have links only to previous page or next page.
> I suggest to add index to access history pages easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4771) Document standalone --supervise feature

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4771:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1  (was: 1.1.2, 1.2.1)

> Document standalone --supervise feature
> ---
>
> Key: SPARK-4771
> URL: https://issues.apache.org/jira/browse/SPARK-4771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.3.0
>
>
> We need this especially for streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4771) Document standalone --supervise feature

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4771:
-
Fix Version/s: 1.3.0

> Document standalone --supervise feature
> ---
>
> Key: SPARK-4771
> URL: https://issues.apache.org/jira/browse/SPARK-4771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.3.0
>
>
> We need this especially for streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4771) Document standalone --supervise feature

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4771.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2

> Document standalone --supervise feature
> ---
>
> Key: SPARK-4771
> URL: https://issues.apache.org/jira/browse/SPARK-4771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.3.0, 1.1.2, 1.2.1
>
>
> We need this especially for streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-10 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241701#comment-14241701
 ] 

Pat Ferrel commented on SPARK-2075:
---

If the explanation is correct this needs to be filed against Spark as putting 
the wrong or not enough artifacts into maven repos. There would need to be a 
different artifact for every config option that will change internal naming.

I can't understand why lots of people aren't running into this, all it requires 
is that you link against the repo artifact and run against a user compiled 
Spark.

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4215:
-
Fix Version/s: 1.3.0

> Allow requesting executors only on Yarn (for now)
> -
>
> Key: SPARK-4215
> URL: https://issues.apache.org/jira/browse/SPARK-4215
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> Currently if the user attempts to call `sc.requestExecutors` or enables 
> dynamic allocation on, say, standalone mode, it just fails silently. We must 
> at the very least log a warning to say it's only available for Yarn 
> currently, or maybe even throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4215:
-
Labels: backport-needed  (was: )

> Allow requesting executors only on Yarn (for now)
> -
>
> Key: SPARK-4215
> URL: https://issues.apache.org/jira/browse/SPARK-4215
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> Currently if the user attempts to call `sc.requestExecutors` or enables 
> dynamic allocation on, say, standalone mode, it just fails silently. We must 
> at the very least log a warning to say it's only available for Yarn 
> currently, or maybe even throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4793:
-
Assignee: Adrian Wang

> way to find assembly jar is too strict
> --
>
> Key: SPARK-4793
> URL: https://issues.apache.org/jira/browse/SPARK-4793
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4793:
-
Affects Version/s: 1.1.0

> way to find assembly jar is too strict
> --
>
> Key: SPARK-4793
> URL: https://issues.apache.org/jira/browse/SPARK-4793
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4793:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1
   Fix Version/s: 1.3.0

> way to find assembly jar is too strict
> --
>
> Key: SPARK-4793
> URL: https://issues.apache.org/jira/browse/SPARK-4793
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4819) Remove Guava's "Optional" from public API

2014-12-10 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-4819:
-

 Summary: Remove Guava's "Optional" from public API
 Key: SPARK-4819
 URL: https://issues.apache.org/jira/browse/SPARK-4819
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin


Filing this mostly so this isn't forgotten. Spark currently exposes Guava types 
in its public API (the {{Optional}} class is used in the Java bindings). This 
makes it hard to properly hide Guava from user applications, and makes mixing 
different Guava versions with Spark a little sketchy (even if things should 
work, since those classes are pretty simple in general).

Since this changes the public API, it has to be done in a release that allows 
such breakages. But it would be nice to at least have a transition plan for 
deprecating the affected APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems

2014-12-10 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-4820:
--

 Summary: Spark build encounters "File name too long" on some 
encrypted filesystems
 Key: SPARK-4820
 URL: https://issues.apache.org/jira/browse/SPARK-4820
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell


This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4820) Spark build encounters "File name too long" on some encrypted filesystems

2014-12-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4820:
---
Description: 
This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here. If you 
encounter this issue please comment on the JIRA so we can assess the frequency.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}


  was:
This was reported by Luchesar Cekov on github along with a proposed fix. The 
fix has some potential downstream issues (it will modify the classnames) so 
until we understand better how many users are affected we aren't going to merge 
it. However, I'd like to include the issue and workaround here.

The issue produces this error:
{code}
[error] == Expanded type of tree ==
[error] 
[error] ConstantType(value = Constant(Throwable))
[error] 
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long
[error] two errors found
{code}

The workaround is in maven under the compile options add: 

{code}
+  -Xmax-classfile-name
+  128
{code}

In SBT add:

{code}
+scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
{code}



> Spark build encounters "File name too long" on some encrypted filesystems
> -
>
> Key: SPARK-4820
> URL: https://issues.apache.org/jira/browse/SPARK-4820
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>
> This was reported by Luchesar Cekov on github along with a proposed fix. The 
> fix has some potential downstream issues (it will modify the classnames) so 
> until we understand better how many users are affected we aren't going to 
> merge it. However, I'd like to include the issue and workaround here. If you 
> encounter this issue please comment on the JIRA so we can assess the 
> frequency.
> The issue produces this error:
> {code}
> [error] == Expanded type of tree ==
> [error] 
> [error] ConstantType(value = Constant(Throwable))
> [error] 
> [error] uncaught exception during compilation: java.io.IOException
> [error] File name too long
> [error] two errors found
> {code}
> The workaround is in maven under the compile options add: 
> {code}
> +  -Xmax-classfile-name
> +  128
> {code}
> In SBT add:
> {code}
> +scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4569) Rename "externalSorting" in Aggregator

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4569:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1
   Fix Version/s: 1.3.0

> Rename "externalSorting" in Aggregator
> --
>
> Key: SPARK-4569
> URL: https://issues.apache.org/jira/browse/SPARK-4569
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Trivial
> Fix For: 1.3.0
>
>
> While technically all spilling in Spark does result in sorting, calling this 
> variable externalSorting makes it seem like ExternalSorter will be used, when 
> in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4569) Rename "externalSorting" in Aggregator

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4569:
-
Labels: backport-needed  (was: )

> Rename "externalSorting" in Aggregator
> --
>
> Key: SPARK-4569
> URL: https://issues.apache.org/jira/browse/SPARK-4569
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Priority: Trivial
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> While technically all spilling in Spark does result in sorting, calling this 
> variable externalSorting makes it seem like ExternalSorter will be used, when 
> in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4821) pyspark.mllib.rand docs not generated correctly

2014-12-10 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-4821:


 Summary: pyspark.mllib.rand docs not generated correctly
 Key: SPARK-4821
 URL: https://issues.apache.org/jira/browse/SPARK-4821
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


spark/python/docs/pyspark.mllib.rst needs to be updated to reflect the change 
in package names from pyspark.mllib.random to .rand

Otherwise, the Python API docs are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4759:
-
Target Version/s: 1.3.0, 1.1.2, 1.2.1

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0, 1.1.2
>
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4759:
-
Labels: backport-needed  (was: )

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0, 1.1.2
>
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode

2014-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4759:
-
Fix Version/s: 1.1.2
   1.3.0

> Deadlock in complex spark job in local mode
> ---
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
>Reporter: Davis Shepherd
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.3.0, 1.1.2
>
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative 
> computation on an RDD[(Int, Int)]. This computation involves 
>   # taking new data merging it with the previous result
>   # caching and checkpointing the new result
>   # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is 
> shut down. The second time the job is run with a new spark context in the 
> same process, the job hangs indefinitely, only having scheduled a subset of 
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've 
> added some comments where some knockout experimentation has left some 
> breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >