[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-04 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152915#comment-16152915
 ] 

Matei Zaharia commented on SPARK-21866:
---

Just to chime in on this, I've also seen feedback that the deep learning 
libraries for Spark are too fragmented: there are too many of them, and people 
don't know where to start. This standard representation would at least give 
them a clear way to interoperate. It would let people write separate libraries 
for image processing, data augmentation and then training for example.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4

[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2017-08-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-18278:
--
Labels: SPIP  (was: )

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
>  Labels: SPIP
> Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision 
> 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-21866:
--
Labels: SPIP  (was: )

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The content of this is 

[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-12 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484732#comment-15484732
 ] 

Matei Zaharia commented on SPARK-17445:
---

Sounds good to me.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-10 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15480419#comment-15480419
 ] 

Matei Zaharia commented on SPARK-17445:
---

Sounds good, but IMO just keep the current supplemental projects there -- don't 
they fit better into "third-party packages" than "powered by"? I viewed powered 
by as a list of users, similar to https://wiki.apache.org/hadoop/PoweredBy, but 
I guess you're viewing it as a list of software that integrates with Spark.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-09 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15479121#comment-15479121
 ] 

Matei Zaharia commented on SPARK-17445:
---

The powered by wiki page is a bit of a mess IMO, so I'd separate out the 
third-party packages from that one. Basically, the powered by page was useful 
when the project was really new and nobody knew who's using it, but right now 
it's a snapshot of the users from back then because = few new organizations 
(especially the large ones) list themselves there. Anyway, just linking to this 
wiki page is nice, though I'd try to rename the page to "Third-Party Packages" 
instead of "Supplemental Spark Projects" if it's possible to make the old name 
redirect.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-08 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15474543#comment-15474543
 ] 

Matei Zaharia commented on SPARK-17445:
---

I think one part you're missing, Josh, is that spark-packages.org *is* an index 
of packages from a wide variety of organizations, where anyone can submit a 
package. Have you looked through it? Maybe there is some concern about which 
third-party index we highlight on the site, but AFAIK there are no other 
third-party package indexes. Nonetheless it would make sense to have a stable 
URL on the Spark homepage that lists them.

BTW, in the past, we also used a wiki page to track them: 
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects 
so we could just link to that. The spark-packages site provides some nicer 
functionality though such as letting anyone add a package with just a GitHub 
account, listing releases, etc.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-07 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-17445:
-

 Summary: Reference an ASF page as the main place to find 
third-party packages
 Key: SPARK-17445
 URL: https://issues.apache.org/jira/browse/SPARK-17445
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia


Some comments and docs like 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
 say to go to spark-packages.org, but since this is a package index maintained 
by a third party, it would be better to reference an ASF page that we can keep 
updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16031) Add debug-only socket source in Structured Streaming

2016-06-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337182#comment-15337182
 ] 

Matei Zaharia commented on SPARK-16031:
---

FYI I'll post a PR for this soon.

> Add debug-only socket source in Structured Streaming
> 
>
> Key: SPARK-16031
> URL: https://issues.apache.org/jira/browse/SPARK-16031
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>
> This is a debug-only version of SPARK-15842: for tutorials and debugging of 
> streaming apps, it would be nice to have a text-based socket source similar 
> to the one in Spark Streaming. It will clearly be marked as debug-only so 
> that users don't try to run it in production applications, because this type 
> of source cannot provide HA without storing a lot of state in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16031) Add debug-only socket source in Structured Streaming

2016-06-17 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-16031:
-

 Summary: Add debug-only socket source in Structured Streaming
 Key: SPARK-16031
 URL: https://issues.apache.org/jira/browse/SPARK-16031
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Streaming
Reporter: Matei Zaharia
Assignee: Matei Zaharia


This is a debug-only version of SPARK-15842: for tutorials and debugging of 
streaming apps, it would be nice to have a text-based socket source similar to 
the one in Spark Streaming. It will clearly be marked as debug-only so that 
users don't try to run it in production applications, because this type of 
source cannot provide HA without storing a lot of state in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15879) Update logo in UI and docs to add "Apache"

2016-06-10 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-15879:
-

 Summary: Update logo in UI and docs to add "Apache"
 Key: SPARK-15879
 URL: https://issues.apache.org/jira/browse/SPARK-15879
 Project: Spark
  Issue Type: Task
  Components: Documentation, Web UI
Reporter: Matei Zaharia


We recently added "Apache" to the Spark logo on the website 
(http://spark.apache.org/images/spark-logo.eps) to have it be the full project 
name, and we should do the same in the web UI and docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14356) Update spark.sql.execution.debug to work on Datasets

2016-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-14356:
-

Assignee: Matei Zaharia

> Update spark.sql.execution.debug to work on Datasets
> 
>
> Key: SPARK-14356
> URL: https://issues.apache.org/jira/browse/SPARK-14356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Minor
>
> Currently it only works on DataFrame, which seems unnecessarily restrictive 
> for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14356) Update spark.sql.execution.debug to work on Datasets

2016-04-03 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-14356:
-

 Summary: Update spark.sql.execution.debug to work on Datasets
 Key: SPARK-14356
 URL: https://issues.apache.org/jira/browse/SPARK-14356
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Matei Zaharia
Priority: Minor


Currently it only works on DataFrame, which seems unnecessarily restrictive for 
2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10854) MesosExecutorBackend: Received launchTask but executor was null

2015-12-03 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15038058#comment-15038058
 ] 

Matei Zaharia commented on SPARK-10854:
---

Just a note, I saw a log where this happened, and the sequence of events is 
that the executor logs a launchTask callback before registered(). It could be a 
synchronization thing or a problem in the Mesos library.

> MesosExecutorBackend: Received launchTask but executor was null
> ---
>
> Key: SPARK-10854
> URL: https://issues.apache.org/jira/browse/SPARK-10854
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0
> Mesos 0.23.0
> Docker 1.8.1
>Reporter: Kevin Matzen
>Priority: Minor
>
> Sometimes my tasks get stuck in staging.  Here's stdout from one such worker. 
>  I'm running mesos-slave inside a docker container with the host's docker 
> exposed and I'm using Spark's docker support to launch the worker inside its 
> own container.  Both containers are running.  I'm using pyspark.  I can see 
> mesos-slave and java running, but I do not see python running.
> {noformat}
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered signal handlers for 
> [TERM, HUP, INT]
> I0928 15:02:09.65854138 exec.cpp:132] Version: 0.23.0
> 15/09/28 15:02:09 ERROR MesosExecutorBackend: Received launchTask but 
> executor was null
> I0928 15:02:09.70295554 exec.cpp:206] Executor registered on slave 
> 20150928-044200-1140850698-5050-8-S190
> 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered with Mesos as 
> executor ID 20150928-044200-1140850698-5050-8-S190 with 1 cpus
> 15/09/28 15:02:09 INFO SecurityManager: Changing view acls to: root
> 15/09/28 15:02:09 INFO SecurityManager: Changing modify acls to: root
> 15/09/28 15:02:09 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/09/28 15:02:10 INFO Slf4jLogger: Slf4jLogger started
> 15/09/28 15:02:10 INFO Remoting: Starting remoting
> 15/09/28 15:02:10 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkExecutor@:56458]
> 15/09/28 15:02:10 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 56458.
> 15/09/28 15:02:10 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-28a21c2d-54cc-40b3-b0c2-cc3624f1a73c/blockmgr-f2336fec-e1ea-44f1-bd5c-9257049d5e7b
> 15/09/28 15:02:10 INFO MemoryStore: MemoryStore started with capacity 52.1 MB
> 15/09/28 15:02:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/09/28 15:02:11 INFO Executor: Starting executor ID 
> 20150928-044200-1140850698-5050-8-S190 on host 
> 15/09/28 15:02:11 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57431.
> 15/09/28 15:02:11 INFO NettyBlockTransferService: Server created on 57431
> 15/09/28 15:02:11 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/28 15:02:11 INFO BlockManagerMaster: Registered BlockManager
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11733) Allow shuffle readers to request data from just one mapper

2015-11-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-11733:
-

 Summary: Allow shuffle readers to request data from just one mapper
 Key: SPARK-11733
 URL: https://issues.apache.org/jira/browse/SPARK-11733
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia


This is needed to do broadcast joins. Right now the shuffle reader interface 
takes a range of reduce IDs but fetches from all maps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-16 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961567#comment-14961567
 ] 

Matei Zaharia commented on SPARK-:
--

Beyond tuples, you'll also want encoders for other generic classes, such as 
Seq[T]. They're the cleanest mechanism to get the most type info. Also, from a 
software engineering point of view it's nice to avoid a central object where 
you register stuff to allow composition between libraries (basically, see the 
problems that the Kryo registry creates today).

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9850) Adaptive execution in Spark

2015-09-24 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907518#comment-14907518
 ] 

Matei Zaharia commented on SPARK-9850:
--

Hey Imran, this could make sense, but note that the problem will only happen if 
you have 2000 map *output* partitions, which would've been 2000 reduce tasks 
normally. Otherwise, you can have as many map *tasks* as needed with fewer 
partitions. In most jobs, I'd expect data to get significantly smaller after 
the maps, so we'd catch that. In particular, for choosing between broadcast and 
shuffle joins this should be fine. We can do something different if we suspect 
that there is going to be tons of map output *and* we think there's nontrivial 
planning to be done once we see it.

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Yin Huai
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9852) Let reduce tasks fetch multiple map output partitions

2015-09-24 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9852.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Let reduce tasks fetch multiple map output partitions
> -
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9852) Let reduce tasks fetch multiple map output partitions

2015-09-20 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9852:
-
Summary: Let reduce tasks fetch multiple map output partitions  (was: Let 
HashShuffleFetcher fetch multiple map output partitions)

> Let reduce tasks fetch multiple map output partitions
> -
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9851) Support submitting map stages individually in DAGScheduler

2015-09-14 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9851.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Support submitting map stages individually in DAGScheduler
> --
>
> Key: SPARK-9851
> URL: https://issues.apache.org/jira/browse/SPARK-9851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2015-08-20 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9853:


Assignee: Matei Zaharia

> Optimize shuffle fetch of contiguous partition IDs
> --
>
> Key: SPARK-9853
> URL: https://issues.apache.org/jira/browse/SPARK-9853
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Minor
>
> On the map side, we should be able to serve a block representing multiple 
> partition IDs in one block manager request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-10008.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Shuffle locality can take precedence over narrow dependencies for RDDs with 
> both
> 
>
> Key: SPARK-10008
> URL: https://issues.apache.org/jira/browse/SPARK-10008
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.5.0
>
>
> The shuffle locality patch made the DAGScheduler aware of shuffle data, but 
> for RDDs that have both narrow and shuffle dependencies, it can cause them to 
> place tasks based on the shuffle dependency instead of the narrow one. This 
> case is common in iterative join-based algorithms like PageRank and ALS, 
> where one RDD is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-14 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-10008:
-

Assignee: Matei Zaharia

> Shuffle locality can take precedence over narrow dependencies for RDDs with 
> both
> 
>
> Key: SPARK-10008
> URL: https://issues.apache.org/jira/browse/SPARK-10008
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>
> The shuffle locality patch made the DAGScheduler aware of shuffle data, but 
> for RDDs that have both narrow and shuffle dependencies, it can cause them to 
> place tasks based on the shuffle dependency instead of the narrow one. This 
> case is common in iterative join-based algorithms like PageRank and ALS, 
> where one RDD is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-14 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-10008:
-

 Summary: Shuffle locality can take precedence over narrow 
dependencies for RDDs with both
 Key: SPARK-10008
 URL: https://issues.apache.org/jira/browse/SPARK-10008
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Matei Zaharia


The shuffle locality patch made the DAGScheduler aware of shuffle data, but for 
RDDs that have both narrow and shuffle dependencies, it can cause them to place 
tasks based on the shuffle dependency instead of the narrow one. This case is 
common in iterative join-based algorithms like PageRank and ALS, where one RDD 
is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9851) Support submitting map stages individually in DAGScheduler

2015-08-13 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9851:
-
Summary: Support submitting map stages individually in DAGScheduler  (was: 
Add support for submitting map stages individually in DAGScheduler)

> Support submitting map stages individually in DAGScheduler
> --
>
> Key: SPARK-9851
> URL: https://issues.apache.org/jira/browse/SPARK-9851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long

2015-08-12 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9923:


 Summary: ShuffleMapStage.numAvailableOutputs should be an Int 
instead of Long
 Key: SPARK-9923
 URL: https://issues.apache.org/jira/browse/SPARK-9923
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Priority: Trivial


Not sure why it was made a Long, but every usage assumes it's an Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long

2015-08-12 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9923:
-
Labels: Starter  (was: )

> ShuffleMapStage.numAvailableOutputs should be an Int instead of Long
> 
>
> Key: SPARK-9923
> URL: https://issues.apache.org/jira/browse/SPARK-9923
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>Priority: Trivial
>  Labels: Starter
>
> Not sure why it was made a Long, but every usage assumes it's an Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Issue Type: Epic  (was: New Feature)

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Yin Huai
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Assignee: Yin Huai

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Yin Huai
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9853:


 Summary: Optimize shuffle fetch of contiguous partition IDs
 Key: SPARK-9853
 URL: https://issues.apache.org/jira/browse/SPARK-9853
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia
Priority: Minor


On the map side, we should be able to serve a block representing multiple 
partition IDs in one block manager request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9852:


Assignee: Matei Zaharia

> Let HashShuffleFetcher fetch multiple map output partitions
> ---
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9851:


 Summary: Add support for submitting map stages individually in 
DAGScheduler
 Key: SPARK-9851
 URL: https://issues.apache.org/jira/browse/SPARK-9851
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9851:


Assignee: Matei Zaharia

> Add support for submitting map stages individually in DAGScheduler
> --
>
> Key: SPARK-9851
> URL: https://issues.apache.org/jira/browse/SPARK-9851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9852:


 Summary: Let HashShuffleFetcher fetch multiple map output 
partitions
 Key: SPARK-9852
 URL: https://issues.apache.org/jira/browse/SPARK-9852
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Attachment: AdaptiveExecutionInSpark.pdf

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9850:


 Summary: Adaptive execution in Spark
 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Matei Zaharia


Query planning is one of the main factors in high performance, but the current 
Spark engine requires the execution DAG for a job to be set in advance. Even 
with cost­-based optimization, it is hard to know the behavior of data and 
user-defined functions well enough to always get great execution plans. This 
JIRA proposes to add adaptive query execution, so that the engine can change 
the plan for each query as it sees what data earlier stages produced.

We propose adding this to Spark SQL / DataFrames first, using a new API in the 
Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the 
functionality could be extended to other libraries or the RDD API, but that is 
more difficult than adding it in SQL.

I've attached a design doc by Yin Huai and myself explaining how it would work 
in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9244) Increase some default memory limits

2015-07-22 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9244.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

> Increase some default memory limits
> ---
>
> Key: SPARK-9244
> URL: https://issues.apache.org/jira/browse/SPARK-9244
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Minor
> Fix For: 1.5.0
>
>
> There are a few memory limits that people hit often and that we could make 
> higher, especially now that memory sizes have grown.
> - spark.akka.frameSize: This defaults at 10 but is often hit for map output 
> statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
> so we can just make this larger and still not affect jobs that never sent a 
> status that large.
> - spark.executor.memory: Defaults at 512m, which is really small. We can at 
> least increase it to 1g, though this is something users do need to set on 
> their own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9244) Increase some default memory limits

2015-07-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9244:


 Summary: Increase some default memory limits
 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia


There are a few memory limits that people hit often and that we could make 
higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map output 
statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
so we can just make this larger and still not affect jobs that never sent a 
status that large.

- spark.executor.memory: Defaults at 512m, which is really small. We can at 
least increase it to 1g, though this is something users do need to set on their 
own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup

2015-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-8111:


 Summary: SparkR shell should display Spark logo and version banner 
on startup
 Key: SPARK-8111
 URL: https://issues.apache.org/jira/browse/SPARK-8111
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Matei Zaharia
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8110) DAG visualizations sometimes look weird in Python

2015-06-04 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-8110:
-
Attachment: Screen Shot 2015-06-04 at 1.51.32 PM.png
Screen Shot 2015-06-04 at 1.51.35 PM.png

> DAG visualizations sometimes look weird in Python
> -
>
> Key: SPARK-8110
> URL: https://issues.apache.org/jira/browse/SPARK-8110
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Matei Zaharia
>Priority: Minor
> Attachments: Screen Shot 2015-06-04 at 1.51.32 PM.png, Screen Shot 
> 2015-06-04 at 1.51.35 PM.png
>
>
> Got this by doing sc.textFile("README.md").count() -- there are some RDDs 
> outside of any stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8110) DAG visualizations sometimes look weird in Python

2015-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-8110:


 Summary: DAG visualizations sometimes look weird in Python
 Key: SPARK-8110
 URL: https://issues.apache.org/jira/browse/SPARK-8110
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Matei Zaharia
Priority: Minor


Got this by doing sc.textFile("README.md").count() -- there are some RDDs 
outside of any stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7298) Harmonize style of new UI visualizations

2015-05-08 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-7298.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

> Harmonize style of new UI visualizations
> 
>
> Key: SPARK-7298
> URL: https://issues.apache.org/jira/browse/SPARK-7298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Patrick Wendell
>Assignee: Matei Zaharia
>Priority: Blocker
> Fix For: 1.4.0
>
>
> We need to go through all new visualizations in the web UI and make sure they 
> have consistent style. Both consistent with each-other and consistent with 
> the rest of the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7298) Harmonize style of new UI visualizations

2015-05-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-7298:


Assignee: Matei Zaharia  (was: Patrick Wendell)

> Harmonize style of new UI visualizations
> 
>
> Key: SPARK-7298
> URL: https://issues.apache.org/jira/browse/SPARK-7298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Patrick Wendell
>Assignee: Matei Zaharia
>Priority: Blocker
>
> We need to go through all new visualizations in the web UI and make sure they 
> have consistent style. Both consistent with each-other and consistent with 
> the rest of the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7261) Change default log level to WARN in the REPL

2015-04-29 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520366#comment-14520366
 ] 

Matei Zaharia commented on SPARK-7261:
--

IMO we can do this even without SPARK-7260 in 1.4, but that one would be nice 
to have.

> Change default log level to WARN in the REPL
> 
>
> Key: SPARK-7261
> URL: https://issues.apache.org/jira/browse/SPARK-7261
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Minor
>
> We should add a log4j properties file for the repl 
> (log4j-defaults-repl.properties) that has the level of warning. The main 
> reason for doing this is that we now display nice progress bars in the REPL 
> so the need for task level INFO messages is much less.
> A couple other things:
> 1. I'd block this on SPARK-7260
> 2. We should say in the repl opening that the log leve is set to WARN and 
> explain to people how to change it programatically.
> 3. If the user has a log4j properties, it should take precedence over this 
> default of WARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext

2015-04-08 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-6778:


 Summary: SQL contexts in spark-shell and pyspark should both be 
called sqlContext
 Key: SPARK-6778
 URL: https://issues.apache.org/jira/browse/SPARK-6778
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell
Reporter: Matei Zaharia


For some reason the Python one is only called sqlCtx. This is pretty confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391456#comment-14391456
 ] 

Matei Zaharia commented on SPARK-6646:
--

Not to rain on the parade here, but I worry that focusing on mobile phones is 
short-sighted. Does this design present a path forward for the Internet of 
Things as well? You'd want something that runs on Arduino, Raspberry Pi, etc. 
We already have MQTT input in Spark Streaming so we could consider using MQTT 
to replace Netty for shuffle as well. Has anybody benchmarked that?

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-03-12 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359017#comment-14359017
 ] 

Matei Zaharia commented on SPARK-1564:
--

This is still a valid issue AFAIK, isn't it? These things still show up badly 
in Javadoc. So we could change the parent issue or something but I'd like to 
see it fixed.

> Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
> -
>
> Key: SPARK-1564
> URL: https://issues.apache.org/jira/browse/SPARK-1564
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Andrew Or
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-02-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309782#comment-14309782
 ] 

Matei Zaharia commented on SPARK-5654:
--

Yup, there's a tradeoff, but given that this is a language API and not an 
algorithm, input source or anything like that, I think it's important to 
support it along with the core engine. R is extremely popular for data science, 
more so than Python, and it fits well with many existing concepts in Spark.

> Integrate SparkR into Apache Spark
> --
>
> Key: SPARK-5654
> URL: https://issues.apache.org/jira/browse/SPARK-5654
> Project: Spark
>  Issue Type: New Feature
>Reporter: Shivaram Venkataraman
>
> The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
> from R. The project was started at the AMPLab around a year ago and has been 
> incubated as its own project to make sure it can be easily merged into 
> upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
> goals are similar to PySpark and shares a similar design pattern as described 
> in our meetup talk[2], Spark Summit presentation[3].
> Integrating SparkR into the Apache project will enable R users to use Spark 
> out of the box and given R’s large user base, it will help the Spark project 
> reach more users.  Additionally, work in progress features like providing R 
> integration with ML Pipelines and Dataframes can be better achieved by 
> development in a unified code base.
> SparkR is available under the Apache 2.0 License and does not have any 
> external dependencies other than requiring users to have R and Java installed 
> on their machines.  SparkR’s developers come from many organizations 
> including UC Berkeley, Alteryx, Intel and we will support future development, 
> maintenance after the integration.
> [1] https://github.com/amplab-extras/SparkR-pkg
> [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
> [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs

2015-02-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-5608.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

> Improve SEO of Spark documentation site to let Google find latest docs
> --
>
> Key: SPARK-5608
> URL: https://issues.apache.org/jira/browse/SPARK-5608
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.3.0
>
>
> Google currently has trouble finding spark.apache.org/docs/latest, so a lot 
> of the results returned for various queries are from random previous versions 
> of Spark where someone created a link. I'd like to do the following:
> - Add a sitemap.xml to spark.apache.org that lists all the docs/latest pages 
> (already done)
> - Add meta description tags on some of the most important doc pages
> - Shorten the titles of some pages to have more relevant keywords; for 
> example there's no reason to have "Spark SQL Programming Guide - Spark 1.2.0 
> documentation", we can just say "Spark SQL - Spark 1.2.0 documentation".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs

2015-02-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-5608:


 Summary: Improve SEO of Spark documentation site to let Google 
find latest docs
 Key: SPARK-5608
 URL: https://issues.apache.org/jira/browse/SPARK-5608
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Matei Zaharia


Google currently has trouble finding spark.apache.org/docs/latest, so a lot of 
the results returned for various queries are from random previous versions of 
Spark where someone created a link. I'd like to do the following:

- Add a sitemap.xml to spark.apache.org that lists all the docs/latest pages 
(already done)
- Add meta description tags on some of the most important doc pages
- Shorten the titles of some pages to have more relevant keywords; for example 
there's no reason to have "Spark SQL Programming Guide - Spark 1.2.0 
documentation", we can just say "Spark SQL - Spark 1.2.0 documentation".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-13 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-5088:
-
Target Version/s: 1.3.0  (was: 1.3.0, 1.2.1)

> Use spark-class for running executors directly on mesos
> ---
>
> Key: SPARK-5088
> URL: https://issues.apache.org/jira/browse/SPARK-5088
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 1.2.0
>Reporter: Jongyoul Lee
>Priority: Minor
> Fix For: 1.3.0
>
>
> - sbin/spark-executor is only used by running executor on mesos environment.
> - spark-executor calls spark-class without specific parameter internally.
> - PYTHONPATH is moved to spark-class' case.
> - Remove a redundant file for maintaining codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-13 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-5088:
-
Fix Version/s: (was: 1.2.1)

> Use spark-class for running executors directly on mesos
> ---
>
> Key: SPARK-5088
> URL: https://issues.apache.org/jira/browse/SPARK-5088
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 1.2.0
>Reporter: Jongyoul Lee
>Priority: Minor
> Fix For: 1.3.0
>
>
> - sbin/spark-executor is only used by running executor on mesos environment.
> - spark-executor calls spark-class without specific parameter internally.
> - PYTHONPATH is moved to spark-class' case.
> - Remove a redundant file for maintaining codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2015-01-09 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3619.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Jongyoul Lee  (was: Timothy Chen)

> Upgrade to Mesos 0.21 to work around MESOS-1688
> ---
>
> Key: SPARK-3619
> URL: https://issues.apache.org/jira/browse/SPARK-3619
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Matei Zaharia
>Assignee: Jongyoul Lee
> Fix For: 1.3.0
>
>
> The Mesos 0.21 release has a fix for 
> https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4660) JavaSerializer uses wrong classloader

2014-12-29 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260544#comment-14260544
 ] 

Matei Zaharia commented on SPARK-4660:
--

[~pkolaczk] mind sending a pull request against http://github.com/apache/spark 
for this? It will allow us to run it through the automated tests. It looks like 
a good fix but this stuff can be tricky.

> JavaSerializer uses wrong classloader
> -
>
> Key: SPARK-4660
> URL: https://issues.apache.org/jira/browse/SPARK-4660
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Piotr Kołaczkowski
>Priority: Critical
> Attachments: spark-serializer-classloader.patch
>
>
> During testing we found failures when trying to load some classes of the user 
> application:
> {noformat}
> ERROR 2014-11-29 20:01:56 org.apache.spark.storage.BlockManagerWorker: 
> Exception handling buffer message
> java.lang.ClassNotFoundException: 
> org.apache.spark.demo.HttpReceiverCases$HttpRequest
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.serializer.JavaDeseriali
> zationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>   at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126)
>   at 
> org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104)
>   at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:76)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:748)
>   at 
> org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:639)
>   at 
> org.apache.spark.storage.BlockManagerWorker.putBlock(BlockManagerWorker.scala:92)
>   at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:73)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
>   at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:48)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38)
>   at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:682)
>   at 
> org.apache.spark.network.ConnectionManager$$anon$10.run(ConnectionManager.scala:520)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}


[jira] [Commented] (SPARK-3247) Improved support for external data sources

2014-12-11 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243253#comment-14243253
 ] 

Matei Zaharia commented on SPARK-3247:
--

For those looking to learn about the interface in more detail, there is a 
meetup video on it at https://www.youtube.com/watch?v=GQSNJAzxOr8.

> Improved support for external data sources
> --
>
> Key: SPARK-3247
> URL: https://issues.apache.org/jira/browse/SPARK-3247
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-4690.

Resolution: Invalid

> AppendOnlyMap seems not using Quadratic probing as the JavaDoc
> --
>
> Key: SPARK-4690
> URL: https://issues.apache.org/jira/browse/SPARK-4690
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0, 1.3.0
>Reporter: Yijie Shen
>Priority: Minor
>
> org.apache.spark.util.collection.AppendOnlyMap's Documentation like this:
> "This implementation uses quadratic probing with a power-of-2 "
> However, the probe procedure in face with a hash collision is just using 
> linear probing. the code below:
> val delta = i
> pos = (pos + delta) & mask
> i += 1
> Maybe a bug here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1429#comment-1429
 ] 

Matei Zaharia commented on SPARK-4690:
--

Yup, that's the definition of it.

> AppendOnlyMap seems not using Quadratic probing as the JavaDoc
> --
>
> Key: SPARK-4690
> URL: https://issues.apache.org/jira/browse/SPARK-4690
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0, 1.3.0
>Reporter: Yijie Shen
>Priority: Minor
>
> org.apache.spark.util.collection.AppendOnlyMap's Documentation like this:
> "This implementation uses quadratic probing with a power-of-2 "
> However, the probe procedure in face with a hash collision is just using 
> linear probing. the code below:
> val delta = i
> pos = (pos + delta) & mask
> i += 1
> Maybe a bug here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4685:
-
Target Version/s: 1.2.1  (was: 1.2.0)

> Update JavaDoc settings to include spark.ml and all spark.mllib subpackages 
> in the right sections
> -
>
> Key: SPARK-4685
> URL: https://issues.apache.org/jira/browse/SPARK-4685
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Matei Zaharia
>Priority: Trivial
>
> Right now they're listed under "other packages" on the homepage of the 
> JavaDoc docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4685:
-
Priority: Trivial  (was: Major)

> Update JavaDoc settings to include spark.ml and all spark.mllib subpackages 
> in the right sections
> -
>
> Key: SPARK-4685
> URL: https://issues.apache.org/jira/browse/SPARK-4685
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Matei Zaharia
>Priority: Trivial
>
> Right now they're listed under "other packages" on the homepage of the 
> JavaDoc docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4685:


 Summary: Update JavaDoc settings to include spark.ml and all 
spark.mllib subpackages in the right sections
 Key: SPARK-4685
 URL: https://issues.apache.org/jira/browse/SPARK-4685
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Matei Zaharia


Right now they're listed under "other packages" on the homepage of the JavaDoc 
docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4684) Add a script to run JDBC server on Windows

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4684:


 Summary: Add a script to run JDBC server on Windows
 Key: SPARK-4684
 URL: https://issues.apache.org/jira/browse/SPARK-4684
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Matei Zaharia
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4683) Add a beeline.cmd to run on Windows

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4683:


 Summary: Add a beeline.cmd to run on Windows
 Key: SPARK-4683
 URL: https://issues.apache.org/jira/browse/SPARK-4683
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-27 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4613:
-
Issue Type: Improvement  (was: Bug)

> Make JdbcRDD easier to use from Java
> 
>
> Key: SPARK-4613
> URL: https://issues.apache.org/jira/browse/SPARK-4613
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Cheng Lian
> Fix For: 1.2.0
>
>
> We might eventually deprecate it, but for now it would be nice to expose a 
> Java wrapper that allows users to create this using the java function 
> interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-27 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4613.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> Make JdbcRDD easier to use from Java
> 
>
> Key: SPARK-4613
> URL: https://issues.apache.org/jira/browse/SPARK-4613
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Cheng Lian
> Fix For: 1.2.0
>
>
> We might eventually deprecate it, but for now it would be nice to expose a 
> Java wrapper that allows users to create this using the java function 
> interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reopened SPARK-3628:
--

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-11-26 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-732.
---
Resolution: Won't Fix

> Recomputation of RDDs may result in duplicated accumulator updates
> --
>
> Key: SPARK-732
> URL: https://issues.apache.org/jira/browse/SPARK-732
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.8.2, 
> 0.9.0, 1.0.1, 1.1.0
>Reporter: Josh Rosen
>Assignee: Nan Zhu
>Priority: Blocker
>
> Currently, Spark doesn't guard against duplicated updates to the same 
> accumulator due to recomputations of an RDD.  For example:
> {code}
> val acc = sc.accumulator(0)
> data.map(x => acc += 1; f(x))
> data.count()
> // acc should equal data.count() here
> data.foreach{...}
> // Now, acc = 2 * data.count() because the map() was recomputed.
> {code}
> I think that this behavior is incorrect, especially because this behavior 
> allows the additon or removal of a cache() call to affect the outcome of a 
> computation.
> There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
> code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
> I haven't tested whether recomputation due to blocks being dropped from the 
> cache can trigger duplicate accumulator updates.
> Hypothetically someone could be relying on the current behavior to implement 
> performance counters that track the actual number of computations performed 
> (including recomputations).  To be safe, we could add an explicit warning in 
> the release notes that documents the change in behavior when we fix this.
> Ignoring duplicate updates shouldn't be too hard, but there are a few 
> subtleties.  Currently, we allow accumulators to be used in multiple 
> transformations, so we'd need to detect duplicate updates at the 
> per-transformation level.  I haven't dug too deeply into the scheduler 
> internals, but we might also run into problems where pipelining causes what 
> is logically one set of accumulator updates to show up in two different tasks 
> (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
> what's logically the same accumulator update to be applied from two different 
> contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-11-26 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227108#comment-14227108
 ] 

Matei Zaharia commented on SPARK-732:
-

As discussed on https://github.com/apache/spark/pull/2524 this is pretty hard 
to provide good semantics for in the general case (accumulator updates inside 
non-result stages), for the following reasons:

- An RDD may be computed as part of multiple stages. For example, if you update 
an accumulator inside a MappedRDD and then shuffle it, that might be one stage. 
But if you then call map() again on the MappedRDD, and shuffle the result of 
that, you get a second stage where that map is pipeline. Do you want to count 
this accumulator update twice or not?
- Entire stages may be resubmitted if shuffle files are deleted by the periodic 
cleaner or are lost due to a node failure, so anything that tracks RDDs would 
need to do so for long periods of time (as long as the RDD is referenceable in 
the user program), which would be pretty complicated to implement.

So I'm going to mark this as "won't fix" for now, except for the part for 
result stages done in SPARK-3628.

> Recomputation of RDDs may result in duplicated accumulator updates
> --
>
> Key: SPARK-732
> URL: https://issues.apache.org/jira/browse/SPARK-732
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.8.2, 
> 0.9.0, 1.0.1, 1.1.0
>Reporter: Josh Rosen
>Assignee: Nan Zhu
>Priority: Blocker
>
> Currently, Spark doesn't guard against duplicated updates to the same 
> accumulator due to recomputations of an RDD.  For example:
> {code}
> val acc = sc.accumulator(0)
> data.map(x => acc += 1; f(x))
> data.count()
> // acc should equal data.count() here
> data.foreach{...}
> // Now, acc = 2 * data.count() because the map() was recomputed.
> {code}
> I think that this behavior is incorrect, especially because this behavior 
> allows the additon or removal of a cache() call to affect the outcome of a 
> computation.
> There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
> code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
> I haven't tested whether recomputation due to blocks being dropped from the 
> cache can trigger duplicate accumulator updates.
> Hypothetically someone could be relying on the current behavior to implement 
> performance counters that track the actual number of computations performed 
> (including recomputations).  To be safe, we could add an explicit warning in 
> the release notes that documents the change in behavior when we fix this.
> Ignoring duplicate updates shouldn't be too hard, but there are a few 
> subtleties.  Currently, we allow accumulators to be used in multiple 
> transformations, so we'd need to detect duplicate updates at the 
> per-transformation level.  I haven't dug too deeply into the scheduler 
> internals, but we might also run into problems where pipelining causes what 
> is logically one set of accumulator updates to show up in two different tasks 
> (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
> what's logically the same accumulator update to be applied from two different 
> contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227077#comment-14227077
 ] 

Matei Zaharia commented on SPARK-3628:
--

FYI I merged this into 1.2.0, since the patch is now quite a bit smaller. We 
should decide whether we want to back port it to branch-1.1, so I'll leave it 
open for that reason. I don't think there's much point backporting it further 
because the issue is somewhat rare, but we can do it if people ask for it.

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3628.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.1.2  (was: 0.9.3, 1.0.3, 1.1.2, 1.2.1)

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-25 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225615#comment-14225615
 ] 

Matei Zaharia commented on SPARK-4613:
--

BTW the strawman for this would be a version of the API that doesn't take Scala 
function objects for getConnection and mapRow, possibly through a static method 
on object JdbcRDD.

> Make JdbcRDD easier to use from Java
> 
>
> Key: SPARK-4613
> URL: https://issues.apache.org/jira/browse/SPARK-4613
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Matei Zaharia
>
> We might eventually deprecate it, but for now it would be better to make it 
> more Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-25 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4613:


 Summary: Make JdbcRDD easier to use from Java
 Key: SPARK-4613
 URL: https://issues.apache.org/jira/browse/SPARK-4613
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia


We might eventually deprecate it, but for now it would be better to make it 
more Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-11-23 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222545#comment-14222545
 ] 

Matei Zaharia commented on SPARK-3633:
--

[~stephen] you can try the 1.1.1 RC in 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-1-RC2-td9439.html,
 which includes a Maven staging repo that you can just add as a repo in a build.

> Fetches failure observed after SPARK-2711
> -
>
> Key: SPARK-3633
> URL: https://issues.apache.org/jira/browse/SPARK-3633
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.1.0
>Reporter: Nishkam Ravi
>Priority: Blocker
>
> Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
> Recently upgraded to Spark 1.1. The workload fails with the following error 
> message(s):
> {code}
> 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
> c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
> c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
> 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
> {code}
> In order to identify the problem, I carried out change set analysis. As I go 
> back in time, the error message changes to:
> {code}
> 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
> c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
> /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
>  (Too many open files)
> java.io.FileOutputStream.open(Native Method)
> java.io.FileOutputStream.(FileOutputStream.java:221)
> 
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
> 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-18 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217331#comment-14217331
 ] 

Matei Zaharia commented on SPARK-4452:
--

Forced spilling is orthogonal to how you set the limits actually. For example, 
if there are N objects, one way to set limits is to reserve at least 1/N of 
memory for each one. But another way would be to group them by thread, and use 
a different algorithm for allocation within a thread (e.g. set each object's 
cap to more if other objects in their thread are using less). Whether you force 
spilling or not, you'll have to decide what the right limit for each thing is.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
>Priority: Critical
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-18 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216691#comment-14216691
 ] 

Matei Zaharia commented on SPARK-4452:
--

BTW I've thought about this more and here's what I'd suggest: try a version 
where each object is allowed to ramp up to a certain size (say 5 MB) before 
being subject to the limit, and if that doesn't work, then maybe go for the 
forced-spilling one. The reason is that as soon as N objects are active, the 
ShuffleMemoryManager will not let any object ramp up to more than 1/N, so it 
just has to fill up its current quota and stop. This means that scenarios with 
very little free memory might only happen at the beginning (when tasks start 
up). If we can make this work, then we avoid a lot of concurrency problems that 
would happen with forced spilling. 

Another improvement would be to make the Spillables request less than 2x their 
current memory when they ramp up, e.g. 1.5x. They'd then make more requests but 
it would lead to slower ramp-up and more of a chance for other threads to grab 
memory. But I think this will have less impact than simply increasing that free 
minimum amount.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
>Priority: Blocker
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215557#comment-14215557
 ] 

Matei Zaharia commented on SPARK-4452:
--

BTW we may also want to create a separate JIRA for the short-term fix for 1.1 
and 1.2.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
>Priority: Blocker
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215556#comment-14215556
 ] 

Matei Zaharia commented on SPARK-4452:
--

Got it. It would be fine to do this if you found it to help, I was just 
wondering whether simpler fixes would get us far enough. For the forced 
spilling change, I'd suggest writing a short design doc, or making sure that 
the comments in the code about it are very detailed (essentially having a 
design doc at the top of the class). This can have a lot of tricky cases due to 
concurrency so it's important to document the design.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
>Priority: Blocker
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215425#comment-14215425
 ] 

Matei Zaharia commented on SPARK-4452:
--

How much of this gets fixed if you fix the elementsRead bug in ExternalSorter?

With forcing data structures to spill, the problem is that it will introduce 
complexity in every spillable data structure. I wonder if we can make it just 
give out memory in smaller increments, so that threads check whether they 
should spill more often. In addition, we can set a better minimum or maximum on 
each thread (e.g. always let it ramp up to, say, 5 MB, or some fraction of the 
memory space).

I do like the idea of making the ShuffleMemoryManager track limits per object. 
I actually considered this when I wrote that and didn't do it, possibly because 
it would've created more complexity in figuring out when an object is done. But 
it seems like it should be straightforward to add in, as long as you also track 
which objects come from which thread so that you can still 
releaseMemoryForThisThread() to clean up.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> This happens when using the sort-based shuffle. The issue is caused by 
> multiple factors:
> 1. There seems to be a bug in setting the elementsRead variable in 
> ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
> useless for triggering spilling, the pr to fix it is 
> https://github.com/apache/spark/pull/3302
> 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
> happening. Previously spillable object triggers spilling by themself. So one 
> may not trigger spilling even if another object in the same thread needs more 
> memory. After this change The ShuffleMemoryManager could trigger the spilling 
> of an object if it needs to
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager 
> Already made change 3 and have a prototype of change 1 and 2 to evict 
> spillable from memory manager, still in progress.
> I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4439) Expose RandomForest in Python

2014-11-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4439:
-
Summary: Expose RandomForest in Python  (was: Export RandomForest in Python)

> Expose RandomForest in Python
> -
>
> Key: SPARK-4439
> URL: https://issues.apache.org/jira/browse/SPARK-4439
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4439) Export RandomForest in Python

2014-11-16 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4439:


 Summary: Export RandomForest in Python
 Key: SPARK-4439
 URL: https://issues.apache.org/jira/browse/SPARK-4439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214155#comment-14214155
 ] 

Matei Zaharia commented on SPARK-4434:
--

[~joshrosen] make sure to revert this on 1.2 and master as well.

> spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
> -
>
> Key: SPARK-4434
> URL: https://issues.apache.org/jira/browse/SPARK-4434
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Andrew Or
>Priority: Blocker
>
> When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
> allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
> application JAR URL, e.g.
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
> {code}
> In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> Usage: DriverClient [options] launch
> [driver options]
> Usage: DriverClient kill  
> {code}
> I tried changing my URL to conform to the new format, but this either 
> resulted in an error or a job that failed:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Jar url 
> 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
>  is not in valid format.
> Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
> {code}
> If I omit the extra slash:
> {code}
> ./bin/spark-submit --deploy-mode cluster --master 
> spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
> Sending launch command to spark://joshs-mbp.att.net:7077
> Driver successfully submitted as driver-20141116143235-0002
> ... waiting before polling master for driver state
> ... polling master for driver state
> State of driver-20141116143235-0002 is ERROR
> Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
> java.lang.IllegalArgumentException: Wrong FS: 
> file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
>   at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
>   at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
> {code}
> This bug effectively prevents users from using {{spark-submit}} in cluster 
> mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4435) Add setThreshold in Python LogisticRegressionModel and SVMModel

2014-11-16 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4435:


 Summary: Add setThreshold in Python LogisticRegressionModel and 
SVMModel
 Key: SPARK-4435
 URL: https://issues.apache.org/jira/browse/SPARK-4435
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214134#comment-14214134
 ] 

Matei Zaharia commented on SPARK-4306:
--

[~srinathsmn] I've assigned it to you. When do you think you'll get this done? 
It would be great to include in 1.2 but for that we'd need it quite soon (say 
this week). If you don't have time, I can also assign it to someone else.

> LogisticRegressionWithLBFGS support for PySpark MLlib 
> --
>
> Key: SPARK-4306
> URL: https://issues.apache.org/jira/browse/SPARK-4306
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Varadharajan
>  Labels: newbie
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
> interfact. This task is to add support for LogisticRegressionWithLBFGS 
> algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4306:
-
Assignee: Varadharajan

> LogisticRegressionWithLBFGS support for PySpark MLlib 
> --
>
> Key: SPARK-4306
> URL: https://issues.apache.org/jira/browse/SPARK-4306
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Varadharajan
>Assignee: Varadharajan
>  Labels: newbie
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
> interfact. This task is to add support for LogisticRegressionWithLBFGS 
> algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4306:
-
Target Version/s: 1.2.0

> LogisticRegressionWithLBFGS support for PySpark MLlib 
> --
>
> Key: SPARK-4306
> URL: https://issues.apache.org/jira/browse/SPARK-4306
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Varadharajan
>  Labels: newbie
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
> interfact. This task is to add support for LogisticRegressionWithLBFGS 
> algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4330:
-
Assignee: Kousuke Saruta

> Link to proper URL for YARN overview
> 
>
> Key: SPARK-4330
> URL: https://issues.apache.org/jira/browse/SPARK-4330
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> In running-on-yarn.md, a link to YARN overview is here.
> But the URL is to YARN alpha's.
> It should be stable's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4330.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s:   (was: 1.3.0)

> Link to proper URL for YARN overview
> 
>
> Key: SPARK-4330
> URL: https://issues.apache.org/jira/browse/SPARK-4330
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> In running-on-yarn.md, a link to YARN overview is here.
> But the URL is to YARN alpha's.
> It should be stable's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4303) [MLLIB] Use "Long" IDs instead of "Int" in ALS.Rating class

2014-11-07 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203147#comment-14203147
 ] 

Matei Zaharia commented on SPARK-4303:
--

Yup, this will actually become easier with the new pipeline API, but it's 
probably not going to happen in 1.2.

> [MLLIB] Use "Long" IDs instead of "Int" in ALS.Rating class
> ---
>
> Key: SPARK-4303
> URL: https://issues.apache.org/jira/browse/SPARK-4303
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jia Xu
>
> In many big data recommendation applications, the IDs used for "users" and 
> "products" are usually Long type instead of Integer. 
> So a Rating class based on Long IDs should be more useful for these 
> applications.
> i.e. case class Rating(val user: Long, val product: Long, val rating: Double)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2348:
-
Priority: Critical  (was: Major)

> In Windows having a enviorinment variable named 'classpath' gives error
> ---
>
> Key: SPARK-2348
> URL: https://issues.apache.org/jira/browse/SPARK-2348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Windows 7 Enterprise
>Reporter: Chirag Todarka
>Assignee: Chirag Todarka
>Priority: Critical
>
> Operating System:: Windows 7 Enterprise
> If having enviorinment variable named 'classpath' gives then starting 
> 'spark-shell' gives below error::
> \spark\bin>spark-shell
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found
> .
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
> acces
> sed before init set up.  Assuming no postInit code.
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found
> .
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.AssertionError: assertion failed: null
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
> la:202)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
> kILoop.scala:929)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
> scala:884)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
> scala:884)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
> Loader.scala:135)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2237) Add ZLIBCompressionCodec code

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-2237.

Resolution: Won't Fix

> Add ZLIBCompressionCodec code
> -
>
> Key: SPARK-2237
> URL: https://issues.apache.org/jira/browse/SPARK-2237
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Yanjie Gao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1185) In Spark Programming Guide, "Master URLs" should mention yarn-client

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1185.
--
Resolution: Fixed

> In Spark Programming Guide, "Master URLs" should mention yarn-client
> 
>
> Key: SPARK-1185
> URL: https://issues.apache.org/jira/browse/SPARK-1185
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.9.0
>Reporter: Sandy Pérez González
>Assignee: Sandy Pérez González
>
> It would also be helpful to mention that the reason a host:post isn't 
> required for YARN mode is that it comes from the Hadoop configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1023) Remove Thread.sleep(5000) from TaskSchedulerImpl

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1023.
--
Resolution: Fixed

> Remove Thread.sleep(5000) from TaskSchedulerImpl
> 
>
> Key: SPARK-1023
> URL: https://issues.apache.org/jira/browse/SPARK-1023
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
> Fix For: 1.0.0
>
>
> This causes the unit tests to take super long. We should figure out why this 
> exists and see if we can lower it or do something smarter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1000) Crash when running SparkPi example with local-cluster

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-1000.

Resolution: Cannot Reproduce

> Crash when running SparkPi example with local-cluster
> -
>
> Key: SPARK-1000
> URL: https://issues.apache.org/jira/browse/SPARK-1000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: xiajunluan
>Assignee: Andrew Or
>
> when I run SparkPi with local-cluster[2,2,512], it will throw following 
> exception at the end of job.
> WARNING: An exception was thrown by an exception handler.
> java.util.concurrent.RejectedExecutionException
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.start(AbstractNioWorker.java:184)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:330)
>   at 
> org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:313)
>   at 
> org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
>   at 
> org.jboss.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:504)
>   at 
> org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:47)
>   at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170)
>   at 
> org.jboss.netty.channel.socket.nio.NioClientSocketChannel.(NioClientSocketChannel.java:79)
>   at 
> org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:176)
>   at 
> org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:82)
>   at 
> org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:213)
>   at 
> org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183)
>   at 
> akka.remote.netty.ActiveRemoteClient$$anonfun$connect$1.apply$mcV$sp(Client.scala:173)
>   at akka.util.Switch.liftedTree1$1(LockUtil.scala:33)
>   at akka.util.Switch.transcend(LockUtil.scala:32)
>   at akka.util.Switch.switchOn(LockUtil.scala:55)
>   at akka.remote.netty.ActiveRemoteClient.connect(Client.scala:158)
>   at 
> akka.remote.netty.NettyRemoteTransport.send(NettyRemoteSupport.scala:153)
>   at akka.remote.RemoteActorRef.$bang(RemoteActorRefProvider.scala:247)
>   at 
> akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559)
>   at 
> akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:772)
>   at scala.collection.immutable.VectorIterator.foreach(Vector.scala:648)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:73)
>   at scala.collection.immutable.Vector.foreach(Vector.scala:63)
>   at akka.actor.LocalDeathWatch.publish(ActorRefProvider.scala:559)
>   at 
> akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:280)
>   at 
> akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:262)
>   at akka.actor.ActorCell.doTerminate(ActorCell.scala:701)
>   at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:747)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:608)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:209)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:178)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
>   at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
>   at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
>   at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
>   at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default

2014-11-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200531#comment-14200531
 ] 

Matei Zaharia commented on SPARK-993:
-

Arun, you'd see this issue if you do collect() or take() and then println. The 
problem is that the same Text object (for example) is referenced for all 
records in the dataset. The counts will be okay.

> Don't reuse Writable objects in HadoopRDDs by default
> -
>
> Key: SPARK-993
> URL: https://issues.apache.org/jira/browse/SPARK-993
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Right now we reuse them as an optimization, which leads to weird results when 
> you call collect() on a file with distinct items. We should instead make that 
> behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-993.
-
Resolution: Won't Fix

We investigated this for 1.0 but found that many InputFormats behave wrongly if 
you try to clone the object, so we won't fix it.

> Don't reuse Writable objects in HadoopRDDs by default
> -
>
> Key: SPARK-993
> URL: https://issues.apache.org/jira/browse/SPARK-993
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Right now we reuse them as an optimization, which leads to weird results when 
> you call collect() on a file with distinct items. We should instead make that 
> behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-798) AMI: ami-530f7a3a and Mesos

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-798.
-
Resolution: Won't Fix

This is now about a pretty old AMI, so I'll close it. New versions of Spark use 
newer versions of Mesos.

> AMI: ami-530f7a3a and Mesos
> ---
>
> Key: SPARK-798
> URL: https://issues.apache.org/jira/browse/SPARK-798
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 0.7.2
>Reporter: Alexander Albul
>
> Hi,
> I have some strange problems after new version of AMI out.
> The problem is, that when i create a Mesos cluster, i can't use it.
> I mean, the spark-console is frozen most of the time.
> Even if it is not frozen, the scheduled tasks are frozen.
> Steps to reproduce:
> 1) Start cluster: ./spark-ec2 -s 1 -w 200 -i [identity] -k [key-pair] 
> --cluster-type=mesos launch spark-aalbul
> 2) SSH to the master node
> 3) go to "spark" dir 
> 4) Execute: MASTER=`cat ~/spark-ec2/cluster-url` ./spark-shell
> Problems:
> 1) The most recent problem is that spark shell unable to start like this:
> {code}
> Welcome to
>     __  
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 0.7.2
>   /_/  
> Using Scala version 2.9.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_25)
> Initializing interpreter...
> 13/07/10 12:14:11 INFO server.Server: jetty-7.6.8.v20121106
> 13/07/10 12:14:11 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:33601
> Creating SparkContext...
> 13/07/10 12:14:21 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
> 13/07/10 12:14:22 INFO spark.SparkEnv: Registering BlockManagerMaster
> 13/07/10 12:14:22 INFO storage.MemoryStore: MemoryStore started with capacity 
> 3.8 GB.
> 13/07/10 12:14:22 INFO storage.DiskStore: Created local directory at 
> /mnt/spark/spark-local-20130710121422-61da
> 13/07/10 12:14:22 INFO storage.DiskStore: Created local directory at 
> /mnt2/spark/spark-local-20130710121422-07d2
> 13/07/10 12:14:22 INFO network.ConnectionManager: Bound socket to port 49473 
> with id = ConnectionManagerId(ip-10-46-37-82.ec2.internal,49473)
> 13/07/10 12:14:22 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 13/07/10 12:14:22 INFO storage.BlockManagerMaster: Registered BlockManager
> 13/07/10 12:14:22 INFO server.Server: jetty-7.6.8.v20121106
> 13/07/10 12:14:22 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:50518
> 13/07/10 12:14:22 INFO broadcast.HttpBroadcast: Broadcast server started at 
> http://10.46.37.82:50518
> 13/07/10 12:14:22 INFO spark.SparkEnv: Registering MapOutputTracker
> 13/07/10 12:14:22 INFO spark.HttpFileServer: HTTP File server directory is 
> /tmp/spark-6d5266c8-c9f8-4db6-958c-e4791bd8a81d
> 13/07/10 12:14:22 INFO server.Server: jetty-7.6.8.v20121106
> 13/07/10 12:14:22 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:47677
> 13/07/10 12:14:22 INFO io.IoWorker: IoWorker thread 'spray-io-worker-0' 
> started
> 13/07/10 12:14:23 INFO server.HttpServer: 
> akka://spark/user/BlockManagerHTTPServer started on /0.0.0.0:35347
> 13/07/10 12:14:23 INFO storage.BlockManagerUI: Started BlockManager web UI at 
> http://ip-10-46-37-82.ec2.internal:35347
> {code}
> When i execute jstack on this process i see that one of the threads is trying 
> to load mesos native library:
> {code}
> "main" prio=10 tid=0x7fcfc800c000 nid=0x761 runnable [0x7fcfcefd5000]
>java.lang.Thread.State: RUNNABLE
>   at java.lang.ClassLoader$NativeLibrary.load(Native Method)
>   at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1957)
>   - locked <0x00077fcb4fb8> (a java.util.Vector)
>   - locked <0x00077fd2b348> (a java.util.Vector)
>   at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1882)
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1843)
>   at java.lang.Runtime.load0(Runtime.java:795)
>   - locked <0x00077fcb66d8> (a java.lang.Runtime)
>   at java.lang.System.load(System.java:1061)
>   at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:38)
>   at 
> spark.executor.MesosExecutorBackend$.main(MesosExecutorBackend.scala:73)
>   at spark.executor.MesosExecutorBackend.main(MesosExecutorBackend.scala)
> {code}
> 2) Scheduled task do not want to finish.
> Even when the console is started (rare case) i see this:
> {code}
> Spark context available as sc.
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc.parallelize(List(1,2,3))
> res0: spark.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :13
> scala> res0.collect()
> 13/07/11 09:06:59 INFO spark.SparkContext: Starting job: collect at 
> :15
> 13/07/11 09:06:59 INFO scheduler.DAGScheduler: Got job 0 (collect

[jira] [Resolved] (SPARK-681) Optimize hashtables used in Spark

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-681.
-
Resolution: Fixed

> Optimize hashtables used in Spark
> -
>
> Key: SPARK-681
> URL: https://issues.apache.org/jira/browse/SPARK-681
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> The hash tables used in cogroup, join, etc take up a lot more space than they 
> need to because they're using linked data structures. It would be nice to 
> write a custom open hashtable class to use instead, especially since these 
> tables are "append-only". A custom one would likely run better than fastutil 
> as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

2014-11-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200514#comment-14200514
 ] 

Matei Zaharia commented on SPARK-677:
-

[~joshrosen] is this fixed now?

> PySpark should not collect results through local filesystem
> ---
>
> Key: SPARK-677
> URL: https://issues.apache.org/jira/browse/SPARK-677
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 0.7.0
>Reporter: Josh Rosen
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data 
> to the disk and reads it back in order to collect() RDDs.  On large enough 
> datasets, this data will spill from the buffer cache and write to the 
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or 
> a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-643) Standalone master crashes during actor restart

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-643.
-
Resolution: Fixed

> Standalone master crashes during actor restart
> --
>
> Key: SPARK-643
> URL: https://issues.apache.org/jira/browse/SPARK-643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.6.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The standalone master will crash if it restarts due to an exception:
> {code}
> 12/12/15 03:10:47 ERROR master.Master: Job SkewBenchmark wth ID 
> job-20121215031047- failed 11 times.
> spark.SparkException: Job SkewBenchmark wth ID job-20121215031047- failed 
> 11 times.
> at 
> spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:103)
> at 
> spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:62)
> at akka.actor.Actor$class.apply(Actor.scala:318)
> at spark.deploy.master.Master.apply(Master.scala:17)
> at akka.actor.ActorCell.invoke(ActorCell.scala:626)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
> at akka.dispatch.Mailbox.run(Mailbox.scala:179)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
> at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
> at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
> at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
> at 
> akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
> 12/12/15 03:10:47 INFO master.Master: Starting Spark master at 
> spark://ip-10-226-87-193:7077
> 12/12/15 03:10:47 INFO io.IoWorker: IoWorker thread 'spray-io-worker-1' 
> started
> 12/12/15 03:10:47 ERROR master.Master: Failed to create web UI
> akka.actor.InvalidActorNameException:actor name HttpServer is not unique!
> [05aed000-4665-11e2-b361-12313d316833]
> at akka.actor.ActorCell.actorOf(ActorCell.scala:392)
> at 
> akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.liftedTree1$1(ActorRefProvider.scala:394)
> at 
> akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:394)
> at 
> akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:392)
> at akka.actor.Actor$class.apply(Actor.scala:318)
> at 
> akka.actor.LocalActorRefProvider$Guardian.apply(ActorRefProvider.scala:388)
> at akka.actor.ActorCell.invoke(ActorCell.scala:626)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
> at akka.dispatch.Mailbox.run(Mailbox.scala:179)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
> at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
> at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
> at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
> at 
> akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
> {code}
> When the Master actor restarts, Akka calls the {{postRestart}} hook.  [By 
> default|http://doc.akka.io/docs/akka/snapshot/general/supervision.html#supervision-restart],
>  this calls {{preStart}}.  The standalone master's {{preStart}} method tries 
> to start the webUI but crashes because it is already running.
> I ran into this after a job failed more than 11 times, which causes the 
> Master to throw a SparkException from its {{receive}} method.
> The solution is to implement a custom {{postRestart}} hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-644) Jobs canceled due to repeated executor failures may hang

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-644.
-
Resolution: Fixed

> Jobs canceled due to repeated executor failures may hang
> 
>
> Key: SPARK-644
> URL: https://issues.apache.org/jira/browse/SPARK-644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.6.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In order to prevent an infinite loop, the standalone master aborts jobs that 
> experience more than 10 executor failures (see 
> https://github.com/mesos/spark/pull/210).  Currently, the master crashes when 
> aborting jobs (this is the issue that uncovered SPARK-643).  If we fix the 
> crash, which involves removing a {{throw}} from the actor's {{receive}} 
> method, then these failures can lead to a hang because they cause the job to 
> be removed from the master's scheduler, but the upstream scheduler components 
> aren't notified of the failure and will wait for the job to finish.
> I've considered fixing this by adding additional callbacks to propagate the 
> failure to the higher-level schedulers.  It might be cleaner to move the 
> decision to abort the job into the higher-level layers of the scheduler, 
> sending an {{AbortJob(jobId)}} method to the Master.  The Client is already 
> notified of executor state changes, so it may be able to make the decision to 
> abort (or defer that decision to a higher layer).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4186) Support binaryFiles and binaryRecords API in Python

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4186.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> Support binaryFiles and binaryRecords API in Python
> ---
>
> Key: SPARK-4186
> URL: https://issues.apache.org/jira/browse/SPARK-4186
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Reporter: Matei Zaharia
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> After SPARK-2759, we should expose these methods in Python. Shouldn't be too 
> hard to add.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   >