[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-04 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152915#comment-16152915
 ] 

Matei Zaharia commented on SPARK-21866:
---

Just to chime in on this, I've also seen feedback that the deep learning 
libraries for Spark are too fragmented: there are too many of them, and people 
don't know where to start. This standard representation would at least give 
them a clear way to interoperate. It would let people write separate libraries 
for image processing, data augmentation and then training for example.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in 

[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2017-08-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-18278:
--
Labels: SPIP  (was: )

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
>  Labels: SPIP
> Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision 
> 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-21866:
--
Labels: SPIP  (was: )

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The content of this is 

[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-12 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484732#comment-15484732
 ] 

Matei Zaharia commented on SPARK-17445:
---

Sounds good to me.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-10 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15480419#comment-15480419
 ] 

Matei Zaharia commented on SPARK-17445:
---

Sounds good, but IMO just keep the current supplemental projects there -- don't 
they fit better into "third-party packages" than "powered by"? I viewed powered 
by as a list of users, similar to https://wiki.apache.org/hadoop/PoweredBy, but 
I guess you're viewing it as a list of software that integrates with Spark.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-09 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15479121#comment-15479121
 ] 

Matei Zaharia commented on SPARK-17445:
---

The powered by wiki page is a bit of a mess IMO, so I'd separate out the 
third-party packages from that one. Basically, the powered by page was useful 
when the project was really new and nobody knew who's using it, but right now 
it's a snapshot of the users from back then because = few new organizations 
(especially the large ones) list themselves there. Anyway, just linking to this 
wiki page is nice, though I'd try to rename the page to "Third-Party Packages" 
instead of "Supplemental Spark Projects" if it's possible to make the old name 
redirect.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-08 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474543#comment-15474543
 ] 

Matei Zaharia commented on SPARK-17445:
---

I think one part you're missing, Josh, is that spark-packages.org *is* an index 
of packages from a wide variety of organizations, where anyone can submit a 
package. Have you looked through it? Maybe there is some concern about which 
third-party index we highlight on the site, but AFAIK there are no other 
third-party package indexes. Nonetheless it would make sense to have a stable 
URL on the Spark homepage that lists them.

BTW, in the past, we also used a wiki page to track them: 
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects 
so we could just link to that. The spark-packages site provides some nicer 
functionality though such as letting anyone add a package with just a GitHub 
account, listing releases, etc.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-07 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-17445:
-

 Summary: Reference an ASF page as the main place to find 
third-party packages
 Key: SPARK-17445
 URL: https://issues.apache.org/jira/browse/SPARK-17445
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia


Some comments and docs like 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
 say to go to spark-packages.org, but since this is a package index maintained 
by a third party, it would be better to reference an ASF page that we can keep 
updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16031) Add debug-only socket source in Structured Streaming

2016-06-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337182#comment-15337182
 ] 

Matei Zaharia commented on SPARK-16031:
---

FYI I'll post a PR for this soon.

> Add debug-only socket source in Structured Streaming
> 
>
> Key: SPARK-16031
> URL: https://issues.apache.org/jira/browse/SPARK-16031
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>
> This is a debug-only version of SPARK-15842: for tutorials and debugging of 
> streaming apps, it would be nice to have a text-based socket source similar 
> to the one in Spark Streaming. It will clearly be marked as debug-only so 
> that users don't try to run it in production applications, because this type 
> of source cannot provide HA without storing a lot of state in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16031) Add debug-only socket source in Structured Streaming

2016-06-17 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-16031:
-

 Summary: Add debug-only socket source in Structured Streaming
 Key: SPARK-16031
 URL: https://issues.apache.org/jira/browse/SPARK-16031
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Streaming
Reporter: Matei Zaharia
Assignee: Matei Zaharia


This is a debug-only version of SPARK-15842: for tutorials and debugging of 
streaming apps, it would be nice to have a text-based socket source similar to 
the one in Spark Streaming. It will clearly be marked as debug-only so that 
users don't try to run it in production applications, because this type of 
source cannot provide HA without storing a lot of state in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15879) Update logo in UI and docs to add "Apache"

2016-06-10 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-15879:
-

 Summary: Update logo in UI and docs to add "Apache"
 Key: SPARK-15879
 URL: https://issues.apache.org/jira/browse/SPARK-15879
 Project: Spark
  Issue Type: Task
  Components: Documentation, Web UI
Reporter: Matei Zaharia


We recently added "Apache" to the Spark logo on the website 
(http://spark.apache.org/images/spark-logo.eps) to have it be the full project 
name, and we should do the same in the web UI and docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14356) Update spark.sql.execution.debug to work on Datasets

2016-04-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-14356:
-

Assignee: Matei Zaharia

> Update spark.sql.execution.debug to work on Datasets
> 
>
> Key: SPARK-14356
> URL: https://issues.apache.org/jira/browse/SPARK-14356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Minor
>
> Currently it only works on DataFrame, which seems unnecessarily restrictive 
> for 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14356) Update spark.sql.execution.debug to work on Datasets

2016-04-03 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-14356:
-

 Summary: Update spark.sql.execution.debug to work on Datasets
 Key: SPARK-14356
 URL: https://issues.apache.org/jira/browse/SPARK-14356
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Matei Zaharia
Priority: Minor


Currently it only works on DataFrame, which seems unnecessarily restrictive for 
2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10854) MesosExecutorBackend: Received launchTask but executor was null

2015-12-03 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038058#comment-15038058
 ] 

Matei Zaharia commented on SPARK-10854:
---

Just a note, I saw a log where this happened, and the sequence of events is 
that the executor logs a launchTask callback before registered(). It could be a 
synchronization thing or a problem in the Mesos library.

> MesosExecutorBackend: Received launchTask but executor was null
> ---
>
> Key: SPARK-10854
> URL: https://issues.apache.org/jira/browse/SPARK-10854
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0
> Mesos 0.23.0
> Docker 1.8.1
>Reporter: Kevin Matzen
>Priority: Minor
>
> Sometimes my tasks get stuck in staging.  Here's stdout from one such worker. 
>  I'm running mesos-slave inside a docker container with the host's docker 
> exposed and I'm using Spark's docker support to launch the worker inside its 
> own container.  Both containers are running.  I'm using pyspark.  I can see 
> mesos-slave and java running, but I do not see python running.
> {noformat}
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered signal handlers for 
> [TERM, HUP, INT]
> I0928 15:02:09.65854138 exec.cpp:132] Version: 0.23.0
> 15/09/28 15:02:09 ERROR MesosExecutorBackend: Received launchTask but 
> executor was null
> I0928 15:02:09.70295554 exec.cpp:206] Executor registered on slave 
> 20150928-044200-1140850698-5050-8-S190
> 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered with Mesos as 
> executor ID 20150928-044200-1140850698-5050-8-S190 with 1 cpus
> 15/09/28 15:02:09 INFO SecurityManager: Changing view acls to: root
> 15/09/28 15:02:09 INFO SecurityManager: Changing modify acls to: root
> 15/09/28 15:02:09 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/09/28 15:02:10 INFO Slf4jLogger: Slf4jLogger started
> 15/09/28 15:02:10 INFO Remoting: Starting remoting
> 15/09/28 15:02:10 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkExecutor@:56458]
> 15/09/28 15:02:10 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 56458.
> 15/09/28 15:02:10 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-28a21c2d-54cc-40b3-b0c2-cc3624f1a73c/blockmgr-f2336fec-e1ea-44f1-bd5c-9257049d5e7b
> 15/09/28 15:02:10 INFO MemoryStore: MemoryStore started with capacity 52.1 MB
> 15/09/28 15:02:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/09/28 15:02:11 INFO Executor: Starting executor ID 
> 20150928-044200-1140850698-5050-8-S190 on host 
> 15/09/28 15:02:11 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57431.
> 15/09/28 15:02:11 INFO NettyBlockTransferService: Server created on 57431
> 15/09/28 15:02:11 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/28 15:02:11 INFO BlockManagerMaster: Registered BlockManager
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11733) Allow shuffle readers to request data from just one mapper

2015-11-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-11733:
-

 Summary: Allow shuffle readers to request data from just one mapper
 Key: SPARK-11733
 URL: https://issues.apache.org/jira/browse/SPARK-11733
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia


This is needed to do broadcast joins. Right now the shuffle reader interface 
takes a range of reduce IDs but fetches from all maps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-16 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961567#comment-14961567
 ] 

Matei Zaharia commented on SPARK-:
--

Beyond tuples, you'll also want encoders for other generic classes, such as 
Seq[T]. They're the cleanest mechanism to get the most type info. Also, from a 
software engineering point of view it's nice to avoid a central object where 
you register stuff to allow composition between libraries (basically, see the 
problems that the Kryo registry creates today).

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9850) Adaptive execution in Spark

2015-09-24 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907518#comment-14907518
 ] 

Matei Zaharia commented on SPARK-9850:
--

Hey Imran, this could make sense, but note that the problem will only happen if 
you have 2000 map *output* partitions, which would've been 2000 reduce tasks 
normally. Otherwise, you can have as many map *tasks* as needed with fewer 
partitions. In most jobs, I'd expect data to get significantly smaller after 
the maps, so we'd catch that. In particular, for choosing between broadcast and 
shuffle joins this should be fine. We can do something different if we suspect 
that there is going to be tons of map output *and* we think there's nontrivial 
planning to be done once we see it.

> Adaptive execution in Spark
> ---
>
> Key: SPARK-9850
> URL: https://issues.apache.org/jira/browse/SPARK-9850
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Yin Huai
> Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9852) Let reduce tasks fetch multiple map output partitions

2015-09-24 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9852.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Let reduce tasks fetch multiple map output partitions
> -
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9852) Let reduce tasks fetch multiple map output partitions

2015-09-20 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9852:
-
Summary: Let reduce tasks fetch multiple map output partitions  (was: Let 
HashShuffleFetcher fetch multiple map output partitions)

> Let reduce tasks fetch multiple map output partitions
> -
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9851) Support submitting map stages individually in DAGScheduler

2015-09-14 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9851.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Support submitting map stages individually in DAGScheduler
> --
>
> Key: SPARK-9851
> URL: https://issues.apache.org/jira/browse/SPARK-9851
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2015-08-20 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9853:


Assignee: Matei Zaharia

 Optimize shuffle fetch of contiguous partition IDs
 --

 Key: SPARK-9853
 URL: https://issues.apache.org/jira/browse/SPARK-9853
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Minor

 On the map side, we should be able to serve a block representing multiple 
 partition IDs in one block manager request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-10008.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 Shuffle locality can take precedence over narrow dependencies for RDDs with 
 both
 

 Key: SPARK-10008
 URL: https://issues.apache.org/jira/browse/SPARK-10008
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.5.0


 The shuffle locality patch made the DAGScheduler aware of shuffle data, but 
 for RDDs that have both narrow and shuffle dependencies, it can cause them to 
 place tasks based on the shuffle dependency instead of the narrow one. This 
 case is common in iterative join-based algorithms like PageRank and ALS, 
 where one RDD is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-14 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-10008:
-

Assignee: Matei Zaharia

 Shuffle locality can take precedence over narrow dependencies for RDDs with 
 both
 

 Key: SPARK-10008
 URL: https://issues.apache.org/jira/browse/SPARK-10008
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Matei Zaharia
Assignee: Matei Zaharia

 The shuffle locality patch made the DAGScheduler aware of shuffle data, but 
 for RDDs that have both narrow and shuffle dependencies, it can cause them to 
 place tasks based on the shuffle dependency instead of the narrow one. This 
 case is common in iterative join-based algorithms like PageRank and ALS, 
 where one RDD is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-14 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-10008:
-

 Summary: Shuffle locality can take precedence over narrow 
dependencies for RDDs with both
 Key: SPARK-10008
 URL: https://issues.apache.org/jira/browse/SPARK-10008
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Matei Zaharia


The shuffle locality patch made the DAGScheduler aware of shuffle data, but for 
RDDs that have both narrow and shuffle dependencies, it can cause them to place 
tasks based on the shuffle dependency instead of the narrow one. This case is 
common in iterative join-based algorithms like PageRank and ALS, where one RDD 
is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9851) Support submitting map stages individually in DAGScheduler

2015-08-13 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9851:
-
Summary: Support submitting map stages individually in DAGScheduler  (was: 
Add support for submitting map stages individually in DAGScheduler)

 Support submitting map stages individually in DAGScheduler
 --

 Key: SPARK-9851
 URL: https://issues.apache.org/jira/browse/SPARK-9851
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long

2015-08-12 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9923:
-
Labels: Starter  (was: )

 ShuffleMapStage.numAvailableOutputs should be an Int instead of Long
 

 Key: SPARK-9923
 URL: https://issues.apache.org/jira/browse/SPARK-9923
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Priority: Trivial
  Labels: Starter

 Not sure why it was made a Long, but every usage assumes it's an Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long

2015-08-12 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9923:


 Summary: ShuffleMapStage.numAvailableOutputs should be an Int 
instead of Long
 Key: SPARK-9923
 URL: https://issues.apache.org/jira/browse/SPARK-9923
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Priority: Trivial


Not sure why it was made a Long, but every usage assumes it's an Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-12 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Issue Type: Epic  (was: New Feature)

 Adaptive execution in Spark
 ---

 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: Epic
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Yin Huai
 Attachments: AdaptiveExecutionInSpark.pdf


 Query planning is one of the main factors in high performance, but the 
 current Spark engine requires the execution DAG for a job to be set in 
 advance. Even with cost­-based optimization, it is hard to know the behavior 
 of data and user-defined functions well enough to always get great execution 
 plans. This JIRA proposes to add adaptive query execution, so that the engine 
 can change the plan for each query as it sees what data earlier stages 
 produced.
 We propose adding this to Spark SQL / DataFrames first, using a new API in 
 the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
 the functionality could be extended to other libraries or the RDD API, but 
 that is more difficult than adding it in SQL.
 I've attached a design doc by Yin Huai and myself explaining how it would 
 work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Assignee: Yin Huai

 Adaptive execution in Spark
 ---

 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Yin Huai
 Attachments: AdaptiveExecutionInSpark.pdf


 Query planning is one of the main factors in high performance, but the 
 current Spark engine requires the execution DAG for a job to be set in 
 advance. Even with cost­-based optimization, it is hard to know the behavior 
 of data and user-defined functions well enough to always get great execution 
 plans. This JIRA proposes to add adaptive query execution, so that the engine 
 can change the plan for each query as it sees what data earlier stages 
 produced.
 We propose adding this to Spark SQL / DataFrames first, using a new API in 
 the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
 the functionality could be extended to other libraries or the RDD API, but 
 that is more difficult than adding it in SQL.
 I've attached a design doc by Yin Huai and myself explaining how it would 
 work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9851:


Assignee: Matei Zaharia

 Add support for submitting map stages individually in DAGScheduler
 --

 Key: SPARK-9851
 URL: https://issues.apache.org/jira/browse/SPARK-9851
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9852:


 Summary: Let HashShuffleFetcher fetch multiple map output 
partitions
 Key: SPARK-9852
 URL: https://issues.apache.org/jira/browse/SPARK-9852
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-9852:


Assignee: Matei Zaharia

 Let HashShuffleFetcher fetch multiple map output partitions
 ---

 Key: SPARK-9852
 URL: https://issues.apache.org/jira/browse/SPARK-9852
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Matei Zaharia
Assignee: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9851:


 Summary: Add support for submitting map stages individually in 
DAGScheduler
 Key: SPARK-9851
 URL: https://issues.apache.org/jira/browse/SPARK-9851
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9850:
-
Attachment: AdaptiveExecutionInSpark.pdf

 Adaptive execution in Spark
 ---

 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Matei Zaharia
 Attachments: AdaptiveExecutionInSpark.pdf


 Query planning is one of the main factors in high performance, but the 
 current Spark engine requires the execution DAG for a job to be set in 
 advance. Even with cost­-based optimization, it is hard to know the behavior 
 of data and user-defined functions well enough to always get great execution 
 plans. This JIRA proposes to add adaptive query execution, so that the engine 
 can change the plan for each query as it sees what data earlier stages 
 produced.
 We propose adding this to Spark SQL / DataFrames first, using a new API in 
 the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
 the functionality could be extended to other libraries or the RDD API, but 
 that is more difficult than adding it in SQL.
 I've attached a design doc by Yin Huai and myself explaining how it would 
 work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9850:


 Summary: Adaptive execution in Spark
 Key: SPARK-9850
 URL: https://issues.apache.org/jira/browse/SPARK-9850
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Matei Zaharia


Query planning is one of the main factors in high performance, but the current 
Spark engine requires the execution DAG for a job to be set in advance. Even 
with cost­-based optimization, it is hard to know the behavior of data and 
user-defined functions well enough to always get great execution plans. This 
JIRA proposes to add adaptive query execution, so that the engine can change 
the plan for each query as it sees what data earlier stages produced.

We propose adding this to Spark SQL / DataFrames first, using a new API in the 
Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the 
functionality could be extended to other libraries or the RDD API, but that is 
more difficult than adding it in SQL.

I've attached a design doc by Yin Huai and myself explaining how it would work 
in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9853:


 Summary: Optimize shuffle fetch of contiguous partition IDs
 Key: SPARK-9853
 URL: https://issues.apache.org/jira/browse/SPARK-9853
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia
Priority: Minor


On the map side, we should be able to serve a block representing multiple 
partition IDs in one block manager request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9244) Increase some default memory limits

2015-07-22 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-9244.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Increase some default memory limits
 ---

 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Minor
 Fix For: 1.5.0


 There are a few memory limits that people hit often and that we could make 
 higher, especially now that memory sizes have grown.
 - spark.akka.frameSize: This defaults at 10 but is often hit for map output 
 statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
 so we can just make this larger and still not affect jobs that never sent a 
 status that large.
 - spark.executor.memory: Defaults at 512m, which is really small. We can at 
 least increase it to 1g, though this is something users do need to set on 
 their own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9244) Increase some default memory limits

2015-07-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9244:


 Summary: Increase some default memory limits
 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia


There are a few memory limits that people hit often and that we could make 
higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map output 
statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
so we can just make this larger and still not affect jobs that never sent a 
status that large.

- spark.executor.memory: Defaults at 512m, which is really small. We can at 
least increase it to 1g, though this is something users do need to set on their 
own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8110) DAG visualizations sometimes look weird in Python

2015-06-04 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-8110:
-
Attachment: Screen Shot 2015-06-04 at 1.51.32 PM.png
Screen Shot 2015-06-04 at 1.51.35 PM.png

 DAG visualizations sometimes look weird in Python
 -

 Key: SPARK-8110
 URL: https://issues.apache.org/jira/browse/SPARK-8110
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Matei Zaharia
Priority: Minor
 Attachments: Screen Shot 2015-06-04 at 1.51.32 PM.png, Screen Shot 
 2015-06-04 at 1.51.35 PM.png


 Got this by doing sc.textFile(README.md).count() -- there are some RDDs 
 outside of any stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8110) DAG visualizations sometimes look weird in Python

2015-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-8110:


 Summary: DAG visualizations sometimes look weird in Python
 Key: SPARK-8110
 URL: https://issues.apache.org/jira/browse/SPARK-8110
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Matei Zaharia
Priority: Minor


Got this by doing sc.textFile(README.md).count() -- there are some RDDs 
outside of any stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7298) Harmonize style of new UI visualizations

2015-05-08 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-7298.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 Harmonize style of new UI visualizations
 

 Key: SPARK-7298
 URL: https://issues.apache.org/jira/browse/SPARK-7298
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Patrick Wendell
Assignee: Matei Zaharia
Priority: Blocker
 Fix For: 1.4.0


 We need to go through all new visualizations in the web UI and make sure they 
 have consistent style. Both consistent with each-other and consistent with 
 the rest of the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7261) Change default log level to WARN in the REPL

2015-04-29 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520366#comment-14520366
 ] 

Matei Zaharia commented on SPARK-7261:
--

IMO we can do this even without SPARK-7260 in 1.4, but that one would be nice 
to have.

 Change default log level to WARN in the REPL
 

 Key: SPARK-7261
 URL: https://issues.apache.org/jira/browse/SPARK-7261
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor

 We should add a log4j properties file for the repl 
 (log4j-defaults-repl.properties) that has the level of warning. The main 
 reason for doing this is that we now display nice progress bars in the REPL 
 so the need for task level INFO messages is much less.
 A couple other things:
 1. I'd block this on SPARK-7260
 2. We should say in the repl opening that the log leve is set to WARN and 
 explain to people how to change it programatically.
 3. If the user has a log4j properties, it should take precedence over this 
 default of WARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext

2015-04-08 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-6778:


 Summary: SQL contexts in spark-shell and pyspark should both be 
called sqlContext
 Key: SPARK-6778
 URL: https://issues.apache.org/jira/browse/SPARK-6778
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell
Reporter: Matei Zaharia


For some reason the Python one is only called sqlCtx. This is pretty confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391456#comment-14391456
 ] 

Matei Zaharia commented on SPARK-6646:
--

Not to rain on the parade here, but I worry that focusing on mobile phones is 
short-sighted. Does this design present a path forward for the Internet of 
Things as well? You'd want something that runs on Arduino, Raspberry Pi, etc. 
We already have MQTT input in Spark Streaming so we could consider using MQTT 
to replace Netty for shuffle as well. Has anybody benchmarked that?

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-03-12 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359017#comment-14359017
 ] 

Matei Zaharia commented on SPARK-1564:
--

This is still a valid issue AFAIK, isn't it? These things still show up badly 
in Javadoc. So we could change the parent issue or something but I'd like to 
see it fixed.

 Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
 -

 Key: SPARK-1564
 URL: https://issues.apache.org/jira/browse/SPARK-1564
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Andrew Or
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-02-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309782#comment-14309782
 ] 

Matei Zaharia commented on SPARK-5654:
--

Yup, there's a tradeoff, but given that this is a language API and not an 
algorithm, input source or anything like that, I think it's important to 
support it along with the core engine. R is extremely popular for data science, 
more so than Python, and it fits well with many existing concepts in Spark.

 Integrate SparkR into Apache Spark
 --

 Key: SPARK-5654
 URL: https://issues.apache.org/jira/browse/SPARK-5654
 Project: Spark
  Issue Type: New Feature
Reporter: Shivaram Venkataraman

 The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
 from R. The project was started at the AMPLab around a year ago and has been 
 incubated as its own project to make sure it can be easily merged into 
 upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
 goals are similar to PySpark and shares a similar design pattern as described 
 in our meetup talk[2], Spark Summit presentation[3].
 Integrating SparkR into the Apache project will enable R users to use Spark 
 out of the box and given R’s large user base, it will help the Spark project 
 reach more users.  Additionally, work in progress features like providing R 
 integration with ML Pipelines and Dataframes can be better achieved by 
 development in a unified code base.
 SparkR is available under the Apache 2.0 License and does not have any 
 external dependencies other than requiring users to have R and Java installed 
 on their machines.  SparkR’s developers come from many organizations 
 including UC Berkeley, Alteryx, Intel and we will support future development, 
 maintenance after the integration.
 [1] https://github.com/amplab-extras/SparkR-pkg
 [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
 [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs

2015-02-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-5608.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 Improve SEO of Spark documentation site to let Google find latest docs
 --

 Key: SPARK-5608
 URL: https://issues.apache.org/jira/browse/SPARK-5608
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.3.0


 Google currently has trouble finding spark.apache.org/docs/latest, so a lot 
 of the results returned for various queries are from random previous versions 
 of Spark where someone created a link. I'd like to do the following:
 - Add a sitemap.xml to spark.apache.org that lists all the docs/latest pages 
 (already done)
 - Add meta description tags on some of the most important doc pages
 - Shorten the titles of some pages to have more relevant keywords; for 
 example there's no reason to have Spark SQL Programming Guide - Spark 1.2.0 
 documentation, we can just say Spark SQL - Spark 1.2.0 documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-13 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-5088:
-
Fix Version/s: (was: 1.2.1)

 Use spark-class for running executors directly on mesos
 ---

 Key: SPARK-5088
 URL: https://issues.apache.org/jira/browse/SPARK-5088
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos
Affects Versions: 1.2.0
Reporter: Jongyoul Lee
Priority: Minor
 Fix For: 1.3.0


 - sbin/spark-executor is only used by running executor on mesos environment.
 - spark-executor calls spark-class without specific parameter internally.
 - PYTHONPATH is moved to spark-class' case.
 - Remove a redundant file for maintaining codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-13 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-5088:
-
Target Version/s: 1.3.0  (was: 1.3.0, 1.2.1)

 Use spark-class for running executors directly on mesos
 ---

 Key: SPARK-5088
 URL: https://issues.apache.org/jira/browse/SPARK-5088
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Mesos
Affects Versions: 1.2.0
Reporter: Jongyoul Lee
Priority: Minor
 Fix For: 1.3.0


 - sbin/spark-executor is only used by running executor on mesos environment.
 - spark-executor calls spark-class without specific parameter internally.
 - PYTHONPATH is moved to spark-class' case.
 - Remove a redundant file for maintaining codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2015-01-09 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3619.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Jongyoul Lee  (was: Timothy Chen)

 Upgrade to Mesos 0.21 to work around MESOS-1688
 ---

 Key: SPARK-3619
 URL: https://issues.apache.org/jira/browse/SPARK-3619
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Matei Zaharia
Assignee: Jongyoul Lee
 Fix For: 1.3.0


 The Mesos 0.21 release has a fix for 
 https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4660) JavaSerializer uses wrong classloader

2014-12-29 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260544#comment-14260544
 ] 

Matei Zaharia commented on SPARK-4660:
--

[~pkolaczk] mind sending a pull request against http://github.com/apache/spark 
for this? It will allow us to run it through the automated tests. It looks like 
a good fix but this stuff can be tricky.

 JavaSerializer uses wrong classloader
 -

 Key: SPARK-4660
 URL: https://issues.apache.org/jira/browse/SPARK-4660
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.1.1
Reporter: Piotr Kołaczkowski
Priority: Critical
 Attachments: spark-serializer-classloader.patch


 During testing we found failures when trying to load some classes of the user 
 application:
 {noformat}
 ERROR 2014-11-29 20:01:56 org.apache.spark.storage.BlockManagerWorker: 
 Exception handling buffer message
 java.lang.ClassNotFoundException: 
 org.apache.spark.demo.HttpReceiverCases$HttpRequest
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:270)
   at org.apache.spark.serializer.JavaDeseriali
 zationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
   at 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
   at 
 org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126)
   at 
 org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104)
   at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:76)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:748)
   at 
 org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:639)
   at 
 org.apache.spark.storage.BlockManagerWorker.putBlock(BlockManagerWorker.scala:92)
   at 
 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:73)
   at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
   at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
   at 
 org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at 
 org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
   at 
 org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:48)
   at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38)
   at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38)
   at 
 org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:682)
   at 
 org.apache.spark.network.ConnectionManager$$anon$10.run(ConnectionManager.scala:520)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-3247) Improved support for external data sources

2014-12-11 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243253#comment-14243253
 ] 

Matei Zaharia commented on SPARK-3247:
--

For those looking to learn about the interface in more detail, there is a 
meetup video on it at https://www.youtube.com/watch?v=GQSNJAzxOr8.

 Improved support for external data sources
 --

 Key: SPARK-3247
 URL: https://issues.apache.org/jira/browse/SPARK-3247
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1429#comment-1429
 ] 

Matei Zaharia commented on SPARK-4690:
--

Yup, that's the definition of it.

 AppendOnlyMap seems not using Quadratic probing as the JavaDoc
 --

 Key: SPARK-4690
 URL: https://issues.apache.org/jira/browse/SPARK-4690
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0, 1.3.0
Reporter: Yijie Shen
Priority: Minor

 org.apache.spark.util.collection.AppendOnlyMap's Documentation like this:
 This implementation uses quadratic probing with a power-of-2 
 However, the probe procedure in face with a hash collision is just using 
 linear probing. the code below:
 val delta = i
 pos = (pos + delta)  mask
 i += 1
 Maybe a bug here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-4690.

Resolution: Invalid

 AppendOnlyMap seems not using Quadratic probing as the JavaDoc
 --

 Key: SPARK-4690
 URL: https://issues.apache.org/jira/browse/SPARK-4690
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0, 1.3.0
Reporter: Yijie Shen
Priority: Minor

 org.apache.spark.util.collection.AppendOnlyMap's Documentation like this:
 This implementation uses quadratic probing with a power-of-2 
 However, the probe procedure in face with a hash collision is just using 
 linear probing. the code below:
 val delta = i
 pos = (pos + delta)  mask
 i += 1
 Maybe a bug here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4683) Add a beeline.cmd to run on Windows

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4683:


 Summary: Add a beeline.cmd to run on Windows
 Key: SPARK-4683
 URL: https://issues.apache.org/jira/browse/SPARK-4683
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4684) Add a script to run JDBC server on Windows

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4684:


 Summary: Add a script to run JDBC server on Windows
 Key: SPARK-4684
 URL: https://issues.apache.org/jira/browse/SPARK-4684
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Matei Zaharia
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4685:
-
Priority: Trivial  (was: Major)

 Update JavaDoc settings to include spark.ml and all spark.mllib subpackages 
 in the right sections
 -

 Key: SPARK-4685
 URL: https://issues.apache.org/jira/browse/SPARK-4685
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Matei Zaharia
Priority: Trivial

 Right now they're listed under other packages on the homepage of the 
 JavaDoc docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4685:


 Summary: Update JavaDoc settings to include spark.ml and all 
spark.mllib subpackages in the right sections
 Key: SPARK-4685
 URL: https://issues.apache.org/jira/browse/SPARK-4685
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Matei Zaharia


Right now they're listed under other packages on the homepage of the JavaDoc 
docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4685:
-
Target Version/s: 1.2.1  (was: 1.2.0)

 Update JavaDoc settings to include spark.ml and all spark.mllib subpackages 
 in the right sections
 -

 Key: SPARK-4685
 URL: https://issues.apache.org/jira/browse/SPARK-4685
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Matei Zaharia
Priority: Trivial

 Right now they're listed under other packages on the homepage of the 
 JavaDoc docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-27 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4613.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Make JdbcRDD easier to use from Java
 

 Key: SPARK-4613
 URL: https://issues.apache.org/jira/browse/SPARK-4613
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Cheng Lian
 Fix For: 1.2.0


 We might eventually deprecate it, but for now it would be nice to expose a 
 Java wrapper that allows users to create this using the java function 
 interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-27 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4613:
-
Issue Type: Improvement  (was: Bug)

 Make JdbcRDD easier to use from Java
 

 Key: SPARK-4613
 URL: https://issues.apache.org/jira/browse/SPARK-4613
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Cheng Lian
 Fix For: 1.2.0


 We might eventually deprecate it, but for now it would be nice to expose a 
 Java wrapper that allows users to create this using the java function 
 interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3628.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.1.2  (was: 0.9.3, 1.0.3, 1.1.2, 1.2.1)

 Don't apply accumulator updates multiple times for tasks in result stages
 -

 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.2.0


 In previous versions of Spark, accumulator updates only got applied once for 
 accumulators that are only used in actions (i.e. result stages), letting you 
 use them to deterministically compute a result. Unfortunately, this got 
 broken in some recent refactorings.
 This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
 issue is about applying the same semantics to intermediate stages too, which 
 is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227077#comment-14227077
 ] 

Matei Zaharia commented on SPARK-3628:
--

FYI I merged this into 1.2.0, since the patch is now quite a bit smaller. We 
should decide whether we want to back port it to branch-1.1, so I'll leave it 
open for that reason. I don't think there's much point backporting it further 
because the issue is somewhat rare, but we can do it if people ask for it.

 Don't apply accumulator updates multiple times for tasks in result stages
 -

 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.2.0


 In previous versions of Spark, accumulator updates only got applied once for 
 accumulators that are only used in actions (i.e. result stages), letting you 
 use them to deterministically compute a result. Unfortunately, this got 
 broken in some recent refactorings.
 This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
 issue is about applying the same semantics to intermediate stages too, which 
 is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-11-26 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227108#comment-14227108
 ] 

Matei Zaharia commented on SPARK-732:
-

As discussed on https://github.com/apache/spark/pull/2524 this is pretty hard 
to provide good semantics for in the general case (accumulator updates inside 
non-result stages), for the following reasons:

- An RDD may be computed as part of multiple stages. For example, if you update 
an accumulator inside a MappedRDD and then shuffle it, that might be one stage. 
But if you then call map() again on the MappedRDD, and shuffle the result of 
that, you get a second stage where that map is pipeline. Do you want to count 
this accumulator update twice or not?
- Entire stages may be resubmitted if shuffle files are deleted by the periodic 
cleaner or are lost due to a node failure, so anything that tracks RDDs would 
need to do so for long periods of time (as long as the RDD is referenceable in 
the user program), which would be pretty complicated to implement.

So I'm going to mark this as won't fix for now, except for the part for 
result stages done in SPARK-3628.

 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.8.2, 
 0.9.0, 1.0.1, 1.1.0
Reporter: Josh Rosen
Assignee: Nan Zhu
Priority: Blocker

 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reopened SPARK-3628:
--

 Don't apply accumulator updates multiple times for tasks in result stages
 -

 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.2.0


 In previous versions of Spark, accumulator updates only got applied once for 
 accumulators that are only used in actions (i.e. result stages), letting you 
 use them to deterministically compute a result. Unfortunately, this got 
 broken in some recent refactorings.
 This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
 issue is about applying the same semantics to intermediate stages too, which 
 is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-25 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4613:


 Summary: Make JdbcRDD easier to use from Java
 Key: SPARK-4613
 URL: https://issues.apache.org/jira/browse/SPARK-4613
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia


We might eventually deprecate it, but for now it would be better to make it 
more Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-25 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225615#comment-14225615
 ] 

Matei Zaharia commented on SPARK-4613:
--

BTW the strawman for this would be a version of the API that doesn't take Scala 
function objects for getConnection and mapRow, possibly through a static method 
on object JdbcRDD.

 Make JdbcRDD easier to use from Java
 

 Key: SPARK-4613
 URL: https://issues.apache.org/jira/browse/SPARK-4613
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia

 We might eventually deprecate it, but for now it would be better to make it 
 more Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-11-23 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222545#comment-14222545
 ] 

Matei Zaharia commented on SPARK-3633:
--

[~stephen] you can try the 1.1.1 RC in 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-1-RC2-td9439.html,
 which includes a Maven staging repo that you can just add as a repo in a build.

 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi
Priority: Blocker

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-18 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216691#comment-14216691
 ] 

Matei Zaharia commented on SPARK-4452:
--

BTW I've thought about this more and here's what I'd suggest: try a version 
where each object is allowed to ramp up to a certain size (say 5 MB) before 
being subject to the limit, and if that doesn't work, then maybe go for the 
forced-spilling one. The reason is that as soon as N objects are active, the 
ShuffleMemoryManager will not let any object ramp up to more than 1/N, so it 
just has to fill up its current quota and stop. This means that scenarios with 
very little free memory might only happen at the beginning (when tasks start 
up). If we can make this work, then we avoid a lot of concurrency problems that 
would happen with forced spilling. 

Another improvement would be to make the Spillables request less than 2x their 
current memory when they ramp up, e.g. 1.5x. They'd then make more requests but 
it would lead to slower ramp-up and more of a chance for other threads to grab 
memory. But I think this will have less impact than simply increasing that free 
minimum amount.

 Shuffle data structures can starve others on the same thread for memory 
 

 Key: SPARK-4452
 URL: https://issues.apache.org/jira/browse/SPARK-4452
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Tianshuo Deng
Assignee: Tianshuo Deng
Priority: Blocker

 When an Aggregator is used with ExternalSorter in a task, spark will create 
 many small files and could cause too many files open error during merging.
 Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
 objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
 by Aggregator) in this case. Here is an example: Due to the usage of mapside 
 aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
 ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
 on when ExternalSorter is created in the same thread, the 
 ShuffleMemoryManager could refuse to allocate more memory to it, since the 
 memory is already given to the previous requested 
 object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
 small files(due to the lack of memory)
 I'm currently working on a PR to address these two issues. It will include 
 following changes:
 1. The ShuffleMemoryManager should not only track the memory usage for each 
 thread, but also the object who holds the memory
 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
 spillable object. In this way, if a new object in a thread is requesting 
 memory, the old occupant could be evicted/spilled. Previously the spillable 
 objects trigger spilling by themselves. So one may not trigger spilling even 
 if another object in the same thread needs more memory. After this change The 
 ShuffleMemoryManager could trigger the spilling of an object if it needs to.
 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
 ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
 after the iterator is returned. This should be changed so that even after the 
 iterator is returned, the ShuffleMemoryManager can still spill it.
 Currently, I have a working branch in progress: 
 https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
 change 3 and have a prototype of change 1 and 2 to evict spillable from 
 memory manager, still in progress. I will send a PR when it's done.
 Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-18 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217331#comment-14217331
 ] 

Matei Zaharia commented on SPARK-4452:
--

Forced spilling is orthogonal to how you set the limits actually. For example, 
if there are N objects, one way to set limits is to reserve at least 1/N of 
memory for each one. But another way would be to group them by thread, and use 
a different algorithm for allocation within a thread (e.g. set each object's 
cap to more if other objects in their thread are using less). Whether you force 
spilling or not, you'll have to decide what the right limit for each thing is.

 Shuffle data structures can starve others on the same thread for memory 
 

 Key: SPARK-4452
 URL: https://issues.apache.org/jira/browse/SPARK-4452
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Tianshuo Deng
Assignee: Tianshuo Deng
Priority: Critical

 When an Aggregator is used with ExternalSorter in a task, spark will create 
 many small files and could cause too many files open error during merging.
 Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
 objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
 by Aggregator) in this case. Here is an example: Due to the usage of mapside 
 aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
 ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
 on when ExternalSorter is created in the same thread, the 
 ShuffleMemoryManager could refuse to allocate more memory to it, since the 
 memory is already given to the previous requested 
 object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
 small files(due to the lack of memory)
 I'm currently working on a PR to address these two issues. It will include 
 following changes:
 1. The ShuffleMemoryManager should not only track the memory usage for each 
 thread, but also the object who holds the memory
 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
 spillable object. In this way, if a new object in a thread is requesting 
 memory, the old occupant could be evicted/spilled. Previously the spillable 
 objects trigger spilling by themselves. So one may not trigger spilling even 
 if another object in the same thread needs more memory. After this change The 
 ShuffleMemoryManager could trigger the spilling of an object if it needs to.
 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
 ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
 after the iterator is returned. This should be changed so that even after the 
 iterator is returned, the ShuffleMemoryManager can still spill it.
 Currently, I have a working branch in progress: 
 https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
 change 3 and have a prototype of change 1 and 2 to evict spillable from 
 memory manager, still in progress. I will send a PR when it's done.
 Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215425#comment-14215425
 ] 

Matei Zaharia commented on SPARK-4452:
--

How much of this gets fixed if you fix the elementsRead bug in ExternalSorter?

With forcing data structures to spill, the problem is that it will introduce 
complexity in every spillable data structure. I wonder if we can make it just 
give out memory in smaller increments, so that threads check whether they 
should spill more often. In addition, we can set a better minimum or maximum on 
each thread (e.g. always let it ramp up to, say, 5 MB, or some fraction of the 
memory space).

I do like the idea of making the ShuffleMemoryManager track limits per object. 
I actually considered this when I wrote that and didn't do it, possibly because 
it would've created more complexity in figuring out when an object is done. But 
it seems like it should be straightforward to add in, as long as you also track 
which objects come from which thread so that you can still 
releaseMemoryForThisThread() to clean up.

 Shuffle data structures can starve others on the same thread for memory 
 

 Key: SPARK-4452
 URL: https://issues.apache.org/jira/browse/SPARK-4452
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Tianshuo Deng

 When an Aggregator is used with ExternalSorter in a task, spark will create 
 many small files and could cause too many files open error during merging.
 This happens when using the sort-based shuffle. The issue is caused by 
 multiple factors:
 1. There seems to be a bug in setting the elementsRead variable in 
 ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
 useless for triggering spilling, the pr to fix it is 
 https://github.com/apache/spark/pull/3302
 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
 objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
 by Aggregator) in this case. Here is an example: Due to the usage of mapside 
 aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
 ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
 on when ExternalSorter is created in the same thread, the 
 ShuffleMemoryManager could refuse to allocate more memory to it, since the 
 memory is already given to the previous requested 
 object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
 small files(due to the lack of memory)
 I'm currently working on a PR to address these two issues. It will include 
 following changes
 1. The ShuffleMemoryManager should not only track the memory usage for each 
 thread, but also the object who holds the memory
 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
 spillable object. In this way, if a new object in a thread is requesting 
 memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
 happening. Previously spillable object triggers spilling by themself. So one 
 may not trigger spilling even if another object in the same thread needs more 
 memory. After this change The ShuffleMemoryManager could trigger the spilling 
 of an object if it needs to
 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
 ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
 after the iterator is returned. This should be changed so that even after the 
 iterator is returned, the ShuffleMemoryManager can still spill it.
 Currently, I have a working branch in progress: 
 https://github.com/tsdeng/spark/tree/enhance_memory_manager 
 Already made change 3 and have a prototype of change 1 and 2 to evict 
 spillable from memory manager, still in progress.
 I will send a PR when it's done.
 Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215557#comment-14215557
 ] 

Matei Zaharia commented on SPARK-4452:
--

BTW we may also want to create a separate JIRA for the short-term fix for 1.1 
and 1.2.

 Shuffle data structures can starve others on the same thread for memory 
 

 Key: SPARK-4452
 URL: https://issues.apache.org/jira/browse/SPARK-4452
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Tianshuo Deng
Assignee: Tianshuo Deng
Priority: Blocker

 When an Aggregator is used with ExternalSorter in a task, spark will create 
 many small files and could cause too many files open error during merging.
 This happens when using the sort-based shuffle. The issue is caused by 
 multiple factors:
 1. There seems to be a bug in setting the elementsRead variable in 
 ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
 useless for triggering spilling, the pr to fix it is 
 https://github.com/apache/spark/pull/3302
 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
 objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
 by Aggregator) in this case. Here is an example: Due to the usage of mapside 
 aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
 ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
 on when ExternalSorter is created in the same thread, the 
 ShuffleMemoryManager could refuse to allocate more memory to it, since the 
 memory is already given to the previous requested 
 object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
 small files(due to the lack of memory)
 I'm currently working on a PR to address these two issues. It will include 
 following changes
 1. The ShuffleMemoryManager should not only track the memory usage for each 
 thread, but also the object who holds the memory
 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
 spillable object. In this way, if a new object in a thread is requesting 
 memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
 happening. Previously spillable object triggers spilling by themself. So one 
 may not trigger spilling even if another object in the same thread needs more 
 memory. After this change The ShuffleMemoryManager could trigger the spilling 
 of an object if it needs to
 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
 ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
 after the iterator is returned. This should be changed so that even after the 
 iterator is returned, the ShuffleMemoryManager can still spill it.
 Currently, I have a working branch in progress: 
 https://github.com/tsdeng/spark/tree/enhance_memory_manager 
 Already made change 3 and have a prototype of change 1 and 2 to evict 
 spillable from memory manager, still in progress.
 I will send a PR when it's done.
 Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215556#comment-14215556
 ] 

Matei Zaharia commented on SPARK-4452:
--

Got it. It would be fine to do this if you found it to help, I was just 
wondering whether simpler fixes would get us far enough. For the forced 
spilling change, I'd suggest writing a short design doc, or making sure that 
the comments in the code about it are very detailed (essentially having a 
design doc at the top of the class). This can have a lot of tricky cases due to 
concurrency so it's important to document the design.

 Shuffle data structures can starve others on the same thread for memory 
 

 Key: SPARK-4452
 URL: https://issues.apache.org/jira/browse/SPARK-4452
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Tianshuo Deng
Assignee: Tianshuo Deng
Priority: Blocker

 When an Aggregator is used with ExternalSorter in a task, spark will create 
 many small files and could cause too many files open error during merging.
 This happens when using the sort-based shuffle. The issue is caused by 
 multiple factors:
 1. There seems to be a bug in setting the elementsRead variable in 
 ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) 
 useless for triggering spilling, the pr to fix it is 
 https://github.com/apache/spark/pull/3302
 2. Current ShuffleMemoryManager does not work well when there are 2 spillable 
 objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
 by Aggregator) in this case. Here is an example: Due to the usage of mapside 
 aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
 ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
 on when ExternalSorter is created in the same thread, the 
 ShuffleMemoryManager could refuse to allocate more memory to it, since the 
 memory is already given to the previous requested 
 object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
 small files(due to the lack of memory)
 I'm currently working on a PR to address these two issues. It will include 
 following changes
 1. The ShuffleMemoryManager should not only track the memory usage for each 
 thread, but also the object who holds the memory
 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
 spillable object. In this way, if a new object in a thread is requesting 
 memory, the old occupant could be evicted/spilled. This avoids problem 2 from 
 happening. Previously spillable object triggers spilling by themself. So one 
 may not trigger spilling even if another object in the same thread needs more 
 memory. After this change The ShuffleMemoryManager could trigger the spilling 
 of an object if it needs to
 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
 ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
 after the iterator is returned. This should be changed so that even after the 
 iterator is returned, the ShuffleMemoryManager can still spill it.
 Currently, I have a working branch in progress: 
 https://github.com/tsdeng/spark/tree/enhance_memory_manager 
 Already made change 3 and have a prototype of change 1 and 2 to evict 
 spillable from memory manager, still in progress.
 I will send a PR when it's done.
 Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4306:
-
Target Version/s: 1.2.0

 LogisticRegressionWithLBFGS support for PySpark MLlib 
 --

 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Varadharajan
  Labels: newbie
   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
 interfact. This task is to add support for LogisticRegressionWithLBFGS 
 algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214134#comment-14214134
 ] 

Matei Zaharia commented on SPARK-4306:
--

[~srinathsmn] I've assigned it to you. When do you think you'll get this done? 
It would be great to include in 1.2 but for that we'd need it quite soon (say 
this week). If you don't have time, I can also assign it to someone else.

 LogisticRegressionWithLBFGS support for PySpark MLlib 
 --

 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Varadharajan
  Labels: newbie
   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
 interfact. This task is to add support for LogisticRegressionWithLBFGS 
 algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib

2014-11-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4306:
-
Assignee: Varadharajan

 LogisticRegressionWithLBFGS support for PySpark MLlib 
 --

 Key: SPARK-4306
 URL: https://issues.apache.org/jira/browse/SPARK-4306
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Varadharajan
Assignee: Varadharajan
  Labels: newbie
   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib 
 interfact. This task is to add support for LogisticRegressionWithLBFGS 
 algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4435) Add setThreshold in Python LogisticRegressionModel and SVMModel

2014-11-16 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4435:


 Summary: Add setThreshold in Python LogisticRegressionModel and 
SVMModel
 Key: SPARK-4435
 URL: https://issues.apache.org/jira/browse/SPARK-4435
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1

2014-11-16 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214155#comment-14214155
 ] 

Matei Zaharia commented on SPARK-4434:
--

[~joshrosen] make sure to revert this on 1.2 and master as well.

 spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
 -

 Key: SPARK-4434
 URL: https://issues.apache.org/jira/browse/SPARK-4434
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: Josh Rosen
Assignee: Andrew Or
Priority: Blocker

 When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 
 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the 
 application JAR URL, e.g.
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar
 {code}
 In Spark 1.1.1 and 1.2.0, this same command now fails with an error:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Jar url 
 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
  is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
 Usage: DriverClient [options] launch active-master jar-url main-class 
 [driver options]
 Usage: DriverClient kill active-master driver-id
 {code}
 I tried changing my URL to conform to the new format, but this either 
 resulted in an error or a job that failed:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Jar url 
 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar'
  is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar)
 {code}
 If I omit the extra slash:
 {code}
 ./bin/spark-submit --deploy-mode cluster --master 
 spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar
 Sending launch command to spark://joshs-mbp.att.net:7077
 Driver successfully submitted as driver-20141116143235-0002
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20141116143235-0002 is ERROR
 Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
  expected: file:///
 java.lang.IllegalArgumentException: Wrong FS: 
 file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar,
  expected: file:///
   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74)
 {code}
 This bug effectively prevents users from using {{spark-submit}} in cluster 
 mode to run drivers whose JARs are stored on shared cluster filesystems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4439) Export RandomForest in Python

2014-11-16 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4439:


 Summary: Export RandomForest in Python
 Key: SPARK-4439
 URL: https://issues.apache.org/jira/browse/SPARK-4439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4439) Expose RandomForest in Python

2014-11-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4439:
-
Summary: Expose RandomForest in Python  (was: Export RandomForest in Python)

 Expose RandomForest in Python
 -

 Key: SPARK-4439
 URL: https://issues.apache.org/jira/browse/SPARK-4439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4330.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s:   (was: 1.3.0)

 Link to proper URL for YARN overview
 

 Key: SPARK-4330
 URL: https://issues.apache.org/jira/browse/SPARK-4330
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 In running-on-yarn.md, a link to YARN overview is here.
 But the URL is to YARN alpha's.
 It should be stable's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4330) Link to proper URL for YARN overview

2014-11-10 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4330:
-
Assignee: Kousuke Saruta

 Link to proper URL for YARN overview
 

 Key: SPARK-4330
 URL: https://issues.apache.org/jira/browse/SPARK-4330
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 In running-on-yarn.md, a link to YARN overview is here.
 But the URL is to YARN alpha's.
 It should be stable's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class

2014-11-07 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203147#comment-14203147
 ] 

Matei Zaharia commented on SPARK-4303:
--

Yup, this will actually become easier with the new pipeline API, but it's 
probably not going to happen in 1.2.

 [MLLIB] Use Long IDs instead of Int in ALS.Rating class
 ---

 Key: SPARK-4303
 URL: https://issues.apache.org/jira/browse/SPARK-4303
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jia Xu

 In many big data recommendation applications, the IDs used for users and 
 products are usually Long type instead of Integer. 
 So a Rating class based on Long IDs should be more useful for these 
 applications.
 i.e. case class Rating(val user: Long, val product: Long, val rating: Double)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4186) Support binaryFiles and binaryRecords API in Python

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4186.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Support binaryFiles and binaryRecords API in Python
 ---

 Key: SPARK-4186
 URL: https://issues.apache.org/jira/browse/SPARK-4186
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Reporter: Matei Zaharia
Assignee: Davies Liu
 Fix For: 1.2.0


 After SPARK-2759, we should expose these methods in Python. Shouldn't be too 
 hard to add.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-644) Jobs canceled due to repeated executor failures may hang

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-644.
-
Resolution: Fixed

 Jobs canceled due to repeated executor failures may hang
 

 Key: SPARK-644
 URL: https://issues.apache.org/jira/browse/SPARK-644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.6.1
Reporter: Josh Rosen
Assignee: Josh Rosen

 In order to prevent an infinite loop, the standalone master aborts jobs that 
 experience more than 10 executor failures (see 
 https://github.com/mesos/spark/pull/210).  Currently, the master crashes when 
 aborting jobs (this is the issue that uncovered SPARK-643).  If we fix the 
 crash, which involves removing a {{throw}} from the actor's {{receive}} 
 method, then these failures can lead to a hang because they cause the job to 
 be removed from the master's scheduler, but the upstream scheduler components 
 aren't notified of the failure and will wait for the job to finish.
 I've considered fixing this by adding additional callbacks to propagate the 
 failure to the higher-level schedulers.  It might be cleaner to move the 
 decision to abort the job into the higher-level layers of the scheduler, 
 sending an {{AbortJob(jobId)}} method to the Master.  The Client is already 
 notified of executor state changes, so it may be able to make the decision to 
 abort (or defer that decision to a higher layer).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-643) Standalone master crashes during actor restart

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-643.
-
Resolution: Fixed

 Standalone master crashes during actor restart
 --

 Key: SPARK-643
 URL: https://issues.apache.org/jira/browse/SPARK-643
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.6.1
Reporter: Josh Rosen
Assignee: Josh Rosen

 The standalone master will crash if it restarts due to an exception:
 {code}
 12/12/15 03:10:47 ERROR master.Master: Job SkewBenchmark wth ID 
 job-20121215031047- failed 11 times.
 spark.SparkException: Job SkewBenchmark wth ID job-20121215031047- failed 
 11 times.
 at 
 spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:103)
 at 
 spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:62)
 at akka.actor.Actor$class.apply(Actor.scala:318)
 at spark.deploy.master.Master.apply(Master.scala:17)
 at akka.actor.ActorCell.invoke(ActorCell.scala:626)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
 at akka.dispatch.Mailbox.run(Mailbox.scala:179)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
 at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
 at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
 at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
 at 
 akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
 12/12/15 03:10:47 INFO master.Master: Starting Spark master at 
 spark://ip-10-226-87-193:7077
 12/12/15 03:10:47 INFO io.IoWorker: IoWorker thread 'spray-io-worker-1' 
 started
 12/12/15 03:10:47 ERROR master.Master: Failed to create web UI
 akka.actor.InvalidActorNameException:actor name HttpServer is not unique!
 [05aed000-4665-11e2-b361-12313d316833]
 at akka.actor.ActorCell.actorOf(ActorCell.scala:392)
 at 
 akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.liftedTree1$1(ActorRefProvider.scala:394)
 at 
 akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:394)
 at 
 akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:392)
 at akka.actor.Actor$class.apply(Actor.scala:318)
 at 
 akka.actor.LocalActorRefProvider$Guardian.apply(ActorRefProvider.scala:388)
 at akka.actor.ActorCell.invoke(ActorCell.scala:626)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
 at akka.dispatch.Mailbox.run(Mailbox.scala:179)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
 at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
 at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
 at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
 at 
 akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
 {code}
 When the Master actor restarts, Akka calls the {{postRestart}} hook.  [By 
 default|http://doc.akka.io/docs/akka/snapshot/general/supervision.html#supervision-restart],
  this calls {{preStart}}.  The standalone master's {{preStart}} method tries 
 to start the webUI but crashes because it is already running.
 I ran into this after a job failed more than 11 times, which causes the 
 Master to throw a SparkException from its {{receive}} method.
 The solution is to implement a custom {{postRestart}} hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

2014-11-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200514#comment-14200514
 ] 

Matei Zaharia commented on SPARK-677:
-

[~joshrosen] is this fixed now?

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 0.7.0
Reporter: Josh Rosen

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-681) Optimize hashtables used in Spark

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-681.
-
Resolution: Fixed

 Optimize hashtables used in Spark
 -

 Key: SPARK-681
 URL: https://issues.apache.org/jira/browse/SPARK-681
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 The hash tables used in cogroup, join, etc take up a lot more space than they 
 need to because they're using linked data structures. It would be nice to 
 write a custom open hashtable class to use instead, especially since these 
 tables are append-only. A custom one would likely run better than fastutil 
 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-993.
-
Resolution: Won't Fix

We investigated this for 1.0 but found that many InputFormats behave wrongly if 
you try to clone the object, so we won't fix it.

 Don't reuse Writable objects in HadoopRDDs by default
 -

 Key: SPARK-993
 URL: https://issues.apache.org/jira/browse/SPARK-993
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 Right now we reuse them as an optimization, which leads to weird results when 
 you call collect() on a file with distinct items. We should instead make that 
 behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default

2014-11-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200531#comment-14200531
 ] 

Matei Zaharia commented on SPARK-993:
-

Arun, you'd see this issue if you do collect() or take() and then println. The 
problem is that the same Text object (for example) is referenced for all 
records in the dataset. The counts will be okay.

 Don't reuse Writable objects in HadoopRDDs by default
 -

 Key: SPARK-993
 URL: https://issues.apache.org/jira/browse/SPARK-993
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia

 Right now we reuse them as an optimization, which leads to weird results when 
 you call collect() on a file with distinct items. We should instead make that 
 behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1000) Crash when running SparkPi example with local-cluster

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-1000.

Resolution: Cannot Reproduce

 Crash when running SparkPi example with local-cluster
 -

 Key: SPARK-1000
 URL: https://issues.apache.org/jira/browse/SPARK-1000
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: xiajunluan
Assignee: Andrew Or

 when I run SparkPi with local-cluster[2,2,512], it will throw following 
 exception at the end of job.
 WARNING: An exception was thrown by an exception handler.
 java.util.concurrent.RejectedExecutionException
   at 
 java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
   at 
 java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
   at 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
   at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.start(AbstractNioWorker.java:184)
   at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:330)
   at 
 org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35)
   at 
 org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:313)
   at 
 org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35)
   at 
 org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
   at 
 org.jboss.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:504)
   at 
 org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:47)
   at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannel.init(NioClientSocketChannel.java:79)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:176)
   at 
 org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:82)
   at 
 org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:213)
   at 
 org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183)
   at 
 akka.remote.netty.ActiveRemoteClient$$anonfun$connect$1.apply$mcV$sp(Client.scala:173)
   at akka.util.Switch.liftedTree1$1(LockUtil.scala:33)
   at akka.util.Switch.transcend(LockUtil.scala:32)
   at akka.util.Switch.switchOn(LockUtil.scala:55)
   at akka.remote.netty.ActiveRemoteClient.connect(Client.scala:158)
   at 
 akka.remote.netty.NettyRemoteTransport.send(NettyRemoteSupport.scala:153)
   at akka.remote.RemoteActorRef.$bang(RemoteActorRefProvider.scala:247)
   at 
 akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559)
   at 
 akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559)
   at scala.collection.Iterator$class.foreach(Iterator.scala:772)
   at scala.collection.immutable.VectorIterator.foreach(Vector.scala:648)
   at scala.collection.IterableLike$class.foreach(IterableLike.scala:73)
   at scala.collection.immutable.Vector.foreach(Vector.scala:63)
   at akka.actor.LocalDeathWatch.publish(ActorRefProvider.scala:559)
   at 
 akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:280)
   at 
 akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:262)
   at akka.actor.ActorCell.doTerminate(ActorCell.scala:701)
   at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:747)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:608)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:209)
   at akka.dispatch.Mailbox.run(Mailbox.scala:178)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
   at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
   at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
   at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
   at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1023) Remove Thread.sleep(5000) from TaskSchedulerImpl

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1023.
--
Resolution: Fixed

 Remove Thread.sleep(5000) from TaskSchedulerImpl
 

 Key: SPARK-1023
 URL: https://issues.apache.org/jira/browse/SPARK-1023
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
 Fix For: 1.0.0


 This causes the unit tests to take super long. We should figure out why this 
 exists and see if we can lower it or do something smarter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1185) In Spark Programming Guide, Master URLs should mention yarn-client

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1185.
--
Resolution: Fixed

 In Spark Programming Guide, Master URLs should mention yarn-client
 

 Key: SPARK-1185
 URL: https://issues.apache.org/jira/browse/SPARK-1185
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Pérez González

 It would also be helpful to mention that the reason a host:post isn't 
 required for YARN mode is that it comes from the Hadoop configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2237) Add ZLIBCompressionCodec code

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-2237.

Resolution: Won't Fix

 Add ZLIBCompressionCodec code
 -

 Key: SPARK-2237
 URL: https://issues.apache.org/jira/browse/SPARK-2237
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yanjie Gao





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-11-06 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2348:
-
Priority: Critical  (was: Major)

 In Windows having a enviorinment variable named 'classpath' gives error
 ---

 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka
Assignee: Chirag Todarka
Priority: Critical

 Operating System:: Windows 7 Enterprise
 If having enviorinment variable named 'classpath' gives then starting 
 'spark-shell' gives below error::
 mydir\spark\binspark-shell
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
 acces
 sed before init set up.  Assuming no postInit code.
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 Exception in thread main java.lang.AssertionError: assertion failed: null
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
 la:202)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
 kILoop.scala:929)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
 Loader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4222) FixedLengthBinaryRecordReader should readFully

2014-11-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4222:
-
Assignee: Jascha Swisher

 FixedLengthBinaryRecordReader should readFully
 --

 Key: SPARK-4222
 URL: https://issues.apache.org/jira/browse/SPARK-4222
 Project: Spark
  Issue Type: Bug
Reporter: Jascha Swisher
Assignee: Jascha Swisher
Priority: Minor

 The new FixedLengthBinaryRecordReader currently uses a read() call to read 
 from the FSDataInputStream, without checking the number of bytes actually 
 returned. The currentPosition variable is updated assuming that the full 
 number of requested bytes are returned, which could lead to data corruption 
 or other problems if fewer bytes come back than requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4222) FixedLengthBinaryRecordReader should readFully

2014-11-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4222.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 FixedLengthBinaryRecordReader should readFully
 --

 Key: SPARK-4222
 URL: https://issues.apache.org/jira/browse/SPARK-4222
 Project: Spark
  Issue Type: Bug
Reporter: Jascha Swisher
Assignee: Jascha Swisher
Priority: Minor
 Fix For: 1.2.0


 The new FixedLengthBinaryRecordReader currently uses a read() call to read 
 from the FSDataInputStream, without checking the number of bytes actually 
 returned. The currentPosition variable is updated assuming that the full 
 number of requested bytes are returned, which could lead to data corruption 
 or other problems if fewer bytes come back than requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4040) Update spark documentation for local mode and spark-streaming.

2014-11-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4040:
-
Assignee: jay vyas

 Update spark documentation for local mode and spark-streaming. 
 ---

 Key: SPARK-4040
 URL: https://issues.apache.org/jira/browse/SPARK-4040
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: jay vyas
Assignee: jay vyas
 Fix For: 1.2.0


 *Note:   this JIRA has changed since its inception - its not a bug, but 
 something which can be tricky to surmise from existing docs.  So the attached 
 patch is a doc improvement.*
 Below is the original JIRA which was filed: 
 Please note that Im somewhat new to spark streaming's API, and am not a spark 
 expert - so I've done the best to write up and reproduce this bug.  If its 
 not a bug i hope an expert will help to explain why and promptly close it.  
 However, it appears it could be a bug after discussing with [~rnowling] who 
 is a spark contributor.
 CC [~rnowling] [~willbenton] 
  
 It appears that in a DStream context, a call to   {{MappedRDD.count()}} 
 blocks progress and prevents emission of RDDs from a stream.
 {noformat}
 tweetStream.foreachRDD((rdd,lent)= {
   tweetStream.repartition(1)
   //val count = rdd.count()  DONT DO THIS !
   checks += 1;
   if (checks  20) {
 ssc.stop()
   }
}
 {noformat} 
 The above code block should inevitably halt, after 20 intervals of RDDs... 
 However, if we uncomment the call  to {{rdd.count()}}, it turns out that we 
 get an infinite stream which emits no RDDs , and thus our program runs 
 forever (ssc.stop is unreachable), because *forEach doesnt receive any more 
 entries*.  
 I suspect this is actually because the foreach block never completes, because 
 {{count()}} is winds up calling {{compute}}, which ultimately just reads from 
 the stream.
 I havent put together a minimal reproducer or unit test yet, but I can work 
 on doing so if more info is needed.
 I guess this could be seen as an application bug - but i think spark might be 
 made smarter to throw its hands up when people execute blocking code in a 
 stream processor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4040) Update spark documentation for local mode and spark-streaming.

2014-11-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4040.
--
Resolution: Fixed

 Update spark documentation for local mode and spark-streaming. 
 ---

 Key: SPARK-4040
 URL: https://issues.apache.org/jira/browse/SPARK-4040
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: jay vyas
Assignee: jay vyas
 Fix For: 1.2.0


 *Note:   this JIRA has changed since its inception - its not a bug, but 
 something which can be tricky to surmise from existing docs.  So the attached 
 patch is a doc improvement.*
 Below is the original JIRA which was filed: 
 Please note that Im somewhat new to spark streaming's API, and am not a spark 
 expert - so I've done the best to write up and reproduce this bug.  If its 
 not a bug i hope an expert will help to explain why and promptly close it.  
 However, it appears it could be a bug after discussing with [~rnowling] who 
 is a spark contributor.
 CC [~rnowling] [~willbenton] 
  
 It appears that in a DStream context, a call to   {{MappedRDD.count()}} 
 blocks progress and prevents emission of RDDs from a stream.
 {noformat}
 tweetStream.foreachRDD((rdd,lent)= {
   tweetStream.repartition(1)
   //val count = rdd.count()  DONT DO THIS !
   checks += 1;
   if (checks  20) {
 ssc.stop()
   }
}
 {noformat} 
 The above code block should inevitably halt, after 20 intervals of RDDs... 
 However, if we uncomment the call  to {{rdd.count()}}, it turns out that we 
 get an infinite stream which emits no RDDs , and thus our program runs 
 forever (ssc.stop is unreachable), because *forEach doesnt receive any more 
 entries*.  
 I suspect this is actually because the foreach block never completes, because 
 {{count()}} is winds up calling {{compute}}, which ultimately just reads from 
 the stream.
 I havent put together a minimal reproducer or unit test yet, but I can work 
 on doing so if more info is needed.
 I guess this could be seen as an application bug - but i think spark might be 
 made smarter to throw its hands up when people execute blocking code in a 
 stream processor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-565) Integrate spark in scala standard collection API

2014-11-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-565.
-
Resolution: Won't Fix

FYI I'm going to close this because we've locked down the API for 1.X, and it's 
pretty clear that it can't fully fit into the Scala collections API (that has a 
lot of things we don't have, and vice versa). This is something we can 
investigate later but it's unlikely that we'll want to bind the API to Scala 
even if we change pieces of it in the future.

 Integrate spark in scala standard collection API
 

 Key: SPARK-565
 URL: https://issues.apache.org/jira/browse/SPARK-565
 Project: Spark
  Issue Type: New Feature
Reporter: tjhunter

 This is more a meta-bug / whish item than a real bug. 
 Scala 2.0 provides some API for parallel collections which might be 
 interesting to leverage, but mostly as a user, I would like to be able to 
 write a function like:
 def contrived_example(xs:Seq[Int]) = xs.map(_ * 2).sum
 and not have to care if xs is an array, a scala parallel collection or a RDD. 
 Given that RDDs already implement most of the API for Seq, it seems mostly a 
 matter of standardization. I am probably missing some subtle details here?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >