[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152915#comment-16152915 ] Matei Zaharia commented on SPARK-21866: --- Just to chime in on this, I've also seen feedback that the deep learning libraries for Spark are too fragmented: there are too many of them, and people don't know where to start. This standard representation would at least give them a clear way to interoperate. It would let people write separate libraries for image processing, data augmentation and then training for example. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in
[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-18278: -- Labels: SPIP (was: ) > SPIP: Support native submission of spark jobs to a kubernetes cluster > - > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Labels: SPIP > Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision > 2 (1).pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-21866: -- Labels: SPIP (was: ) > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information about the origin of the image. The content of this is
[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484732#comment-15484732 ] Matei Zaharia commented on SPARK-17445: --- Sounds good to me. > Reference an ASF page as the main place to find third-party packages > > > Key: SPARK-17445 > URL: https://issues.apache.org/jira/browse/SPARK-17445 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > Some comments and docs like > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 > say to go to spark-packages.org, but since this is a package index > maintained by a third party, it would be better to reference an ASF page that > we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15480419#comment-15480419 ] Matei Zaharia commented on SPARK-17445: --- Sounds good, but IMO just keep the current supplemental projects there -- don't they fit better into "third-party packages" than "powered by"? I viewed powered by as a list of users, similar to https://wiki.apache.org/hadoop/PoweredBy, but I guess you're viewing it as a list of software that integrates with Spark. > Reference an ASF page as the main place to find third-party packages > > > Key: SPARK-17445 > URL: https://issues.apache.org/jira/browse/SPARK-17445 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > Some comments and docs like > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 > say to go to spark-packages.org, but since this is a package index > maintained by a third party, it would be better to reference an ASF page that > we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15479121#comment-15479121 ] Matei Zaharia commented on SPARK-17445: --- The powered by wiki page is a bit of a mess IMO, so I'd separate out the third-party packages from that one. Basically, the powered by page was useful when the project was really new and nobody knew who's using it, but right now it's a snapshot of the users from back then because = few new organizations (especially the large ones) list themselves there. Anyway, just linking to this wiki page is nice, though I'd try to rename the page to "Third-Party Packages" instead of "Supplemental Spark Projects" if it's possible to make the old name redirect. > Reference an ASF page as the main place to find third-party packages > > > Key: SPARK-17445 > URL: https://issues.apache.org/jira/browse/SPARK-17445 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > Some comments and docs like > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 > say to go to spark-packages.org, but since this is a package index > maintained by a third party, it would be better to reference an ASF page that > we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474543#comment-15474543 ] Matei Zaharia commented on SPARK-17445: --- I think one part you're missing, Josh, is that spark-packages.org *is* an index of packages from a wide variety of organizations, where anyone can submit a package. Have you looked through it? Maybe there is some concern about which third-party index we highlight on the site, but AFAIK there are no other third-party package indexes. Nonetheless it would make sense to have a stable URL on the Spark homepage that lists them. BTW, in the past, we also used a wiki page to track them: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects so we could just link to that. The spark-packages site provides some nicer functionality though such as letting anyone add a package with just a GitHub account, listing releases, etc. > Reference an ASF page as the main place to find third-party packages > > > Key: SPARK-17445 > URL: https://issues.apache.org/jira/browse/SPARK-17445 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > Some comments and docs like > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 > say to go to spark-packages.org, but since this is a package index > maintained by a third party, it would be better to reference an ASF page that > we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
Matei Zaharia created SPARK-17445: - Summary: Reference an ASF page as the main place to find third-party packages Key: SPARK-17445 URL: https://issues.apache.org/jira/browse/SPARK-17445 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Some comments and docs like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 say to go to spark-packages.org, but since this is a package index maintained by a third party, it would be better to reference an ASF page that we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16031) Add debug-only socket source in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-16031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337182#comment-15337182 ] Matei Zaharia commented on SPARK-16031: --- FYI I'll post a PR for this soon. > Add debug-only socket source in Structured Streaming > > > Key: SPARK-16031 > URL: https://issues.apache.org/jira/browse/SPARK-16031 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Reporter: Matei Zaharia >Assignee: Matei Zaharia > > This is a debug-only version of SPARK-15842: for tutorials and debugging of > streaming apps, it would be nice to have a text-based socket source similar > to the one in Spark Streaming. It will clearly be marked as debug-only so > that users don't try to run it in production applications, because this type > of source cannot provide HA without storing a lot of state in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16031) Add debug-only socket source in Structured Streaming
Matei Zaharia created SPARK-16031: - Summary: Add debug-only socket source in Structured Streaming Key: SPARK-16031 URL: https://issues.apache.org/jira/browse/SPARK-16031 Project: Spark Issue Type: New Feature Components: SQL, Streaming Reporter: Matei Zaharia Assignee: Matei Zaharia This is a debug-only version of SPARK-15842: for tutorials and debugging of streaming apps, it would be nice to have a text-based socket source similar to the one in Spark Streaming. It will clearly be marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15879) Update logo in UI and docs to add "Apache"
Matei Zaharia created SPARK-15879: - Summary: Update logo in UI and docs to add "Apache" Key: SPARK-15879 URL: https://issues.apache.org/jira/browse/SPARK-15879 Project: Spark Issue Type: Task Components: Documentation, Web UI Reporter: Matei Zaharia We recently added "Apache" to the Spark logo on the website (http://spark.apache.org/images/spark-logo.eps) to have it be the full project name, and we should do the same in the web UI and docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14356) Update spark.sql.execution.debug to work on Datasets
[ https://issues.apache.org/jira/browse/SPARK-14356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-14356: - Assignee: Matei Zaharia > Update spark.sql.execution.debug to work on Datasets > > > Key: SPARK-14356 > URL: https://issues.apache.org/jira/browse/SPARK-14356 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Matei Zaharia >Assignee: Matei Zaharia >Priority: Minor > > Currently it only works on DataFrame, which seems unnecessarily restrictive > for 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14356) Update spark.sql.execution.debug to work on Datasets
Matei Zaharia created SPARK-14356: - Summary: Update spark.sql.execution.debug to work on Datasets Key: SPARK-14356 URL: https://issues.apache.org/jira/browse/SPARK-14356 Project: Spark Issue Type: Bug Components: SQL Reporter: Matei Zaharia Priority: Minor Currently it only works on DataFrame, which seems unnecessarily restrictive for 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10854) MesosExecutorBackend: Received launchTask but executor was null
[ https://issues.apache.org/jira/browse/SPARK-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038058#comment-15038058 ] Matei Zaharia commented on SPARK-10854: --- Just a note, I saw a log where this happened, and the sequence of events is that the executor logs a launchTask callback before registered(). It could be a synchronization thing or a problem in the Mesos library. > MesosExecutorBackend: Received launchTask but executor was null > --- > > Key: SPARK-10854 > URL: https://issues.apache.org/jira/browse/SPARK-10854 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.0 > Environment: Spark 1.4.0 > Mesos 0.23.0 > Docker 1.8.1 >Reporter: Kevin Matzen >Priority: Minor > > Sometimes my tasks get stuck in staging. Here's stdout from one such worker. > I'm running mesos-slave inside a docker container with the host's docker > exposed and I'm using Spark's docker support to launch the worker inside its > own container. Both containers are running. I'm using pyspark. I can see > mesos-slave and java running, but I do not see python running. > {noformat} > WARNING: Your kernel does not support swap limit capabilities, memory limited > without swap. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered signal handlers for > [TERM, HUP, INT] > I0928 15:02:09.65854138 exec.cpp:132] Version: 0.23.0 > 15/09/28 15:02:09 ERROR MesosExecutorBackend: Received launchTask but > executor was null > I0928 15:02:09.70295554 exec.cpp:206] Executor registered on slave > 20150928-044200-1140850698-5050-8-S190 > 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered with Mesos as > executor ID 20150928-044200-1140850698-5050-8-S190 with 1 cpus > 15/09/28 15:02:09 INFO SecurityManager: Changing view acls to: root > 15/09/28 15:02:09 INFO SecurityManager: Changing modify acls to: root > 15/09/28 15:02:09 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root); users > with modify permissions: Set(root) > 15/09/28 15:02:10 INFO Slf4jLogger: Slf4jLogger started > 15/09/28 15:02:10 INFO Remoting: Starting remoting > 15/09/28 15:02:10 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkExecutor@:56458] > 15/09/28 15:02:10 INFO Utils: Successfully started service 'sparkExecutor' on > port 56458. > 15/09/28 15:02:10 INFO DiskBlockManager: Created local directory at > /tmp/spark-28a21c2d-54cc-40b3-b0c2-cc3624f1a73c/blockmgr-f2336fec-e1ea-44f1-bd5c-9257049d5e7b > 15/09/28 15:02:10 INFO MemoryStore: MemoryStore started with capacity 52.1 MB > 15/09/28 15:02:11 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 15/09/28 15:02:11 INFO Executor: Starting executor ID > 20150928-044200-1140850698-5050-8-S190 on host > 15/09/28 15:02:11 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57431. > 15/09/28 15:02:11 INFO NettyBlockTransferService: Server created on 57431 > 15/09/28 15:02:11 INFO BlockManagerMaster: Trying to register BlockManager > 15/09/28 15:02:11 INFO BlockManagerMaster: Registered BlockManager > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11733) Allow shuffle readers to request data from just one mapper
Matei Zaharia created SPARK-11733: - Summary: Allow shuffle readers to request data from just one mapper Key: SPARK-11733 URL: https://issues.apache.org/jira/browse/SPARK-11733 Project: Spark Issue Type: Sub-task Reporter: Matei Zaharia This is needed to do broadcast joins. Right now the shuffle reader interface takes a range of reduce IDs but fetches from all maps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961567#comment-14961567 ] Matei Zaharia commented on SPARK-: -- Beyond tuples, you'll also want encoders for other generic classes, such as Seq[T]. They're the cleanest mechanism to get the most type info. Also, from a software engineering point of view it's nice to avoid a central object where you register stuff to allow composition between libraries (basically, see the problems that the Kryo registry creates today). > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9850) Adaptive execution in Spark
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907518#comment-14907518 ] Matei Zaharia commented on SPARK-9850: -- Hey Imran, this could make sense, but note that the problem will only happen if you have 2000 map *output* partitions, which would've been 2000 reduce tasks normally. Otherwise, you can have as many map *tasks* as needed with fewer partitions. In most jobs, I'd expect data to get significantly smaller after the maps, so we'd catch that. In particular, for choosing between broadcast and shuffle joins this should be fine. We can do something different if we suspect that there is going to be tons of map output *and* we think there's nontrivial planning to be done once we see it. > Adaptive execution in Spark > --- > > Key: SPARK-9850 > URL: https://issues.apache.org/jira/browse/SPARK-9850 > Project: Spark > Issue Type: Epic > Components: Spark Core, SQL >Reporter: Matei Zaharia >Assignee: Yin Huai > Attachments: AdaptiveExecutionInSpark.pdf > > > Query planning is one of the main factors in high performance, but the > current Spark engine requires the execution DAG for a job to be set in > advance. Even with cost-based optimization, it is hard to know the behavior > of data and user-defined functions well enough to always get great execution > plans. This JIRA proposes to add adaptive query execution, so that the engine > can change the plan for each query as it sees what data earlier stages > produced. > We propose adding this to Spark SQL / DataFrames first, using a new API in > the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, > the functionality could be extended to other libraries or the RDD API, but > that is more difficult than adding it in SQL. > I've attached a design doc by Yin Huai and myself explaining how it would > work in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9852) Let reduce tasks fetch multiple map output partitions
[ https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-9852. -- Resolution: Fixed Fix Version/s: 1.6.0 > Let reduce tasks fetch multiple map output partitions > - > > Key: SPARK-9852 > URL: https://issues.apache.org/jira/browse/SPARK-9852 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Reporter: Matei Zaharia >Assignee: Matei Zaharia > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9852) Let reduce tasks fetch multiple map output partitions
[ https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9852: - Summary: Let reduce tasks fetch multiple map output partitions (was: Let HashShuffleFetcher fetch multiple map output partitions) > Let reduce tasks fetch multiple map output partitions > - > > Key: SPARK-9852 > URL: https://issues.apache.org/jira/browse/SPARK-9852 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Reporter: Matei Zaharia >Assignee: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9851) Support submitting map stages individually in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-9851. -- Resolution: Fixed Fix Version/s: 1.6.0 > Support submitting map stages individually in DAGScheduler > -- > > Key: SPARK-9851 > URL: https://issues.apache.org/jira/browse/SPARK-9851 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Reporter: Matei Zaharia >Assignee: Matei Zaharia > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs
[ https://issues.apache.org/jira/browse/SPARK-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-9853: Assignee: Matei Zaharia Optimize shuffle fetch of contiguous partition IDs -- Key: SPARK-9853 URL: https://issues.apache.org/jira/browse/SPARK-9853 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Matei Zaharia Assignee: Matei Zaharia Priority: Minor On the map side, we should be able to serve a block representing multiple partition IDs in one block manager request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both
[ https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-10008. --- Resolution: Fixed Fix Version/s: 1.5.0 Shuffle locality can take precedence over narrow dependencies for RDDs with both Key: SPARK-10008 URL: https://issues.apache.org/jira/browse/SPARK-10008 Project: Spark Issue Type: Bug Components: Scheduler Reporter: Matei Zaharia Assignee: Matei Zaharia Fix For: 1.5.0 The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both
[ https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-10008: - Assignee: Matei Zaharia Shuffle locality can take precedence over narrow dependencies for RDDs with both Key: SPARK-10008 URL: https://issues.apache.org/jira/browse/SPARK-10008 Project: Spark Issue Type: Bug Components: Scheduler Reporter: Matei Zaharia Assignee: Matei Zaharia The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both
Matei Zaharia created SPARK-10008: - Summary: Shuffle locality can take precedence over narrow dependencies for RDDs with both Key: SPARK-10008 URL: https://issues.apache.org/jira/browse/SPARK-10008 Project: Spark Issue Type: Bug Components: Scheduler Reporter: Matei Zaharia The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9851) Support submitting map stages individually in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9851: - Summary: Support submitting map stages individually in DAGScheduler (was: Add support for submitting map stages individually in DAGScheduler) Support submitting map stages individually in DAGScheduler -- Key: SPARK-9851 URL: https://issues.apache.org/jira/browse/SPARK-9851 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Matei Zaharia Assignee: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long
[ https://issues.apache.org/jira/browse/SPARK-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9923: - Labels: Starter (was: ) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long Key: SPARK-9923 URL: https://issues.apache.org/jira/browse/SPARK-9923 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Priority: Trivial Labels: Starter Not sure why it was made a Long, but every usage assumes it's an Int. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long
Matei Zaharia created SPARK-9923: Summary: ShuffleMapStage.numAvailableOutputs should be an Int instead of Long Key: SPARK-9923 URL: https://issues.apache.org/jira/browse/SPARK-9923 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Priority: Trivial Not sure why it was made a Long, but every usage assumes it's an Int. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9850) Adaptive execution in Spark
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9850: - Issue Type: Epic (was: New Feature) Adaptive execution in Spark --- Key: SPARK-9850 URL: https://issues.apache.org/jira/browse/SPARK-9850 Project: Spark Issue Type: Epic Components: Spark Core, SQL Reporter: Matei Zaharia Assignee: Yin Huai Attachments: AdaptiveExecutionInSpark.pdf Query planning is one of the main factors in high performance, but the current Spark engine requires the execution DAG for a job to be set in advance. Even with cost-based optimization, it is hard to know the behavior of data and user-defined functions well enough to always get great execution plans. This JIRA proposes to add adaptive query execution, so that the engine can change the plan for each query as it sees what data earlier stages produced. We propose adding this to Spark SQL / DataFrames first, using a new API in the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the functionality could be extended to other libraries or the RDD API, but that is more difficult than adding it in SQL. I've attached a design doc by Yin Huai and myself explaining how it would work in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9850) Adaptive execution in Spark
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9850: - Assignee: Yin Huai Adaptive execution in Spark --- Key: SPARK-9850 URL: https://issues.apache.org/jira/browse/SPARK-9850 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Reporter: Matei Zaharia Assignee: Yin Huai Attachments: AdaptiveExecutionInSpark.pdf Query planning is one of the main factors in high performance, but the current Spark engine requires the execution DAG for a job to be set in advance. Even with cost-based optimization, it is hard to know the behavior of data and user-defined functions well enough to always get great execution plans. This JIRA proposes to add adaptive query execution, so that the engine can change the plan for each query as it sees what data earlier stages produced. We propose adding this to Spark SQL / DataFrames first, using a new API in the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the functionality could be extended to other libraries or the RDD API, but that is more difficult than adding it in SQL. I've attached a design doc by Yin Huai and myself explaining how it would work in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-9851: Assignee: Matei Zaharia Add support for submitting map stages individually in DAGScheduler -- Key: SPARK-9851 URL: https://issues.apache.org/jira/browse/SPARK-9851 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Matei Zaharia Assignee: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions
Matei Zaharia created SPARK-9852: Summary: Let HashShuffleFetcher fetch multiple map output partitions Key: SPARK-9852 URL: https://issues.apache.org/jira/browse/SPARK-9852 Project: Spark Issue Type: Sub-task Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions
[ https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-9852: Assignee: Matei Zaharia Let HashShuffleFetcher fetch multiple map output partitions --- Key: SPARK-9852 URL: https://issues.apache.org/jira/browse/SPARK-9852 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Matei Zaharia Assignee: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler
Matei Zaharia created SPARK-9851: Summary: Add support for submitting map stages individually in DAGScheduler Key: SPARK-9851 URL: https://issues.apache.org/jira/browse/SPARK-9851 Project: Spark Issue Type: Sub-task Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9850) Adaptive execution in Spark
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9850: - Attachment: AdaptiveExecutionInSpark.pdf Adaptive execution in Spark --- Key: SPARK-9850 URL: https://issues.apache.org/jira/browse/SPARK-9850 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Reporter: Matei Zaharia Attachments: AdaptiveExecutionInSpark.pdf Query planning is one of the main factors in high performance, but the current Spark engine requires the execution DAG for a job to be set in advance. Even with cost-based optimization, it is hard to know the behavior of data and user-defined functions well enough to always get great execution plans. This JIRA proposes to add adaptive query execution, so that the engine can change the plan for each query as it sees what data earlier stages produced. We propose adding this to Spark SQL / DataFrames first, using a new API in the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the functionality could be extended to other libraries or the RDD API, but that is more difficult than adding it in SQL. I've attached a design doc by Yin Huai and myself explaining how it would work in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9850) Adaptive execution in Spark
Matei Zaharia created SPARK-9850: Summary: Adaptive execution in Spark Key: SPARK-9850 URL: https://issues.apache.org/jira/browse/SPARK-9850 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Reporter: Matei Zaharia Query planning is one of the main factors in high performance, but the current Spark engine requires the execution DAG for a job to be set in advance. Even with cost-based optimization, it is hard to know the behavior of data and user-defined functions well enough to always get great execution plans. This JIRA proposes to add adaptive query execution, so that the engine can change the plan for each query as it sees what data earlier stages produced. We propose adding this to Spark SQL / DataFrames first, using a new API in the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the functionality could be extended to other libraries or the RDD API, but that is more difficult than adding it in SQL. I've attached a design doc by Yin Huai and myself explaining how it would work in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs
Matei Zaharia created SPARK-9853: Summary: Optimize shuffle fetch of contiguous partition IDs Key: SPARK-9853 URL: https://issues.apache.org/jira/browse/SPARK-9853 Project: Spark Issue Type: Sub-task Reporter: Matei Zaharia Priority: Minor On the map side, we should be able to serve a block representing multiple partition IDs in one block manager request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9244) Increase some default memory limits
[ https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-9244. -- Resolution: Fixed Fix Version/s: 1.5.0 Increase some default memory limits --- Key: SPARK-9244 URL: https://issues.apache.org/jira/browse/SPARK-9244 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Priority: Minor Fix For: 1.5.0 There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. AFAIK the memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. - spark.executor.memory: Defaults at 512m, which is really small. We can at least increase it to 1g, though this is something users do need to set on their own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9244) Increase some default memory limits
Matei Zaharia created SPARK-9244: Summary: Increase some default memory limits Key: SPARK-9244 URL: https://issues.apache.org/jira/browse/SPARK-9244 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Matei Zaharia There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. AFAIK the memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. - spark.executor.memory: Defaults at 512m, which is really small. We can at least increase it to 1g, though this is something users do need to set on their own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8110) DAG visualizations sometimes look weird in Python
[ https://issues.apache.org/jira/browse/SPARK-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-8110: - Attachment: Screen Shot 2015-06-04 at 1.51.32 PM.png Screen Shot 2015-06-04 at 1.51.35 PM.png DAG visualizations sometimes look weird in Python - Key: SPARK-8110 URL: https://issues.apache.org/jira/browse/SPARK-8110 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Matei Zaharia Priority: Minor Attachments: Screen Shot 2015-06-04 at 1.51.32 PM.png, Screen Shot 2015-06-04 at 1.51.35 PM.png Got this by doing sc.textFile(README.md).count() -- there are some RDDs outside of any stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8110) DAG visualizations sometimes look weird in Python
Matei Zaharia created SPARK-8110: Summary: DAG visualizations sometimes look weird in Python Key: SPARK-8110 URL: https://issues.apache.org/jira/browse/SPARK-8110 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Matei Zaharia Priority: Minor Got this by doing sc.textFile(README.md).count() -- there are some RDDs outside of any stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7298) Harmonize style of new UI visualizations
[ https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-7298. -- Resolution: Fixed Fix Version/s: 1.4.0 Harmonize style of new UI visualizations Key: SPARK-7298 URL: https://issues.apache.org/jira/browse/SPARK-7298 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Patrick Wendell Assignee: Matei Zaharia Priority: Blocker Fix For: 1.4.0 We need to go through all new visualizations in the web UI and make sure they have consistent style. Both consistent with each-other and consistent with the rest of the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7261) Change default log level to WARN in the REPL
[ https://issues.apache.org/jira/browse/SPARK-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520366#comment-14520366 ] Matei Zaharia commented on SPARK-7261: -- IMO we can do this even without SPARK-7260 in 1.4, but that one would be nice to have. Change default log level to WARN in the REPL Key: SPARK-7261 URL: https://issues.apache.org/jira/browse/SPARK-7261 Project: Spark Issue Type: Improvement Components: Spark Shell Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor We should add a log4j properties file for the repl (log4j-defaults-repl.properties) that has the level of warning. The main reason for doing this is that we now display nice progress bars in the REPL so the need for task level INFO messages is much less. A couple other things: 1. I'd block this on SPARK-7260 2. We should say in the repl opening that the log leve is set to WARN and explain to people how to change it programatically. 3. If the user has a log4j properties, it should take precedence over this default of WARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext
Matei Zaharia created SPARK-6778: Summary: SQL contexts in spark-shell and pyspark should both be called sqlContext Key: SPARK-6778 URL: https://issues.apache.org/jira/browse/SPARK-6778 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell Reporter: Matei Zaharia For some reason the Python one is only called sqlCtx. This is pretty confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391456#comment-14391456 ] Matei Zaharia commented on SPARK-6646: -- Not to rain on the parade here, but I worry that focusing on mobile phones is short-sighted. Does this design present a path forward for the Internet of Things as well? You'd want something that runs on Arduino, Raspberry Pi, etc. We already have MQTT input in Spark Streaming so we could consider using MQTT to replace Netty for shuffle as well. Has anybody benchmarked that? Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
[ https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359017#comment-14359017 ] Matei Zaharia commented on SPARK-1564: -- This is still a valid issue AFAIK, isn't it? These things still show up badly in Javadoc. So we could change the parent issue or something but I'd like to see it fixed. Add JavaScript into Javadoc to turn ::Experimental:: and such into badges - Key: SPARK-1564 URL: https://issues.apache.org/jira/browse/SPARK-1564 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Matei Zaharia Assignee: Andrew Or Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309782#comment-14309782 ] Matei Zaharia commented on SPARK-5654: -- Yup, there's a tradeoff, but given that this is a language API and not an algorithm, input source or anything like that, I think it's important to support it along with the core engine. R is extremely popular for data science, more so than Python, and it fits well with many existing concepts in Spark. Integrate SparkR into Apache Spark -- Key: SPARK-5654 URL: https://issues.apache.org/jira/browse/SPARK-5654 Project: Spark Issue Type: New Feature Reporter: Shivaram Venkataraman The SparkR project [1] provides a light-weight frontend to launch Spark jobs from R. The project was started at the AMPLab around a year ago and has been incubated as its own project to make sure it can be easily merged into upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s goals are similar to PySpark and shares a similar design pattern as described in our meetup talk[2], Spark Summit presentation[3]. Integrating SparkR into the Apache project will enable R users to use Spark out of the box and given R’s large user base, it will help the Spark project reach more users. Additionally, work in progress features like providing R integration with ML Pipelines and Dataframes can be better achieved by development in a unified code base. SparkR is available under the Apache 2.0 License and does not have any external dependencies other than requiring users to have R and Java installed on their machines. SparkR’s developers come from many organizations including UC Berkeley, Alteryx, Intel and we will support future development, maintenance after the integration. [1] https://github.com/amplab-extras/SparkR-pkg [2] http://files.meetup.com/3138542/SparkR-meetup.pdf [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs
[ https://issues.apache.org/jira/browse/SPARK-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-5608. -- Resolution: Fixed Fix Version/s: 1.3.0 Improve SEO of Spark documentation site to let Google find latest docs -- Key: SPARK-5608 URL: https://issues.apache.org/jira/browse/SPARK-5608 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Matei Zaharia Assignee: Matei Zaharia Fix For: 1.3.0 Google currently has trouble finding spark.apache.org/docs/latest, so a lot of the results returned for various queries are from random previous versions of Spark where someone created a link. I'd like to do the following: - Add a sitemap.xml to spark.apache.org that lists all the docs/latest pages (already done) - Add meta description tags on some of the most important doc pages - Shorten the titles of some pages to have more relevant keywords; for example there's no reason to have Spark SQL Programming Guide - Spark 1.2.0 documentation, we can just say Spark SQL - Spark 1.2.0 documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos
[ https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-5088: - Fix Version/s: (was: 1.2.1) Use spark-class for running executors directly on mesos --- Key: SPARK-5088 URL: https://issues.apache.org/jira/browse/SPARK-5088 Project: Spark Issue Type: Improvement Components: Deploy, Mesos Affects Versions: 1.2.0 Reporter: Jongyoul Lee Priority: Minor Fix For: 1.3.0 - sbin/spark-executor is only used by running executor on mesos environment. - spark-executor calls spark-class without specific parameter internally. - PYTHONPATH is moved to spark-class' case. - Remove a redundant file for maintaining codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos
[ https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-5088: - Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) Use spark-class for running executors directly on mesos --- Key: SPARK-5088 URL: https://issues.apache.org/jira/browse/SPARK-5088 Project: Spark Issue Type: Improvement Components: Deploy, Mesos Affects Versions: 1.2.0 Reporter: Jongyoul Lee Priority: Minor Fix For: 1.3.0 - sbin/spark-executor is only used by running executor on mesos environment. - spark-executor calls spark-class without specific parameter internally. - PYTHONPATH is moved to spark-class' case. - Remove a redundant file for maintaining codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3619. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Jongyoul Lee (was: Timothy Chen) Upgrade to Mesos 0.21 to work around MESOS-1688 --- Key: SPARK-3619 URL: https://issues.apache.org/jira/browse/SPARK-3619 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Matei Zaharia Assignee: Jongyoul Lee Fix For: 1.3.0 The Mesos 0.21 release has a fix for https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4660) JavaSerializer uses wrong classloader
[ https://issues.apache.org/jira/browse/SPARK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260544#comment-14260544 ] Matei Zaharia commented on SPARK-4660: -- [~pkolaczk] mind sending a pull request against http://github.com/apache/spark for this? It will allow us to run it through the automated tests. It looks like a good fix but this stuff can be tricky. JavaSerializer uses wrong classloader - Key: SPARK-4660 URL: https://issues.apache.org/jira/browse/SPARK-4660 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.1.1 Reporter: Piotr Kołaczkowski Priority: Critical Attachments: spark-serializer-classloader.patch During testing we found failures when trying to load some classes of the user application: {noformat} ERROR 2014-11-29 20:01:56 org.apache.spark.storage.BlockManagerWorker: Exception handling buffer message java.lang.ClassNotFoundException: org.apache.spark.demo.HttpReceiverCases$HttpRequest at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeseriali zationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104) at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:76) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:748) at org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:639) at org.apache.spark.storage.BlockManagerWorker.putBlock(BlockManagerWorker.scala:92) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:73) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:48) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:38) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:682) at org.apache.spark.network.ConnectionManager$$anon$10.run(ConnectionManager.scala:520) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-3247) Improved support for external data sources
[ https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243253#comment-14243253 ] Matei Zaharia commented on SPARK-3247: -- For those looking to learn about the interface in more detail, there is a meetup video on it at https://www.youtube.com/watch?v=GQSNJAzxOr8. Improved support for external data sources -- Key: SPARK-3247 URL: https://issues.apache.org/jira/browse/SPARK-3247 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1429#comment-1429 ] Matei Zaharia commented on SPARK-4690: -- Yup, that's the definition of it. AppendOnlyMap seems not using Quadratic probing as the JavaDoc -- Key: SPARK-4690 URL: https://issues.apache.org/jira/browse/SPARK-4690 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.1.0, 1.2.0, 1.3.0 Reporter: Yijie Shen Priority: Minor org.apache.spark.util.collection.AppendOnlyMap's Documentation like this: This implementation uses quadratic probing with a power-of-2 However, the probe procedure in face with a hash collision is just using linear probing. the code below: val delta = i pos = (pos + delta) mask i += 1 Maybe a bug here? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-4690. Resolution: Invalid AppendOnlyMap seems not using Quadratic probing as the JavaDoc -- Key: SPARK-4690 URL: https://issues.apache.org/jira/browse/SPARK-4690 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.1.0, 1.2.0, 1.3.0 Reporter: Yijie Shen Priority: Minor org.apache.spark.util.collection.AppendOnlyMap's Documentation like this: This implementation uses quadratic probing with a power-of-2 However, the probe procedure in face with a hash collision is just using linear probing. the code below: val delta = i pos = (pos + delta) mask i += 1 Maybe a bug here? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4683) Add a beeline.cmd to run on Windows
Matei Zaharia created SPARK-4683: Summary: Add a beeline.cmd to run on Windows Key: SPARK-4683 URL: https://issues.apache.org/jira/browse/SPARK-4683 Project: Spark Issue Type: New Feature Components: SQL Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4684) Add a script to run JDBC server on Windows
Matei Zaharia created SPARK-4684: Summary: Add a script to run JDBC server on Windows Key: SPARK-4684 URL: https://issues.apache.org/jira/browse/SPARK-4684 Project: Spark Issue Type: New Feature Components: SQL Reporter: Matei Zaharia Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
[ https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4685: - Priority: Trivial (was: Major) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections - Key: SPARK-4685 URL: https://issues.apache.org/jira/browse/SPARK-4685 Project: Spark Issue Type: New Feature Components: Documentation Reporter: Matei Zaharia Priority: Trivial Right now they're listed under other packages on the homepage of the JavaDoc docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
Matei Zaharia created SPARK-4685: Summary: Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections Key: SPARK-4685 URL: https://issues.apache.org/jira/browse/SPARK-4685 Project: Spark Issue Type: New Feature Components: Documentation Reporter: Matei Zaharia Right now they're listed under other packages on the homepage of the JavaDoc docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
[ https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4685: - Target Version/s: 1.2.1 (was: 1.2.0) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections - Key: SPARK-4685 URL: https://issues.apache.org/jira/browse/SPARK-4685 Project: Spark Issue Type: New Feature Components: Documentation Reporter: Matei Zaharia Priority: Trivial Right now they're listed under other packages on the homepage of the JavaDoc docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4613) Make JdbcRDD easier to use from Java
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4613. -- Resolution: Fixed Fix Version/s: 1.2.0 Make JdbcRDD easier to use from Java Key: SPARK-4613 URL: https://issues.apache.org/jira/browse/SPARK-4613 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Cheng Lian Fix For: 1.2.0 We might eventually deprecate it, but for now it would be nice to expose a Java wrapper that allows users to create this using the java function interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4613) Make JdbcRDD easier to use from Java
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4613: - Issue Type: Improvement (was: Bug) Make JdbcRDD easier to use from Java Key: SPARK-4613 URL: https://issues.apache.org/jira/browse/SPARK-4613 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Assignee: Cheng Lian Fix For: 1.2.0 We might eventually deprecate it, but for now it would be nice to expose a Java wrapper that allows users to create this using the java function interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3628. -- Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.1.2 (was: 0.9.3, 1.0.3, 1.1.2, 1.2.1) Don't apply accumulator updates multiple times for tasks in result stages - Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Nan Zhu Priority: Blocker Fix For: 1.2.0 In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227077#comment-14227077 ] Matei Zaharia commented on SPARK-3628: -- FYI I merged this into 1.2.0, since the patch is now quite a bit smaller. We should decide whether we want to back port it to branch-1.1, so I'll leave it open for that reason. I don't think there's much point backporting it further because the issue is somewhat rare, but we can do it if people ask for it. Don't apply accumulator updates multiple times for tasks in result stages - Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Nan Zhu Priority: Blocker Fix For: 1.2.0 In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates
[ https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227108#comment-14227108 ] Matei Zaharia commented on SPARK-732: - As discussed on https://github.com/apache/spark/pull/2524 this is pretty hard to provide good semantics for in the general case (accumulator updates inside non-result stages), for the following reasons: - An RDD may be computed as part of multiple stages. For example, if you update an accumulator inside a MappedRDD and then shuffle it, that might be one stage. But if you then call map() again on the MappedRDD, and shuffle the result of that, you get a second stage where that map is pipeline. Do you want to count this accumulator update twice or not? - Entire stages may be resubmitted if shuffle files are deleted by the periodic cleaner or are lost due to a node failure, so anything that tracks RDDs would need to do so for long periods of time (as long as the RDD is referenceable in the user program), which would be pretty complicated to implement. So I'm going to mark this as won't fix for now, except for the part for result stages done in SPARK-3628. Recomputation of RDDs may result in duplicated accumulator updates -- Key: SPARK-732 URL: https://issues.apache.org/jira/browse/SPARK-732 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.8.2, 0.9.0, 1.0.1, 1.1.0 Reporter: Josh Rosen Assignee: Nan Zhu Priority: Blocker Currently, Spark doesn't guard against duplicated updates to the same accumulator due to recomputations of an RDD. For example: {code} val acc = sc.accumulator(0) data.map(x = acc += 1; f(x)) data.count() // acc should equal data.count() here data.foreach{...} // Now, acc = 2 * data.count() because the map() was recomputed. {code} I think that this behavior is incorrect, especially because this behavior allows the additon or removal of a cache() call to affect the outcome of a computation. There's an old TODO to fix this duplicate update issue in the [DAGScheduler code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494]. I haven't tested whether recomputation due to blocks being dropped from the cache can trigger duplicate accumulator updates. Hypothetically someone could be relying on the current behavior to implement performance counters that track the actual number of computations performed (including recomputations). To be safe, we could add an explicit warning in the release notes that documents the change in behavior when we fix this. Ignoring duplicate updates shouldn't be too hard, but there are a few subtleties. Currently, we allow accumulators to be used in multiple transformations, so we'd need to detect duplicate updates at the per-transformation level. I haven't dug too deeply into the scheduler internals, but we might also run into problems where pipelining causes what is logically one set of accumulator updates to show up in two different tasks (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause what's logically the same accumulator update to be applied from two different contexts, complicating the detection of duplicate updates). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages
[ https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reopened SPARK-3628: -- Don't apply accumulator updates multiple times for tasks in result stages - Key: SPARK-3628 URL: https://issues.apache.org/jira/browse/SPARK-3628 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Nan Zhu Priority: Blocker Fix For: 1.2.0 In previous versions of Spark, accumulator updates only got applied once for accumulators that are only used in actions (i.e. result stages), letting you use them to deterministically compute a result. Unfortunately, this got broken in some recent refactorings. This is related to https://issues.apache.org/jira/browse/SPARK-732, but that issue is about applying the same semantics to intermediate stages too, which is more work and may not be what we want for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4613) Make JdbcRDD easier to use from Java
Matei Zaharia created SPARK-4613: Summary: Make JdbcRDD easier to use from Java Key: SPARK-4613 URL: https://issues.apache.org/jira/browse/SPARK-4613 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia We might eventually deprecate it, but for now it would be better to make it more Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4613) Make JdbcRDD easier to use from Java
[ https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225615#comment-14225615 ] Matei Zaharia commented on SPARK-4613: -- BTW the strawman for this would be a version of the API that doesn't take Scala function objects for getConnection and mapRow, possibly through a static method on object JdbcRDD. Make JdbcRDD easier to use from Java Key: SPARK-4613 URL: https://issues.apache.org/jira/browse/SPARK-4613 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia We might eventually deprecate it, but for now it would be better to make it more Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222545#comment-14222545 ] Matei Zaharia commented on SPARK-3633: -- [~stephen] you can try the 1.1.1 RC in http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-1-RC2-td9439.html, which includes a Maven staging repo that you can just add as a repo in a build. Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Priority: Blocker Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216691#comment-14216691 ] Matei Zaharia commented on SPARK-4452: -- BTW I've thought about this more and here's what I'd suggest: try a version where each object is allowed to ramp up to a certain size (say 5 MB) before being subject to the limit, and if that doesn't work, then maybe go for the forced-spilling one. The reason is that as soon as N objects are active, the ShuffleMemoryManager will not let any object ramp up to more than 1/N, so it just has to fill up its current quota and stop. This means that scenarios with very little free memory might only happen at the beginning (when tasks start up). If we can make this work, then we avoid a lot of concurrency problems that would happen with forced spilling. Another improvement would be to make the Spillables request less than 2x their current memory when they ramp up, e.g. 1.5x. They'd then make more requests but it would lead to slower ramp-up and more of a chance for other threads to grab memory. But I think this will have less impact than simply increasing that free minimum amount. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Tianshuo Deng Assignee: Tianshuo Deng Priority: Blocker When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. Currently, ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes: 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. Previously the spillable objects trigger spilling by themselves. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to. 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217331#comment-14217331 ] Matei Zaharia commented on SPARK-4452: -- Forced spilling is orthogonal to how you set the limits actually. For example, if there are N objects, one way to set limits is to reserve at least 1/N of memory for each one. But another way would be to group them by thread, and use a different algorithm for allocation within a thread (e.g. set each object's cap to more if other objects in their thread are using less). Whether you force spilling or not, you'll have to decide what the right limit for each thing is. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Tianshuo Deng Assignee: Tianshuo Deng Priority: Critical When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. Currently, ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes: 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. Previously the spillable objects trigger spilling by themselves. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to. 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215425#comment-14215425 ] Matei Zaharia commented on SPARK-4452: -- How much of this gets fixed if you fix the elementsRead bug in ExternalSorter? With forcing data structures to spill, the problem is that it will introduce complexity in every spillable data structure. I wonder if we can make it just give out memory in smaller increments, so that threads check whether they should spill more often. In addition, we can set a better minimum or maximum on each thread (e.g. always let it ramp up to, say, 5 MB, or some fraction of the memory space). I do like the idea of making the ShuffleMemoryManager track limits per object. I actually considered this when I wrote that and didn't do it, possibly because it would've created more complexity in figuring out when an object is done. But it seems like it should be straightforward to add in, as long as you also track which objects come from which thread so that you can still releaseMemoryForThisThread() to clean up. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Tianshuo Deng When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215557#comment-14215557 ] Matei Zaharia commented on SPARK-4452: -- BTW we may also want to create a separate JIRA for the short-term fix for 1.1 and 1.2. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Tianshuo Deng Assignee: Tianshuo Deng Priority: Blocker When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215556#comment-14215556 ] Matei Zaharia commented on SPARK-4452: -- Got it. It would be fine to do this if you found it to help, I was just wondering whether simpler fixes would get us far enough. For the forced spilling change, I'd suggest writing a short design doc, or making sure that the comments in the code about it are very detailed (essentially having a design doc at the top of the class). This can have a lot of tricky cases due to concurrency so it's important to document the design. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Tianshuo Deng Assignee: Tianshuo Deng Priority: Blocker When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. This happens when using the sort-based shuffle. The issue is caused by multiple factors: 1. There seems to be a bug in setting the elementsRead variable in ExternalSorter, which renders the trackMemoryThreshold(defined in Spillable) useless for triggering spilling, the pr to fix it is https://github.com/apache/spark/pull/3302 2. Current ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. This avoids problem 2 from happening. Previously spillable object triggers spilling by themself. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4306: - Target Version/s: 1.2.0 LogisticRegressionWithLBFGS support for PySpark MLlib -- Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Varadharajan Labels: newbie Original Estimate: 48h Remaining Estimate: 48h Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214134#comment-14214134 ] Matei Zaharia commented on SPARK-4306: -- [~srinathsmn] I've assigned it to you. When do you think you'll get this done? It would be great to include in 1.2 but for that we'd need it quite soon (say this week). If you don't have time, I can also assign it to someone else. LogisticRegressionWithLBFGS support for PySpark MLlib -- Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Varadharajan Labels: newbie Original Estimate: 48h Remaining Estimate: 48h Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4306: - Assignee: Varadharajan LogisticRegressionWithLBFGS support for PySpark MLlib -- Key: SPARK-4306 URL: https://issues.apache.org/jira/browse/SPARK-4306 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Varadharajan Assignee: Varadharajan Labels: newbie Original Estimate: 48h Remaining Estimate: 48h Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4435) Add setThreshold in Python LogisticRegressionModel and SVMModel
Matei Zaharia created SPARK-4435: Summary: Add setThreshold in Python LogisticRegressionModel and SVMModel Key: SPARK-4435 URL: https://issues.apache.org/jira/browse/SPARK-4435 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4434) spark-submit cluster deploy mode JAR URLs are broken in 1.1.1
[ https://issues.apache.org/jira/browse/SPARK-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214155#comment-14214155 ] Matei Zaharia commented on SPARK-4434: -- [~joshrosen] make sure to revert this on 1.2 and master as well. spark-submit cluster deploy mode JAR URLs are broken in 1.1.1 - Key: SPARK-4434 URL: https://issues.apache.org/jira/browse/SPARK-4434 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Reporter: Josh Rosen Assignee: Andrew Or Priority: Blocker When submitting a driver using {{spark-submit}} in cluster mode, Spark 1.1.0 allowed you to omit the {{file://}} or {{hdfs://}} prefix from the application JAR URL, e.g. {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/old-spark-releases/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar {code} In Spark 1.1.1 and 1.2.0, this same command now fails with an error: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi /Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:/Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) Usage: DriverClient [options] launch active-master jar-url main-class [driver options] Usage: DriverClient kill active-master driver-id {code} I tried changing my URL to conform to the new format, but this either resulted in an error or a job that failed: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Jar url 'file:///Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://XX.jar, file://XX.jar) {code} If I omit the extra slash: {code} ./bin/spark-submit --deploy-mode cluster --master spark://joshs-mbp.att.net:7077 --class org.apache.spark.examples.SparkPi file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar Sending launch command to spark://joshs-mbp.att.net:7077 Driver successfully submitted as driver-20141116143235-0002 ... waiting before polling master for driver state ... polling master for driver state State of driver-20141116143235-0002 is ERROR Exception from cluster was: java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// java.lang.IllegalArgumentException: Wrong FS: file://Users/joshrosen/Documents/Spark/examples/target/scala-2.10/spark-examples_2.10-1.1.2-SNAPSHOT.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:157) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:74) {code} This bug effectively prevents users from using {{spark-submit}} in cluster mode to run drivers whose JARs are stored on shared cluster filesystems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4439) Export RandomForest in Python
Matei Zaharia created SPARK-4439: Summary: Export RandomForest in Python Key: SPARK-4439 URL: https://issues.apache.org/jira/browse/SPARK-4439 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4439) Expose RandomForest in Python
[ https://issues.apache.org/jira/browse/SPARK-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4439: - Summary: Expose RandomForest in Python (was: Export RandomForest in Python) Expose RandomForest in Python - Key: SPARK-4439 URL: https://issues.apache.org/jira/browse/SPARK-4439 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4330) Link to proper URL for YARN overview
[ https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4330. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target Version/s: (was: 1.3.0) Link to proper URL for YARN overview Key: SPARK-4330 URL: https://issues.apache.org/jira/browse/SPARK-4330 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.1.1, 1.2.0 In running-on-yarn.md, a link to YARN overview is here. But the URL is to YARN alpha's. It should be stable's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4330) Link to proper URL for YARN overview
[ https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4330: - Assignee: Kousuke Saruta Link to proper URL for YARN overview Key: SPARK-4330 URL: https://issues.apache.org/jira/browse/SPARK-4330 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.1.1, 1.2.0 In running-on-yarn.md, a link to YARN overview is here. But the URL is to YARN alpha's. It should be stable's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class
[ https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203147#comment-14203147 ] Matei Zaharia commented on SPARK-4303: -- Yup, this will actually become easier with the new pipeline API, but it's probably not going to happen in 1.2. [MLLIB] Use Long IDs instead of Int in ALS.Rating class --- Key: SPARK-4303 URL: https://issues.apache.org/jira/browse/SPARK-4303 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jia Xu In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. So a Rating class based on Long IDs should be more useful for these applications. i.e. case class Rating(val user: Long, val product: Long, val rating: Double) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4186) Support binaryFiles and binaryRecords API in Python
[ https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4186. -- Resolution: Fixed Fix Version/s: 1.2.0 Support binaryFiles and binaryRecords API in Python --- Key: SPARK-4186 URL: https://issues.apache.org/jira/browse/SPARK-4186 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Reporter: Matei Zaharia Assignee: Davies Liu Fix For: 1.2.0 After SPARK-2759, we should expose these methods in Python. Shouldn't be too hard to add. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-644) Jobs canceled due to repeated executor failures may hang
[ https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-644. - Resolution: Fixed Jobs canceled due to repeated executor failures may hang Key: SPARK-644 URL: https://issues.apache.org/jira/browse/SPARK-644 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.6.1 Reporter: Josh Rosen Assignee: Josh Rosen In order to prevent an infinite loop, the standalone master aborts jobs that experience more than 10 executor failures (see https://github.com/mesos/spark/pull/210). Currently, the master crashes when aborting jobs (this is the issue that uncovered SPARK-643). If we fix the crash, which involves removing a {{throw}} from the actor's {{receive}} method, then these failures can lead to a hang because they cause the job to be removed from the master's scheduler, but the upstream scheduler components aren't notified of the failure and will wait for the job to finish. I've considered fixing this by adding additional callbacks to propagate the failure to the higher-level schedulers. It might be cleaner to move the decision to abort the job into the higher-level layers of the scheduler, sending an {{AbortJob(jobId)}} method to the Master. The Client is already notified of executor state changes, so it may be able to make the decision to abort (or defer that decision to a higher layer). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-643) Standalone master crashes during actor restart
[ https://issues.apache.org/jira/browse/SPARK-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-643. - Resolution: Fixed Standalone master crashes during actor restart -- Key: SPARK-643 URL: https://issues.apache.org/jira/browse/SPARK-643 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.6.1 Reporter: Josh Rosen Assignee: Josh Rosen The standalone master will crash if it restarts due to an exception: {code} 12/12/15 03:10:47 ERROR master.Master: Job SkewBenchmark wth ID job-20121215031047- failed 11 times. spark.SparkException: Job SkewBenchmark wth ID job-20121215031047- failed 11 times. at spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:103) at spark.deploy.master.Master$$anonfun$receive$1.apply(Master.scala:62) at akka.actor.Actor$class.apply(Actor.scala:318) at spark.deploy.master.Master.apply(Master.scala:17) at akka.actor.ActorCell.invoke(ActorCell.scala:626) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197) at akka.dispatch.Mailbox.run(Mailbox.scala:179) at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516) at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259) at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) 12/12/15 03:10:47 INFO master.Master: Starting Spark master at spark://ip-10-226-87-193:7077 12/12/15 03:10:47 INFO io.IoWorker: IoWorker thread 'spray-io-worker-1' started 12/12/15 03:10:47 ERROR master.Master: Failed to create web UI akka.actor.InvalidActorNameException:actor name HttpServer is not unique! [05aed000-4665-11e2-b361-12313d316833] at akka.actor.ActorCell.actorOf(ActorCell.scala:392) at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.liftedTree1$1(ActorRefProvider.scala:394) at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:394) at akka.actor.LocalActorRefProvider$Guardian$$anonfun$receive$1.apply(ActorRefProvider.scala:392) at akka.actor.Actor$class.apply(Actor.scala:318) at akka.actor.LocalActorRefProvider$Guardian.apply(ActorRefProvider.scala:388) at akka.actor.ActorCell.invoke(ActorCell.scala:626) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197) at akka.dispatch.Mailbox.run(Mailbox.scala:179) at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516) at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259) at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) {code} When the Master actor restarts, Akka calls the {{postRestart}} hook. [By default|http://doc.akka.io/docs/akka/snapshot/general/supervision.html#supervision-restart], this calls {{preStart}}. The standalone master's {{preStart}} method tries to start the webUI but crashes because it is already running. I ran into this after a job failed more than 11 times, which causes the Master to throw a SparkException from its {{receive}} method. The solution is to implement a custom {{postRestart}} hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200514#comment-14200514 ] Matei Zaharia commented on SPARK-677: - [~joshrosen] is this fixed now? PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 0.7.0 Reporter: Josh Rosen Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-681) Optimize hashtables used in Spark
[ https://issues.apache.org/jira/browse/SPARK-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-681. - Resolution: Fixed Optimize hashtables used in Spark - Key: SPARK-681 URL: https://issues.apache.org/jira/browse/SPARK-681 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia The hash tables used in cogroup, join, etc take up a lot more space than they need to because they're using linked data structures. It would be nice to write a custom open hashtable class to use instead, especially since these tables are append-only. A custom one would likely run better than fastutil as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default
[ https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-993. - Resolution: Won't Fix We investigated this for 1.0 but found that many InputFormats behave wrongly if you try to clone the object, so we won't fix it. Don't reuse Writable objects in HadoopRDDs by default - Key: SPARK-993 URL: https://issues.apache.org/jira/browse/SPARK-993 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Right now we reuse them as an optimization, which leads to weird results when you call collect() on a file with distinct items. We should instead make that behavior optional through a flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-993) Don't reuse Writable objects in HadoopRDDs by default
[ https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200531#comment-14200531 ] Matei Zaharia commented on SPARK-993: - Arun, you'd see this issue if you do collect() or take() and then println. The problem is that the same Text object (for example) is referenced for all records in the dataset. The counts will be okay. Don't reuse Writable objects in HadoopRDDs by default - Key: SPARK-993 URL: https://issues.apache.org/jira/browse/SPARK-993 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Right now we reuse them as an optimization, which leads to weird results when you call collect() on a file with distinct items. We should instead make that behavior optional through a flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1000) Crash when running SparkPi example with local-cluster
[ https://issues.apache.org/jira/browse/SPARK-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-1000. Resolution: Cannot Reproduce Crash when running SparkPi example with local-cluster - Key: SPARK-1000 URL: https://issues.apache.org/jira/browse/SPARK-1000 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: xiajunluan Assignee: Andrew Or when I run SparkPi with local-cluster[2,2,512], it will throw following exception at the end of job. WARNING: An exception was thrown by an exception handler. java.util.concurrent.RejectedExecutionException at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.start(AbstractNioWorker.java:184) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:330) at org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:313) at org.jboss.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:35) at org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34) at org.jboss.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:504) at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:47) at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170) at org.jboss.netty.channel.socket.nio.NioClientSocketChannel.init(NioClientSocketChannel.java:79) at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:176) at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:82) at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:213) at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183) at akka.remote.netty.ActiveRemoteClient$$anonfun$connect$1.apply$mcV$sp(Client.scala:173) at akka.util.Switch.liftedTree1$1(LockUtil.scala:33) at akka.util.Switch.transcend(LockUtil.scala:32) at akka.util.Switch.switchOn(LockUtil.scala:55) at akka.remote.netty.ActiveRemoteClient.connect(Client.scala:158) at akka.remote.netty.NettyRemoteTransport.send(NettyRemoteSupport.scala:153) at akka.remote.RemoteActorRef.$bang(RemoteActorRefProvider.scala:247) at akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559) at akka.actor.LocalDeathWatch$$anonfun$publish$1.apply(ActorRefProvider.scala:559) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.immutable.VectorIterator.foreach(Vector.scala:648) at scala.collection.IterableLike$class.foreach(IterableLike.scala:73) at scala.collection.immutable.Vector.foreach(Vector.scala:63) at akka.actor.LocalDeathWatch.publish(ActorRefProvider.scala:559) at akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:280) at akka.remote.RemoteDeathWatch.publish(RemoteActorRefProvider.scala:262) at akka.actor.ActorCell.doTerminate(ActorCell.scala:701) at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:747) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:608) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:209) at akka.dispatch.Mailbox.run(Mailbox.scala:178) at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516) at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259) at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1023) Remove Thread.sleep(5000) from TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1023. -- Resolution: Fixed Remove Thread.sleep(5000) from TaskSchedulerImpl Key: SPARK-1023 URL: https://issues.apache.org/jira/browse/SPARK-1023 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Fix For: 1.0.0 This causes the unit tests to take super long. We should figure out why this exists and see if we can lower it or do something smarter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1185) In Spark Programming Guide, Master URLs should mention yarn-client
[ https://issues.apache.org/jira/browse/SPARK-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1185. -- Resolution: Fixed In Spark Programming Guide, Master URLs should mention yarn-client Key: SPARK-1185 URL: https://issues.apache.org/jira/browse/SPARK-1185 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Pérez González It would also be helpful to mention that the reason a host:post isn't required for YARN mode is that it comes from the Hadoop configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2237) Add ZLIBCompressionCodec code
[ https://issues.apache.org/jira/browse/SPARK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-2237. Resolution: Won't Fix Add ZLIBCompressionCodec code - Key: SPARK-2237 URL: https://issues.apache.org/jira/browse/SPARK-2237 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yanjie Gao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2348: - Priority: Critical (was: Major) In Windows having a enviorinment variable named 'classpath' gives error --- Key: SPARK-2348 URL: https://issues.apache.org/jira/browse/SPARK-2348 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Windows 7 Enterprise Reporter: Chirag Todarka Assignee: Chirag Todarka Priority: Critical Operating System:: Windows 7 Enterprise If having enviorinment variable named 'classpath' gives then starting 'spark-shell' gives below error:: mydir\spark\binspark-shell Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces sed before init set up. Assuming no postInit code. Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. Exception in thread main java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca la:202) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar kILoop.scala:929) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass Loader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4222) FixedLengthBinaryRecordReader should readFully
[ https://issues.apache.org/jira/browse/SPARK-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4222: - Assignee: Jascha Swisher FixedLengthBinaryRecordReader should readFully -- Key: SPARK-4222 URL: https://issues.apache.org/jira/browse/SPARK-4222 Project: Spark Issue Type: Bug Reporter: Jascha Swisher Assignee: Jascha Swisher Priority: Minor The new FixedLengthBinaryRecordReader currently uses a read() call to read from the FSDataInputStream, without checking the number of bytes actually returned. The currentPosition variable is updated assuming that the full number of requested bytes are returned, which could lead to data corruption or other problems if fewer bytes come back than requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4222) FixedLengthBinaryRecordReader should readFully
[ https://issues.apache.org/jira/browse/SPARK-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4222. -- Resolution: Fixed Fix Version/s: 1.2.0 FixedLengthBinaryRecordReader should readFully -- Key: SPARK-4222 URL: https://issues.apache.org/jira/browse/SPARK-4222 Project: Spark Issue Type: Bug Reporter: Jascha Swisher Assignee: Jascha Swisher Priority: Minor Fix For: 1.2.0 The new FixedLengthBinaryRecordReader currently uses a read() call to read from the FSDataInputStream, without checking the number of bytes actually returned. The currentPosition variable is updated assuming that the full number of requested bytes are returned, which could lead to data corruption or other problems if fewer bytes come back than requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4040) Update spark documentation for local mode and spark-streaming.
[ https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4040: - Assignee: jay vyas Update spark documentation for local mode and spark-streaming. --- Key: SPARK-4040 URL: https://issues.apache.org/jira/browse/SPARK-4040 Project: Spark Issue Type: Documentation Components: Documentation Reporter: jay vyas Assignee: jay vyas Fix For: 1.2.0 *Note: this JIRA has changed since its inception - its not a bug, but something which can be tricky to surmise from existing docs. So the attached patch is a doc improvement.* Below is the original JIRA which was filed: Please note that Im somewhat new to spark streaming's API, and am not a spark expert - so I've done the best to write up and reproduce this bug. If its not a bug i hope an expert will help to explain why and promptly close it. However, it appears it could be a bug after discussing with [~rnowling] who is a spark contributor. CC [~rnowling] [~willbenton] It appears that in a DStream context, a call to {{MappedRDD.count()}} blocks progress and prevents emission of RDDs from a stream. {noformat} tweetStream.foreachRDD((rdd,lent)= { tweetStream.repartition(1) //val count = rdd.count() DONT DO THIS ! checks += 1; if (checks 20) { ssc.stop() } } {noformat} The above code block should inevitably halt, after 20 intervals of RDDs... However, if we uncomment the call to {{rdd.count()}}, it turns out that we get an infinite stream which emits no RDDs , and thus our program runs forever (ssc.stop is unreachable), because *forEach doesnt receive any more entries*. I suspect this is actually because the foreach block never completes, because {{count()}} is winds up calling {{compute}}, which ultimately just reads from the stream. I havent put together a minimal reproducer or unit test yet, but I can work on doing so if more info is needed. I guess this could be seen as an application bug - but i think spark might be made smarter to throw its hands up when people execute blocking code in a stream processor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4040) Update spark documentation for local mode and spark-streaming.
[ https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4040. -- Resolution: Fixed Update spark documentation for local mode and spark-streaming. --- Key: SPARK-4040 URL: https://issues.apache.org/jira/browse/SPARK-4040 Project: Spark Issue Type: Documentation Components: Documentation Reporter: jay vyas Assignee: jay vyas Fix For: 1.2.0 *Note: this JIRA has changed since its inception - its not a bug, but something which can be tricky to surmise from existing docs. So the attached patch is a doc improvement.* Below is the original JIRA which was filed: Please note that Im somewhat new to spark streaming's API, and am not a spark expert - so I've done the best to write up and reproduce this bug. If its not a bug i hope an expert will help to explain why and promptly close it. However, it appears it could be a bug after discussing with [~rnowling] who is a spark contributor. CC [~rnowling] [~willbenton] It appears that in a DStream context, a call to {{MappedRDD.count()}} blocks progress and prevents emission of RDDs from a stream. {noformat} tweetStream.foreachRDD((rdd,lent)= { tweetStream.repartition(1) //val count = rdd.count() DONT DO THIS ! checks += 1; if (checks 20) { ssc.stop() } } {noformat} The above code block should inevitably halt, after 20 intervals of RDDs... However, if we uncomment the call to {{rdd.count()}}, it turns out that we get an infinite stream which emits no RDDs , and thus our program runs forever (ssc.stop is unreachable), because *forEach doesnt receive any more entries*. I suspect this is actually because the foreach block never completes, because {{count()}} is winds up calling {{compute}}, which ultimately just reads from the stream. I havent put together a minimal reproducer or unit test yet, but I can work on doing so if more info is needed. I guess this could be seen as an application bug - but i think spark might be made smarter to throw its hands up when people execute blocking code in a stream processor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-565) Integrate spark in scala standard collection API
[ https://issues.apache.org/jira/browse/SPARK-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-565. - Resolution: Won't Fix FYI I'm going to close this because we've locked down the API for 1.X, and it's pretty clear that it can't fully fit into the Scala collections API (that has a lot of things we don't have, and vice versa). This is something we can investigate later but it's unlikely that we'll want to bind the API to Scala even if we change pieces of it in the future. Integrate spark in scala standard collection API Key: SPARK-565 URL: https://issues.apache.org/jira/browse/SPARK-565 Project: Spark Issue Type: New Feature Reporter: tjhunter This is more a meta-bug / whish item than a real bug. Scala 2.0 provides some API for parallel collections which might be interesting to leverage, but mostly as a user, I would like to be able to write a function like: def contrived_example(xs:Seq[Int]) = xs.map(_ * 2).sum and not have to care if xs is an array, a scala parallel collection or a RDD. Given that RDDs already implement most of the API for Seq, it seems mostly a matter of standardization. I am probably missing some subtle details here? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org