[jira] [Created] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline
Timothy Hunter created SPARK-25124: -- Summary: VectorSizeHint.size is buggy, breaking streaming pipeline Key: SPARK-25124 URL: https://issues.apache.org/jira/browse/SPARK-25124 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.1 Reporter: Timothy Hunter Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, transforming a stream will return a nondescript exception about the stream not started. At core are the following bugs that {{setSize}} and {{getSize}} do not {{return}} values but {{None}}: https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846 How to reproduce, using the example in the doc: {code} from pyspark.ml.linalg import Vectors from pyspark.ml import Pipeline, PipelineModel from pyspark.ml.feature import VectorAssembler, VectorSizeHint data = [(Vectors.dense([1., 2., 3.]), 4.)] df = spark.createDataFrame(data, ["vector", "float"]) sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) # Will fail vecAssembler = VectorAssembler(inputCols=["vector", "float"], outputCol="assembled") pipeline = Pipeline(stages=[sizeHint, vecAssembler]) pipelineModel = pipeline.fit(df) pipelineModel.transform(df).head().assembled {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams
[ https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447781#comment-16447781 ] Timothy Hunter commented on SPARK-23996: [~wm624] yes this is the implementation: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala] you can see the test suite here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala] The current implementation focused on doubles, but I do not see much issue in switch to floats. The main entry points are fairly similar: [https://github.com/DataSketches/sketches-core/blob/master/src/main/java/com/yahoo/sketches/kll/KllFloatsSketch.java#L299] > Implement the optimal KLL algorithms for quantiles in streams > - > > Key: SPARK-23996 > URL: https://issues.apache.org/jira/browse/SPARK-23996 > Project: Spark > Issue Type: Improvement > Components: MLlib, SQL >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Priority: Major > > The current implementation for approximate quantiles - a variant of > Grunwald-Khanna, which I implemented - is not the best in light of recent > papers: > - it is not exactly the one from the paper for performance reasons, but the > changes are not documented beyond comments on the code > - there are now more optimal algorithms with proven bounds (unlike q-digest, > the other contender at the time) > I propose that we revisit the current implementation and look at the > Karnin-Lang-Liberty algorithm (KLL) for example: > [https://arxiv.org/abs/1603.05346] > [https://edoliberty.github.io//papers/streamingQuantiles.pdf] > This algorithm seems to have favorable characteristics for streaming and a > distributed implementation, and there is a python implementation for > reference. > It is a fairly standalone piece, and in that respect available to people who > don't know too much about spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams
Timothy Hunter created SPARK-23996: -- Summary: Implement the optimal KLL algorithms for quantiles in streams Key: SPARK-23996 URL: https://issues.apache.org/jira/browse/SPARK-23996 Project: Spark Issue Type: Improvement Components: MLlib, SQL Affects Versions: 2.3.0 Reporter: Timothy Hunter The current implementation for approximate quantiles - a variant of Grunwald-Khanna, which I implemented - is not the best in light of recent papers: - it is not exactly the one from the paper for performance reasons, but the changes are not documented beyond comments on the code - there are now more optimal algorithms with proven bounds (unlike q-digest, the other contender at the time) I propose that we revisit the current implementation and look at the Karnin-Lang-Liberty algorithm (KLL) for example: [https://arxiv.org/abs/1603.05346] [https://edoliberty.github.io//papers/streamingQuantiles.pdf] This algorithm seems to have favorable characteristics for streaming and a distributed implementation, and there is a python implementation for reference. It is a fairly standalone piece, and in that respect available to people who don't know too much about spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273575#comment-16273575 ] Timothy Hunter commented on SPARK-21866: [~josephkb] I have created a separate ticket to continue progress on the reader interface in SPARK-22666. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter >Assignee: Ilya Matiach > Labels: SPIP > Fix For: 2.3.0 > > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels
[jira] [Created] (SPARK-22666) Spark reader source for image format
Timothy Hunter created SPARK-22666: -- Summary: Spark reader source for image format Key: SPARK-22666 URL: https://issues.apache.org/jira/browse/SPARK-22666 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Timothy Hunter The current API for the new image format is implemented as a standalone feature, in order to make it reside within the mllib package. As discussed in SPARK-21866, users should be able to load images through the more common spark source reader interface. This ticket is concerned with adding image reading support in the spark source API, through either of the following interfaces: - {{spark.read.format("image")...}} - {{spark.read.image}} The output is a dataframe that contains images (and the file names for example), following the semantics discussed already in SPARK-21866. A few technical notes: * since the functionality is implemented in {{mllib}}, calling this function may fail at runtime if users have not imported the {{spark-mllib}} dependency * How to deal with very flat directories? It is common to have millions of files in a single "directory" (like in S3), which seems to have caused some issues to some users. If this issue is too complex to handle in this ticket, it can be dealt with separately. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248628#comment-16248628 ] Timothy Hunter commented on SPARK-21866: [~josephkb] if I am not mistaken, the image code is implemented in the {{mllib}} package, which depends on {{sql}}. Meanwhile, the data source API is implemented in {{sql}}, and if we want it to have some image-specific source, like we do for csv or json, we will need to depend on {{mllib}}. This dependency should not happen, first because it introduces a circular dependency (causing compile time issues), and second because sql (one of the core modules) should not depend on {{mllib}}, which is large and not related to SQL. [~rxin] suggested that we add a runtime dependency using reflection instead, and I am keen on making that change a second pull request. What are your thoughts? > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > *
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237731#comment-16237731 ] Timothy Hunter commented on SPARK-21866: Adding {{spark.read.image}} is going to create a (soft) dependency between the core and mllib, which hosts the implementation of the current reader methods. This is fine and can dealt with using reflection, but since this would involve adding a core API to Spark, I suggest we do it as a follow-up task. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter >Priority: Major > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel
[jira] [Commented] (SPARK-8515) Improve ML attribute API
[ https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206111#comment-16206111 ] Timothy Hunter commented on SPARK-8515: --- Before we commit to an implementation, we should think about the goal of adding metadata in ML, because it comes with its own costs. For instance, there has been a number of bug reports around them. See for example SPARK-2008, SPARK-14862. I see a couple of use cases for metadata: - feature indexing -> that case should require just longs (or strings) for each dimension of a feature vector - expressing categorical info -> the Estimator -> Model -> Transformer pattern is more appropriate, I believe - vector dimensions -> I think that in all cases, the underlying code should be able to proceed without this information, although this is debatable > Improve ML attribute API > > > Key: SPARK-8515 > URL: https://issues.apache.org/jira/browse/SPARK-8515 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > Labels: advanced > Attachments: SPARK-8515.pdf > > > In 1.4.0, we introduced ML attribute API to embed feature/label attribute > info inside DataFrame's schema. However, the API is not very friendly to use. > We should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175158#comment-16175158 ] Timothy Hunter commented on SPARK-21866: Putting this code under {{org.apache.spark.ml.image}} sounds good to me. Based on the initial exploration, it should not be too hard to integrate this in the data source framework. I am going to submit this proposal to a vote on the dev mailing list. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of eac
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154132#comment-16154132 ] Timothy Hunter commented on SPARK-21866: [~yanboliang] thanks you for the comments. Regarding your questions: 1. making {{image}} part of {{ml}} or not: I do not have a strong preference, but I think that image support is more general than machine learning. 2. there is no obstacle, but that would create a dependency between the core ({{spark.read}}) and an external module. This sort of dependency inversion is not great design, as any change into a sub-package will have API repercussion into the core of Spark. The SQL team is already struggling with such issues. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, >
[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-21866: --- Attachment: (was: SPIP - Image support for Apache Spark.pdf) > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information abou
[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-21866: --- Attachment: SPIP - Image support for Apache Spark V1.1.pdf Updated authors' list. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Som
[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n
[ https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149510#comment-16149510 ] Timothy Hunter commented on SPARK-21184: [~a1ray] thank you for the report, someone should investigate about these given values. You raise some valid questions about the choice of data structures and algorithm, which were discussed during the implementation and that can certainly be revisited: - tree structures: the major constraint here is that this structure gets serialized often, due to how UDAFs work. This is why the current implementation is amortized over multiple records. Edo Liberty has published some recent work that is relevant in that area. - algorithm: we looked at t-digest (and q-digest). The main concern back then was that there was no published worst-time guarantee given a target precision. This is still the case to my knowledge. Because of that, it is hard to understand what could happen in some unusual cases - which tend to be not so unusual in big data. That being said, it looks like it is a popular and well-maintained choice now, so I am certainly open to relaxing this constraint. > QuantileSummaries implementation is wrong and QuantileSummariesSuite fails > with larger n > > > Key: SPARK-21184 > URL: https://issues.apache.org/jira/browse/SPARK-21184 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Andrew Ray > > 1. QuantileSummaries implementation does not match the paper it is supposed > to be based on. > 1a. The compress method > (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240) > merges neighboring buckets, but thats not what the paper says to do. The > paper > (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) > describes an implicit tree structure and the compress method deletes selected > subtrees. > 1b. The paper does not discuss merging these summary data structures at all. > The following comment is in the merge method of QuantileSummaries: > {quote} // The GK algorithm is a bit unclear about it, but it seems > there is no need to adjust the > // statistics during the merging: the invariants are still respected > after the merge.{quote} > Unless I'm missing something that needs substantiation, it's not clear that > that the invariants hold. > 2. QuantileSummariesSuite fails with n = 1 (and other non trivial values) > https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27 > One possible solution if these issues can't be resolved would be to move to > an algorithm that explicitly supports merging and is well tested like > https://github.com/tdunning/t-digest -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149186#comment-16149186 ] Timothy Hunter commented on SPARK-21866: [~srowen] thank you for the comments. Indeed, this proposal is limited in scope on purpose, because it aims at achieving consensus around multiple libraries. For instance, the MMLSpark project from Microsoft uses this data format to interface with OpenCV (wrapped through JNI), and the Deep Learning Pipelines is going to rely on it as its primary mechanism to load and process images. Also, nothing precludes adding common transforms to this package later - it is easier to start small. Regarding the spark package, yes, it will be discontinued like the CSV parser. The aim is to offer a working library that can be tried out without having to wait for an implementation to be merged into Spark itself. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = Struct
[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-21866: --- Attachment: SPIP - Image support for Apache Spark.pdf > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Attachments: SPIP - Image support for Apache Spark.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information about the origin of the image. The content of th
[jira] [Created] (SPARK-21866) SPIP: Image support in Spark
Timothy Hunter created SPARK-21866: -- Summary: SPIP: Image support in Spark Key: SPARK-21866 URL: https://issues.apache.org/jira/browse/SPARK-21866 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: Timothy Hunter h2. Background and motivation As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers. This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions. This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines. The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead. h2. Targets users and personas: Data scientists, data engineers, library developers. The following libraries define primitives for loading and representing images, and will gain from a common interchange format (in alphabetical order): * BigDL * DeepLearning4J * Deep Learning Pipelines * MMLSpark * TensorFlow (Spark connector) * TensorFlowOnSpark * TensorFrames * Thunder h2. Goals: * Simple representation of images in Spark DataFrames, based on pre-existing industrial standards (OpenCV) * This format should eventually allow the development of high-performance integration points with image processing libraries such as libOpenCV, Google TensorFlow, CNTK, and other C libraries. * The reader should be able to read popular formats of images from distributed sources. h2. Non-Goals: Images are a versatile medium and encompass a very wide range of formats and representations. This SPIP explicitly aims at the most common use case in the industry currently: multi-channel matrices of binary, int32, int64, float or double data that can fit comfortably in the heap of the JVM: * the total size of an image should be restricted to less than 2GB (roughly) * the meaning of color channels is application-specific and is not mandated by the standard (in line with the OpenCV standard) * specialized formats used in meteorology, the medical field, etc. are not supported * this format is specialized to images and does not attempt to solve the more general problem of representing n-dimensional tensors in Spark h2. Proposed API changes We propose to add a new package in the package structure, under the MLlib project: {{org.apache.spark.image}} h3. Data format We propose to add the following structure: imageSchema = StructType([ * StructField("mode", StringType(), False), ** The exact representation of the data. ** The values are described in the following OpenCV convention. Basically, the type has both "depth" and "number of channels" info: in particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 (value 32 in the table) with the channel order specified by convention. ** The exact channel ordering and meaning of each channel is dictated by convention. By default, the order is RGB (3 channels) and BGRA (4 channels). If the image failed to load, the value is the empty string "". * StructField("origin", StringType(), True), ** Some information about the origin of the image. The content of this is application-specific. ** When the image is loaded from files, users should expect to find the file name in this field. * StructField("height", IntegerType(), False), ** the height of the image, pixels ** If the image fails to load, the value is -1. * StructField("width", IntegerType(), False), ** the width of the image, pixels ** If the image fails to load, the value is -1. * StructField("nChannels", In
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944215#comment-15944215 ] Timothy Hunter commented on SPARK-19634: [~sethah], yes, thanks for bringing up these concerns. Regarding the first points, the UDAF interface does not let you update arrays in place, which is a non-starter in our case. This is why the implementation switches to TIA. I have updated the design doc with these comments. Regarding the performance, I agree that there is a tension between having an API that is compatible with structured streaming and the current, RDD-based implementation. I will provide some test numbers so that we have a basis for discussion. That being said, the RDD API is not going away, so if users care about performance and do not need the additional benefit of integrating with SQL or structured streaming, they can still use it. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944019#comment-15944019 ] Timothy Hunter commented on SPARK-19634: [~dongjin] [~wm624] sorry it looks like I missed your comments... I pushed a PR for this feature. Please feel free to comment on the PR if you have the time. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165
[ https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944004#comment-15944004 ] Timothy Hunter commented on SPARK-20111: As Spark SQL is making more and more forays into code generation, I have been wondering if it would make sense to start adopting practical compiler technologies, such as generating first an intermediate representation, instead of doing string manipulation as we currently do. This is of course much beyond the scope of this particular ticket. > codegen bug surfaced by GraphFrames issue 165 > - > > Key: SPARK-20111 > URL: https://issues.apache.org/jira/browse/SPARK-20111 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley > > In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} > surfaces a SQL codegen bug. > This is described in https://github.com/graphframes/graphframes/issues/165 > Summary > * The unit test does a simple motif query on a graph. Essentially, this > means taking 2 DataFrames, doing a few joins, selecting 2 columns, and > collecting the (tiny) DataFrame. > * The test runs, but codegen fails. See the linked GraphFrames issue for the > stacktrace. > To reproduce this: > * Check out GraphFrames https://github.com/graphframes/graphframes > * Run {{sbt assembly}} to compile it and run tests > Copying [~felixcheung]'s comment from the GraphFrames issue 165: > {quote} > Seems like codegen bug; it looks like at least 2 issues: > 1. At L472, inputadapter_value is not defined within scope > 2. inputadapter_value is an InternalRow, for this statement to work > {{bhj_primitiveA = inputadapter_value;}} > it should be > {{bhj_primitiveA = inputadapter_value.getLong(0);}} > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20077) Documentation for ml.stats.Correlation
Timothy Hunter created SPARK-20077: -- Summary: Documentation for ml.stats.Correlation Key: SPARK-20077 URL: https://issues.apache.org/jira/browse/SPARK-20077 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Timothy Hunter Now that (Pearson) correlations are available in spark.ml, we need to write some documentation to go along with this feature. It can simply be looking at the unit tests for example right now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20076) Python interface for ml.stats.Correlation
Timothy Hunter created SPARK-20076: -- Summary: Python interface for ml.stats.Correlation Key: SPARK-20076 URL: https://issues.apache.org/jira/browse/SPARK-20076 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Timothy Hunter The (Pearson) statistics have been exposed with a Dataframe interface as part of SPARK-19636 in the Scala interface. We should now make these available in Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923119#comment-15923119 ] Timothy Hunter commented on SPARK-19634: I was not able to finish it in time, but the bulk of the code is in this branch: https://github.com/apache/spark/compare/master...thunterdb:19634?expand=1 Note that it currently includes a (non-working) UDAF and an incomplete TypedImperativeAggregate. It turns out that UDAF interface is not suited for this sort of aggregators, which I realized quite late. I started to refactor my code to use TypedImperativeAggregate, but did not have to finish it. If someone wants to pick up this task, he or she is welcome to do it. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886900#comment-15886900 ] Timothy Hunter commented on SPARK-19634: [~wm624] were you able to start to work on this task? I have some time now and I can work on it. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881471#comment-15881471 ] Timothy Hunter commented on SPARK-19635: After working on it, I realized that Column operations do not fit very well the sort of requested operations. Hypothesis testing require to chain a UDAF with a UDF then with a UDAF again, which is not something that can be expressed inside catalyst by doing {{dataframe.select(test("features"))}}. I am going to have a simpler interface that is simpler to interface (see design doc above). > Feature parity for Chi-square hypothesis testing in MLlib > - > > Key: SPARK-19635 > URL: https://issues.apache.org/jira/browse/SPARK-19635 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.Statistics.chiSqTest over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881457#comment-15881457 ] Timothy Hunter commented on SPARK-19636: After working on it, I realized that Column operations do not fit very well the sort of requested operations. Correlations require to chain a UDAF with a UDF then with a UDAF again, which is not something that can be expressed inside catalyst by doing {{dataframe.select(corr("features"))}}. I am going to have a simpler interface that is simpler to interface (see design doc above). > Feature parity for correlation statistics in MLlib > -- > > Key: SPARK-19636 > URL: https://issues.apache.org/jira/browse/SPARK-19636 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of spark.mllib.Statistics.corr() > over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19573) Make NaN/null handling consistent in approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879387#comment-15879387 ] Timothy Hunter commented on SPARK-19573: I do not have too strong an opinion, as long as: 1. we are consistent within Spark, or 2. we follow the standard for numerical stuff (IEEE-754) I am not sure what the standard is for SQL, though. > Make NaN/null handling consistent in approxQuantile > --- > > Key: SPARK-19573 > URL: https://issues.apache.org/jira/browse/SPARK-19573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: zhengruifeng > > As discussed in https://github.com/apache/spark/pull/16776, this jira is used > to track the following issue: > Multi-column version of approxQuantile drop the rows containing *any* > NaN/null, the results are not consistent with outputs of the single-version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15877128#comment-15877128 ] Timothy Hunter commented on SPARK-19636: Looking more closely at the code, it makes sense to start by a replacement of MultivariateStatisticalSummary, which is the basis of PearsonCorrelation and the final step of the Spearman correlation. Also, looking at these algorithms, it is not going to write them as UDAFs (unlike the original design), so the interface will need to take a {{Dataset[Vector]}} instead of a column. > Feature parity for correlation statistics in MLlib > -- > > Key: SPARK-19636 > URL: https://issues.apache.org/jira/browse/SPARK-19636 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of spark.mllib.Statistics.corr() > over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876929#comment-15876929 ] Timothy Hunter commented on SPARK-19636: Unless someone has started to work on this task, I will take it. > Feature parity for correlation statistics in MLlib > -- > > Key: SPARK-19636 > URL: https://issues.apache.org/jira/browse/SPARK-19636 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > This ticket tracks porting the functionality of spark.mllib.Statistics.corr() > over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19636) Feature parity for correlation statistics in MLlib
Timothy Hunter created SPARK-19636: -- Summary: Feature parity for correlation statistics in MLlib Key: SPARK-19636 URL: https://issues.apache.org/jira/browse/SPARK-19636 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Timothy Hunter This ticket tracks porting the functionality of spark.mllib.Statistics.corr() over to spark.ml. Here is a design doc: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
Timothy Hunter created SPARK-19635: -- Summary: Feature parity for Chi-square hypothesis testing in MLlib Key: SPARK-19635 URL: https://issues.apache.org/jira/browse/SPARK-19635 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Timothy Hunter This ticket tracks porting the functionality of spark.mllib.Statistics.chiSqTest over to spark.ml. Here is a design doc: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19634) Feature parity for descriptive statistics in MLlib
Timothy Hunter created SPARK-19634: -- Summary: Feature parity for descriptive statistics in MLlib Key: SPARK-19634 URL: https://issues.apache.org/jira/browse/SPARK-19634 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.1.0 Reporter: Timothy Hunter This ticket tracks porting the functionality of spark.mllib.MultivariateOnlineSummarizer over to spark.ml. A design has been discussed in SPARK-19208 . Here is a design doc: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870655#comment-15870655 ] Timothy Hunter commented on SPARK-19208: I put together the ideas in this thread into a document. I will update the umbrella ticket with sub tasks once folks have had a chance to comment: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866768#comment-15866768 ] Timothy Hunter commented on SPARK-19208: Yes, I meant returning a struct and then projecting this struct. I do not think there is any other way right now with the current UDAFs, as you mention. In that proposal, {{VectorSummarizer.metrics(...).summary(...)}} returns a struct, the fields of which are decided by the arguments in {{.metrics}}, and each of the individual functions {{VectorSummarizer.min/max/variasce(...)}} returns columns of vectors or matrices. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866714#comment-15866714 ] Timothy Hunter edited comment on SPARK-19208 at 2/14/17 9:24 PM: - Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("vector_summary(features).min").as("min(features)"), col("vector_summary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? was (Author: timhunter): Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), col("VectorSummary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866714#comment-15866714 ] Timothy Hunter commented on SPARK-19208: Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), col("VectorSummary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866535#comment-15866535 ] Timothy Hunter edited comment on SPARK-19208 at 2/14/17 8:04 PM: - I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have an API like this: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. was (Author: timhunter): I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have a the following API: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866535#comment-15866535 ] Timothy Hunter commented on SPARK-19208: I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have a the following API: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib
[ https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866295#comment-15866295 ] Timothy Hunter commented on SPARK-14523: Also, the correlation is missing the multivariate case. I will take this task over unless one expresses some interest. > Feature parity for Statistics ML with MLlib > --- > > Key: SPARK-14523 > URL: https://issues.apache.org/jira/browse/SPARK-14523 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: yuhao yang > > Some statistics functions have been supported by DataFrame directly. Use this > jira to discuss/design the statistics package in Spark.ML and its function > scope. Hypothesis test and correlation computation may still need to expose > independent interfaces. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)
[ https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866288#comment-15866288 ] Timothy Hunter commented on SPARK-4591: --- [~josephkb] do you also want some subtasks for KernelDensity and multivariate summaries? They are in the state module but not covered. > Algorithm/model parity for spark.ml (Scala) > --- > > Key: SPARK-4591 > URL: https://issues.apache.org/jira/browse/SPARK-4591 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > This is an umbrella JIRA for porting spark.mllib implementations to use the > DataFrame-based API defined under spark.ml. We want to achieve critical > feature parity for the next release. > h3. Instructions for 3 subtask types > *Review tasks*: detailed review of a subpackage to identify feature gaps > between spark.mllib and spark.ml. > * Should be listed as a subtask of this umbrella. > * Review subtasks cover major algorithm groups. To pick up a review subtask, > please: > ** Comment that you are working on it. > ** Compare the public APIs of spark.ml vs. spark.mllib. > ** Comment on all missing items within spark.ml: algorithms, models, methods, > features, etc. > ** Check for existing JIRAs covering those items. If there is no existing > JIRA, create one, and link it to your comment. > *Critical tasks*: higher priority missing features which are required for > this umbrella JIRA. > * Should be linked as "requires" links. > *Other tasks*: lower priority missing features which can be completed after > the critical tasks. > * Should be linked as "contains" links. > h4. Excluded items > This does *not* include: > * Python: We can compare Scala vs. Python in spark.ml itself. > * Moving linalg to spark.ml: [SPARK-13944] > * Streaming ML: Requires stabilizing some internal APIs of structured > streaming first > h3. TODO list > *Critical issues* > * [SPARK-14501]: Frequent Pattern Mining > * [SPARK-14709]: linear SVM > * [SPARK-15784]: Power Iteration Clustering (PIC) > *Lower priority issues* > * Missing methods within algorithms (see Issue Links below) > * evaluation submodule > * stat submodule (should probably be covered in DataFrames) > * Developer-facing submodules: > ** optimization (including [SPARK-17136]) > ** random, rdd > ** util > *To be prioritized* > * single-instance prediction: [SPARK-10413] > * pmml [SPARK-11171] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15657825#comment-15657825 ] Timothy Hunter commented on SPARK-8884: --- I do not have a strong preference either way. We should just either complete this feature (with DataFrame APIs) or close the open PR. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655563#comment-15655563 ] Timothy Hunter commented on SPARK-8884: --- [~srowen] this ticket should still be open I believe? [~yuhaoyan] has an open PR for it. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566380#comment-15566380 ] Timothy Hunter commented on SPARK-17845: I like the {{Window.rowsBetween(Long.MinValue, -3)}} syntax, but it is exposing a system implementation detail. How about having some static/singleton values that define our notion of plus/minus infinity instead of relying on the system values? Here is a suggestion: {code} Window.rowsBetween(Window.unboundedBefore, -3) object Window { def unboundedBefore: Long = Int.MinVal.toLong } {code} To get around that various sizes of the ints for each language, I suggest we say that every value above 2^31 is considered unbounded above. That should be more than enough and cover at least python, scala, R, java. > Improve window function frame boundary API in DataFrame > --- > > Key: SPARK-17845 > URL: https://issues.apache.org/jira/browse/SPARK-17845 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > {code} > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > In Spark's DataFrame API, we use integer values to indicate relative position: > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetween(0, Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetween(Long.MinValue, Long.MaxValue) > {code} > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > 1. The API is not self-evident. There is no way for a new user to figure out > how to indicate an unbounded frame by looking at just the API. The user has > to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we use > -sys.maxsize and +sys.maxsize. > To make this API less confusing, we have a few options: > Option 1. Add the following (additional) methods: > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) // this one exists already > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > {code} > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially if > they come from SQL background. > Option 2. Decouple the specification for frame begin and frame end into two > functions. Assume the boundary is unlimited unless specified. > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsToCurrent() or Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsFromCurrent() or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > // no need to specify > {code} > If we go with option 2, we should throw exceptions if users specify multiple > from's or to's. A variant of option 2 is to require explicitly specification > of begin/end even in the case of unbounded boundary, e.g.: > {code} > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553490#comment-15553490 ] Timothy Hunter commented on SPARK-17219: If I understand correctly the PR, I am concerned by this approach for a couple of reasons: - when users set the number of buckets, the general expectation should be that (number of returned buckets) <= (number of requested buckets). With the current treatment of NaN, you can end up with more buckets than you asked for. Breaking this invariant seems troublesome for me. - in general, MLLib's policy in regard to NaNs has been to consider them as invalid input. This is also the approach followed by sklearn and the reason for having an imputer with SPARK-13568. If we start to let NaN values go through, they will trigger some other issues down the pipelines. Why not simply stopping with an error at that point, as [~srowen] was suggesting at the beginning? [~barrybecker4], I am trying to understand your use case here. > QuantileDiscretizer does strange things with NaN values > --- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2 >Reporter: Barry Becker >Assignee: Vincent > Fix For: 2.1.0 > > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17074) generate histogram information for column
[ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537295#comment-15537295 ] Timothy Hunter commented on SPARK-17074: We have discussed this through email and either is fine. Regarding the second one, even if the result is approximate, you can still get some reasonable bounds on the error. > generate histogram information for column > - > > Key: SPARK-17074 > URL: https://issues.apache.org/jira/browse/SPARK-17074 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > We support two kinds of histograms: > - Equi-width histogram: We have a fixed width for each column interval in > the histogram. The height of a histogram represents the frequency for those > column values in a specific interval. For this kind of histogram, its height > varies for different column intervals. We use the equi-width histogram when > the number of distinct values is less than 254. > - Equi-height histogram: For this histogram, the width of column interval > varies. The heights of all column intervals are the same. The equi-height > histogram is effective in handling skewed data distribution. We use the equi- > height histogram when the number of distinct values is equal to or greater > than 254. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16485) Additional fixes to Mllib 2.0 documentation
[ https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-16485: --- Description: While reviewing the documentation of MLlib, I found some additional issues. Important issues that affect the binary signatures: - GBTClassificationModel: all the setters should be overriden - LogisticRegressionModel: setThreshold(s) - RandomForestClassificationModel: all the setters should be overriden - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but most of the methods are private[ml] -> do we need to expose this class for now? - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed - sqlDataTypes: name does not follow conventions. Do we need to expose it? Issues that involve only documentation: - Evaluator: 1. inconsistent doc between evaluate and isLargerBetter - MinMaxScaler: math rendering - GeneralizedLinearRegressionSummary: aic doc is incorrect The reference documentation that was used was: http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ was: While reviewing the documentation of MLlib, I found some additional issues. Important issues that affect the binary signatures: - GBTClassificationModel: all the setters should be overriden - LogisticRegressionModel: setThreshold(s) - RandomForestClassificationModel: all the setters should be overriden - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but most of the methods are private[ml] -> do we need to expose this class for now? - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed - sqlDataTypes: name does not follow conventions. Do we need to expose it? Issues that involve only documentation: - Evaluator: 1. inconsistent doc between evaluate and isLargerBetter 2. missing `def evaluate(dataset: Dataset[_]): Double` from the doc (the other method with the same name shows up). This may be a bug in scaladoc. - MinMaxScaler: math rendering - GeneralizedLinearRegressionSummary: aic doc is incorrect The reference documentation that was used was: http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ > Additional fixes to Mllib 2.0 documentation > --- > > Key: SPARK-16485 > URL: https://issues.apache.org/jira/browse/SPARK-16485 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Timothy Hunter > > While reviewing the documentation of MLlib, I found some additional issues. > Important issues that affect the binary signatures: > - GBTClassificationModel: all the setters should be overriden > - LogisticRegressionModel: setThreshold(s) > - RandomForestClassificationModel: all the setters should be overriden > - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but > most of the methods are private[ml] -> do we need to expose this class for > now? > - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should > not be exposed > - sqlDataTypes: name does not follow conventions. Do we need to expose it? > Issues that involve only documentation: > - Evaluator: > 1. inconsistent doc between evaluate and isLargerBetter > - MinMaxScaler: math rendering > - GeneralizedLinearRegressionSummary: aic doc is incorrect > The reference documentation that was used was: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0
[ https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371504#comment-15371504 ] Timothy Hunter commented on SPARK-14816: Also, in `mllib-guide.md`, let's switch the order between spark.ml and spark.mllib to give more proeminence to spark.ml. > Update MLlib, GraphX, SparkR websites for 2.0 > - > > Key: SPARK-14816 > URL: https://issues.apache.org/jira/browse/SPARK-14816 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Update the sub-projects' websites to include new features in this release. > For MLlib, make it clear that the DataFrame-based API is the primary one now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16485) Additional fixes to Mllib 2.0 documentation
Timothy Hunter created SPARK-16485: -- Summary: Additional fixes to Mllib 2.0 documentation Key: SPARK-16485 URL: https://issues.apache.org/jira/browse/SPARK-16485 Project: Spark Issue Type: Sub-task Reporter: Timothy Hunter While reviewing the documentation of MLlib, I found some additional issues. Important issues that affect the binary signatures: - GBTClassificationModel: all the setters should be overriden - LogisticRegressionModel: setThreshold(s) - RandomForestClassificationModel: all the setters should be overriden - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but most of the methods are private[ml] -> do we need to expose this class for now? - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed - sqlDataTypes: name does not follow conventions. Do we need to expose it? Issues that involve only documentation: - Evaluator: 1. inconsistent doc between evaluate and isLargerBetter 2. missing `def evaluate(dataset: Dataset[_]): Double` from the doc (the other method with the same name shows up). This may be a bug in scaladoc. - MinMaxScaler: math rendering - GeneralizedLinearRegressionSummary: aic doc is incorrect The reference documentation that was used was: http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353374#comment-15353374 ] Timothy Hunter commented on SPARK-12922: I opened a separate JIRA for that issue: SPARK-16258 > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply
Timothy Hunter created SPARK-16258: -- Summary: Automatically append the grouping keys in SparkR's gapply Key: SPARK-16258 URL: https://issues.apache.org/jira/browse/SPARK-16258 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Timothy Hunter While working on the group apply function for python [1], we found it easier to depart from SparkR's gapply function in the following way: - the keys are appended by default to the spark dataframe being returned - the output schema that the users provides is the schema of the R data frame and does not include the keys Here are the reasons for doing so: - in most cases, users will want to know the key associated with a result -> appending the key is the sensible default - most functions in the SQL interface and in MLlib append columns, and gapply departs from this philosophy - for the cases when they do not need it, adding the key is a fraction of the computation time and of the output size - from a formal perspective, it makes calling gapply fully transparent to the type of the key: it is easier to build a function with gapply because it does not need to know anything about the key This ticket proposes to change SparkR's gapply function to follow the same convention as Python's implementation. cc [~Narine] [~shivaram] [1] https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351311#comment-15351311 ] Timothy Hunter commented on SPARK-12922: [~Narine] while working on a similar function for python [1], we found it easier to have the following changes: - the keys are appended by default to the spark dataframe being returned - the output schema that the users provides is the schema of the R data frame and does not include the keys Here were our reasons to depart from the R implementation of gapply: - in most cases, users will want to know the key associated with a result -> appending the key is the sensible default - most functions in the SQL interface and in MLlib append columns, and gapply departs from this philosophy - for the cases when they do not need it, adding the key is a fraction of the computation time and of the output size - from a formal perspective, it makes calling gapply fully transparent to the type of the key: it is easier to build a function with gapply because it does not need to know anything about the key I think it would make sense to make this change to the R's gapply implementation. Let me know what you think about it. [1] https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342674#comment-15342674 ] Timothy Hunter commented on SPARK-15581: With respect to deep learning, I think it depends on whether we are confortable to have a generic implementation that works for all supported languages, but that is going to be 1-2 orders of magnitude slower than specialized frameworks. Unlike BLAS for linear algebra, there is no generic interface in java or C++ to interface with specialized deep learning libraries, so just integrating them as a plugin will require a significant effort. Also, we are constrained by the dependencies we can pull into Spark, as experienced with breeze. If we decide to roll out our own deep learning stack, we may be facing a perception issue that "deep learning on Spark is slow". > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for p
[jira] [Comment Edited] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0
[ https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264370#comment-15264370 ] Timothy Hunter edited comment on SPARK-14816 at 4/29/16 5:21 PM: - Also, add a comment about the {{spark.lapply}} API was (Author: timhunter): Also, add a comment about the {{doparallel}} API > Update MLlib, GraphX, SparkR websites for 2.0 > - > > Key: SPARK-14816 > URL: https://issues.apache.org/jira/browse/SPARK-14816 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > > Update the sub-projects' websites to include new features in this release. > For MLlib, make it clear that the DataFrame-based API is the primary one now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0
[ https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264370#comment-15264370 ] Timothy Hunter commented on SPARK-14816: Also, add a comment about the {{doparallel}} API > Update MLlib, GraphX, SparkR websites for 2.0 > - > > Key: SPARK-14816 > URL: https://issues.apache.org/jira/browse/SPARK-14816 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > > Update the sub-projects' websites to include new features in this release. > For MLlib, make it clear that the DataFrame-based API is the primary one now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14571) Log instrumentation in ALS
[ https://issues.apache.org/jira/browse/SPARK-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249058#comment-15249058 ] Timothy Hunter commented on SPARK-14571: Yes, please feel free to take this task. Thanks! > Log instrumentation in ALS > -- > > Key: SPARK-14571 > URL: https://issues.apache.org/jira/browse/SPARK-14571 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Timothy Hunter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7264) SparkR API for parallel functions
[ https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243584#comment-15243584 ] Timothy Hunter commented on SPARK-7264: --- I will have a PR for this soon. > SparkR API for parallel functions > - > > Key: SPARK-7264 > URL: https://issues.apache.org/jira/browse/SPARK-7264 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman > > This is a JIRA to discuss design proposals for enabling parallel R > computation in SparkR without exposing the entire RDD API. > The rationale for this is that the RDD API has a number of low level > functions and we would like to expose a more light-weight API that is both > friendly to R users and easy to maintain. > http://goo.gl/GLHKZI has a first cut design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14569) Log instrumentation in KMeans
[ https://issues.apache.org/jira/browse/SPARK-14569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243238#comment-15243238 ] Timothy Hunter commented on SPARK-14569: [~iamshrek] thanks for taking a look! > Log instrumentation in KMeans > - > > Key: SPARK-14569 > URL: https://issues.apache.org/jira/browse/SPARK-14569 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Timothy Hunter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14571) Log instrumentation in ALS
[ https://issues.apache.org/jira/browse/SPARK-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239794#comment-15239794 ] Timothy Hunter commented on SPARK-14571: SPARK-14568 has been merged, so it should easy to follow the same metrics that have been added to LogisticRegression. [~yuu.ishik...@gmail.com], [~yinxusen], are you interested? > Log instrumentation in ALS > -- > > Key: SPARK-14571 > URL: https://issues.apache.org/jira/browse/SPARK-14571 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Timothy Hunter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14570) Log instrumentation in Random forests
[ https://issues.apache.org/jira/browse/SPARK-14570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239775#comment-15239775 ] Timothy Hunter commented on SPARK-14570: SPARK-14568 has been merged, so it should easy to follow the same pattern as in LogisticRegression. In fact, most of the metrics have already been added in {{RandomForest.scala}} by [~josephkb] . It is just a matter of surfacing them better. [~yuu.ishik...@gmail.com], [~yinxusen], are you interested? > Log instrumentation in Random forests > - > > Key: SPARK-14570 > URL: https://issues.apache.org/jira/browse/SPARK-14570 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Timothy Hunter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14569) Log instrumentation in KMeans
[ https://issues.apache.org/jira/browse/SPARK-14569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239751#comment-15239751 ] Timothy Hunter commented on SPARK-14569: SPARK-14568 has been merged, so it should easy to follow the same metrics that have been added to LogisticRegression. [~yuu.ishik...@gmail.com], [~yinxusen], are you interested? > Log instrumentation in KMeans > - > > Key: SPARK-14569 > URL: https://issues.apache.org/jira/browse/SPARK-14569 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Timothy Hunter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14567) Add instrumentation logs to MLlib training algorithms
[ https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-14567: --- Description: In order to debug performance issues when training mllib algorithms, it is useful to log some metrics about the training dataset, the training parameters, etc. This ticket is an umbrella to add some simple logging messages to the most common MLlib estimators. There should be no performance impact on the current implementation, and the output is simply printed in the logs. Here are some values that are of interest when debugging training tasks: * number of features * number of instances * number of partitions * number of classes * input RDD/DF cache level * hyper-parameters I suggest to start with the most common al was: In order to debug performance issues when training mllib algorithms, it is useful to log some metrics about the training dataset, the training parameters, etc. This ticket is an umbrella to add some simple logging messages to the most common MLlib estimators. There should be no performance impact on the current implementation, and the output is simply printed in the logs. Here are some values that are of interest when debugging training tasks: * number of features * number of instances * number of partitions * number of classes * input RDD/DF cache level * hyper-parameters > Add instrumentation logs to MLlib training algorithms > - > > Key: SPARK-14567 > URL: https://issues.apache.org/jira/browse/SPARK-14567 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Reporter: Timothy Hunter > > In order to debug performance issues when training mllib algorithms, > it is useful to log some metrics about the training dataset, the training > parameters, etc. > This ticket is an umbrella to add some simple logging messages to the most > common MLlib estimators. There should be no performance impact on the current > implementation, and the output is simply printed in the logs. > Here are some values that are of interest when debugging training tasks: > * number of features > * number of instances > * number of partitions > * number of classes > * input RDD/DF cache level > * hyper-parameters > I suggest to start with the most common al -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14568) Log instrumentation in logistic regression as a first task
[ https://issues.apache.org/jira/browse/SPARK-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237737#comment-15237737 ] Timothy Hunter commented on SPARK-14568: I have an upcoming PR for this task > Log instrumentation in logistic regression as a first task > -- > > Key: SPARK-14568 > URL: https://issues.apache.org/jira/browse/SPARK-14568 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Timothy Hunter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14567) Add instrumentation logs to MLlib training algorithms
[ https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-14567: --- Description: In order to debug performance issues when training mllib algorithms, it is useful to log some metrics about the training dataset, the training parameters, etc. This ticket is an umbrella to add some simple logging messages to the most common MLlib estimators. There should be no performance impact on the current implementation, and the output is simply printed in the logs. Here are some values that are of interest when debugging training tasks: * number of features * number of instances * number of partitions * number of classes * input RDD/DF cache level * hyper-parameters was: In order to debug performance issues when training mllib algorithms, it is useful to log some metrics about the training dataset, the training parameters, etc. This ticket is an umbrella to add some simple logging messages to the most common MLlib estimators. There should be no performance impact on the current implementation, and the output is simply printed in the logs. Here are some values that are of interest when debugging training tasks: * number of features * number of instances * number of partitions * number of classes * input RDD/DF cache level * hyper-parameters I suggest to start with the most common al > Add instrumentation logs to MLlib training algorithms > - > > Key: SPARK-14567 > URL: https://issues.apache.org/jira/browse/SPARK-14567 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Reporter: Timothy Hunter > > In order to debug performance issues when training mllib algorithms, > it is useful to log some metrics about the training dataset, the training > parameters, etc. > This ticket is an umbrella to add some simple logging messages to the most > common MLlib estimators. There should be no performance impact on the current > implementation, and the output is simply printed in the logs. > Here are some values that are of interest when debugging training tasks: > * number of features > * number of instances > * number of partitions > * number of classes > * input RDD/DF cache level > * hyper-parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14571) Log instrumentation in ALS
Timothy Hunter created SPARK-14571: -- Summary: Log instrumentation in ALS Key: SPARK-14571 URL: https://issues.apache.org/jira/browse/SPARK-14571 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Timothy Hunter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14570) Log instrumentation in Random forests
Timothy Hunter created SPARK-14570: -- Summary: Log instrumentation in Random forests Key: SPARK-14570 URL: https://issues.apache.org/jira/browse/SPARK-14570 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Timothy Hunter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14569) Log instrumentation in KMeans
Timothy Hunter created SPARK-14569: -- Summary: Log instrumentation in KMeans Key: SPARK-14569 URL: https://issues.apache.org/jira/browse/SPARK-14569 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Timothy Hunter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14568) Log instrumentation in logistic regression as a first task
Timothy Hunter created SPARK-14568: -- Summary: Log instrumentation in logistic regression as a first task Key: SPARK-14568 URL: https://issues.apache.org/jira/browse/SPARK-14568 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Timothy Hunter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14567) Add instrumentation logs to MLlib training algorithms
Timothy Hunter created SPARK-14567: -- Summary: Add instrumentation logs to MLlib training algorithms Key: SPARK-14567 URL: https://issues.apache.org/jira/browse/SPARK-14567 Project: Spark Issue Type: Umbrella Components: MLlib Reporter: Timothy Hunter In order to debug performance issues when training mllib algorithms, it is useful to log some metrics about the training dataset, the training parameters, etc. This ticket is an umbrella to add some simple logging messages to the most common MLlib estimators. There should be no performance impact on the current implementation, and the output is simply printed in the logs. Here are some values that are of interest when debugging training tasks: * number of features * number of instances * number of partitions * number of classes * input RDD/DF cache level * hyper-parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14100) Merge StringIndexer and StringIndexerModel
Timothy Hunter created SPARK-14100: -- Summary: Merge StringIndexer and StringIndexerModel Key: SPARK-14100 URL: https://issues.apache.org/jira/browse/SPARK-14100 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Timothy Hunter This is an initial task to convert a simple estimator (StringIndexer) to the proposed API that merges models and estimators together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13986) Make `DeveloperApi`-annotated things public
[ https://issues.apache.org/jira/browse/SPARK-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200525#comment-15200525 ] Timothy Hunter commented on SPARK-13986: [~dongjoon] how did you find the conflicting annotation? It would be great to automate this as part of the style checks > Make `DeveloperApi`-annotated things public > --- > > Key: SPARK-13986 > URL: https://issues.apache.org/jira/browse/SPARK-13986 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Reporter: Dongjoon Hyun >Priority: Minor > > Spark uses `@DeveloperApi` annotation, but sometimes it seems to conflict > with its visibility. This issue proposes to fix those conflict. The following > is the example. > {code:title=JobResult.scala|borderStyle=solid} > @DeveloperApi > sealed trait JobResult > @DeveloperApi > case object JobSucceeded extends JobResult > @DeveloperApi > -private[spark] case class JobFailed(exception: Exception) extends JobResult > +case class JobFailed(exception: Exception) extends JobResult > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190093#comment-15190093 ] Timothy Hunter commented on SPARK-10931: Using python decorators, it is fairly easy to autogenerate at runtime all the param wrappers, getters and setters and extract the documentation from the scala side so that the documentation of the parameter is included in the docstring of the getters and setters. There are two issues with that: - do we need to specialize the documentation or some of the conversions between java and python? In both cases, it is possible to "subclass" and make sure that the methods do not get overwritten by some autogenerated stubs - the documentation of a class (which relies on the bytecode, not on runtime instances) would miss all the params, because they are only generated in runtime objects. I believe there are some ways around it, such as inserting such methods at import time, but that would require more investigation. > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12566) GLM model family, link function support in SparkR:::glm
[ https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188057#comment-15188057 ] Timothy Hunter commented on SPARK-12566: [~yuhaoyan] I took a look at the current code, and it looks like the implementation of GLM in SparkRWrappers, and it looks like we only check the solver in the case of the gaussian family. [~mengxr] if users use the 'auto' solver, it means we can swap the implementation underneath, right? If this is the case, here is what I suggest, in pseudo-scala-code: {code} (family, solver) match { (gaussian, auto) => IRLS // This is a behavioral change (gaussian, normal | l-bfgs) => LinearRegression (binomial, auto) => IRLS // This is a behavioral change (binomial, binomial) => LogisticRegression // This is a new option to preserve logisticregression if there is a need for that (_, _) => IRLS } {code} > GLM model family, link function support in SparkR:::glm > --- > > Key: SPARK-12566 > URL: https://issues.apache.org/jira/browse/SPARK-12566 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Critical > > This JIRA is for extending the support of MLlib's Generalized Linear Models > (GLMs) to more model families and link functions in SparkR. After > SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR > with support of popular families and link functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11569) StringIndexer transform fails when column contains nulls
[ https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185444#comment-15185444 ] Timothy Hunter commented on SPARK-11569: Also, I suggest to look at Pandas' indexers, which have the same issue to deal with. > StringIndexer transform fails when column contains nulls > > > Key: SPARK-11569 > URL: https://issues.apache.org/jira/browse/SPARK-11569 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0 >Reporter: Maciej Szymkiewicz > > Transforming column containing {{null}} values using {{StringIndexer}} > results in {{java.lang.NullPointerException}} > {code} > from pyspark.ml.feature import StringIndexer > df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v")) > df.printSchema() > ## root > ## |-- k: string (nullable = true) > ## |-- v: long (nullable = true) > indexer = StringIndexer(inputCol="k", outputCol="kIdx") > indexer.fit(df).transform(df) > ## ) failed: > py4j.protocol.Py4JJavaError: An error occurred while calling o75.json. > ## : java.lang.NullPointerException > {code} > Problem disappears when we drop > {code} > df1 = df.na.drop() > indexer.fit(df1).transform(df1) > {code} > or replace {{nulls}} > {code} > from pyspark.sql.functions import col, when > k = col("k") > df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k)) > indexer.fit(df2).transform(df2) > {code} > and cannot be reproduced using Scala API > {code} > import org.apache.spark.ml.feature.StringIndexer > val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v") > df.printSchema > // root > // |-- k: string (nullable = true) > // |-- v: integer (nullable = false) > val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx") > indexer.fit(df).transform(df).count > // 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070500#comment-15070500 ] Timothy Hunter commented on SPARK-12247: Sorry for the delay. That sounds great! Let me know when you get a PR out. On Wed, Dec 23, 2015 at 12:14 AM, Benjamin Fradet (JIRA) > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068908#comment-15068908 ] Timothy Hunter commented on SPARK-12247: It seems to me that the calculation of false positives is more relevant for the movie ratings, and that the RMSE right above in the example is already a good example to but. What do you think? > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066857#comment-15066857 ] Timothy Hunter commented on SPARK-12247: Thanks for working on it, [~BenFradet]! > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15066854#comment-15066854 ] Timothy Hunter commented on SPARK-12247: If we could import all the code that builds the ratings dataframe {{val ratings = sc.textFile(params.ratings).map(Rating.parseRating).cache()}}, that would be ideal. > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12324) The documentation sidebar does not collapse properly
[ https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056677#comment-15056677 ] Timothy Hunter commented on SPARK-12324: I am creating a PR with a fix. cc [~josephkb] > The documentation sidebar does not collapse properly > > > Key: SPARK-12324 > URL: https://issues.apache.org/jira/browse/SPARK-12324 > Project: Spark > Issue Type: Bug > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png > > > When the browser's window is reduced horizontally, the sidebar slides under > the main content and does not collapse: > - hide the sidebar when the browser's width is not large enough > - add a button to show and hide the sidebar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12324) The documentation sidebar does not collapse properly
[ https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-12324: --- Attachment: Screen Shot 2015-12-14 at 12.29.57 PM.png > The documentation sidebar does not collapse properly > > > Key: SPARK-12324 > URL: https://issues.apache.org/jira/browse/SPARK-12324 > Project: Spark > Issue Type: Bug > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png > > > When the browser's window is reduced horizontally, the sidebar slides under > the main content and does not collapse: > - hide the sidebar when the browser's width is not large enough > - add a button to show and hide the sidebar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12324) The documentation sidebar does not collapse properly
Timothy Hunter created SPARK-12324: -- Summary: The documentation sidebar does not collapse properly Key: SPARK-12324 URL: https://issues.apache.org/jira/browse/SPARK-12324 Project: Spark Issue Type: Bug Components: Documentation, MLlib Affects Versions: 1.5.2 Reporter: Timothy Hunter When the browser's window is reduced horizontally, the sidebar slides under the main content and does not collapse: - hide the sidebar when the browser's width is not large enough - add a button to show and hide the sidebar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12246) Add documentation for spark.ml.clustering.kmeans
[ https://issues.apache.org/jira/browse/SPARK-12246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter closed SPARK-12246. -- Resolution: Duplicate > Add documentation for spark.ml.clustering.kmeans > > > Key: SPARK-12246 > URL: https://issues.apache.org/jira/browse/SPARK-12246 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We should add some documentation for the KMeans implementation in spark.ml. > - small description about the concept (maybe copy and adapt the > documentation from spark.mllib) > - add an example for java, scala, python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12246) Add documentation for spark.ml.clustering.kmeans
[ https://issues.apache.org/jira/browse/SPARK-12246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049779#comment-15049779 ] Timothy Hunter commented on SPARK-12246: It does, thanks [~yuhaoyan] > Add documentation for spark.ml.clustering.kmeans > > > Key: SPARK-12246 > URL: https://issues.apache.org/jira/browse/SPARK-12246 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We should add some documentation for the KMeans implementation in spark.ml. > - small description about the concept (maybe copy and adapt the > documentation from spark.mllib) > - add an example for java, scala, python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-12247: --- Target Version/s: (was: 1.6.0) > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12246) Add documentation for spark.ml.clustering.kmeans
[ https://issues.apache.org/jira/browse/SPARK-12246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-12246: --- Target Version/s: (was: 1.6.0) > Add documentation for spark.ml.clustering.kmeans > > > Key: SPARK-12246 > URL: https://issues.apache.org/jira/browse/SPARK-12246 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We should add some documentation for the KMeans implementation in spark.ml. > - small description about the concept (maybe copy and adapt the > documentation from spark.mllib) > - add an example for java, scala, python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-8517: -- Target Version/s: (was: 1.6.0) > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12210) Small example that shows how to integrate spark.mllib with spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-12210: --- Target Version/s: (was: 1.6.0) > Small example that shows how to integrate spark.mllib with spark.ml > --- > > Key: SPARK-12210 > URL: https://issues.apache.org/jira/browse/SPARK-12210 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > Since we are missing a number of algorithms in {{spark.ml}} such as > clustering or LDA, we should have a small example that shows the recommended > way to go back and forth between {{spark.ml}} and {{spark.mllib}}. It is > mostly putting together existing pieces, but I feel it is important for new > users to see how the interaction plays out in practice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12212) Clarify the distinction between spark.mllib and spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-12212: --- Target Version/s: (was: 1.6.0) > Clarify the distinction between spark.mllib and spark.ml > > > Key: SPARK-12212 > URL: https://issues.apache.org/jira/browse/SPARK-12212 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 1.5.2 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > > There is a confusion in the documentation of MLLib as to what exactly MLlib: > is it the package, or is it the whole effort of ML on spark, and how it > differs from spark.ml? Is MLLib going to be deprecated? > We should do the following: > - refer to the mllib the code package as spark.mllib across all the > documentation. Alternative name is "RDD API of MLlib". > - refer to MLlib the project that encompasses spark.ml + spark.mllib as > MLlib (it should be the default) > - replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of > MLlib". I would deemphasize that this API is for building pipelines. Some > users are lead to believe from the documentation that spark.ml can only be > used for building pipelines and that using a single algorithm can only be > done with spark.mllib. > Most relevant places: > - {{mllib-guide.md}} > - {{mllib-linear-methods.md}} > - {{mllib-dimensionality-reduction.md}} > - {{mllib-pmml-model-export.md}} > - {{mllib-statistics.md}} > In these files, most references to {{MLlib}} are meant to refer to > {{spark.mllib}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049521#comment-15049521 ] Timothy Hunter commented on SPARK-12247: [~srowen] would you be interested in this task? > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
Timothy Hunter created SPARK-12247: -- Summary: Documentation for spark.ml's ALS and collaborative filtering in general Key: SPARK-12247 URL: https://issues.apache.org/jira/browse/SPARK-12247 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Affects Versions: 1.5.2 Reporter: Timothy Hunter We need to add a section in the documentation about collaborative filtering in the dataframe API: - copy explanations about collaborative filtering and ALS from spark.mllib - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12246) Add documentation for spark.ml.clustering.kmeans
Timothy Hunter created SPARK-12246: -- Summary: Add documentation for spark.ml.clustering.kmeans Key: SPARK-12246 URL: https://issues.apache.org/jira/browse/SPARK-12246 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Affects Versions: 1.5.2 Reporter: Timothy Hunter We should add some documentation for the KMeans implementation in spark.ml. - small description about the concept (maybe copy and adapt the documentation from spark.mllib) - add an example for java, scala, python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12212) Clarify the distinction between spark.mllib and spark.ml
Timothy Hunter created SPARK-12212: -- Summary: Clarify the distinction between spark.mllib and spark.ml Key: SPARK-12212 URL: https://issues.apache.org/jira/browse/SPARK-12212 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.5.2 Reporter: Timothy Hunter There is a confusion in the documentation of MLLib as to what exactly MLlib: is it the package, or is it the whole effort of ML on spark, and how it differs from spark.ml? Is MLLib going to be deprecated? We should do the following: - refer to the mllib the code package as spark.mllib across all the documentation. Alternative name is "RDD API of MLlib". - refer to MLlib the project that encompasses spark.ml + spark.mllib as MLlib (it should be the default) - replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of MLlib". I would deemphasize that this API is for building pipelines. Some users are lead to believe from the documentation that spark.ml can only be used for building pipelines and that using a single algorithm can only be done with spark.mllib. Most relevant places: - {{mllib-guide.md}} - {{mllib-linear-methods.md}} - {{mllib-dimensionality-reduction.md}} - {{mllib-pmml-model-export.md}} - {{mllib-statistics.md}} In these files, most references to {{MLlib}} are meant to refer to {{spark.mllib}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12210) Small example that shows how to integrate spark.mllib with spark.ml
Timothy Hunter created SPARK-12210: -- Summary: Small example that shows how to integrate spark.mllib with spark.ml Key: SPARK-12210 URL: https://issues.apache.org/jira/browse/SPARK-12210 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Affects Versions: 1.5.2 Reporter: Timothy Hunter Since we are missing a number of algorithms in {{spark.ml}} such as clustering or LDA, we should have a small example that shows the recommended way to go back and forth between {{spark.ml}} and {{spark.mllib}}. It is mostly putting together existing pieces, but I feel it is important for new users to see how the interaction plays out in practice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12208) Abstract the examples into a common place
Timothy Hunter created SPARK-12208: -- Summary: Abstract the examples into a common place Key: SPARK-12208 URL: https://issues.apache.org/jira/browse/SPARK-12208 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Affects Versions: 1.5.2 Reporter: Timothy Hunter When we write examples in the code, we put the generation of the data along with the example itself. We typically have either: {code} val data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") ... {code} or some more esoteric stuff such as: {code} val data = Array( (0, 0.1), (1, 0.8), (2, 0.2) ) val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", "feature") {code} {code} val data = Array( Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) ) val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features") {code} I suggest we follow the example of sklearn and standardize all the generation of example data inside a few methods, for example in {{org.apache.spark.ml.examples.ExampleData}}. One reason is that just reading the code is sometimes not enough to figure out what the data is supposed to be. For example when using {{libsvm_data}}, it is unclear what the dataframe columns are. This is something we should comment somewhere. Also, it would help explaining in one place all the scala idiosyncracies such as using {{Tuple1.apply}} and such. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter closed SPARK-11601. -- Resolution: Done > ML 1.6 QA: API: Binary incompatible changes > --- > > Key: SPARK-11601 > URL: https://issues.apache.org/jira/browse/SPARK-11601 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Timothy Hunter > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, ping [~mengxr] for advice since he did it for > 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation
[ https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032058#comment-15032058 ] Timothy Hunter commented on SPARK-12000: Yes, I have this branch with some fixes, but I would need double review from someone more familiar with SBT to make sure it does not break something else: https://github.com/apache/spark/compare/master...thunterdb:1511-java8?expand=1 > `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation > - > > Key: SPARK-12000 > URL: https://issues.apache.org/jira/browse/SPARK-12000 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, MLlib >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Reported by [~josephkb]. Not sure what is the root cause, but this is the > error message when I ran "sbt publishLocal": > {code} > [error] (launcher/compile:doc) javadoc returned nonzero exit code > [error] (mllib/compile:doc) scala.reflect.internal.FatalError: > [error] while compiling: > /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/util/modelSaveLoad.scala > [error] during phase: global=terminal, atPhase=parser > [error] library version: version 2.10.5 > [error] compiler version: version 2.10.5 > [error] reconstructed args: -Yno-self-type-checks -groups -classpath > /Users/meng/src/spark/core/target/scala-2.10/classes:/Users/meng/src/spark/launcher/target/scala-2.10/classes:/Users/meng/src/spark/network/common/target/scala-2.10/classes:/Users/meng/src/spark/network/shuffle/target/scala-2.10/classes:/Users/meng/src/spark/unsafe/target/scala-2.10/classes:/Users/meng/src/spark/streaming/target/scala-2.10/classes:/Users/meng/src/spark/sql/core/target/scala-2.10/classes:/Users/meng/src/spark/sql/catalyst/target/scala-2.10/classes:/Users/meng/src/spark/graphx/target/scala-2.10/classes:/Users/meng/.ivy2/cache/org.spark-project.spark/unused/jars/unused-1.0.0.jar:/Users/meng/.ivy2/cache/com.google.guava/guava/bundles/guava-14.0.1.jar:/Users/meng/.ivy2/cache/io.netty/netty-all/jars/netty-all-4.0.29.Final.jar:/Users/meng/.ivy2/cache/org.fusesource.leveldbjni/leveldbjni-all/bundles/leveldbjni-all-1.8.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-databind/bundles/jackson-databind-2.4.4.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-annotations/bundles/jackson-annotations-2.4.4.jar:/Users/meng/.ivy2/cache/com.fasterxml.jackson.core/jackson-core/bundles/jackson-core-2.4.4.jar:/Users/meng/.ivy2/cache/com.twitter/chill_2.10/jars/chill_2.10-0.5.0.jar:/Users/meng/.ivy2/cache/com.twitter/chill-java/jars/chill-java-0.5.0.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.reflectasm/reflectasm/jars/reflectasm-1.07-shaded.jar:/Users/meng/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:/Users/meng/.ivy2/cache/org.objenesis/objenesis/jars/objenesis-1.2.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-mapred/jars/avro-mapred-1.7.7-hadoop2.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-ipc/jars/avro-ipc-1.7.7-tests.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro-ipc/jars/avro-ipc-1.7.7.jar:/Users/meng/.ivy2/cache/org.apache.avro/avro/jars/avro-1.7.7.jar:/Users/meng/.ivy2/cache/org.codehaus.jackson/jackson-core-asl/jars/jackson-core-asl-1.9.13.jar:/Users/meng/.ivy2/cache/org.codehaus.jackson/jackson-mapper-asl/jars/jackson-mapper-asl-1.9.13.jar:/Users/meng/.ivy2/cache/org.apache.commons/commons-compress/jars/commons-compress-1.4.1.jar:/Users/meng/.ivy2/cache/org.tukaani/xz/jars/xz-1.0.jar:/Users/meng/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.10.jar:/Users/meng/.ivy2/cache/org.apache.xbean/xbean-asm5-shaded/bundles/xbean-asm5-shaded-4.4.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-client/jars/hadoop-client-2.2.0.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-common/jars/hadoop-common-2.2.0.jar:/Users/meng/.ivy2/cache/org.apache.hadoop/hadoop-annotations/jars/hadoop-annotations-2.2.0.jar:/Users/meng/.ivy2/cache/commons-cli/commons-cli/jars/commons-cli-1.2.jar:/Users/meng/.ivy2/cache/org.apache.commons/commons-math/jars/commons-math-2.1.jar:/Users/meng/.ivy2/cache/xmlenc/xmlenc/jars/xmlenc-0.52.jar:/Users/meng/.ivy2/cache/commons-httpclient/commons-httpclient/jars/commons-httpclient-3.1.jar:/Users/meng/.ivy2/cache/commons-net/commons-net/jars/commons-net-3.1.jar:/Users/meng/.ivy2/cache/log4j/log4j/bundles/log4j-1.2.17.jar:/Users/meng/.ivy2/cache/commons-lang/commons-lang/jars/commons-lang-2.5.jar:/Users/meng/.ivy2/cache/commons-configuration/commons-configuration/jars/commons-configuration-1.6.jar:/Users/meng/.ivy2/cache/commons-collections/commons-collections/jars/commons-collections
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027512#comment-15027512 ] Timothy Hunter commented on SPARK-8517: --- - A couple of pages such as {{ml-ensembles}} and {{ml-linear-methods}} refer to MLlib "for details". It is unclear what the differences are. I suggest we either copy or clearly mention that we mean to refer to the mathematical formulation only, that the parameters may have the same name (num iterations, regularization parameter) but that the API is different. By the way, I do not see SVM yet on the spark.ml documentation. - The perceptron classifier does not need to be a top-level section of spark.ml. I would move it as a subsection of classification > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027491#comment-15027491 ] Timothy Hunter commented on SPARK-8517: --- - We need to make a whole page about how best practices with dataframes containing numerical data (vector UDTs). That was a big pain point for me. We have a whole page on spark.mllib and we should have something similar for dataframes. - in `ml-guide`, I would split the high-level concepts (`fit`, `transform`, etc.) from chaining them together with a pipeline. From reading the current document, sparkML seems harder to use than spark.mllib because it introduces complicated examples right at the start (model selection with cross-validation). - small nit: the links under each example should link to the github file, right now they are not super useful. Do we have a ticket for that? Building examples: The current way to build a dead-simple dataframe is as follows. It is rather noisy when you compare it to python. I would recommend we move all the example code generation to a library, and thoroughly explain there what the dataframe contain (or make it part of the graph). For example: {code} val data = Array(-0.5, -0.3, 0.0, 0.2) val dataFrame = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features") {code} This requires some understanding about tuple packing, the synthetic apply method, etc. Definitely more complicated than the python or RDD equivalent. I do not have a good solution right now, but I find this a bit unsettling when this is the first line I read in an example. Other examples are easier to read, I find: {code} val training = sqlContext.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.2, -0.5.toDF("label", "features") {code} > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027375#comment-15027375 ] Timothy Hunter commented on SPARK-8517: --- Here is a few comments I have at a high level: - branding confusion about spark.mllib vs spark.ml vs the union of the two. It is a bit hard right now when you navigate to the first page to see the difference - the focus of spark.ml is on pipelines. It should be on dataframes. It makes it clear to separate it from spark.mllib which is on RDDs - make pipelines a sub-concept of the spark.ml (instead of saying that spark.ml is pipeline). Say that you can build pipelines with spark.ml - make sure that all algorithms in spark.ml have the same level of usability as in mllib. You should not be force to make a pipeline to use a single algorithm - Reorganize the spark.ml menu about the goal and not about the content. Users want to solve issues (clustering, regression, classification), we organize by theoretical concepts (decision trees, ensembles, linear methods). We should do as mllib and sk-learn: {code} - MLlib: machine learning on RDDs ... - SparkML: machine learning with (Spark) Dataframes - General concepts and overview - Building and transforming features - Classification and Regression - Clustering - Collaborative filtering - Chaining transforms with pipelines - Advanced: Evaluation, import/export, developer APIs - Examples {code} Some pieces are missing with this such as Dimensionality reduction. Also, the scikit-learn guide has a more academic focus by splitting roughly at supervised vs unsupervised. I am going to drill down more into the sections for some suggestions. > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Timothy Hunter > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2762) SparkILoop leaks memory in multi-repl configurations
Timothy Hunter created SPARK-2762: - Summary: SparkILoop leaks memory in multi-repl configurations Key: SPARK-2762 URL: https://issues.apache.org/jira/browse/SPARK-2762 Project: Spark Issue Type: Bug Reporter: Timothy Hunter Priority: Minor When subclassing SparkILoop and instantiating multiple objects, the SparkILoop instances do not get garbage collected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2452) Multi-statement input to spark repl does not work
[ https://issues.apache.org/jira/browse/SPARK-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070841#comment-14070841 ] Timothy Hunter commented on SPARK-2452: --- Excellent, thanks Patrick. > Multi-statement input to spark repl does not work > - > > Key: SPARK-2452 > URL: https://issues.apache.org/jira/browse/SPARK-2452 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.1 >Reporter: Timothy Hunter >Assignee: Prashant Sharma >Priority: Blocker > Fix For: 1.1.0 > > > Here is an example: > {code} > scala> val x = 4 ; def f() = x > x: Int = 4 > f: ()Int > scala> f() > :11: error: $VAL5 is already defined as value $VAL5 > val $VAL5 = INSTANCE; > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2452) Multi-statement input to spark repl does not work
Timothy Hunter created SPARK-2452: - Summary: Multi-statement input to spark repl does not work Key: SPARK-2452 URL: https://issues.apache.org/jira/browse/SPARK-2452 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Timothy Hunter Here is an example: scala> val x = 4 ; def f() = x x: Int = 4 f: ()Int scala> f() :11: error: $VAL5 is already defined as value $VAL5 val $VAL5 = INSTANCE; -- This message was sent by Atlassian JIRA (v6.2#6252)