[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10022: Summary: Scala-Python method/parameter inconsistency check for ML during 1.5 QA (was: Scala-Python method/parameter inconsistency check for ML MLlib during 1.5 QA) Scala-Python method/parameter inconsistency check for ML during 1.5 QA -- Key: SPARK-10022 URL: https://issues.apache.org/jira/browse/SPARK-10022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Reporter: Yanbo Liang The missing classes for PySpark were listed at SPARK-9663. Here we check and list the missing method/parameter for ML MLlib of PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10024) Implement RandomForestParams and TreeEnsembleParams for Python API
[ https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10024: Description: Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameter in place. (was: Implement RandomForestParams and TreeEnsembleParams for Python API, and make corresponding parameter in place.) Implement RandomForestParams and TreeEnsembleParams for Python API -- Key: SPARK-10024 URL: https://issues.apache.org/jira/browse/SPARK-10024 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Yanbo Liang Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameter in place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10024) Python API Tree related params clear up
[ https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10024: Summary: Python API Tree related params clear up (was: Implement RandomForestParams and TreeEnsembleParams for Python API) Python API Tree related params clear up --- Key: SPARK-10024 URL: https://issues.apache.org/jira/browse/SPARK-10024 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Yanbo Liang Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameter in place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10024) Python API Tree related params clear up
[ https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10024: Description: Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place. (was: Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameter in place.) Python API Tree related params clear up --- Key: SPARK-10024 URL: https://issues.apache.org/jira/browse/SPARK-10024 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Yanbo Liang Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9663: --- Description: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** attribute ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 was: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** attribute ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10028) Add Python API for PrefixSpan
Yanbo Liang created SPARK-10028: --- Summary: Add Python API for PrefixSpan Key: SPARK-10028 URL: https://issues.apache.org/jira/browse/SPARK-10028 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Yanbo Liang Add Python API for mllib.fpm.PrefixSpan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9431) TimeIntervalType for for time intervals
[ https://issues.apache.org/jira/browse/SPARK-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698589#comment-14698589 ] Apache Spark commented on SPARK-9431: - User 'yjshen' has created a pull request for this issue: https://github.com/apache/spark/pull/8224 TimeIntervalType for for time intervals --- Key: SPARK-9431 URL: https://issues.apache.org/jira/browse/SPARK-9431 Project: Spark Issue Type: Story Components: SQL Reporter: Reynold Xin Priority: Critical Related to the existing CalendarIntervalType, TimeIntervalType internally has only one component: the number of microseoncds, represented as a long. TimeIntervalType can be used in equality test and ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9431) TimeIntervalType for for time intervals
[ https://issues.apache.org/jira/browse/SPARK-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9431: --- Assignee: (was: Apache Spark) TimeIntervalType for for time intervals --- Key: SPARK-9431 URL: https://issues.apache.org/jira/browse/SPARK-9431 Project: Spark Issue Type: Story Components: SQL Reporter: Reynold Xin Priority: Critical Related to the existing CalendarIntervalType, TimeIntervalType internally has only one component: the number of microseoncds, represented as a long. TimeIntervalType can be used in equality test and ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9973) Wrong initial size of in-memory columnar buffers
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9973: -- Summary: Wrong initial size of in-memory columnar buffers (was: wrong buffle size) Wrong initial size of in-memory columnar buffers Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun Assignee: xukun When cache table in memory in spark sql, we allocate too more memory. InMemoryColumnarTableScan.class val initialBufferSize = columnType.defaultSize * batchSize ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, useCompression) BasicColumnBuilder.class buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize) So total allocate size is (4+ size * columnType.defaultSize * columnType.defaultSize), We change it to 4+ size * columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9973) wrong buffle size
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9973: -- Assignee: xukun wrong buffle size - Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Reporter: xukun Assignee: xukun When cache table in memory in spark sql, we allocate too more memory. InMemoryColumnarTableScan.class val initialBufferSize = columnType.defaultSize * batchSize ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, useCompression) BasicColumnBuilder.class buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize) So total allocate size is (4+ size * columnType.defaultSize * columnType.defaultSize), We change it to 4+ size * columnType.defaultSize. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10024) Implement RandomForestParams and TreeEnsembleParams for Python API
Yanbo Liang created SPARK-10024: --- Summary: Implement RandomForestParams and TreeEnsembleParams for Python API Key: SPARK-10024 URL: https://issues.apache.org/jira/browse/SPARK-10024 Project: Spark Issue Type: Sub-task Reporter: Yanbo Liang Implement RandomForestParams and TreeEnsembleParams for Python API, and make corresponding parameter in place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9663: --- Description: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** attribute SPARK-10025 ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 was: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** attribute ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** attribute SPARK-10025 ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10025) Add Python API for ml.attribute
Yanbo Liang created SPARK-10025: --- Summary: Add Python API for ml.attribute Key: SPARK-10025 URL: https://issues.apache.org/jira/browse/SPARK-10025 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Currently there is no Python implementation for ml.attribute, so we can not use Attribute in ML pipeline. Some transformers need this feature such as VectorSlicer can take a subarray of the original features by specifying column names which should contains in the column Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9793) PySpark DenseVector, SparseVector should override __eq__ and __hash__
[ https://issues.apache.org/jira/browse/SPARK-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9793: --- Summary: PySpark DenseVector, SparseVector should override __eq__ and __hash__ (was: PySpark DenseVector, SparseVector should override __eq__) PySpark DenseVector, SparseVector should override __eq__ and __hash__ - Key: SPARK-9793 URL: https://issues.apache.org/jira/browse/SPARK-9793 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Priority: Critical See [SPARK-9750]. PySpark DenseVector and SparseVector do not override the equality operator properly. They should use semantics, not representation, for comparison. (This is what Scala currently does.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10027) Add Python API missing methods for ml.feature
Yanbo Liang created SPARK-10027: --- Summary: Add Python API missing methods for ml.feature Key: SPARK-10027 URL: https://issues.apache.org/jira/browse/SPARK-10027 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang Missing method of ml.feature are listed here: * StringIndexer lack of handleInvalid parameter * VectorIndexerModel lack of numFeatures and categoryMaps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9431) TimeIntervalType for for time intervals
[ https://issues.apache.org/jira/browse/SPARK-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9431: --- Assignee: Apache Spark TimeIntervalType for for time intervals --- Key: SPARK-9431 URL: https://issues.apache.org/jira/browse/SPARK-9431 Project: Spark Issue Type: Story Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Critical Related to the existing CalendarIntervalType, TimeIntervalType internally has only one component: the number of microseoncds, represented as a long. TimeIntervalType can be used in equality test and ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide
[ https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10029: Issue Type: Sub-task (was: Documentation) Parent: SPARK-8757 Add Python examples for mllib IsotonicRegression user guide --- Key: SPARK-10029 URL: https://issues.apache.org/jira/browse/SPARK-10029 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Add Python examples for mllib IsotonicRegression user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide
Yanbo Liang created SPARK-10029: --- Summary: Add Python examples for mllib IsotonicRegression user guide Key: SPARK-10029 URL: https://issues.apache.org/jira/browse/SPARK-10029 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Add Python examples for mllib IsotonicRegression user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10022: Description: The missing classes for PySpark were listed at SPARK-9663. Here we check and list the missing method/parameter for ML of PySpark. was: The missing classes for PySpark were listed at SPARK-9663. Here we check and list the missing method/parameter for ML MLlib of PySpark. Scala-Python method/parameter inconsistency check for ML during 1.5 QA -- Key: SPARK-10022 URL: https://issues.apache.org/jira/browse/SPARK-10022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Reporter: Yanbo Liang The missing classes for PySpark were listed at SPARK-9663. Here we check and list the missing method/parameter for ML of PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML MLlib during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10022: Description: The missing classes for PySpark were listed at SPARK-9663. Here we check and list the missing method/parameter for ML MLlib of PySpark. was:Check the Scala-Python inconsistency of ML MLlib method/parameter Scala-Python method/parameter inconsistency check for ML MLlib during 1.5 QA -- Key: SPARK-10022 URL: https://issues.apache.org/jira/browse/SPARK-10022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Reporter: Yanbo Liang The missing classes for PySpark were listed at SPARK-9663. Here we check and list the missing method/parameter for ML MLlib of PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both
[ https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-10008. --- Resolution: Fixed Fix Version/s: 1.5.0 Shuffle locality can take precedence over narrow dependencies for RDDs with both Key: SPARK-10008 URL: https://issues.apache.org/jira/browse/SPARK-10008 Project: Spark Issue Type: Bug Components: Scheduler Reporter: Matei Zaharia Assignee: Matei Zaharia Fix For: 1.5.0 The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9663: --- Description: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 was: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML MLlib during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10022: Summary: Scala-Python method/parameter inconsistency check for ML MLlib during 1.5 QA (was: Scala-Python inconsistency check for ML MLlib during 1.5 QA) Scala-Python method/parameter inconsistency check for ML MLlib during 1.5 QA -- Key: SPARK-10022 URL: https://issues.apache.org/jira/browse/SPARK-10022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Reporter: Yanbo Liang Check the Scala-Python inconsistency of ML MLlib method/parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10022) Scala-Python inconsistency check for ML MLlib during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10022: Description: Check the Scala-Python inconsistency of ML MLlib method/parameter (was: Check the Scala-Python inconsistency of ML MLlib class/method/parameter) Scala-Python inconsistency check for ML MLlib during 1.5 QA - Key: SPARK-10022 URL: https://issues.apache.org/jira/browse/SPARK-10022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Reporter: Yanbo Liang Check the Scala-Python inconsistency of ML MLlib method/parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9973) Wrong initial size of in-memory columnar buffers
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9973: -- Shepherd: Cheng Lian Sprint: Spark 1.5 doc/QA sprint Affects Version/s: 1.5.0 Target Version/s: 1.5.0 Description: Two much memory is allocated for in-memory columnar buffers. The {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial number of rows rather than bytes, but the value passed in in {{InMemoryColumnarTableScan}} is the latter: {code} // Class InMemoryColumnarTableScan val initialBufferSize = columnType.defaultSize * batchSize ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, useCompression) {code} Then it's converted to byte size again by multiplying {{columnType.defaultSize}}: {code} // Class BasicColumnBuilder buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize) {code} was: When cache table in memory in spark sql, we allocate too more memory. InMemoryColumnarTableScan.class val initialBufferSize = columnType.defaultSize * batchSize ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, useCompression) BasicColumnBuilder.class buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize) So total allocate size is (4+ size * columnType.defaultSize * columnType.defaultSize), We change it to 4+ size * columnType.defaultSize. Wrong initial size of in-memory columnar buffers Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: xukun Assignee: xukun Two much memory is allocated for in-memory columnar buffers. The {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial number of rows rather than bytes, but the value passed in in {{InMemoryColumnarTableScan}} is the latter: {code} // Class InMemoryColumnarTableScan val initialBufferSize = columnType.defaultSize * batchSize ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, useCompression) {code} Then it's converted to byte size again by multiplying {{columnType.defaultSize}}: {code} // Class BasicColumnBuilder buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9973) Wrong initial size of in-memory columnar buffers
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698553#comment-14698553 ] Cheng Lian commented on SPARK-9973: --- I've updated the title and description. Wrong initial size of in-memory columnar buffers Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: xukun Assignee: xukun Two much memory is allocated for in-memory columnar buffers. The {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial number of rows rather than bytes, but the value passed in in {{InMemoryColumnarTableScan}} is the latter: {code} // Class InMemoryColumnarTableScan val initialBufferSize = columnType.defaultSize * batchSize ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, useCompression) {code} Then it's converted to byte size again by multiplying {{columnType.defaultSize}}: {code} // Class BasicColumnBuilder buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10024) Python API RF and GBT related params clear up
[ https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10024: Description: Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place. There are lots of duplicated code in the current implementation. You can refer the Scala API which is more compact. (was: Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place. It can refer the Scala API.) Python API RF and GBT related params clear up - Key: SPARK-10024 URL: https://issues.apache.org/jira/browse/SPARK-10024 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Yanbo Liang Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place. There are lots of duplicated code in the current implementation. You can refer the Scala API which is more compact. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA
[ https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9663: --- Description: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for ML: ** attribute SPARK-10025 ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing classes for MLlib: ** fpm *** PrefixSpan SPARK-10028 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 was: This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for PySpark(ML): ** attribute SPARK-10025 ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 ML Python API coverage issues found during 1.5 QA - Key: SPARK-9663 URL: https://issues.apache.org/jira/browse/SPARK-9663 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley This umbrella is for a list of Python API coverage issues which we should fix for the 1.6 release cycle. This list is to be generated from issues found in [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536]. Here we check and compare the Python and Scala API of MLlib/ML, add missing classes/methods/parameters for PySpark. * Missing classes for ML: ** attribute SPARK-10025 ** feature *** CountVectorizerModel SPARK-9769 *** DCT SPARK-8472 *** ElementwiseProduct SPARK-9768 *** MinMaxScaler SPARK-8530 *** StopWordsRemover SPARK-9679 *** VectorSlicer SPARK-9772 *** IndexToString SPARK-10021 ** classification *** OneVsRest SPARK-7861 *** MultilayerPerceptronClassifier SPARK-9773 ** regression *** IsotonicRegression SPARK-9774 * Missing classes for MLlib: ** fpm *** PrefixSpan SPARK-10028 * Missing User Guide documents for PySpark SPARK-8757 * Scala-Python method/parameter inconsistency check for ML MLlib SPARK-10022 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9662) ML 1.5 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698565#comment-14698565 ] Yanbo Liang edited comment on SPARK-9662 at 8/16/15 7:37 AM: - [~josephkb] I have finished checking for Scala-Python method/parameter inconsistency and list what we should do in the next release cycle in SPARK-10022. was (Author: yanboliang): [~josephkb] I have finished checking for Scala-Python inconsistency and list what we should do in the next release cycle in SPARK-10022. ML 1.5 QA: API: Python API coverage --- Key: SPARK-9662 URL: https://issues.apache.org/jira/browse/SPARK-9662 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9662) ML 1.5 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698565#comment-14698565 ] Yanbo Liang commented on SPARK-9662: [~josephkb] I have finished checking for Scala-Python inconsistency and list what we should do in the next release cycle in SPARK-10022. ML 1.5 QA: API: Python API coverage --- Key: SPARK-9662 URL: https://issues.apache.org/jira/browse/SPARK-9662 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.
Yanbo Liang created SPARK-10023: --- Summary: Unified DecisionTreeParams checkpointInterval between Scala and Python API. Key: SPARK-10023 URL: https://issues.apache.org/jira/browse/SPARK-10023 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Yanbo Liang checkpointInterval is one of DecisionTreeParams in Scala API which is inconsistency with Scala API, we should unified them. Proposal: Make checkpointInterval shared param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10024) Python API RF and GBT related params clear up
[ https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10024: Summary: Python API RF and GBT related params clear up (was: Python API Tree related params clear up) Python API RF and GBT related params clear up - Key: SPARK-10024 URL: https://issues.apache.org/jira/browse/SPARK-10024 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Yanbo Liang Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10024) Python API RF and GBT related params clear up
[ https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10024: Description: Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place. It can refer the Scala API. (was: Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place.) Python API RF and GBT related params clear up - Key: SPARK-10024 URL: https://issues.apache.org/jira/browse/SPARK-10024 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Yanbo Liang Implement RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and make corresponding parameters in place. It can refer the Scala API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8844) head/collect is broken in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8844. -- Resolution: Fixed Fix Version/s: 1.5.0 head/collect is broken in SparkR - Key: SPARK-8844 URL: https://issues.apache.org/jira/browse/SPARK-8844 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Sun Rui Priority: Blocker Fix For: 1.5.0 {code} t = tables(sqlContext) showDF(T) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘showDF’ for signature ‘logical’ showDF(t) +-+---+ |tableName|isTemporary| +-+---+ +-+---+ 15/07/06 09:59:10 WARN Executor: Told to re-register on heartbeat head(t) Error in readTypedObject(con, type) : Unsupported type for deserialization collect(t) Error in readTypedObject(con, type) : Unsupported type for deserialization {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8844) head/collect is broken in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698564#comment-14698564 ] Shivaram Venkataraman commented on SPARK-8844: -- Resolved by https://github.com/apache/spark/pull/7419 head/collect is broken in SparkR - Key: SPARK-8844 URL: https://issues.apache.org/jira/browse/SPARK-8844 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Sun Rui Priority: Blocker {code} t = tables(sqlContext) showDF(T) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘showDF’ for signature ‘logical’ showDF(t) +-+---+ |tableName|isTemporary| +-+---+ +-+---+ 15/07/06 09:59:10 WARN Executor: Told to re-register on heartbeat head(t) Error in readTypedObject(con, type) : Unsupported type for deserialization collect(t) Error in readTypedObject(con, type) : Unsupported type for deserialization {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8844) head/collect is broken in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8844: - Assignee: Sun Rui head/collect is broken in SparkR - Key: SPARK-8844 URL: https://issues.apache.org/jira/browse/SPARK-8844 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Sun Rui Priority: Blocker {code} t = tables(sqlContext) showDF(T) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘showDF’ for signature ‘logical’ showDF(t) +-+---+ |tableName|isTemporary| +-+---+ +-+---+ 15/07/06 09:59:10 WARN Executor: Told to re-register on heartbeat head(t) Error in readTypedObject(con, type) : Unsupported type for deserialization collect(t) Error in readTypedObject(con, type) : Unsupported type for deserialization {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide
[ https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10029: Assignee: (was: Apache Spark) Add Python examples for mllib IsotonicRegression user guide --- Key: SPARK-10029 URL: https://issues.apache.org/jira/browse/SPARK-10029 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Add Python examples for mllib IsotonicRegression user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide
[ https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698597#comment-14698597 ] Apache Spark commented on SPARK-10029: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8225 Add Python examples for mllib IsotonicRegression user guide --- Key: SPARK-10029 URL: https://issues.apache.org/jira/browse/SPARK-10029 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Add Python examples for mllib IsotonicRegression user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide
[ https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10029: Assignee: Apache Spark Add Python examples for mllib IsotonicRegression user guide --- Key: SPARK-10029 URL: https://issues.apache.org/jira/browse/SPARK-10029 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Assignee: Apache Spark Priority: Minor Add Python examples for mllib IsotonicRegression user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10022) Scala-Python inconsistency check for ML MLlib during 1.5 QA
Yanbo Liang created SPARK-10022: --- Summary: Scala-Python inconsistency check for ML MLlib during 1.5 QA Key: SPARK-10022 URL: https://issues.apache.org/jira/browse/SPARK-10022 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Reporter: Yanbo Liang Check the Scala-Python inconsistency of ML MLlib class/method/parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: I test lastest spark-1.5.0 in standalone mode and follow the steps bellow, then issues occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps bellow, then errors occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps bellow, then issues occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at
[jira] [Assigned] (SPARK-7707) User guide and example code for KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7707: --- Assignee: (was: Apache Spark) User guide and example code for KernelDensity - Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7707) User guide and example code for Statistics.kernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698634#comment-14698634 ] Sandy Ryza commented on SPARK-7707: --- [~mengxr] thoughts on which page this should land in? mllib-statistics? User guide and example code for Statistics.kernelDensity Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7707) User guide and example code for KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-7707: -- Summary: User guide and example code for KernelDensity (was: User guide and example code for Statistics.kernelDensity) User guide and example code for KernelDensity - Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10030) Managed memory leak detected when cache table
wangwei created SPARK-10030: --- Summary: Managed memory leak detected when cache table Key: SPARK-10030 URL: https://issues.apache.org/jira/browse/SPARK-10030 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: wangwei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Managed memory leak detected when cache table - Key: SPARK-10030 URL: https://issues.apache.org/jira/browse/SPARK-10030 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: wangwei 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'spark/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath ' $ {SparkSource} /sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath ' $ {SparkSource} /sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps bellow, then issues occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: I test lastest spark-1.5.0 in standalone mode and follow the steps bellow, then issues occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at
[jira] [Updated] (SPARK-10032) Add Python example for mllib LDAModel user guide
[ https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10032: Affects Version/s: (was: 1.5.0) Add Python example for mllib LDAModel user guide Key: SPARK-10032 URL: https://issues.apache.org/jira/browse/SPARK-10032 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Labels: 1.5.0 Add Python example for mllib LDAModel user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10032) Add Python example for mllib LDAModel user guide
Yanbo Liang created SPARK-10032: --- Summary: Add Python example for mllib LDAModel user guide Key: SPARK-10032 URL: https://issues.apache.org/jira/browse/SPARK-10032 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Priority: Minor Add Python example for mllib LDAModel user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10032) Add Python example for mllib LDAModel user guide
[ https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10032: Labels: 1.5.0 (was: ) Add Python example for mllib LDAModel user guide Key: SPARK-10032 URL: https://issues.apache.org/jira/browse/SPARK-10032 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Labels: 1.5.0 Add Python example for mllib LDAModel user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs
[ https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10005: Assignee: Apache Spark (was: Cheng Lian) Parquet reader doesn't handle schema merging properly for nested structs Key: SPARK-10005 URL: https://issues.apache.org/jira/browse/SPARK-10005 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Blocker Spark shell snippet to reproduce this issue: {code} import sqlContext.implicits._ val path = file:///tmp/foo (0 until 3).map(i = Tuple1((sa_$i, sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path) (0 until 3).map(i = Tuple1((sa_$i, sb_$i, sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path) sqlContext.read.option(schemaMerging, true).parquet(path).show() {code} Exception: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136) at org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) ... 25 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs
[ https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10005: --- Description: Spark shell snippet to reproduce this issue (note that both {{DataFrame}} written below contain a single struct column with multiple fields): {code} import sqlContext.implicits._ val path = file:///tmp/foo (0 until 3).map(i = Tuple1((sa_$i, sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path) (0 until 3).map(i = Tuple1((sa_$i, sb_$i, sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path) sqlContext.read.option(schemaMerging, true).parquet(path).show() {code} Exception: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136) at org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) ... 25 more {noformat} was: Spark shell snippet to reproduce this issue: {code} import sqlContext.implicits._ val path = file:///tmp/foo (0 until 3).map(i = Tuple1((sa_$i, sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path) (0 until 3).map(i = Tuple1((sa_$i, sb_$i, sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path) sqlContext.read.option(schemaMerging, true).parquet(path).show() {code} Exception: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet at
[jira] [Commented] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs
[ https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698631#comment-14698631 ] Apache Spark commented on SPARK-10005: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8228 Parquet reader doesn't handle schema merging properly for nested structs Key: SPARK-10005 URL: https://issues.apache.org/jira/browse/SPARK-10005 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Spark shell snippet to reproduce this issue: {code} import sqlContext.implicits._ val path = file:///tmp/foo (0 until 3).map(i = Tuple1((sa_$i, sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path) (0 until 3).map(i = Tuple1((sa_$i, sb_$i, sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path) sqlContext.read.option(schemaMerging, true).parquet(path).show() {code} Exception: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136) at org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) ... 25 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs
[ https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10005: Assignee: Cheng Lian (was: Apache Spark) Parquet reader doesn't handle schema merging properly for nested structs Key: SPARK-10005 URL: https://issues.apache.org/jira/browse/SPARK-10005 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Spark shell snippet to reproduce this issue: {code} import sqlContext.implicits._ val path = file:///tmp/foo (0 until 3).map(i = Tuple1((sa_$i, sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path) (0 until 3).map(i = Tuple1((sa_$i, sb_$i, sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path) sqlContext.read.option(schemaMerging, true).parquet(path).show() {code} Exception: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136) at org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) ... 25 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps bellow, then errors occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; configuration: spark.driver.memory5g spark.executor.memory 28g spark.cores.max 21 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps bellow, then errors occured. 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at
[jira] [Commented] (SPARK-8918) Add @since tags to mllib.clustering
[ https://issues.apache.org/jira/browse/SPARK-8918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698635#comment-14698635 ] Apache Spark commented on SPARK-8918: - User 'XiaoqingWang' has created a pull request for this issue: https://github.com/apache/spark/pull/8229 Add @since tags to mllib.clustering --- Key: SPARK-8918 URL: https://issues.apache.org/jira/browse/SPARK-8918 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 2h Remaining Estimate: 2h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7707) User guide and example code for KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-7707: - Assignee: Sandy Ryza User guide and example code for KernelDensity - Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7707) User guide and example code for KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7707: --- Assignee: Apache Spark User guide and example code for KernelDensity - Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7707) User guide and example code for KernelDensity
[ https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698647#comment-14698647 ] Apache Spark commented on SPARK-7707: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/8230 User guide and example code for KernelDensity - Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${spark}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'spark/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at
[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table
[ https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SPARK-10030: Description: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) was: 1. create table cache_test(id int, name string) stored as textfile ; 2. load data local inpath '${spark}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test; 3. cache table test as select * from cache_test distribute by id; 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 434 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434) java.util.NoSuchElementException: key not found: val_54 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110) at org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153) at
[jira] [Commented] (SPARK-10031) Join two UnsafeRows in SortMergeJoin if possible
[ https://issues.apache.org/jira/browse/SPARK-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698609#comment-14698609 ] Apache Spark commented on SPARK-10031: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/8226 Join two UnsafeRows in SortMergeJoin if possible Key: SPARK-10031 URL: https://issues.apache.org/jira/browse/SPARK-10031 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently in SortMergeJoin, when two rows from left and right plans are both UnsafeRow, we still use JoinedRow to join them and do an extra UnsafeProjection later. We can just use GenerateUnsafeRowJoiner to join two UnsafeRows in SortMergeJoin if possible. Besides, GenerateUnsafeRowJoiner can have a withRight function to only update row2 with a same row1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10031) Join two UnsafeRows in SortMergeJoin if possible
[ https://issues.apache.org/jira/browse/SPARK-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10031: Assignee: (was: Apache Spark) Join two UnsafeRows in SortMergeJoin if possible Key: SPARK-10031 URL: https://issues.apache.org/jira/browse/SPARK-10031 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently in SortMergeJoin, when two rows from left and right plans are both UnsafeRow, we still use JoinedRow to join them and do an extra UnsafeProjection later. We can just use GenerateUnsafeRowJoiner to join two UnsafeRows in SortMergeJoin if possible. Besides, GenerateUnsafeRowJoiner can have a withRight function to only update row2 with a same row1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10031) Join two UnsafeRows in SortMergeJoin if possible
[ https://issues.apache.org/jira/browse/SPARK-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10031: Assignee: Apache Spark Join two UnsafeRows in SortMergeJoin if possible Key: SPARK-10031 URL: https://issues.apache.org/jira/browse/SPARK-10031 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Apache Spark Currently in SortMergeJoin, when two rows from left and right plans are both UnsafeRow, we still use JoinedRow to join them and do an extra UnsafeProjection later. We can just use GenerateUnsafeRowJoiner to join two UnsafeRows in SortMergeJoin if possible. Besides, GenerateUnsafeRowJoiner can have a withRight function to only update row2 with a same row1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10031) Join two UnsafeRows in SortMergeJoin if possible
Liang-Chi Hsieh created SPARK-10031: --- Summary: Join two UnsafeRows in SortMergeJoin if possible Key: SPARK-10031 URL: https://issues.apache.org/jira/browse/SPARK-10031 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently in SortMergeJoin, when two rows from left and right plans are both UnsafeRow, we still use JoinedRow to join them and do an extra UnsafeProjection later. We can just use GenerateUnsafeRowJoiner to join two UnsafeRows in SortMergeJoin if possible. Besides, GenerateUnsafeRowJoiner can have a withRight function to only update row2 with a same row1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide
[ https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-10029: Labels: 1.5.0 (was: ) Add Python examples for mllib IsotonicRegression user guide --- Key: SPARK-10029 URL: https://issues.apache.org/jira/browse/SPARK-10029 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Labels: 1.5.0 Add Python examples for mllib IsotonicRegression user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10032) Add Python example for mllib LDAModel user guide
[ https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10032: Assignee: Apache Spark Add Python example for mllib LDAModel user guide Key: SPARK-10032 URL: https://issues.apache.org/jira/browse/SPARK-10032 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Assignee: Apache Spark Priority: Minor Labels: 1.5.0 Add Python example for mllib LDAModel user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10032) Add Python example for mllib LDAModel user guide
[ https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10032: Assignee: (was: Apache Spark) Add Python example for mllib LDAModel user guide Key: SPARK-10032 URL: https://issues.apache.org/jira/browse/SPARK-10032 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Labels: 1.5.0 Add Python example for mllib LDAModel user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10032) Add Python example for mllib LDAModel user guide
[ https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698627#comment-14698627 ] Apache Spark commented on SPARK-10032: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8227 Add Python example for mllib LDAModel user guide Key: SPARK-10032 URL: https://issues.apache.org/jira/browse/SPARK-10032 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Reporter: Yanbo Liang Priority: Minor Labels: 1.5.0 Add Python example for mllib LDAModel user guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9973) Wrong initial size of in-memory columnar buffers
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-9973. --- Resolution: Fixed Resolved by https://github.com/apache/spark/pull/8189 Wrong initial size of in-memory columnar buffers Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: xukun Assignee: xukun Fix For: 1.5.0 Two much memory is allocated for in-memory columnar buffers. The {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial number of rows rather than bytes, but the value passed in in {{InMemoryColumnarTableScan}} is the latter: {code} // Class InMemoryColumnarTableScan val initialBufferSize = columnType.defaultSize * batchSize ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, useCompression) {code} Then it's converted to byte size again by multiplying {{columnType.defaultSize}}: {code} // Class BasicColumnBuilder buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9973) Wrong initial size of in-memory columnar buffers
[ https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9973: -- Fix Version/s: 1.5.0 Wrong initial size of in-memory columnar buffers Key: SPARK-9973 URL: https://issues.apache.org/jira/browse/SPARK-9973 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: xukun Assignee: xukun Fix For: 1.5.0 Two much memory is allocated for in-memory columnar buffers. The {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial number of rows rather than bytes, but the value passed in in {{InMemoryColumnarTableScan}} is the latter: {code} // Class InMemoryColumnarTableScan val initialBufferSize = columnType.defaultSize * batchSize ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, useCompression) {code} Then it's converted to byte size again by multiplying {{columnType.defaultSize}}: {code} // Class BasicColumnBuilder buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10016) ML model broadcasts should be stored in private vars: spark.ml Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698808#comment-14698808 ] Apache Spark commented on SPARK-10016: -- User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/8233 ML model broadcasts should be stored in private vars: spark.ml Word2Vec --- Key: SPARK-10016 URL: https://issues.apache.org/jira/browse/SPARK-10016 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Priority: Trivial Labels: starter See parent for details. Applies to: spark.ml.feature.Word2Vec -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10016) ML model broadcasts should be stored in private vars: spark.ml Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10016: Assignee: (was: Apache Spark) ML model broadcasts should be stored in private vars: spark.ml Word2Vec --- Key: SPARK-10016 URL: https://issues.apache.org/jira/browse/SPARK-10016 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Priority: Trivial Labels: starter See parent for details. Applies to: spark.ml.feature.Word2Vec -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10016) ML model broadcasts should be stored in private vars: spark.ml Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10016: Assignee: Apache Spark ML model broadcasts should be stored in private vars: spark.ml Word2Vec --- Key: SPARK-10016 URL: https://issues.apache.org/jira/browse/SPARK-10016 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Trivial Labels: starter See parent for details. Applies to: spark.ml.feature.Word2Vec -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set
[ https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698729#comment-14698729 ] Shixiong Zhu commented on SPARK-9985: - I just realized SPARK-8463 didn't fix all problems. You will still encounter `No suitable driver found error` when using DataFrameReader.jdbc or DataFrameWriter.jdbc. I opened SPARK-10036 to track this issue since it has a different stack trace. DataFrameWriter jdbc method ignore options that have been set - Key: SPARK-9985 URL: https://issues.apache.org/jira/browse/SPARK-9985 Project: Spark Issue Type: Bug Reporter: Richard Garris Assignee: Shixiong Zhu I am working on an RDBMS to DataFrame conversion using Postgres and am hitting a wall where everytime I try to use the Postgresql JDBC driver to get a java.sql.SQLException: No suitable driver found error Here is the stack trace: {code} at java.sql.DriverManager.getConnection(DriverManager.java:596) at java.sql.DriverManager.getConnection(DriverManager.java:187) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} It appears that DataFrameWriter and DataFrameReader ignores options that we set before invoking {{jdbc}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering
[ https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10034: Description: {code=scala} val df = Seq(1 - 2).toDF(i, j) val query = df.groupBy('i) .agg(max('j).as(_aggOrdering)) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering Key: SPARK-10034 URL: https://issues.apache.org/jira/browse/SPARK-10034 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan {code=scala} val df = Seq(1 - 2).toDF(i, j) val query = df.groupBy('i) .agg(max('j).as(_aggOrdering)) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering
[ https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-10034: Description: {code} val df = Seq(1 - 2).toDF(i, j) val query = df.groupBy('i) .agg(max('j).as(_aggOrdering)) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} was: {code=scala} val df = Seq(1 - 2).toDF(i, j) val query = df.groupBy('i) .agg(max('j).as(_aggOrdering)) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering Key: SPARK-10034 URL: https://issues.apache.org/jira/browse/SPARK-10034 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan {code} val df = Seq(1 - 2).toDF(i, j) val query = df.groupBy('i) .agg(max('j).as(_aggOrdering)) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10036) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698727#comment-14698727 ] Apache Spark commented on SPARK-10036: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/8232 DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection -- Key: SPARK-10036 URL: https://issues.apache.org/jira/browse/SPARK-10036 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Here is the reproduce code and the stack trace {code} val url = jdbc:postgresql://.../mytest import java.util.Properties val prop = new Properties() prop.put(driver, org.postgresql.Driver) prop.put(user, ...) prop.put(password, ...) val df = sqlContext.read.jdbc(url, mytest, prop) {code} {code} java.sql.SQLException: No suitable driver found for jdbc:postgresql://.../mytest at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.init(JDBCRelation.scala:91) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10036) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10036: Assignee: Apache Spark DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection -- Key: SPARK-10036 URL: https://issues.apache.org/jira/browse/SPARK-10036 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Assignee: Apache Spark Here is the reproduce code and the stack trace {code} val url = jdbc:postgresql://.../mytest import java.util.Properties val prop = new Properties() prop.put(driver, org.postgresql.Driver) prop.put(user, ...) prop.put(password, ...) val df = sqlContext.read.jdbc(url, mytest, prop) {code} {code} java.sql.SQLException: No suitable driver found for jdbc:postgresql://.../mytest at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.init(JDBCRelation.scala:91) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10036) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection
[ https://issues.apache.org/jira/browse/SPARK-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10036: Assignee: (was: Apache Spark) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection -- Key: SPARK-10036 URL: https://issues.apache.org/jira/browse/SPARK-10036 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Here is the reproduce code and the stack trace {code} val url = jdbc:postgresql://.../mytest import java.util.Properties val prop = new Properties() prop.put(driver, org.postgresql.Driver) prop.put(user, ...) prop.put(password, ...) val df = sqlContext.read.jdbc(url, mytest, prop) {code} {code} java.sql.SQLException: No suitable driver found for jdbc:postgresql://.../mytest at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.init(JDBCRelation.scala:91) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs
[ https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-10005. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8228 [https://github.com/apache/spark/pull/8228] Parquet reader doesn't handle schema merging properly for nested structs Key: SPARK-10005 URL: https://issues.apache.org/jira/browse/SPARK-10005 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.5.0 Spark shell snippet to reproduce this issue (note that both {{DataFrame}} written below contain a single struct column with multiple fields): {code} import sqlContext.implicits._ val path = file:///tmp/foo (0 until 3).map(i = Tuple1((sa_$i, sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path) (0 until 3).map(i = Tuple1((sa_$i, sb_$i, sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path) sqlContext.read.option(schemaMerging, true).parquet(path).show() {code} Exception: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136) at org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) ... 25 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Created] (SPARK-10033) Sort on
Wenchen Fan created SPARK-10033: --- Summary: Sort on Key: SPARK-10033 URL: https://issues.apache.org/jira/browse/SPARK-10033 Project: Spark Issue Type: Bug Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering
[ https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10034: Assignee: (was: Apache Spark) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering Key: SPARK-10034 URL: https://issues.apache.org/jira/browse/SPARK-10034 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan {code} val df = Seq(1 - 2).toDF(i, j) val query = df.groupBy('i) .agg(max('j).as(_aggOrdering)) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering
[ https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698701#comment-14698701 ] Apache Spark commented on SPARK-10034: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8231 Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering Key: SPARK-10034 URL: https://issues.apache.org/jira/browse/SPARK-10034 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan {code} val df = Seq(1 - 2).toDF(i, j) val query = df.groupBy('i) .agg(max('j).as(_aggOrdering)) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering
[ https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10034: Assignee: Apache Spark Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering Key: SPARK-10034 URL: https://issues.apache.org/jira/browse/SPARK-10034 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark {code} val df = Seq(1 - 2).toDF(i, j) val query = df.groupBy('i) .agg(max('j).as(_aggOrdering)) .orderBy(sum('j)) checkAnswer(query, Row(1, 2)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set
[ https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698706#comment-14698706 ] Shixiong Zhu commented on SPARK-9985: - BTW, `sqlContext.load` will load the driver class. That's why `write` works after `load`. DataFrameWriter jdbc method ignore options that have been set - Key: SPARK-9985 URL: https://issues.apache.org/jira/browse/SPARK-9985 Project: Spark Issue Type: Bug Reporter: Richard Garris Assignee: Shixiong Zhu I am working on an RDBMS to DataFrame conversion using Postgres and am hitting a wall where everytime I try to use the Postgresql JDBC driver to get a java.sql.SQLException: No suitable driver found error Here is the stack trace: {code} at java.sql.DriverManager.getConnection(DriverManager.java:596) at java.sql.DriverManager.getConnection(DriverManager.java:187) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} It appears that DataFrameWriter and DataFrameReader ignores options that we set before invoking {{jdbc}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10036) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection
Shixiong Zhu created SPARK-10036: Summary: DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection Key: SPARK-10036 URL: https://issues.apache.org/jira/browse/SPARK-10036 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Here is the reproduce code and the stack trace {code} val url = jdbc:postgresql://.../mytest import java.util.Properties val prop = new Properties() prop.put(driver, org.postgresql.Driver) prop.put(user, ...) prop.put(password, ...) val df = sqlContext.read.jdbc(url, mytest, prop) {code} {code} java.sql.SQLException: No suitable driver found for jdbc:postgresql://.../mytest at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.init(JDBCRelation.scala:91) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering
Wenchen Fan created SPARK-10034: --- Summary: Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering Key: SPARK-10034 URL: https://issues.apache.org/jira/browse/SPARK-10034 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10033) Sort on
[ https://issues.apache.org/jira/browse/SPARK-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan closed SPARK-10033. --- Resolution: Invalid Sort on Key: SPARK-10033 URL: https://issues.apache.org/jira/browse/SPARK-10033 Project: Spark Issue Type: Bug Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10035) Parquet filters does not process EqualNullSafe filter.
Hyukjin Kwon created SPARK-10035: Summary: Parquet filters does not process EqualNullSafe filter. Key: SPARK-10035 URL: https://issues.apache.org/jira/browse/SPARK-10035 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor it is an issue followed by SPARK-9814. Datasources (after {{selectFilters()}} in {{org.apache.spark.sql.execution.datasources.DataSourceStrategy}}) pass {{EqualNotNull}} to {{ParquetRelation}} but {{ParquetFilters}} for {{ParquetRelation}} does not take and process this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set
[ https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-9985. - Resolution: Fixed Target Version/s: (was: 1.5.0) DataFrameWriter jdbc method ignore options that have been set - Key: SPARK-9985 URL: https://issues.apache.org/jira/browse/SPARK-9985 Project: Spark Issue Type: Bug Reporter: Richard Garris Assignee: Shixiong Zhu I am working on an RDBMS to DataFrame conversion using Postgres and am hitting a wall where everytime I try to use the Postgresql JDBC driver to get a java.sql.SQLException: No suitable driver found error Here is the stack trace: {code} at java.sql.DriverManager.getConnection(DriverManager.java:596) at java.sql.DriverManager.getConnection(DriverManager.java:187) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} It appears that DataFrameWriter and DataFrameReader ignores options that we set before invoking {{jdbc}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set
[ https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698702#comment-14698702 ] Shixiong Zhu commented on SPARK-9985: - [~rlgarris_databricks] I think this has been fixed in 1.4.1 by SPARK-8463 DataFrameWriter jdbc method ignore options that have been set - Key: SPARK-9985 URL: https://issues.apache.org/jira/browse/SPARK-9985 Project: Spark Issue Type: Bug Reporter: Richard Garris Assignee: Shixiong Zhu I am working on an RDBMS to DataFrame conversion using Postgres and am hitting a wall where everytime I try to use the Postgresql JDBC driver to get a java.sql.SQLException: No suitable driver found error Here is the stack trace: {code} at java.sql.DriverManager.getConnection(DriverManager.java:596) at java.sql.DriverManager.getConnection(DriverManager.java:187) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189) at org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} It appears that DataFrameWriter and DataFrameReader ignores options that we set before invoking {{jdbc}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9760) SparkSubmit doesn't work with --packages when --repositories is not specified
[ https://issues.apache.org/jira/browse/SPARK-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-9760: Assignee: Shivaram Venkataraman SparkSubmit doesn't work with --packages when --repositories is not specified -- Key: SPARK-9760 URL: https://issues.apache.org/jira/browse/SPARK-9760 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Blocker Fix For: 1.5.0 Running `./bin/sparkR --packages com.databricks:spark-csv_2.10:1.2.0` gives {code} Exception in thread main java.lang.NullPointerException at org.apache.spark.deploy.SparkSubmitUtils$.createRepoResolvers(SparkSubmit.scala:812) at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:962) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9760) SparkSubmit doesn't work with --packages when --repositories is not specified
[ https://issues.apache.org/jira/browse/SPARK-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9760. -- Resolution: Fixed Fix Version/s: 1.5.0 SparkSubmit doesn't work with --packages when --repositories is not specified -- Key: SPARK-9760 URL: https://issues.apache.org/jira/browse/SPARK-9760 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Blocker Fix For: 1.5.0 Running `./bin/sparkR --packages com.databricks:spark-csv_2.10:1.2.0` gives {code} Exception in thread main java.lang.NullPointerException at org.apache.spark.deploy.SparkSubmitUtils$.createRepoResolvers(SparkSubmit.scala:812) at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:962) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698767#comment-14698767 ] Cheng Lian commented on SPARK-7837: --- Just a note to people who want to reproduce this issue: # You need to start a Spark cluster with at least two workers running on two distinct nodes. Speculation isn't enabled when running in local mode or single node cluster. If you only have a single machine, you'll probably have to resort to VMs # Don't forget to set {{spark.speculation}} to {{true}} (it's {{false}} by default) NPE when save as parquet in speculative tasks - Key: SPARK-7837 URL: https://issues.apache.org/jira/browse/SPARK-7837 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Cheng Lian Priority: Critical The query is like {{df.orderBy(...).saveAsTable(...)}}. When there is no partitioning columns and there is a skewed key, I found the following exception in speculative tasks. After these failures, seems we could not call {{SparkHadoopMapRedUtil.commitTask}} correctly. {code} java.lang.NullPointerException at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115) at org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org