[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10022:

Summary: Scala-Python method/parameter inconsistency check for ML during 
1.5 QA  (was: Scala-Python method/parameter inconsistency check for ML  MLlib 
during 1.5 QA)

 Scala-Python method/parameter inconsistency check for ML during 1.5 QA
 --

 Key: SPARK-10022
 URL: https://issues.apache.org/jira/browse/SPARK-10022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 The missing classes for PySpark were listed at SPARK-9663.
 Here we check and list the missing method/parameter for ML  MLlib of PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10024) Implement RandomForestParams and TreeEnsembleParams for Python API

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10024:

Description: Implement RandomForestParams, GBTParams and 
TreeEnsembleParams for Python API, and make corresponding parameter in place. 
 (was: Implement RandomForestParams and TreeEnsembleParams for Python API, 
and make corresponding parameter in place.)

 Implement RandomForestParams and TreeEnsembleParams for Python API
 --

 Key: SPARK-10024
 URL: https://issues.apache.org/jira/browse/SPARK-10024
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 Implement RandomForestParams, GBTParams and TreeEnsembleParams for 
 Python API, and make corresponding parameter in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10024) Python API Tree related params clear up

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10024:

Summary: Python API Tree related params clear up  (was: Implement 
RandomForestParams and TreeEnsembleParams for Python API)

 Python API Tree related params clear up
 ---

 Key: SPARK-10024
 URL: https://issues.apache.org/jira/browse/SPARK-10024
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 Implement RandomForestParams, GBTParams and TreeEnsembleParams for 
 Python API, and make corresponding parameter in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10024) Python API Tree related params clear up

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10024:

Description: Implement RandomForestParams, GBTParams and 
TreeEnsembleParams for Python API, and make corresponding parameters in 
place.  (was: Implement RandomForestParams, GBTParams and 
TreeEnsembleParams for Python API, and make corresponding parameter in place.)

 Python API Tree related params clear up
 ---

 Key: SPARK-10024
 URL: https://issues.apache.org/jira/browse/SPARK-10024
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 Implement RandomForestParams, GBTParams and TreeEnsembleParams for 
 Python API, and make corresponding parameters in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9663:
---
Description: 
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** attribute
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022

  was:
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022


 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for PySpark(ML):
 ** attribute
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-8472
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-8530
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 *** IndexToString SPARK-10021
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing User Guide documents for PySpark SPARK-8757
 * Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10028) Add Python API for PrefixSpan

2015-08-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10028:
---

 Summary: Add Python API for PrefixSpan
 Key: SPARK-10028
 URL: https://issues.apache.org/jira/browse/SPARK-10028
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Yanbo Liang


Add Python API for mllib.fpm.PrefixSpan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9431) TimeIntervalType for for time intervals

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698589#comment-14698589
 ] 

Apache Spark commented on SPARK-9431:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8224

 TimeIntervalType for for time intervals
 ---

 Key: SPARK-9431
 URL: https://issues.apache.org/jira/browse/SPARK-9431
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 Related to the existing CalendarIntervalType, TimeIntervalType internally has 
 only one component: the number of microseoncds, represented as a long.
 TimeIntervalType can be used in equality test and ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9431) TimeIntervalType for for time intervals

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9431:
---

Assignee: (was: Apache Spark)

 TimeIntervalType for for time intervals
 ---

 Key: SPARK-9431
 URL: https://issues.apache.org/jira/browse/SPARK-9431
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 Related to the existing CalendarIntervalType, TimeIntervalType internally has 
 only one component: the number of microseoncds, represented as a long.
 TimeIntervalType can be used in equality test and ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9973) Wrong initial size of in-memory columnar buffers

2015-08-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9973:
--
Summary: Wrong initial size of in-memory columnar buffers  (was: wrong 
buffle size)

 Wrong initial size of in-memory columnar buffers
 

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun
Assignee: xukun

 When cache table in memory in spark sql, we allocate too more memory.
 InMemoryColumnarTableScan.class
   val initialBufferSize = columnType.defaultSize * batchSize
   ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, 
 useCompression)
 BasicColumnBuilder.class
   buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize)
 So total allocate size is (4+ size * columnType.defaultSize  * 
 columnType.defaultSize), We change it to 4+ size * columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9973) wrong buffle size

2015-08-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9973:
--
Assignee: xukun

 wrong buffle size
 -

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun
Assignee: xukun

 When cache table in memory in spark sql, we allocate too more memory.
 InMemoryColumnarTableScan.class
   val initialBufferSize = columnType.defaultSize * batchSize
   ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, 
 useCompression)
 BasicColumnBuilder.class
   buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize)
 So total allocate size is (4+ size * columnType.defaultSize  * 
 columnType.defaultSize), We change it to 4+ size * columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10024) Implement RandomForestParams and TreeEnsembleParams for Python API

2015-08-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10024:
---

 Summary: Implement RandomForestParams and TreeEnsembleParams 
for Python API
 Key: SPARK-10024
 URL: https://issues.apache.org/jira/browse/SPARK-10024
 Project: Spark
  Issue Type: Sub-task
Reporter: Yanbo Liang


Implement RandomForestParams and TreeEnsembleParams for Python API, and 
make corresponding parameter in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9663:
---
Description: 
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** attribute SPARK-10025
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022

  was:
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** attribute
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022


 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for PySpark(ML):
 ** attribute SPARK-10025
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-8472
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-8530
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 *** IndexToString SPARK-10021
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing User Guide documents for PySpark SPARK-8757
 * Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10025) Add Python API for ml.attribute

2015-08-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10025:
---

 Summary: Add Python API for ml.attribute
 Key: SPARK-10025
 URL: https://issues.apache.org/jira/browse/SPARK-10025
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang


Currently there is no Python implementation for ml.attribute, so we can not use 
Attribute in ML pipeline. Some transformers need this feature such as 
VectorSlicer can take a subarray of the original features by specifying column 
names which should contains in the column Attribute. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9793) PySpark DenseVector, SparseVector should override __eq__ and __hash__

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9793:
---
Summary: PySpark DenseVector, SparseVector should override __eq__ and 
__hash__  (was: PySpark DenseVector, SparseVector should override __eq__)

 PySpark DenseVector, SparseVector should override __eq__ and __hash__
 -

 Key: SPARK-9793
 URL: https://issues.apache.org/jira/browse/SPARK-9793
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Priority: Critical

 See [SPARK-9750].
 PySpark DenseVector and SparseVector do not override the equality operator 
 properly.  They should use semantics, not representation, for comparison.  
 (This is what Scala currently does.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10027) Add Python API missing methods for ml.feature

2015-08-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10027:
---

 Summary: Add Python API missing methods for ml.feature
 Key: SPARK-10027
 URL: https://issues.apache.org/jira/browse/SPARK-10027
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang


Missing method of ml.feature are listed here:
* StringIndexer lack of handleInvalid parameter
* VectorIndexerModel lack of numFeatures and categoryMaps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9431) TimeIntervalType for for time intervals

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9431:
---

Assignee: Apache Spark

 TimeIntervalType for for time intervals
 ---

 Key: SPARK-9431
 URL: https://issues.apache.org/jira/browse/SPARK-9431
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Critical

 Related to the existing CalendarIntervalType, TimeIntervalType internally has 
 only one component: the number of microseoncds, represented as a long.
 TimeIntervalType can be used in equality test and ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10029:

Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-8757

 Add Python examples for mllib IsotonicRegression user guide
 ---

 Key: SPARK-10029
 URL: https://issues.apache.org/jira/browse/SPARK-10029
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor

 Add Python examples for mllib IsotonicRegression user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide

2015-08-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10029:
---

 Summary: Add Python examples for mllib IsotonicRegression user 
guide
 Key: SPARK-10029
 URL: https://issues.apache.org/jira/browse/SPARK-10029
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor


Add Python examples for mllib IsotonicRegression user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10022:

Description: 
The missing classes for PySpark were listed at SPARK-9663.
Here we check and list the missing method/parameter for ML of PySpark.

  was:
The missing classes for PySpark were listed at SPARK-9663.
Here we check and list the missing method/parameter for ML  MLlib of PySpark.


 Scala-Python method/parameter inconsistency check for ML during 1.5 QA
 --

 Key: SPARK-10022
 URL: https://issues.apache.org/jira/browse/SPARK-10022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 The missing classes for PySpark were listed at SPARK-9663.
 Here we check and list the missing method/parameter for ML of PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML MLlib during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10022:

Description: 
The missing classes for PySpark were listed at SPARK-9663.
Here we check and list the missing method/parameter for ML  MLlib of PySpark.

  was:Check the Scala-Python inconsistency of ML  MLlib method/parameter


 Scala-Python method/parameter inconsistency check for ML  MLlib during 1.5 QA
 --

 Key: SPARK-10022
 URL: https://issues.apache.org/jira/browse/SPARK-10022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 The missing classes for PySpark were listed at SPARK-9663.
 Here we check and list the missing method/parameter for ML  MLlib of PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10008) Shuffle locality can take precedence over narrow dependencies for RDDs with both

2015-08-16 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-10008.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 Shuffle locality can take precedence over narrow dependencies for RDDs with 
 both
 

 Key: SPARK-10008
 URL: https://issues.apache.org/jira/browse/SPARK-10008
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.5.0


 The shuffle locality patch made the DAGScheduler aware of shuffle data, but 
 for RDDs that have both narrow and shuffle dependencies, it can cause them to 
 place tasks based on the shuffle dependency instead of the narrow one. This 
 case is common in iterative join-based algorithms like PageRank and ALS, 
 where one RDD is hash-partitioned and one isn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9663:
---
Description: 
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022

  was:
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757


 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for PySpark(ML):
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-8472
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-8530
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 *** IndexToString SPARK-10021
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing User Guide documents for PySpark SPARK-8757
 * Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML MLlib during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10022:

Summary: Scala-Python method/parameter inconsistency check for ML  MLlib 
during 1.5 QA  (was: Scala-Python inconsistency check for ML  MLlib during 1.5 
QA)

 Scala-Python method/parameter inconsistency check for ML  MLlib during 1.5 QA
 --

 Key: SPARK-10022
 URL: https://issues.apache.org/jira/browse/SPARK-10022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 Check the Scala-Python inconsistency of ML  MLlib method/parameter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10022) Scala-Python inconsistency check for ML MLlib during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10022:

Description: Check the Scala-Python inconsistency of ML  MLlib 
method/parameter  (was: Check the Scala-Python inconsistency of ML  MLlib 
class/method/parameter)

 Scala-Python inconsistency check for ML  MLlib during 1.5 QA
 -

 Key: SPARK-10022
 URL: https://issues.apache.org/jira/browse/SPARK-10022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 Check the Scala-Python inconsistency of ML  MLlib method/parameter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9973) Wrong initial size of in-memory columnar buffers

2015-08-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9973:
--
 Shepherd: Cheng Lian
   Sprint: Spark 1.5 doc/QA sprint
Affects Version/s: 1.5.0
 Target Version/s: 1.5.0
  Description: 
Two much memory is allocated for in-memory columnar buffers. The 
{{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial number 
of rows rather than bytes, but the value passed in in 
{{InMemoryColumnarTableScan}} is the latter:
{code}
// Class InMemoryColumnarTableScan
  val initialBufferSize = columnType.defaultSize * batchSize
  ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, 
useCompression)
{code}
Then it's converted to byte size again by multiplying 
{{columnType.defaultSize}}:
{code}
// Class BasicColumnBuilder
  buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize)
{code}

  was:
When cache table in memory in spark sql, we allocate too more memory.

InMemoryColumnarTableScan.class
  val initialBufferSize = columnType.defaultSize * batchSize
  ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, 
useCompression)

BasicColumnBuilder.class
  buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize)

So total allocate size is (4+ size * columnType.defaultSize  * 
columnType.defaultSize), We change it to 4+ size * columnType.defaultSize.


 Wrong initial size of in-memory columnar buffers
 

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: xukun
Assignee: xukun

 Two much memory is allocated for in-memory columnar buffers. The 
 {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial 
 number of rows rather than bytes, but the value passed in in 
 {{InMemoryColumnarTableScan}} is the latter:
 {code}
 // Class InMemoryColumnarTableScan
   val initialBufferSize = columnType.defaultSize * batchSize
   ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, 
 useCompression)
 {code}
 Then it's converted to byte size again by multiplying 
 {{columnType.defaultSize}}:
 {code}
 // Class BasicColumnBuilder
   buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9973) Wrong initial size of in-memory columnar buffers

2015-08-16 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698553#comment-14698553
 ] 

Cheng Lian commented on SPARK-9973:
---

I've updated the title and description.

 Wrong initial size of in-memory columnar buffers
 

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: xukun
Assignee: xukun

 Two much memory is allocated for in-memory columnar buffers. The 
 {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial 
 number of rows rather than bytes, but the value passed in in 
 {{InMemoryColumnarTableScan}} is the latter:
 {code}
 // Class InMemoryColumnarTableScan
   val initialBufferSize = columnType.defaultSize * batchSize
   ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, 
 useCompression)
 {code}
 Then it's converted to byte size again by multiplying 
 {{columnType.defaultSize}}:
 {code}
 // Class BasicColumnBuilder
   buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10024) Python API RF and GBT related params clear up

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10024:

Description: Implement RandomForestParams, GBTParams and 
TreeEnsembleParams for Python API, and make corresponding parameters in 
place. There are lots of duplicated code in the current implementation. You can 
refer the Scala API which is more compact.   (was: Implement 
RandomForestParams, GBTParams and TreeEnsembleParams for Python API, and 
make corresponding parameters in place. It can refer the Scala API.)

 Python API RF and GBT related params clear up
 -

 Key: SPARK-10024
 URL: https://issues.apache.org/jira/browse/SPARK-10024
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 Implement RandomForestParams, GBTParams and TreeEnsembleParams for 
 Python API, and make corresponding parameters in place. There are lots of 
 duplicated code in the current implementation. You can refer the Scala API 
 which is more compact. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9663) ML Python API coverage issues found during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9663:
---
Description: 
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for ML:
** attribute SPARK-10025
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing classes for MLlib:
** fpm
*** PrefixSpan SPARK-10028
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022

  was:
This umbrella is for a list of Python API coverage issues which we should fix 
for the 1.6 release cycle.  This list is to be generated from issues found in 
[SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].

Here we check and compare the Python and Scala API of MLlib/ML,
add missing classes/methods/parameters for PySpark. 
* Missing classes for PySpark(ML):
** attribute SPARK-10025
** feature
*** CountVectorizerModel SPARK-9769
*** DCT SPARK-8472
*** ElementwiseProduct SPARK-9768
*** MinMaxScaler SPARK-8530
*** StopWordsRemover SPARK-9679
*** VectorSlicer SPARK-9772
*** IndexToString SPARK-10021
** classification
*** OneVsRest SPARK-7861
*** MultilayerPerceptronClassifier SPARK-9773
** regression
*** IsotonicRegression SPARK-9774
* Missing User Guide documents for PySpark SPARK-8757
* Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022


 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for ML:
 ** attribute SPARK-10025
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-8472
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-8530
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 *** IndexToString SPARK-10021
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing classes for MLlib:
 ** fpm
 *** PrefixSpan SPARK-10028
 * Missing User Guide documents for PySpark SPARK-8757
 * Scala-Python method/parameter inconsistency check for ML  MLlib SPARK-10022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9662) ML 1.5 QA: API: Python API coverage

2015-08-16 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698565#comment-14698565
 ] 

Yanbo Liang edited comment on SPARK-9662 at 8/16/15 7:37 AM:
-

[~josephkb] I have finished checking for Scala-Python method/parameter 
inconsistency and list what we should do in the next release cycle in 
SPARK-10022.


was (Author: yanboliang):
[~josephkb] I have finished checking for Scala-Python inconsistency and list 
what we should do in the next release cycle in SPARK-10022.

 ML 1.5 QA: API: Python API coverage
 ---

 Key: SPARK-9662
 URL: https://issues.apache.org/jira/browse/SPARK-9662
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 For new public APIs added to MLlib, we need to check the generated HTML doc 
 and compare the Scala  Python versions.  We need to track:
 * Inconsistency: Do class/method/parameter names match?
 * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
 be as complete as the Scala doc.
 * API breaking changes: These should be very rare but are occasionally either 
 necessary (intentional) or accidental.  These must be recorded and added in 
 the Migration Guide for this release.
 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
 component, please note that as well.
 * Missing classes/methods/parameters: We should create to-do JIRAs for 
 functionality missing from Python, to be added in the next release cycle.  
 Please use a *separate* JIRA (linked below) for this list of to-do items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9662) ML 1.5 QA: API: Python API coverage

2015-08-16 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698565#comment-14698565
 ] 

Yanbo Liang commented on SPARK-9662:


[~josephkb] I have finished checking for Scala-Python inconsistency and list 
what we should do in the next release cycle in SPARK-10022.

 ML 1.5 QA: API: Python API coverage
 ---

 Key: SPARK-9662
 URL: https://issues.apache.org/jira/browse/SPARK-9662
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 For new public APIs added to MLlib, we need to check the generated HTML doc 
 and compare the Scala  Python versions.  We need to track:
 * Inconsistency: Do class/method/parameter names match?
 * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
 be as complete as the Scala doc.
 * API breaking changes: These should be very rare but are occasionally either 
 necessary (intentional) or accidental.  These must be recorded and added in 
 the Migration Guide for this release.
 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
 component, please note that as well.
 * Missing classes/methods/parameters: We should create to-do JIRAs for 
 functionality missing from Python, to be added in the next release cycle.  
 Please use a *separate* JIRA (linked below) for this list of to-do items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10023) Unified DecisionTreeParams checkpointInterval between Scala and Python API.

2015-08-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10023:
---

 Summary: Unified DecisionTreeParams checkpointInterval between 
Scala and Python API.
 Key: SPARK-10023
 URL: https://issues.apache.org/jira/browse/SPARK-10023
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Yanbo Liang


checkpointInterval is one of DecisionTreeParams in Scala API which is 
inconsistency with Scala API, we should unified them.
Proposal: Make checkpointInterval shared param.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10024) Python API RF and GBT related params clear up

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10024:

Summary: Python API RF and GBT related params clear up  (was: Python API 
Tree related params clear up)

 Python API RF and GBT related params clear up
 -

 Key: SPARK-10024
 URL: https://issues.apache.org/jira/browse/SPARK-10024
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 Implement RandomForestParams, GBTParams and TreeEnsembleParams for 
 Python API, and make corresponding parameters in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10024) Python API RF and GBT related params clear up

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10024:

Description: Implement RandomForestParams, GBTParams and 
TreeEnsembleParams for Python API, and make corresponding parameters in 
place. It can refer the Scala API.  (was: Implement RandomForestParams, 
GBTParams and TreeEnsembleParams for Python API, and make corresponding 
parameters in place.)

 Python API RF and GBT related params clear up
 -

 Key: SPARK-10024
 URL: https://issues.apache.org/jira/browse/SPARK-10024
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang

 Implement RandomForestParams, GBTParams and TreeEnsembleParams for 
 Python API, and make corresponding parameters in place. It can refer the 
 Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8844) head/collect is broken in SparkR

2015-08-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8844.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 head/collect is broken in SparkR 
 -

 Key: SPARK-8844
 URL: https://issues.apache.org/jira/browse/SPARK-8844
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Sun Rui
Priority: Blocker
 Fix For: 1.5.0


 {code}
  t = tables(sqlContext)
  showDF(T)
 Error in (function (classes, fdef, mtable)  :
   unable to find an inherited method for function ‘showDF’ for signature 
 ‘logical’
  showDF(t)
 +-+---+
 |tableName|isTemporary|
 +-+---+
 +-+---+
  15/07/06 09:59:10 WARN Executor: Told to re-register on heartbeat
 
 
  head(t)
 Error in readTypedObject(con, type) :
   Unsupported type for deserialization
  collect(t)
 Error in readTypedObject(con, type) :
   Unsupported type for deserialization
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8844) head/collect is broken in SparkR

2015-08-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698564#comment-14698564
 ] 

Shivaram Venkataraman commented on SPARK-8844:
--

Resolved by https://github.com/apache/spark/pull/7419

 head/collect is broken in SparkR 
 -

 Key: SPARK-8844
 URL: https://issues.apache.org/jira/browse/SPARK-8844
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Sun Rui
Priority: Blocker

 {code}
  t = tables(sqlContext)
  showDF(T)
 Error in (function (classes, fdef, mtable)  :
   unable to find an inherited method for function ‘showDF’ for signature 
 ‘logical’
  showDF(t)
 +-+---+
 |tableName|isTemporary|
 +-+---+
 +-+---+
  15/07/06 09:59:10 WARN Executor: Told to re-register on heartbeat
 
 
  head(t)
 Error in readTypedObject(con, type) :
   Unsupported type for deserialization
  collect(t)
 Error in readTypedObject(con, type) :
   Unsupported type for deserialization
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8844) head/collect is broken in SparkR

2015-08-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-8844:
-
Assignee: Sun Rui

 head/collect is broken in SparkR 
 -

 Key: SPARK-8844
 URL: https://issues.apache.org/jira/browse/SPARK-8844
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Sun Rui
Priority: Blocker

 {code}
  t = tables(sqlContext)
  showDF(T)
 Error in (function (classes, fdef, mtable)  :
   unable to find an inherited method for function ‘showDF’ for signature 
 ‘logical’
  showDF(t)
 +-+---+
 |tableName|isTemporary|
 +-+---+
 +-+---+
  15/07/06 09:59:10 WARN Executor: Told to re-register on heartbeat
 
 
  head(t)
 Error in readTypedObject(con, type) :
   Unsupported type for deserialization
  collect(t)
 Error in readTypedObject(con, type) :
   Unsupported type for deserialization
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10029:


Assignee: (was: Apache Spark)

 Add Python examples for mllib IsotonicRegression user guide
 ---

 Key: SPARK-10029
 URL: https://issues.apache.org/jira/browse/SPARK-10029
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor

 Add Python examples for mllib IsotonicRegression user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698597#comment-14698597
 ] 

Apache Spark commented on SPARK-10029:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8225

 Add Python examples for mllib IsotonicRegression user guide
 ---

 Key: SPARK-10029
 URL: https://issues.apache.org/jira/browse/SPARK-10029
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor

 Add Python examples for mllib IsotonicRegression user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10029:


Assignee: Apache Spark

 Add Python examples for mllib IsotonicRegression user guide
 ---

 Key: SPARK-10029
 URL: https://issues.apache.org/jira/browse/SPARK-10029
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Assignee: Apache Spark
Priority: Minor

 Add Python examples for mllib IsotonicRegression user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10022) Scala-Python inconsistency check for ML MLlib during 1.5 QA

2015-08-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10022:
---

 Summary: Scala-Python inconsistency check for ML  MLlib during 
1.5 QA
 Key: SPARK-10022
 URL: https://issues.apache.org/jira/browse/SPARK-10022
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang


Check the Scala-Python inconsistency of ML  MLlib class/method/parameter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
I test lastest spark-1.5.0 in standalone mode and follow the steps bellow, then 
issues occured.

1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 

[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps 
bellow, then errors occured.

1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps 
bellow, then issues occured.

1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 

[jira] [Assigned] (SPARK-7707) User guide and example code for KernelDensity

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7707:
---

Assignee: (was: Apache Spark)

 User guide and example code for KernelDensity
 -

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7707) User guide and example code for Statistics.kernelDensity

2015-08-16 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698634#comment-14698634
 ] 

Sandy Ryza commented on SPARK-7707:
---

[~mengxr] thoughts on which page this should land in?  mllib-statistics?

 User guide and example code for Statistics.kernelDensity
 

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7707) User guide and example code for KernelDensity

2015-08-16 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-7707:
--
Summary: User guide and example code for KernelDensity  (was: User guide 
and example code for Statistics.kernelDensity)

 User guide and example code for KernelDensity
 -

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)
wangwei created SPARK-10030:
---

 Summary: Managed memory leak detected when cache table
 Key: SPARK-10030
 URL: https://issues.apache.org/jira/browse/SPARK-10030
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: wangwei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


 Managed memory leak detected when cache table
 -

 Key: SPARK-10030
 URL: https://issues.apache.org/jira/browse/SPARK-10030
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: wangwei

 1. create table cache_test(id int,  name string) stored as textfile ;
 2. load data local inpath 
 '${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table 
 cache_test;
 3. cache table test as select * from cache_test distribute by id;
 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 
 67108864 bytes, TID = 434
 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 
 434)
 java.util.NoSuchElementException: key not found: val_54
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
   at 
 

[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;

3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 

[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;

3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 

[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'spark/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath ' $ {SparkSource} 
/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 

[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath ' $ {SparkSource} 
/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'${SparkSource}/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 

[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps 
bellow, then issues occured.

1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
I test lastest spark-1.5.0 in standalone mode and follow the steps bellow, then 
issues occured.

1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 

[jira] [Updated] (SPARK-10032) Add Python example for mllib LDAModel user guide

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10032:

Affects Version/s: (was: 1.5.0)

 Add Python example for mllib LDAModel user guide
 

 Key: SPARK-10032
 URL: https://issues.apache.org/jira/browse/SPARK-10032
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor
  Labels: 1.5.0

 Add Python example for mllib LDAModel user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10032) Add Python example for mllib LDAModel user guide

2015-08-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10032:
---

 Summary: Add Python example for mllib LDAModel user guide
 Key: SPARK-10032
 URL: https://issues.apache.org/jira/browse/SPARK-10032
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang
Priority: Minor


Add Python example for mllib LDAModel user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10032) Add Python example for mllib LDAModel user guide

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10032:

Labels: 1.5.0  (was: )

 Add Python example for mllib LDAModel user guide
 

 Key: SPARK-10032
 URL: https://issues.apache.org/jira/browse/SPARK-10032
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor
  Labels: 1.5.0

 Add Python example for mllib LDAModel user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10005:


Assignee: Apache Spark  (was: Cheng Lian)

 Parquet reader doesn't handle schema merging properly for nested structs
 

 Key: SPARK-10005
 URL: https://issues.apache.org/jira/browse/SPARK-10005
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Blocker

 Spark shell snippet to reproduce this issue:
 {code}
 import sqlContext.implicits._
 val path = file:///tmp/foo
 (0 until 3).map(i = Tuple1((sa_$i, 
 sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path)
 (0 until 3).map(i = Tuple1((sa_$i, sb_$i, 
 sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path)
 sqlContext.read.option(schemaMerging, true).parquet(path).show()
 {code}
 Exception:
 {noformat}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 
 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
 read value at 0 in block -1 in file 
 file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
 at 
 org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
 at 
 org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
 at 
 org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136)
 at 
 org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269)
 at 
 org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
 at 
 org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
 at 
 org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
 org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
 ... 25 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs

2015-08-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10005:
---
Description: 
Spark shell snippet to reproduce this issue (note that both {{DataFrame}} 
written below contain a single struct column with multiple fields):
{code}
import sqlContext.implicits._

val path = file:///tmp/foo

(0 until 3).map(i = Tuple1((sa_$i, 
sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path)
(0 until 3).map(i = Tuple1((sa_$i, sb_$i, 
sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path)

sqlContext.read.option(schemaMerging, true).parquet(path).show()
{code}
Exception:
{noformat}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 
(TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136)
at 
org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
... 25 more
{noformat}

  was:
Spark shell snippet to reproduce this issue:
{code}
import sqlContext.implicits._

val path = file:///tmp/foo

(0 until 3).map(i = Tuple1((sa_$i, 
sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path)
(0 until 3).map(i = Tuple1((sa_$i, sb_$i, 
sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path)

sqlContext.read.option(schemaMerging, true).parquet(path).show()
{code}
Exception:
{noformat}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 
(TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
read value at 0 in block -1 in file 
file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet
at 

[jira] [Commented] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698631#comment-14698631
 ] 

Apache Spark commented on SPARK-10005:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8228

 Parquet reader doesn't handle schema merging properly for nested structs
 

 Key: SPARK-10005
 URL: https://issues.apache.org/jira/browse/SPARK-10005
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 Spark shell snippet to reproduce this issue:
 {code}
 import sqlContext.implicits._
 val path = file:///tmp/foo
 (0 until 3).map(i = Tuple1((sa_$i, 
 sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path)
 (0 until 3).map(i = Tuple1((sa_$i, sb_$i, 
 sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path)
 sqlContext.read.option(schemaMerging, true).parquet(path).show()
 {code}
 Exception:
 {noformat}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 
 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
 read value at 0 in block -1 in file 
 file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
 at 
 org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
 at 
 org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
 at 
 org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136)
 at 
 org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269)
 at 
 org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
 at 
 org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
 at 
 org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
 org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
 ... 25 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10005:


Assignee: Cheng Lian  (was: Apache Spark)

 Parquet reader doesn't handle schema merging properly for nested structs
 

 Key: SPARK-10005
 URL: https://issues.apache.org/jira/browse/SPARK-10005
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 Spark shell snippet to reproduce this issue:
 {code}
 import sqlContext.implicits._
 val path = file:///tmp/foo
 (0 until 3).map(i = Tuple1((sa_$i, 
 sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path)
 (0 until 3).map(i = Tuple1((sa_$i, sb_$i, 
 sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path)
 sqlContext.read.option(schemaMerging, true).parquet(path).show()
 {code}
 Exception:
 {noformat}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 
 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
 read value at 0 in block -1 in file 
 file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
 at 
 org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
 at 
 org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
 at 
 org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136)
 at 
 org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269)
 at 
 org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
 at 
 org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
 at 
 org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
 org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
 ... 25 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps 
bellow, then errors occured.

1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

configuration:
spark.driver.memory5g
spark.executor.memory   28g
spark.cores.max  21

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
I test lastest spark-1.5.0 in local, standalone, yarn mode and follow the steps 
bellow, then errors occured.

1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 

[jira] [Commented] (SPARK-8918) Add @since tags to mllib.clustering

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698635#comment-14698635
 ] 

Apache Spark commented on SPARK-8918:
-

User 'XiaoqingWang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8229

 Add @since tags to mllib.clustering
 ---

 Key: SPARK-8918
 URL: https://issues.apache.org/jira/browse/SPARK-8918
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 2h
  Remaining Estimate: 2h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7707) User guide and example code for KernelDensity

2015-08-16 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-7707:
-

Assignee: Sandy Ryza

 User guide and example code for KernelDensity
 -

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7707) User guide and example code for KernelDensity

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7707:
---

Assignee: Apache Spark

 User guide and example code for KernelDensity
 -

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7707) User guide and example code for KernelDensity

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698647#comment-14698647
 ] 

Apache Spark commented on SPARK-7707:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/8230

 User guide and example code for KernelDensity
 -

 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'${spark}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'spark/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 

[jira] [Updated] (SPARK-10030) Managed memory leak detected when cache table

2015-08-16 Thread wangwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SPARK-10030:

Description: 
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
1. create table cache_test(id int,  name string) stored as textfile ;
2. load data local inpath 
'${spark}/sql/hive/src/test/resources/data/files/kv1.txt' into table cache_test;
3. cache table test as select * from cache_test distribute by id;

15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 434
15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 434)
java.util.NoSuchElementException: key not found: val_54
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
at 

[jira] [Commented] (SPARK-10031) Join two UnsafeRows in SortMergeJoin if possible

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698609#comment-14698609
 ] 

Apache Spark commented on SPARK-10031:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8226

 Join two UnsafeRows in SortMergeJoin if possible
 

 Key: SPARK-10031
 URL: https://issues.apache.org/jira/browse/SPARK-10031
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently in SortMergeJoin, when two rows from left and right plans are both 
 UnsafeRow, we still use JoinedRow to join them and do an extra 
 UnsafeProjection later.
 We can just use GenerateUnsafeRowJoiner to join two UnsafeRows in 
 SortMergeJoin if possible. Besides, GenerateUnsafeRowJoiner can have a 
 withRight function to only update row2 with a same row1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10031) Join two UnsafeRows in SortMergeJoin if possible

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10031:


Assignee: (was: Apache Spark)

 Join two UnsafeRows in SortMergeJoin if possible
 

 Key: SPARK-10031
 URL: https://issues.apache.org/jira/browse/SPARK-10031
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently in SortMergeJoin, when two rows from left and right plans are both 
 UnsafeRow, we still use JoinedRow to join them and do an extra 
 UnsafeProjection later.
 We can just use GenerateUnsafeRowJoiner to join two UnsafeRows in 
 SortMergeJoin if possible. Besides, GenerateUnsafeRowJoiner can have a 
 withRight function to only update row2 with a same row1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10031) Join two UnsafeRows in SortMergeJoin if possible

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10031:


Assignee: Apache Spark

 Join two UnsafeRows in SortMergeJoin if possible
 

 Key: SPARK-10031
 URL: https://issues.apache.org/jira/browse/SPARK-10031
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 Currently in SortMergeJoin, when two rows from left and right plans are both 
 UnsafeRow, we still use JoinedRow to join them and do an extra 
 UnsafeProjection later.
 We can just use GenerateUnsafeRowJoiner to join two UnsafeRows in 
 SortMergeJoin if possible. Besides, GenerateUnsafeRowJoiner can have a 
 withRight function to only update row2 with a same row1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10031) Join two UnsafeRows in SortMergeJoin if possible

2015-08-16 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-10031:
---

 Summary: Join two UnsafeRows in SortMergeJoin if possible
 Key: SPARK-10031
 URL: https://issues.apache.org/jira/browse/SPARK-10031
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


Currently in SortMergeJoin, when two rows from left and right plans are both 
UnsafeRow, we still use JoinedRow to join them and do an extra UnsafeProjection 
later.

We can just use GenerateUnsafeRowJoiner to join two UnsafeRows in SortMergeJoin 
if possible. Besides, GenerateUnsafeRowJoiner can have a withRight function to 
only update row2 with a same row1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10029) Add Python examples for mllib IsotonicRegression user guide

2015-08-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10029:

Labels: 1.5.0  (was: )

 Add Python examples for mllib IsotonicRegression user guide
 ---

 Key: SPARK-10029
 URL: https://issues.apache.org/jira/browse/SPARK-10029
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor
  Labels: 1.5.0

 Add Python examples for mllib IsotonicRegression user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10032) Add Python example for mllib LDAModel user guide

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10032:


Assignee: Apache Spark

 Add Python example for mllib LDAModel user guide
 

 Key: SPARK-10032
 URL: https://issues.apache.org/jira/browse/SPARK-10032
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Assignee: Apache Spark
Priority: Minor
  Labels: 1.5.0

 Add Python example for mllib LDAModel user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10032) Add Python example for mllib LDAModel user guide

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10032:


Assignee: (was: Apache Spark)

 Add Python example for mllib LDAModel user guide
 

 Key: SPARK-10032
 URL: https://issues.apache.org/jira/browse/SPARK-10032
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor
  Labels: 1.5.0

 Add Python example for mllib LDAModel user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10032) Add Python example for mllib LDAModel user guide

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698627#comment-14698627
 ] 

Apache Spark commented on SPARK-10032:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8227

 Add Python example for mllib LDAModel user guide
 

 Key: SPARK-10032
 URL: https://issues.apache.org/jira/browse/SPARK-10032
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Reporter: Yanbo Liang
Priority: Minor
  Labels: 1.5.0

 Add Python example for mllib LDAModel user guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9973) Wrong initial size of in-memory columnar buffers

2015-08-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9973.
---
Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/8189

 Wrong initial size of in-memory columnar buffers
 

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: xukun
Assignee: xukun
 Fix For: 1.5.0


 Two much memory is allocated for in-memory columnar buffers. The 
 {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial 
 number of rows rather than bytes, but the value passed in in 
 {{InMemoryColumnarTableScan}} is the latter:
 {code}
 // Class InMemoryColumnarTableScan
   val initialBufferSize = columnType.defaultSize * batchSize
   ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, 
 useCompression)
 {code}
 Then it's converted to byte size again by multiplying 
 {{columnType.defaultSize}}:
 {code}
 // Class BasicColumnBuilder
   buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9973) Wrong initial size of in-memory columnar buffers

2015-08-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9973:
--
Fix Version/s: 1.5.0

 Wrong initial size of in-memory columnar buffers
 

 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: xukun
Assignee: xukun
 Fix For: 1.5.0


 Two much memory is allocated for in-memory columnar buffers. The 
 {{initialSize}} argument in {{ColumnBuilder.initialize}} is the initial 
 number of rows rather than bytes, but the value passed in in 
 {{InMemoryColumnarTableScan}} is the latter:
 {code}
 // Class InMemoryColumnarTableScan
   val initialBufferSize = columnType.defaultSize * batchSize
   ColumnBuilder(attribute.dataType, initialBufferSize, attribute.name, 
 useCompression)
 {code}
 Then it's converted to byte size again by multiplying 
 {{columnType.defaultSize}}:
 {code}
 // Class BasicColumnBuilder
   buffer = ByteBuffer.allocate(4 + size * columnType.defaultSize)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10016) ML model broadcasts should be stored in private vars: spark.ml Word2Vec

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698808#comment-14698808
 ] 

Apache Spark commented on SPARK-10016:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/8233

 ML model broadcasts should be stored in private vars: spark.ml Word2Vec
 ---

 Key: SPARK-10016
 URL: https://issues.apache.org/jira/browse/SPARK-10016
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter

 See parent for details.  Applies to: spark.ml.feature.Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10016) ML model broadcasts should be stored in private vars: spark.ml Word2Vec

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10016:


Assignee: (was: Apache Spark)

 ML model broadcasts should be stored in private vars: spark.ml Word2Vec
 ---

 Key: SPARK-10016
 URL: https://issues.apache.org/jira/browse/SPARK-10016
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Priority: Trivial
  Labels: starter

 See parent for details.  Applies to: spark.ml.feature.Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10016) ML model broadcasts should be stored in private vars: spark.ml Word2Vec

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10016:


Assignee: Apache Spark

 ML model broadcasts should be stored in private vars: spark.ml Word2Vec
 ---

 Key: SPARK-10016
 URL: https://issues.apache.org/jira/browse/SPARK-10016
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Trivial
  Labels: starter

 See parent for details.  Applies to: spark.ml.feature.Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set

2015-08-16 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698729#comment-14698729
 ] 

Shixiong Zhu commented on SPARK-9985:
-

I just realized SPARK-8463 didn't fix all problems. You will still encounter 
`No suitable driver found error` when using DataFrameReader.jdbc or 
DataFrameWriter.jdbc. I opened SPARK-10036 to track this issue since it has a 
different stack trace.

 DataFrameWriter jdbc method ignore options that have been set
 -

 Key: SPARK-9985
 URL: https://issues.apache.org/jira/browse/SPARK-9985
 Project: Spark
  Issue Type: Bug
Reporter: Richard Garris
Assignee: Shixiong Zhu

 I am working on an RDBMS to DataFrame conversion using Postgres and am 
 hitting a wall where everytime I try to use the Postgresql JDBC driver to get 
 a java.sql.SQLException: No suitable driver found error
 Here is the stack trace:
 {code}
 at java.sql.DriverManager.getConnection(DriverManager.java:596)
 at java.sql.DriverManager.getConnection(DriverManager.java:187)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 It appears that DataFrameWriter and DataFrameReader ignores options that we 
 set before invoking {{jdbc}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering

2015-08-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10034:

Description: 
{code=scala}
val df = Seq(1 - 2).toDF(i, j)
val query = df.groupBy('i)
  .agg(max('j).as(_aggOrdering))
  .orderBy(sum('j))
checkAnswer(query, Row(1, 2))
{code}

 Can't analyze Sort on Aggregate with aggregation expression named 
 _aggOrdering
 

 Key: SPARK-10034
 URL: https://issues.apache.org/jira/browse/SPARK-10034
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 {code=scala}
 val df = Seq(1 - 2).toDF(i, j)
 val query = df.groupBy('i)
   .agg(max('j).as(_aggOrdering))
   .orderBy(sum('j))
 checkAnswer(query, Row(1, 2))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering

2015-08-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10034:

Description: 
{code}
val df = Seq(1 - 2).toDF(i, j)
val query = df.groupBy('i)
  .agg(max('j).as(_aggOrdering))
  .orderBy(sum('j))
checkAnswer(query, Row(1, 2))
{code}

  was:
{code=scala}
val df = Seq(1 - 2).toDF(i, j)
val query = df.groupBy('i)
  .agg(max('j).as(_aggOrdering))
  .orderBy(sum('j))
checkAnswer(query, Row(1, 2))
{code}


 Can't analyze Sort on Aggregate with aggregation expression named 
 _aggOrdering
 

 Key: SPARK-10034
 URL: https://issues.apache.org/jira/browse/SPARK-10034
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 {code}
 val df = Seq(1 - 2).toDF(i, j)
 val query = df.groupBy('i)
   .agg(max('j).as(_aggOrdering))
   .orderBy(sum('j))
 checkAnswer(query, Row(1, 2))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10036) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698727#comment-14698727
 ] 

Apache Spark commented on SPARK-10036:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/8232

 DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver 
 class before creating JDBC connection
 --

 Key: SPARK-10036
 URL: https://issues.apache.org/jira/browse/SPARK-10036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu

 Here is the reproduce code and the stack trace
 {code}
 val url = jdbc:postgresql://.../mytest
 import java.util.Properties
 val prop = new Properties()
 prop.put(driver, org.postgresql.Driver)
 prop.put(user, ...)
 prop.put(password, ...)
 val df = sqlContext.read.jdbc(url, mytest, prop)
 {code}
 {code}
 java.sql.SQLException: No suitable driver found for 
 jdbc:postgresql://.../mytest
   at java.sql.DriverManager.getConnection(DriverManager.java:689)
   at java.sql.DriverManager.getConnection(DriverManager.java:208)
   at 
 org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
   at 
 org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.init(JDBCRelation.scala:91)
   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10036) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10036:


Assignee: Apache Spark

 DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver 
 class before creating JDBC connection
 --

 Key: SPARK-10036
 URL: https://issues.apache.org/jira/browse/SPARK-10036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Apache Spark

 Here is the reproduce code and the stack trace
 {code}
 val url = jdbc:postgresql://.../mytest
 import java.util.Properties
 val prop = new Properties()
 prop.put(driver, org.postgresql.Driver)
 prop.put(user, ...)
 prop.put(password, ...)
 val df = sqlContext.read.jdbc(url, mytest, prop)
 {code}
 {code}
 java.sql.SQLException: No suitable driver found for 
 jdbc:postgresql://.../mytest
   at java.sql.DriverManager.getConnection(DriverManager.java:689)
   at java.sql.DriverManager.getConnection(DriverManager.java:208)
   at 
 org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
   at 
 org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.init(JDBCRelation.scala:91)
   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10036) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10036:


Assignee: (was: Apache Spark)

 DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver 
 class before creating JDBC connection
 --

 Key: SPARK-10036
 URL: https://issues.apache.org/jira/browse/SPARK-10036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu

 Here is the reproduce code and the stack trace
 {code}
 val url = jdbc:postgresql://.../mytest
 import java.util.Properties
 val prop = new Properties()
 prop.put(driver, org.postgresql.Driver)
 prop.put(user, ...)
 prop.put(password, ...)
 val df = sqlContext.read.jdbc(url, mytest, prop)
 {code}
 {code}
 java.sql.SQLException: No suitable driver found for 
 jdbc:postgresql://.../mytest
   at java.sql.DriverManager.getConnection(DriverManager.java:689)
   at java.sql.DriverManager.getConnection(DriverManager.java:208)
   at 
 org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
   at 
 org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.init(JDBCRelation.scala:91)
   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10005) Parquet reader doesn't handle schema merging properly for nested structs

2015-08-16 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10005.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8228
[https://github.com/apache/spark/pull/8228]

 Parquet reader doesn't handle schema merging properly for nested structs
 

 Key: SPARK-10005
 URL: https://issues.apache.org/jira/browse/SPARK-10005
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.5.0


 Spark shell snippet to reproduce this issue (note that both {{DataFrame}} 
 written below contain a single struct column with multiple fields):
 {code}
 import sqlContext.implicits._
 val path = file:///tmp/foo
 (0 until 3).map(i = Tuple1((sa_$i, 
 sb_$i))).toDF().coalesce(1).write.mode(overwrite).parquet(path)
 (0 until 3).map(i = Tuple1((sa_$i, sb_$i, 
 sc_$i))).toDF().coalesce(1).write.mode(append).parquet(path)
 sqlContext.read.option(schemaMerging, true).parquet(path).show()
 {code}
 Exception:
 {noformat}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 39.0 failed 1 times, most recent failure: Lost task 0.0 in stage 39.0 
 (TID 122, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
 read value at 0 in block -1 in file 
 file:/tmp/foo/part-r-0-ba9dc7cf-3210-4006-9cf7-02c3d57483cd.gz.parquet
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
 at 
 org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
 at 
 org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1826)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:88)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
 at 
 org.apache.spark.sql.execution.datasources.parquet.CatalystRowConverter.getConverter(CatalystRowConverter.scala:136)
 at 
 org.apache.parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:269)
 at 
 org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
 at 
 org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
 at 
 org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at 
 org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
 at 
 org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
 ... 25 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Created] (SPARK-10033) Sort on

2015-08-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10033:
---

 Summary: Sort on 
 Key: SPARK-10033
 URL: https://issues.apache.org/jira/browse/SPARK-10033
 Project: Spark
  Issue Type: Bug
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10034:


Assignee: (was: Apache Spark)

 Can't analyze Sort on Aggregate with aggregation expression named 
 _aggOrdering
 

 Key: SPARK-10034
 URL: https://issues.apache.org/jira/browse/SPARK-10034
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 {code}
 val df = Seq(1 - 2).toDF(i, j)
 val query = df.groupBy('i)
   .agg(max('j).as(_aggOrdering))
   .orderBy(sum('j))
 checkAnswer(query, Row(1, 2))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering

2015-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698701#comment-14698701
 ] 

Apache Spark commented on SPARK-10034:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8231

 Can't analyze Sort on Aggregate with aggregation expression named 
 _aggOrdering
 

 Key: SPARK-10034
 URL: https://issues.apache.org/jira/browse/SPARK-10034
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 {code}
 val df = Seq(1 - 2).toDF(i, j)
 val query = df.groupBy('i)
   .agg(max('j).as(_aggOrdering))
   .orderBy(sum('j))
 checkAnswer(query, Row(1, 2))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering

2015-08-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10034:


Assignee: Apache Spark

 Can't analyze Sort on Aggregate with aggregation expression named 
 _aggOrdering
 

 Key: SPARK-10034
 URL: https://issues.apache.org/jira/browse/SPARK-10034
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark

 {code}
 val df = Seq(1 - 2).toDF(i, j)
 val query = df.groupBy('i)
   .agg(max('j).as(_aggOrdering))
   .orderBy(sum('j))
 checkAnswer(query, Row(1, 2))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set

2015-08-16 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698706#comment-14698706
 ] 

Shixiong Zhu commented on SPARK-9985:
-

BTW, `sqlContext.load` will load the driver class. That's why `write` works 
after `load`.

 DataFrameWriter jdbc method ignore options that have been set
 -

 Key: SPARK-9985
 URL: https://issues.apache.org/jira/browse/SPARK-9985
 Project: Spark
  Issue Type: Bug
Reporter: Richard Garris
Assignee: Shixiong Zhu

 I am working on an RDBMS to DataFrame conversion using Postgres and am 
 hitting a wall where everytime I try to use the Postgresql JDBC driver to get 
 a java.sql.SQLException: No suitable driver found error
 Here is the stack trace:
 {code}
 at java.sql.DriverManager.getConnection(DriverManager.java:596)
 at java.sql.DriverManager.getConnection(DriverManager.java:187)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 It appears that DataFrameWriter and DataFrameReader ignores options that we 
 set before invoking {{jdbc}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10036) DataFrameReader.json and DataFrameWriter.json don't load the JDBC driver class before creating JDBC connection

2015-08-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-10036:


 Summary: DataFrameReader.json and DataFrameWriter.json don't load 
the JDBC driver class before creating JDBC connection
 Key: SPARK-10036
 URL: https://issues.apache.org/jira/browse/SPARK-10036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu


Here is the reproduce code and the stack trace

{code}
val url = jdbc:postgresql://.../mytest
import java.util.Properties

val prop = new Properties()
prop.put(driver, org.postgresql.Driver)
prop.put(user, ...)
prop.put(password, ...)

val df = sqlContext.read.jdbc(url, mytest, prop)
{code}

{code}
java.sql.SQLException: No suitable driver found for jdbc:postgresql://.../mytest
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.init(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10034) Can't analyze Sort on Aggregate with aggregation expression named _aggOrdering

2015-08-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10034:
---

 Summary: Can't analyze Sort on Aggregate with aggregation 
expression named _aggOrdering
 Key: SPARK-10034
 URL: https://issues.apache.org/jira/browse/SPARK-10034
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10033) Sort on

2015-08-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan closed SPARK-10033.
---
Resolution: Invalid

 Sort on 
 

 Key: SPARK-10033
 URL: https://issues.apache.org/jira/browse/SPARK-10033
 Project: Spark
  Issue Type: Bug
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10035) Parquet filters does not process EqualNullSafe filter.

2015-08-16 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-10035:


 Summary: Parquet filters does not process EqualNullSafe filter.
 Key: SPARK-10035
 URL: https://issues.apache.org/jira/browse/SPARK-10035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor


it is an issue followed by SPARK-9814.

Datasources (after {{selectFilters()}} in 
{{org.apache.spark.sql.execution.datasources.DataSourceStrategy}}) pass 
{{EqualNotNull}} to {{ParquetRelation}} but  {{ParquetFilters}} for 
{{ParquetRelation}} does not take and process this.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set

2015-08-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-9985.
-
  Resolution: Fixed
Target Version/s:   (was: 1.5.0)

 DataFrameWriter jdbc method ignore options that have been set
 -

 Key: SPARK-9985
 URL: https://issues.apache.org/jira/browse/SPARK-9985
 Project: Spark
  Issue Type: Bug
Reporter: Richard Garris
Assignee: Shixiong Zhu

 I am working on an RDBMS to DataFrame conversion using Postgres and am 
 hitting a wall where everytime I try to use the Postgresql JDBC driver to get 
 a java.sql.SQLException: No suitable driver found error
 Here is the stack trace:
 {code}
 at java.sql.DriverManager.getConnection(DriverManager.java:596)
 at java.sql.DriverManager.getConnection(DriverManager.java:187)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 It appears that DataFrameWriter and DataFrameReader ignores options that we 
 set before invoking {{jdbc}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9985) DataFrameWriter jdbc method ignore options that have been set

2015-08-16 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698702#comment-14698702
 ] 

Shixiong Zhu commented on SPARK-9985:
-

[~rlgarris_databricks] I think this has been fixed in 1.4.1 by SPARK-8463 

 DataFrameWriter jdbc method ignore options that have been set
 -

 Key: SPARK-9985
 URL: https://issues.apache.org/jira/browse/SPARK-9985
 Project: Spark
  Issue Type: Bug
Reporter: Richard Garris
Assignee: Shixiong Zhu

 I am working on an RDBMS to DataFrame conversion using Postgres and am 
 hitting a wall where everytime I try to use the Postgresql JDBC driver to get 
 a java.sql.SQLException: No suitable driver found error
 Here is the stack trace:
 {code}
 at java.sql.DriverManager.getConnection(DriverManager.java:596)
 at java.sql.DriverManager.getConnection(DriverManager.java:187)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$.savePartition(jdbc.scala:67)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:189)
 at 
 org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$saveTable$1.apply(jdbc.scala:188)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:878)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 It appears that DataFrameWriter and DataFrameReader ignores options that we 
 set before invoking {{jdbc}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9760) SparkSubmit doesn't work with --packages when --repositories is not specified

2015-08-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-9760:


Assignee: Shivaram Venkataraman

 SparkSubmit doesn't work with --packages when --repositories is not specified 
 --

 Key: SPARK-9760
 URL: https://issues.apache.org/jira/browse/SPARK-9760
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Blocker
 Fix For: 1.5.0


 Running `./bin/sparkR --packages com.databricks:spark-csv_2.10:1.2.0` gives
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.spark.deploy.SparkSubmitUtils$.createRepoResolvers(SparkSubmit.scala:812)
 at 
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:962)
 at 
 org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9760) SparkSubmit doesn't work with --packages when --repositories is not specified

2015-08-16 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9760.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 SparkSubmit doesn't work with --packages when --repositories is not specified 
 --

 Key: SPARK-9760
 URL: https://issues.apache.org/jira/browse/SPARK-9760
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Blocker
 Fix For: 1.5.0


 Running `./bin/sparkR --packages com.databricks:spark-csv_2.10:1.2.0` gives
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.spark.deploy.SparkSubmitUtils$.createRepoResolvers(SparkSubmit.scala:812)
 at 
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:962)
 at 
 org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks

2015-08-16 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698767#comment-14698767
 ] 

Cheng Lian commented on SPARK-7837:
---

Just a note to people who want to reproduce this issue:

# You need to start a Spark cluster with at least two workers running on two 
distinct nodes. Speculation isn't enabled when running in local mode or single 
node cluster. If you only have a single machine, you'll probably have to resort 
to VMs
# Don't forget to set {{spark.speculation}} to {{true}} (it's {{false}} by 
default)

 NPE when save as parquet in speculative tasks
 -

 Key: SPARK-7837
 URL: https://issues.apache.org/jira/browse/SPARK-7837
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Critical

 The query is like {{df.orderBy(...).saveAsTable(...)}}.
 When there is no partitioning columns and there is a skewed key, I found the 
 following exception in speculative tasks. After these failures, seems we 
 could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
 {code}
 java.lang.NullPointerException
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
   at 
 org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >