date:20171207

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-07 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283145#comment-16283145
 ] 

Bago Amirbekian commented on SPARK-22126:
-

Joseph, the way I read your comment is to say that we should support 
parallelism & optimized model but not parallelism with optimized models. I 
think that would cover our current use cases, but I'm wondering if we want to 
leave open the possibility of optimizing parameters like maxIter & maxDepth and 
have those optimized implements play nice with parallelism in CrossValidator.

I normally believe in doing the simple thing first and then changing it if 
needed, but it would requiring adding another public API later.

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
> callable =>
>  Future[Map[Int, Model[_]]] {
> val modelMap = callable.call()
> if (collectSubModelsParam) {
>...
> }
> modelMap
>  } (executionContext)
>   }
>   // Unpersist training data only when all models have trained
>   Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
> executionContext)
> .onComplete { _ => trainingDataset.unpersist() } (executionContext)
>   // Evaluate models in a Future that will calulate a metric and allow 
> model to be cleaned up
>   val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
> modelMapFuture.map { modelMap =>
>   modelMap.map { case (index: Int, model: Model[_]) =>
>

[jira] [Assigned] (SPARK-22452) DataSourceV2Options should have getInt, getBoolean, etc.

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22452:
---

Assignee: Sunitha Kambhampati

> DataSourceV2Options should have getInt, getBoolean, etc.
> 
>
> Key: SPARK-22452
> URL: https://issues.apache.org/jira/browse/SPARK-22452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Sunitha Kambhampati
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22452) DataSourceV2Options should have getInt, getBoolean, etc.

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22452.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19921
[https://github.com/apache/spark/pull/19921]

> DataSourceV2Options should have getInt, getBoolean, etc.
> 
>
> Key: SPARK-22452
> URL: https://issues.apache.org/jira/browse/SPARK-22452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22187:
-
Target Version/s:   (was: 2.3.0)

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>  Labels: release-notes, releasenotes
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283097#comment-16283097
 ] 

Shixiong Zhu commented on SPARK-22187:
--

Reverted by https://github.com/apache/spark/pull/19924

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>  Labels: release-notes, releasenotes
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22647) Docker files for image creation

2017-12-07 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283092#comment-16283092
 ] 

Yang Wang commented on SPARK-22647:
---

The current Dockerfiles is based on 
[openjdk:8-alpine|https://hub.docker.com/_/openjdk/] which use musl libc 
instead of glibc. This will cause a problem if we want to run native library 
like intel MKL for native BLAS support in mllib. There is also an issue 
(https://github.com/apache-spark-on-k8s/spark/issues/326) reporting a 
compatibility problem of OpenJDK on alpine. 
Should we add glibc support on the image?

I also comment on the pr. 
https://github.com/apache/spark/pull/19717/files#diff-b5c9c835bf25d23b47c1500a3af1bda3R18

> Docker files for image creation
> ---
>
> Key: SPARK-22647
> URL: https://issues.apache.org/jira/browse/SPARK-22647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> This covers the dockerfiles that need to be shipped to enable the Kubernetes 
> backend for Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22737) Simplity OneVsRest transform

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283048#comment-16283048
 ] 

Apache Spark commented on SPARK-22737:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19927

> Simplity OneVsRest transform
> 
>
> Key: SPARK-22737
> URL: https://issues.apache.org/jira/browse/SPARK-22737
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> Current impl of OneVsRest#transform is over-complicated. It sequentially 
> updates an acumulated column.
> By using a direct UDF of prediction, we obtain a speedup of at least 2x.
> On some extreme case with 20 classes, it obtain about 14x speedup.
> The test code and performance comparsion details are in the corresponding PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22737) Simplity OneVsRest transform

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22737:


Assignee: Apache Spark

> Simplity OneVsRest transform
> 
>
> Key: SPARK-22737
> URL: https://issues.apache.org/jira/browse/SPARK-22737
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> Current impl of OneVsRest#transform is over-complicated. It sequentially 
> updates an acumulated column.
> By using a direct UDF of prediction, we obtain a speedup of at least 2x.
> On some extreme case with 20 classes, it obtain about 14x speedup.
> The test code and performance comparsion details are in the corresponding PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22737) Simplity OneVsRest transform

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22737:


Assignee: (was: Apache Spark)

> Simplity OneVsRest transform
> 
>
> Key: SPARK-22737
> URL: https://issues.apache.org/jira/browse/SPARK-22737
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> Current impl of OneVsRest#transform is over-complicated. It sequentially 
> updates an acumulated column.
> By using a direct UDF of prediction, we obtain a speedup of at least 2x.
> On some extreme case with 20 classes, it obtain about 14x speedup.
> The test code and performance comparsion details are in the corresponding PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22737) Simplity OneVsRest transform

2017-12-07 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-22737:


 Summary: Simplity OneVsRest transform
 Key: SPARK-22737
 URL: https://issues.apache.org/jira/browse/SPARK-22737
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng


Current impl of OneVsRest#transform is over-complicated. It sequentially 
updates an acumulated column.
By using a direct UDF of prediction, we obtain a speedup of at least 2x.
On some extreme case with 20 classes, it obtain about 14x speedup.

The test code and performance comparsion details are in the corresponding PR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-07 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283038#comment-16283038
 ] 

Weichen Xu commented on SPARK-22126:


[~josephkb] For 2nd type: suppose there is a single large spark job, each task 
of this job fit a model, and if the API return multiple callables, then each 
callable will fetch one model from the spark job result ? Is this your meaning 
? The usage looks mode a little awkward.
Or can we directly return such type
{code}
RDD[Model[_]]
{code}
because each task of this spark job will fit a model, then directly return a 
model RDD so it will be easier to use. (and Do not collect all the models to 
driver side to avoid OOM). What do you think of this ?

[~bago.amirbekian] [~tomas.nykodym] Any thoughts ?

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
> callable =>
>  Future[Map[Int, Model[_]]] {
> val modelMap = callable.call()
> if (collectSubModelsParam) {
>...
> }
> modelMap
>  } (executionContext)
>   }
>   // Unpersist training data only when all models have trained
>   Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
> executionContext)
> .onComplete { _ => trainingDataset.unpersist() } (executionContext)
>   // Evaluate models in a Future that will calulate a metric and allow 
> model to be cleaned up
>   val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
> modelMapFuture.map { modelMap =>

[jira] [Updated] (SPARK-22688) Upgrade Janino version to 3.0.8

2017-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22688:
--
Fix Version/s: 2.1.3

> Upgrade Janino version to 3.0.8
> ---
>
> Key: SPARK-22688
> URL: https://issues.apache.org/jira/browse/SPARK-22688
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.1.3, 2.2.2, 2.3.0
>
>
> [Janino 3.0.8|https://janino-compiler.github.io/janino/changelog.html] 
> includes an important fix to reduce the number of constant pool entries by 
> using {{sipush}} java bytecode.
> * SIPUSH bytecode is not used for short integer constant 
> [#33|https://github.com/janino-compiler/janino/issues/33]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22736) Consider caching decoded dictionaries in VectorizedColumnReader

2017-12-07 Thread Henry Robinson (JIRA)

Henry Robinson created SPARK-22736:
--

 Summary: Consider caching decoded dictionaries in 
VectorizedColumnReader
 Key: SPARK-22736
 URL: https://issues.apache.org/jira/browse/SPARK-22736
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Henry Robinson


{{VectorizedColumnReader.decodeDictionaryIds()}} calls {{dictionary.decodeToX}} 
for every dictionary ID encountered in a dict-encoded Parquet page.

The whole idea of dictionary encoding is that a) values are repeated in a page 
and b) the dictionary only contains values that are in a page. So we should be 
able to save some decoding cost by decoding the entire dictionary page once, at 
the cost of using some memory (but theoretically we could discard the encoded 
dictionary, I think), and using the decoded dictionary to populate rows. 

This is particularly true for TIMESTAMP data, which after SPARK-12297, might 
have a timezone conversion as part of its decoding step.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12

2017-12-07 Thread liyunzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282919#comment-16282919
 ] 

liyunzhang commented on SPARK-22660:


thanks for [~srowen] ,[~viirya] and [~hyukjin.kwon]'s review.

> Use position() and limit() to fix ambiguity issue in scala-2.12
> ---
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Minor
> Fix For: 2.3.0
>
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
>  ./dev/change-scala-version.sh 2.12
> 2.build with -Pscala-2.12
> for hive on spark
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> for spark sql
> {code}
> ./dev/make-distribution.sh  --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn -Phive 
> -Dhadoop.version=2.7.3>log.sparksql 2>&1
> {code}
> get following error
> #Error1
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix
> #Error2
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}
> The limit method was moved from ByteBuffer to the superclass Buffer and it 
> can no longer be called without (). The same reason for position method.
> #Error3
> {code}
> home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error] properties.putAll(propsMap.asJava)
>  [error]^
> [error] 
> /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error]   props.putAll(outputSerdeProps.toMap.asJava)
>  [error] ^
>  {code}
>  This is because the key type is Object instead of String which is unsafe.
> After solving these 3 errors, compile successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21787) Support for pushing down filters for DateType in native OrcFileFormat

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21787.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18995
[https://github.com/apache/spark/pull/18995]

> Support for pushing down filters for DateType in native OrcFileFormat
> -
>
> Key: SPARK-21787
> URL: https://issues.apache.org/jira/browse/SPARK-21787
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Stefan de Koning
> Fix For: 2.3.0
>
>
> See related issue https://issues.apache.org/jira/browse/SPARK-16516
> It seems that DateType should also be pushed down to ORC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21787) Support for pushing down filters for DateType in native OrcFileFormat

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21787:
---

Assignee: Dongjoon Hyun

> Support for pushing down filters for DateType in native OrcFileFormat
> -
>
> Key: SPARK-21787
> URL: https://issues.apache.org/jira/browse/SPARK-21787
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Stefan de Koning
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> See related issue https://issues.apache.org/jira/browse/SPARK-16516
> It seems that DateType should also be pushed down to ORC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22734) Create Python API for VectorSizeHint

2017-12-07 Thread Bago Amirbekian (JIRA)

Bago Amirbekian created SPARK-22734:
---

 Summary: Create Python API for VectorSizeHint
 Key: SPARK-22734
 URL: https://issues.apache.org/jira/browse/SPARK-22734
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Bago Amirbekian






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22735) Add VectorSizeHint to ML features documentation

2017-12-07 Thread Bago Amirbekian (JIRA)

Bago Amirbekian created SPARK-22735:
---

 Summary: Add VectorSizeHint to ML features documentation
 Key: SPARK-22735
 URL: https://issues.apache.org/jira/browse/SPARK-22735
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Bago Amirbekian






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22721) BytesToBytesMap peak memory usage not accurate after reset()

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282566#comment-16282566
 ] 

Apache Spark commented on SPARK-22721:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/19923

> BytesToBytesMap peak memory usage not accurate after reset()
> 
>
> Key: SPARK-22721
> URL: https://issues.apache.org/jira/browse/SPARK-22721
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
> Fix For: 2.3.0
>
>
> BytesToBytesMap doesn't update peak memory usage before shrinking back to 
> initial capacity in reset(), so after a disk spill one never knows what was 
> the size of hash table was before spilling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-07 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282796#comment-16282796
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

Agreed; thanks!

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2017-12-07 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282764#comment-16282764
 ] 

Marcelo Vanzin commented on SPARK-7736:
---

Make sure all you guys are running apps in cluster mode if you want to see the 
proper status. I just ran a failing pyspark app in cluster mode to double 
check, and all seems fine.

> Exception not failing Python applications (in yarn cluster mode)
> 
>
> Key: SPARK-7736
> URL: https://issues.apache.org/jira/browse/SPARK-7736
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
> Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
>Reporter: Shay Rojansky
>Assignee: Marcelo Vanzin
> Fix For: 1.5.1, 1.6.0
>
>
> It seems that exceptions thrown in Python spark apps after the SparkContext 
> is instantiated don't cause the application to fail, at least in Yarn: the 
> application is marked as SUCCEEDED.
> Note that any exception right before the SparkContext correctly places the 
> application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22733) refactor StreamExecution for extensibility

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282739#comment-16282739
 ] 

Apache Spark commented on SPARK-22733:
--

User 'joseph-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/19926

> refactor StreamExecution for extensibility
> --
>
> Key: SPARK-22733
> URL: https://issues.apache.org/jira/browse/SPARK-22733
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> StreamExecution currently mixes together meta-logic (tracking and signalling 
> progress, persistence, reporting) with the core behavior of generating 
> batches and running them. We want to reuse the former but not the latter in 
> continuous execution, so we need to split them up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22733) refactor StreamExecution for extensibility

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22733:


Assignee: Apache Spark

> refactor StreamExecution for extensibility
> --
>
> Key: SPARK-22733
> URL: https://issues.apache.org/jira/browse/SPARK-22733
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>
> StreamExecution currently mixes together meta-logic (tracking and signalling 
> progress, persistence, reporting) with the core behavior of generating 
> batches and running them. We want to reuse the former but not the latter in 
> continuous execution, so we need to split them up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22733) refactor StreamExecution for extensibility

2017-12-07 Thread Jose Torres (JIRA)

Jose Torres created SPARK-22733:
---

 Summary: refactor StreamExecution for extensibility
 Key: SPARK-22733
 URL: https://issues.apache.org/jira/browse/SPARK-22733
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres


StreamExecution currently mixes together meta-logic (tracking and signalling 
progress, persistence, reporting) with the core behavior of generating batches 
and running them. We want to reuse the former but not the latter in continuous 
execution, so we need to split them up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22733) refactor StreamExecution for extensibility

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22733:


Assignee: (was: Apache Spark)

> refactor StreamExecution for extensibility
> --
>
> Key: SPARK-22733
> URL: https://issues.apache.org/jira/browse/SPARK-22733
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> StreamExecution currently mixes together meta-logic (tracking and signalling 
> progress, persistence, reporting) with the core behavior of generating 
> batches and running them. We want to reuse the former but not the latter in 
> continuous execution, so we need to split them up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-07 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282699#comment-16282699
 ] 

Joseph K. Bradley commented on SPARK-22126:
---

Continuing the discussion from the gist: Perhaps we're overthinking things when 
worrying about handling multiple interdependent Callables.  I figure there are 
2 main use cases for optimizing model fitting:
* Parallel Spark jobs for fitting multiple models at once
** This is what has been introduced into CrossValidator, TrainValidationSplit 
and OneVsRest for Spark 2.3 already.
** This use case is primarily for _small_ Spark jobs.  E.g., fitting a bunch of 
small models requires a bunch of small jobs.  Each job needs to be 
lightweight/fast in order to get much benefit from running parallel jobs.
* Single Spark jobs for fitting multiple models in a clever, model-specific way
** This is what is used by Deep Learning Pipelines and what we'd like to do 
more of in the future.
** This use case is primarily for _large_ Spark jobs.  E.g., for DLP, the Spark 
job includes a bunch of tasks, and each task is sizable since it fits a model 
in Keras.

Assuming we can reasonably divide the use cases of this fitMultiple() API into 
these 2 types, then we don't need to worry about dependencies between 
Callables.  We only need to worry about dependencies when users use parallelism 
> 1 with the 2nd type of use case, which we can advise against in the 
documentation for the parallelism Param.

What do you think?

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures =

[jira] [Resolved] (SPARK-22688) Upgrade Janino version to 3.0.8

2017-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22688.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.3.0
   2.2.2

Still going to back-port to 2.1.x

> Upgrade Janino version to 3.0.8
> ---
>
> Key: SPARK-22688
> URL: https://issues.apache.org/jira/browse/SPARK-22688
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.2, 2.3.0
>
>
> [Janino 3.0.8|https://janino-compiler.github.io/janino/changelog.html] 
> includes an important fix to reduce the number of constant pool entries by 
> using {{sipush}} java bytecode.
> * SIPUSH bytecode is not used for short integer constant 
> [#33|https://github.com/janino-compiler/janino/issues/33]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22187:


Assignee: Apache Spark  (was: Tathagata Das)

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>  Labels: release-notes, releasenotes
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282633#comment-16282633
 ] 

Apache Spark commented on SPARK-22187:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/19924

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>  Labels: release-notes, releasenotes
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22732) Add DataSourceV2 streaming APIs

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22732:


Assignee: Apache Spark

> Add DataSourceV2 streaming APIs
> ---
>
> Key: SPARK-22732
> URL: https://issues.apache.org/jira/browse/SPARK-22732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>
> Structured Streaming APIs are currently tucked in a spark internal package. 
> We need to expose a new version in the DataSourceV2 framework, and add the 
> APIs required for continuous processing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22732) Add DataSourceV2 streaming APIs

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282634#comment-16282634
 ] 

Apache Spark commented on SPARK-22732:
--

User 'joseph-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/19925

> Add DataSourceV2 streaming APIs
> ---
>
> Key: SPARK-22732
> URL: https://issues.apache.org/jira/browse/SPARK-22732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> Structured Streaming APIs are currently tucked in a spark internal package. 
> We need to expose a new version in the DataSourceV2 framework, and add the 
> APIs required for continuous processing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22732) Add DataSourceV2 streaming APIs

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22732:


Assignee: (was: Apache Spark)

> Add DataSourceV2 streaming APIs
> ---
>
> Key: SPARK-22732
> URL: https://issues.apache.org/jira/browse/SPARK-22732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> Structured Streaming APIs are currently tucked in a spark internal package. 
> We need to expose a new version in the DataSourceV2 framework, and add the 
> APIs required for continuous processing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22187:


Assignee: Tathagata Das  (was: Apache Spark)

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>  Labels: release-notes, releasenotes
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282628#comment-16282628
 ] 

Tathagata Das edited comment on SPARK-22187 at 12/7/17 10:33 PM:
-

I am reverting this because this will break existing streaming pipelines 
already using mapGroupswithState. This will be re-applied in the future after 
we start saving more metadata in checkpoints to signify which version of state 
row format the existing streaming query is running. Then we can decode old and 
new formats accordingly.


was (Author: tdas):
I am reverting this because this will break existing streaming pipelines 
already using mapGroupswithState

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>  Labels: release-notes, releasenotes
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reopened SPARK-22187:
---

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>  Labels: release-notes, releasenotes
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22732) Add DataSourceV2 streaming APIs

2017-12-07 Thread Jose Torres (JIRA)

Jose Torres created SPARK-22732:
---

 Summary: Add DataSourceV2 streaming APIs
 Key: SPARK-22732
 URL: https://issues.apache.org/jira/browse/SPARK-22732
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres


Structured Streaming APIs are currently tucked in a spark internal package. We 
need to expose a new version in the DataSourceV2 framework, and add the APIs 
required for continuous processing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282628#comment-16282628
 ] 

Tathagata Das commented on SPARK-22187:
---

I am reverting this because this will break existing streaming pipelines 
already using mapGroupswithState

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>  Labels: release-notes, releasenotes
> Fix For: 2.3.0
>
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2017-12-07 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-22187:
--
Fix Version/s: (was: 2.3.0)

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>  Labels: release-notes, releasenotes
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22452) DataSourceV2Options should have getInt, getBoolean, etc.

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282252#comment-16282252
 ] 

Apache Spark commented on SPARK-22452:
--

User 'skambha' has created a pull request for this issue:
https://github.com/apache/spark/pull/19921

> DataSourceV2Options should have getInt, getBoolean, etc.
> 
>
> Key: SPARK-22452
> URL: https://issues.apache.org/jira/browse/SPARK-22452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22688) Upgrade Janino version to 3.0.8

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282433#comment-16282433
 ] 

Apache Spark commented on SPARK-22688:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19922

> Upgrade Janino version to 3.0.8
> ---
>
> Key: SPARK-22688
> URL: https://issues.apache.org/jira/browse/SPARK-22688
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> [Janino 3.0.8|https://janino-compiler.github.io/janino/changelog.html] 
> includes an important fix to reduce the number of constant pool entries by 
> using {{sipush}} java bytecode.
> * SIPUSH bytecode is not used for short integer constant 
> [#33|https://github.com/janino-compiler/janino/issues/33]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22731) Add a test for ROWID type to OracleIntegrationSuite

2017-12-07 Thread Xiao Li (JIRA)

Xiao Li created SPARK-22731:
---

 Summary: Add a test for ROWID type to OracleIntegrationSuite
 Key: SPARK-22731
 URL: https://issues.apache.org/jira/browse/SPARK-22731
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


We need to add a test case to OracleIntegrationSuite for checking whether the 
current support of ROWID type works well for Oracle. If not, we also need a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-12-07 Thread Tomas Nykodym (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282312#comment-16282312
 ] 

Tomas Nykodym commented on SPARK-21866:
---

I've created a separate ticket to add support for non-integer based images in 
[SPARK-22730|https://issues.apache.org/jira/browse/SPARK-22730]

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order

[jira] [Created] (SPARK-22730) Add support for non-integer image formats

2017-12-07 Thread Tomas Nykodym (JIRA)

Tomas Nykodym created SPARK-22730:
-

 Summary: Add support for non-integer image formats
 Key: SPARK-22730
 URL: https://issues.apache.org/jira/browse/SPARK-22730
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Tomas Nykodym


The conversion functions toImage and toNDArray provided by ImageSchema 
currently do not support non-integer image formats. 
Therefore, users who want to work with both integer and floating point formats 
have to write their own versions.
Related to this problem is the lack of description of supported openCV modes 
(e.g. number of channels, data type).

This tickets is based on our implementation in spark-deep learning and aims to 
bring this functionality to the ImageSchema. 
To be more specific, we want to 
1. update toImage and toNDArray functions to handle float32(64) based 
images.
See 
https://github.com/tomasatdatabricks/spark-deep-learning/blob/92217afcfdb3f0a42540f396d9018d75ffa6ba7c/python/sparkdl/image/imageIO.py#L61-L87
2. add information about individual OpenCv modes, e.g.
See 
https://github.com/tomasatdatabricks/spark-deep-learning/blob/92217afcfdb3f0a42540f396d9018d75ffa6ba7c/python/sparkdl/image/imageIO.py#L31-L46






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22719) refactor ConstantPropagation

2017-12-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22719.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.3.0

> refactor ConstantPropagation
> 
>
> Key: SPARK-22719
> URL: https://issues.apache.org/jira/browse/SPARK-22719
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
> Fix For: 2.3.0
>
>
> The current time complexity of ConstantPropagation is O(n^2), which can be 
> slow when the query is complex.
> Refactor the implementation with O( n ) time complexity 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-12-07 Thread Shankar Kandaswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282128#comment-16282128
 ] 

Shankar Kandaswamy edited comment on SPARK-20555 at 12/7/17 4:50 PM:
-

[~gfeher]
May i know if first issue has been resolved?

"1. DECIMAL(1) becomes BooleanType
In Orcale, a DECIMAL(1) can have values from -9 to 9."

I am using the spark 2.2.0 but still i am getting Boolean "false" when source 
is having NUMBER(1) as 0.  I want it as 0 without customschema Could you please 
advise?


was (Author: shankarkool):
[~gfeher]
May i know if first issue has been resolved?

"1. DECIMAL(1) becomes BooleanType
In Orcale, a DECIMAL(1) can have values from -9 to 9."

I am using the spark 2.2.0 but still i am getting Boolean "false" when source 
is having numeric(1) as 0.  I want it as 0 without customschema Could you 
please advise?

> Incorrect handling of Oracle's decimal types via JDBC
> -
>
> Key: SPARK-20555
> URL: https://issues.apache.org/jira/browse/SPARK-20555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gabor Feher
>Assignee: Gabor Feher
> Fix For: 2.1.2, 2.2.0
>
>
> When querying an Oracle database, Spark maps some Oracle numeric data types 
> to incorrect Catalyst data types:
> 1. DECIMAL(1) becomes BooleanType
> In Orcale, a DECIMAL(1) can have values from -9 to 9.
> In Spark now, values larger than 1 become the boolean value true.
> 2. DECIMAL(3,2) becomes IntegerType
> In Oracle, a DECIMAL(2) can have values like 1.23
> In Spark now, digits after the decimal point are dropped.
> 3. DECIMAL(10) becomes IntegerType
> In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
> more than 2^31
> Spark throws an exception: "java.sql.SQLException: Numeric Overflow"
> I think the best solution is to always keep Oracle's decimal types. (In 
> theory we could introduce a FloatType in some case of #2, and fix #3 by only 
> introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
> complicated and error-prone.)
> Note: I think the above problems were introduced as part of  
> https://github.com/apache/spark/pull/14377
> The main purpose of that PR seems to be converting Spark types to correct 
> Oracle types, and that part seems good to me. But it also adds the inverse 
> conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22680) SparkSQL scan all partitions when the specified partitions are not exists in parquet formatted table

2017-12-07 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282248#comment-16282248
 ] 

Nan Zhu commented on SPARK-22680:
-

how you observed that spark scans all partitions? I tried to reproduce but no 
luck

table structure
{code}
zhunan@sparktest:~/testdata1$ ls
count1=1  count1=2  count1=3  _SUCCESS
{code}

query: select * from table1 where count1 = 4

I can see the log shows 

17/12/07 17:57:51 INFO PrunedInMemoryFileIndex: Selected 0 partitions out of 0, 
pruned 0 partitions.
17/12/07 17:57:51 TRACE PrunedInMemoryFileIndex: Selected files after partition 
pruning:

No file was selected


> SparkSQL scan all partitions when the specified partitions are not exists in 
> parquet formatted table
> 
>
> Key: SPARK-22680
> URL: https://issues.apache.org/jira/browse/SPARK-22680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: spark2.0.2 spark2.2.0
>Reporter: Xiaochen Ouyang
>
> 1. spark-sql --master local[2]
> 2. create external table test (id int,name string) partitioned by (country 
> string,province string, day string,hour int) stored as parquet localtion 
> '/warehouse/test';
> 3.produce data into table test
> 4. select count(1) from test where country = '185' and province = '021' and 
> day = '2017-11-12' and hour = 10; if the 4 filter conditions are not exists 
> in HDFS and MetaStore[mysql] , this sql will scan all partitions in table test



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21787) Support for pushing down filters for DateType in native OrcFileFormat

2017-12-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21787:
--
Summary: Support for pushing down filters for DateType in native 
OrcFileFormat  (was: Support for pushing down filters for date types in ORC)

> Support for pushing down filters for DateType in native OrcFileFormat
> -
>
> Key: SPARK-21787
> URL: https://issues.apache.org/jira/browse/SPARK-21787
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Stefan de Koning
>
> See related issue https://issues.apache.org/jira/browse/SPARK-16516
> It seems that DateType should also be pushed down to ORC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22725) df.select on a Stream is broken, vs a List

2017-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22725:
--
Priority: Minor  (was: Major)

> df.select on a Stream is broken, vs a List
> --
>
> Key: SPARK-22725
> URL: https://issues.apache.org/jira/browse/SPARK-22725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Minor
>
> See failing test at https://github.com/apache/spark/pull/19917
> Failing:
> {noformat}
>   test("SPARK-ABC123: support select with a splatted stream") {
> val df = spark.createDataFrame(sparkContext.emptyRDD[Row], 
> StructType(List("bar", "foo").map {
>   StructField(_, StringType, false)
> }))
> val allColumns = Stream(df.col("bar"), col("foo"))
> val result = df.select(allColumns : _*)
>   }
> {noformat}
> Succeeds:
> {noformat}
>   test("SPARK-ABC123: support select with a splatted stream") {
> val df = spark.createDataFrame(sparkContext.emptyRDD[Row], 
> StructType(List("bar", "foo").map {
>   StructField(_, StringType, false)
> }))
> val allColumns = Seq(df.col("bar"), col("foo"))
> val result = df.select(allColumns : _*)
>   }
> {noformat}
> After stepping through in a debugger, the difference manifests at 
> https://github.com/apache/spark/blob/8ae004b4602266d1f210e4c1564246d590412c06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L120
> Changing {{seq.map}} to {{seq.toList.map}} causes the test to pass.
> I think there's a very subtle bug here where the {{Seq}} of column names 
> passed into {{select}} is expected to eagerly evaluate when {{.map}} is 
> called on it, even though that's not part of the {{Seq}} contract.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20555) Incorrect handling of Oracle's decimal types via JDBC

2017-12-07 Thread Shankar Kandaswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282128#comment-16282128
 ] 

Shankar Kandaswamy commented on SPARK-20555:


[~gfeher]
May i know if first issue has been resolved?

"1. DECIMAL(1) becomes BooleanType
In Orcale, a DECIMAL(1) can have values from -9 to 9."

I am using the spark 2.2.0 but still i am getting Boolean "false" when source 
is having numeric(1) as 0.  I want it as 0 without customschema Could you 
please advise?

> Incorrect handling of Oracle's decimal types via JDBC
> -
>
> Key: SPARK-20555
> URL: https://issues.apache.org/jira/browse/SPARK-20555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gabor Feher
>Assignee: Gabor Feher
> Fix For: 2.1.2, 2.2.0
>
>
> When querying an Oracle database, Spark maps some Oracle numeric data types 
> to incorrect Catalyst data types:
> 1. DECIMAL(1) becomes BooleanType
> In Orcale, a DECIMAL(1) can have values from -9 to 9.
> In Spark now, values larger than 1 become the boolean value true.
> 2. DECIMAL(3,2) becomes IntegerType
> In Oracle, a DECIMAL(2) can have values like 1.23
> In Spark now, digits after the decimal point are dropped.
> 3. DECIMAL(10) becomes IntegerType
> In Oracle, a DECIMAL(10) can have the value 99 (ten nines), which is 
> more than 2^31
> Spark throws an exception: "java.sql.SQLException: Numeric Overflow"
> I think the best solution is to always keep Oracle's decimal types. (In 
> theory we could introduce a FloatType in some case of #2, and fix #3 by only 
> introducing IntegerType for DECIMAL(9). But in my opinion, that would end up 
> complicated and error-prone.)
> Note: I think the above problems were introduced as part of  
> https://github.com/apache/spark/pull/14377
> The main purpose of that PR seems to be converting Spark types to correct 
> Oracle types, and that part seems good to me. But it also adds the inverse 
> conversions. As it turns out in the above examples, that is not possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12

2017-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22660:
-

Assignee: liyunzhang

> Use position() and limit() to fix ambiguity issue in scala-2.12
> ---
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Minor
> Fix For: 2.3.0
>
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
>  ./dev/change-scala-version.sh 2.12
> 2.build with -Pscala-2.12
> for hive on spark
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> for spark sql
> {code}
> ./dev/make-distribution.sh  --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn -Phive 
> -Dhadoop.version=2.7.3>log.sparksql 2>&1
> {code}
> get following error
> #Error1
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix
> #Error2
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}
> The limit method was moved from ByteBuffer to the superclass Buffer and it 
> can no longer be called without (). The same reason for position method.
> #Error3
> {code}
> home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error] properties.putAll(propsMap.asJava)
>  [error]^
> [error] 
> /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error]   props.putAll(outputSerdeProps.toMap.asJava)
>  [error] ^
>  {code}
>  This is because the key type is Object instead of String which is unsafe.
> After solving these 3 errors, compile successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12

2017-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22660.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19854
[https://github.com/apache/spark/pull/19854]

> Use position() and limit() to fix ambiguity issue in scala-2.12
> ---
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>Priority: Minor
> Fix For: 2.3.0
>
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
>  ./dev/change-scala-version.sh 2.12
> 2.build with -Pscala-2.12
> for hive on spark
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> for spark sql
> {code}
> ./dev/make-distribution.sh  --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn -Phive 
> -Dhadoop.version=2.7.3>log.sparksql 2>&1
> {code}
> get following error
> #Error1
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix
> #Error2
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}
> The limit method was moved from ByteBuffer to the superclass Buffer and it 
> can no longer be called without (). The same reason for position method.
> #Error3
> {code}
> home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error] properties.putAll(propsMap.asJava)
>  [error]^
> [error] 
> /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error]   props.putAll(outputSerdeProps.toMap.asJava)
>  [error] ^
>  {code}
>  This is because the key type is Object instead of String which is unsafe.
> After solving these 3 errors, compile successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21074) Parquet files are read fully even though only count() is requested

2017-12-07 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281901#comment-16281901
 ] 

Steve Loughran commented on SPARK-21074:


Is there any update on this? 

# I'd like to see if this problem can be replicated on s3a on Hadoop 2.8+
# if you are seeing this on EMR and the s3:// connector to S3, then it's not 
something the ASF can handle; 

FWIW, I think that first GET you see is just one in the open() command; it's 
opening the entire file length from byte 0 before the next seek comes in. It 
doesn't mean the whole file is read, only that the initial GET was 0-EOF. After 
the seek() that GET will be aborted and a new read kicked off. S3A now 
postpones the GET until the first read() operation, as the sequence of  open + 
seek() is so common

> Parquet files are read fully even though only count() is requested
> --
>
> Key: SPARK-21074
> URL: https://issues.apache.org/jira/browse/SPARK-21074
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Michael Spector
>
> I have the following sample code that creates parquet files:
> {code:java}
> val spark = SparkSession.builder()
>   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
> "2")
>   .config("spark.hadoop.parquet.metadata.read.parallelism", "50")
>   .appName("Test Write").getOrCreate()
> val sqc = spark.sqlContext
> import sqc.implicits._
> val random = new scala.util.Random(31L)
> (1465720077 to 1465720077+1000).map(x => Event(x, random.nextString(2)))
>   .toDS()
>   .write
>   .mode(SaveMode.Overwrite)
>   .parquet("s3://my-bucket/test")
> {code}
> Afterwards, I'm trying to read these files with the following code:
> {code:java}
> val spark = SparkSession.builder()
>   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
> "2")
>   .config("spark.hadoop.parquet.metadata.read.parallelism", "50")
>   .config("spark.sql.parquet.filterPushdown", "true")
>   .appName("Test Read").getOrCreate()
> spark.sqlContext.read
>   .option("mergeSchema", "false")
>   .parquet("s3://my-bucket/test")
>   .count()
> {code}
> I've enabled DEBUG log level to see what requests are actually sent through 
> S3 API, and I've figured out that in addition to parquet "footer" retrieval 
> there are requests that ask for whole file content.
> For example, this is full content request example:
> {noformat}
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET 
> /test/part-0-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet 
> HTTP/1.1[\r][\n]"
> 
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 
> 0-7472093/7472094[\r][\n]"
> 
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 
> 7472094[\r][\n]"
> {noformat}
> And this is partial request example for footer only:
> {noformat}
> 17/06/13 05:46:50 DEBUG headers: http-outgoing-2 >> GET 
> /test/part-0-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1
> 
> 17/06/13 05:46:50 DEBUG headers: http-outgoing-2 >> Range: 
> bytes=7472086-7472094
> ...
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-2 << "Content-Length: 8[\r][\n]"
> 
> {noformat}
> Here's what FileScanRDD prints:
> {noformat}
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-4-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7473020, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00011-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472503, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-6-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472501, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-7-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7473104, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-3-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472458, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00012-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472594, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-1-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472984, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00014-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>

[jira] [Assigned] (SPARK-21672) Remove SHS-specific application / attempt data structures

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21672:


Assignee: Apache Spark

> Remove SHS-specific application / attempt data structures
> -
>
> Key: SPARK-21672
> URL: https://issues.apache.org/jira/browse/SPARK-21672
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> The SHS has its own view of what applications and attempts look like 
> ({{ApplicationHistoryInfo}} and {{ApplicationAttemptInfo}}, declared in 
> ApplicationHistoryProvider.scala).
> The SHS pages actually use the public API types to represent applications; 
> these types are only used in some internal code paths that should be cleaned 
> up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21672) Remove SHS-specific application / attempt data structures

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21672:


Assignee: (was: Apache Spark)

> Remove SHS-specific application / attempt data structures
> -
>
> Key: SPARK-21672
> URL: https://issues.apache.org/jira/browse/SPARK-21672
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> The SHS has its own view of what applications and attempts look like 
> ({{ApplicationHistoryInfo}} and {{ApplicationAttemptInfo}}, declared in 
> ApplicationHistoryProvider.scala).
> The SHS pages actually use the public API types to represent applications; 
> these types are only used in some internal code paths that should be cleaned 
> up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21672) Remove SHS-specific application / attempt data structures

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281898#comment-16281898
 ] 

Apache Spark commented on SPARK-21672:
--

User 'smurakozi' has created a pull request for this issue:
https://github.com/apache/spark/pull/19920

> Remove SHS-specific application / attempt data structures
> -
>
> Key: SPARK-21672
> URL: https://issues.apache.org/jira/browse/SPARK-21672
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> The SHS has its own view of what applications and attempts look like 
> ({{ApplicationHistoryInfo}} and {{ApplicationAttemptInfo}}, declared in 
> ApplicationHistoryProvider.scala).
> The SHS pages actually use the public API types to represent applications; 
> these types are only used in some internal code paths that should be cleaned 
> up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22690.

Resolution: Fixed

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22690:
---
Fix Version/s: 2.3.0

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.3.0
>
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22690:
--

Assignee: zhengruifeng

> Imputer inherit HasOutputCols
> -
>
> Key: SPARK-22690
> URL: https://issues.apache.org/jira/browse/SPARK-22690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also 
> inherit it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22696) Avoid the generation of useless mutable states by objects functions

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22696:
---

Assignee: Marco Gaido

> Avoid the generation of useless mutable states by objects functions
> ---
>
> Key: SPARK-22696
> URL: https://issues.apache.org/jira/browse/SPARK-22696
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> Some objects functions are defining mutable states which are not needed. This 
> is bad for the well known issues related to constant pool limits.
> I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22696) Avoid the generation of useless mutable states by objects functions

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22696.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19908
[https://github.com/apache/spark/pull/19908]

> Avoid the generation of useless mutable states by objects functions
> ---
>
> Key: SPARK-22696
> URL: https://issues.apache.org/jira/browse/SPARK-22696
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
> Fix For: 2.3.0
>
>
> Some objects functions are defining mutable states which are not needed. This 
> is bad for the well known issues related to constant pool limits.
> I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22699) Avoid the generation of useless mutable states by GenerateSafeProjection

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22699.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19914
[https://github.com/apache/spark/pull/19914]

> Avoid the generation of useless mutable states by GenerateSafeProjection
> 
>
> Key: SPARK-22699
> URL: https://issues.apache.org/jira/browse/SPARK-22699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
> Fix For: 2.3.0
>
>
> GenerateSafeProjection is defining mutable states which are not needed. This 
> is bad for the well known issues related to constant pool limits.
> I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22699) Avoid the generation of useless mutable states by GenerateSafeProjection

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22699:
---

Assignee: Marco Gaido

> Avoid the generation of useless mutable states by GenerateSafeProjection
> 
>
> Key: SPARK-22699
> URL: https://issues.apache.org/jira/browse/SPARK-22699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> GenerateSafeProjection is defining mutable states which are not needed. This 
> is bad for the well known issues related to constant pool limits.
> I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22618) RDD.unpersist can cause fatal exception when used with dynamic allocation

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22618:
---

Assignee: Brad

> RDD.unpersist can cause fatal exception when used with dynamic allocation
> -
>
> Key: SPARK-22618
> URL: https://issues.apache.org/jira/browse/SPARK-22618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Brad
>Assignee: Brad
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you use rdd.unpersist() with dynamic allocation, then an executor can be 
> deallocated while your rdd is being removed, which will throw an uncaught 
> exception killing your job. 
> I looked into different ways of preventing this error from occurring but 
> couldn't come up with anything that wouldn't require a big change. I propose 
> the best fix is just to catch and log IOExceptions in unpersist() so they 
> don't kill your job. This will match the effective behavior when executors 
> are lost from dynamic allocation in other parts of the code.
> In the worst case scenario I think this could lead to RDD partitions getting 
> left on executors after they were unpersisted, but this is probably better 
> than the whole job failing. I think in most cases the IOException would be 
> due to the executor dieing for some reason, which is effectively the same 
> result as unpersisting the rdd from that executor anyway.
> I noticed this exception in a job that loads a 100GB dataset on a cluster 
> where we use dynamic allocation heavily. Here is the relevant stack trace
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
> at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:276)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "main" org.apache.spark.SparkException: Exception thrown 
> in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131)
> at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1806)
> at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.doWorkload(CacheTest.scala:62)
> at 
> com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:40)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.run(CacheTest.scala:33)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially(SuiteKickoff.scala:78)
> at 
>

[jira] [Assigned] (SPARK-22712) Use `buildReaderWithPartitionValues` in native OrcFileFormat

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22712:
---

Assignee: Dongjoon Hyun

> Use `buildReaderWithPartitionValues` in native OrcFileFormat
> 
>
> Key: SPARK-22712
> URL: https://issues.apache.org/jira/browse/SPARK-22712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> To support vectorization in native OrcFileFormat, we need to use 
> `buildReaderWithPartitionValues` instead of `buildReader`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22712) Use `buildReaderWithPartitionValues` in native OrcFileFormat

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22712.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19907
[https://github.com/apache/spark/pull/19907]

> Use `buildReaderWithPartitionValues` in native OrcFileFormat
> 
>
> Key: SPARK-22712
> URL: https://issues.apache.org/jira/browse/SPARK-22712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> To support vectorization in native OrcFileFormat, we need to use 
> `buildReaderWithPartitionValues` instead of `buildReader`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22729) Add getTruncateQuery to JdbcDialect

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281819#comment-16281819
 ] 

Apache Spark commented on SPARK-22729:
--

User 'danielvdende' has created a pull request for this issue:
https://github.com/apache/spark/pull/19911

> Add getTruncateQuery to JdbcDialect
> ---
>
> Key: SPARK-22729
> URL: https://issues.apache.org/jira/browse/SPARK-22729
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>
> In order to enable truncate for PostgreSQL databases in Spark JDBC, a change 
> is needed to the query used for truncating a PostgreSQL table. By default, 
> PostgreSQL will automatically truncate any descendant tables if a TRUNCATE 
> query is executed. As this may result in (unwanted) side-effects, the query 
> used for the truncate should be specified separately for PostgreSQL, 
> specifying only to TRUNCATE a single table.
> This will also resolve SPARK-22717
> See PostgreSQL documentation 
> https://www.postgresql.org/docs/current/static/sql-truncate.html
> This change will still not let users truncate a table with cascade enabled 
> (which would also truncate tables with foreign key constraints to the table).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22729) Add getTruncateQuery to JdbcDialect

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22729:


Assignee: (was: Apache Spark)

> Add getTruncateQuery to JdbcDialect
> ---
>
> Key: SPARK-22729
> URL: https://issues.apache.org/jira/browse/SPARK-22729
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>
> In order to enable truncate for PostgreSQL databases in Spark JDBC, a change 
> is needed to the query used for truncating a PostgreSQL table. By default, 
> PostgreSQL will automatically truncate any descendant tables if a TRUNCATE 
> query is executed. As this may result in (unwanted) side-effects, the query 
> used for the truncate should be specified separately for PostgreSQL, 
> specifying only to TRUNCATE a single table.
> This will also resolve SPARK-22717
> See PostgreSQL documentation 
> https://www.postgresql.org/docs/current/static/sql-truncate.html
> This change will still not let users truncate a table with cascade enabled 
> (which would also truncate tables with foreign key constraints to the table).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22729) Add getTruncateQuery to JdbcDialect

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22729:


Assignee: Apache Spark

> Add getTruncateQuery to JdbcDialect
> ---
>
> Key: SPARK-22729
> URL: https://issues.apache.org/jira/browse/SPARK-22729
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Assignee: Apache Spark
>
> In order to enable truncate for PostgreSQL databases in Spark JDBC, a change 
> is needed to the query used for truncating a PostgreSQL table. By default, 
> PostgreSQL will automatically truncate any descendant tables if a TRUNCATE 
> query is executed. As this may result in (unwanted) side-effects, the query 
> used for the truncate should be specified separately for PostgreSQL, 
> specifying only to TRUNCATE a single table.
> This will also resolve SPARK-22717
> See PostgreSQL documentation 
> https://www.postgresql.org/docs/current/static/sql-truncate.html
> This change will still not let users truncate a table with cascade enabled 
> (which would also truncate tables with foreign key constraints to the table).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22618) RDD.unpersist can cause fatal exception when used with dynamic allocation

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22618.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19836
[https://github.com/apache/spark/pull/19836]

> RDD.unpersist can cause fatal exception when used with dynamic allocation
> -
>
> Key: SPARK-22618
> URL: https://issues.apache.org/jira/browse/SPARK-22618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Brad
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you use rdd.unpersist() with dynamic allocation, then an executor can be 
> deallocated while your rdd is being removed, which will throw an uncaught 
> exception killing your job. 
> I looked into different ways of preventing this error from occurring but 
> couldn't come up with anything that wouldn't require a big change. I propose 
> the best fix is just to catch and log IOExceptions in unpersist() so they 
> don't kill your job. This will match the effective behavior when executors 
> are lost from dynamic allocation in other parts of the code.
> In the worst case scenario I think this could lead to RDD partitions getting 
> left on executors after they were unpersisted, but this is probably better 
> than the whole job failing. I think in most cases the IOException would be 
> due to the executor dieing for some reason, which is effectively the same 
> result as unpersisting the rdd from that executor anyway.
> I noticed this exception in a job that loads a 100GB dataset on a cluster 
> where we use dynamic allocation heavily. Here is the relevant stack trace
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
> at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:276)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "main" org.apache.spark.SparkException: Exception thrown 
> in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131)
> at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1806)
> at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.doWorkload(CacheTest.scala:62)
> at 
> com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:40)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.run(CacheTest.scala:33)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
>

[jira] [Created] (SPARK-22729) Add getTruncateQuery to JdbcDialect

2017-12-07 Thread Daniel van der Ende (JIRA)

Daniel van der Ende created SPARK-22729:
---

 Summary: Add getTruncateQuery to JdbcDialect
 Key: SPARK-22729
 URL: https://issues.apache.org/jira/browse/SPARK-22729
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.1
Reporter: Daniel van der Ende


In order to enable truncate for PostgreSQL databases in Spark JDBC, a change is 
needed to the query used for truncating a PostgreSQL table. By default, 
PostgreSQL will automatically truncate any descendant tables if a TRUNCATE 
query is executed. As this may result in (unwanted) side-effects, the query 
used for the truncate should be specified separately for PostgreSQL, specifying 
only to TRUNCATE a single table.

This will also resolve SPARK-22717

See PostgreSQL documentation 
https://www.postgresql.org/docs/current/static/sql-truncate.html

This change will still not let users truncate a table with cascade enabled 
(which would also truncate tables with foreign key constraints to the table).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22705) Reduce # of mutable variables in Case, Coalesce, and In

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22705.
-
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.3.0

> Reduce # of mutable variables in Case, Coalesce, and In
> ---
>
> Key: SPARK-22705
> URL: https://issues.apache.org/jira/browse/SPARK-22705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22703) ColumnarRow should be an immutable view

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22703.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19898
[https://github.com/apache/spark/pull/19898]

> ColumnarRow should be an immutable view
> ---
>
> Key: SPARK-22703
> URL: https://issues.apache.org/jira/browse/SPARK-22703
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22672) Refactor ORC Tests

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22672:
---

Assignee: Dongjoon Hyun

> Refactor ORC Tests
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> Since SPARK-20682, we have two `OrcFileFormat`s. This issue refactor ORC 
> tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22672) Refactor ORC Tests

2017-12-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22672.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19882
[https://github.com/apache/spark/pull/19882]

> Refactor ORC Tests
> --
>
> Key: SPARK-22672
> URL: https://issues.apache.org/jira/browse/SPARK-22672
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> Since SPARK-20682, we have two `OrcFileFormat`s. This issue refactor ORC 
> tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22728) Unify artifact access for (mesos, standalone and yarn) when HDFS is available

2017-12-07 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281771#comment-16281771
 ] 

Stavros Kontopoulos commented on SPARK-22728:
-

[~arand][~susanxhuynh] FYI.

> Unify artifact access for (mesos, standalone and yarn) when HDFS is available
> -
>
> Key: SPARK-22728
> URL: https://issues.apache.org/jira/browse/SPARK-22728
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>
> A unified cluster layer for caching artifacts would be very useful like in 
> the case for the work that has be done for Flink: 
> https://issues.apache.org/jira/browse/FLINK-6177
> It would be great to make available the Hadoop Distributed Cache when we 
> deploy jobs in Mesos and Standalone envs. Hdfs is often present in many 
> end-to-end apps out there, so we should have an option for using it.
> I am creating this JIRA as a follow up of the discussion here: 
> https://github.com/apache/spark/pull/18587#issuecomment-314718391



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22728) Unify artifact access for (mesos, standalone and yarn) when HDFS is available

2017-12-07 Thread Stavros Kontopoulos (JIRA)

Stavros Kontopoulos created SPARK-22728:
---

 Summary: Unify artifact access for (mesos, standalone and yarn) 
when HDFS is available
 Key: SPARK-22728
 URL: https://issues.apache.org/jira/browse/SPARK-22728
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Stavros Kontopoulos


A unified cluster layer for caching artifacts would be very useful like in the 
case for the work that has be done for Flink: 
https://issues.apache.org/jira/browse/FLINK-6177
It would be great to make available the Hadoop Distributed Cache when we deploy 
jobs in Mesos and Standalone envs. Hdfs is often present in many end-to-end 
apps out there, so we should have an option for using it.
I am creating this JIRA as a follow up of the discussion here: 
https://github.com/apache/spark/pull/18587#issuecomment-314718391



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22721) BytesToBytesMap peak memory usage not accurate after reset()

2017-12-07 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-22721.
---
   Resolution: Fixed
 Assignee: Juliusz Sompolski
Fix Version/s: 2.3.0

> BytesToBytesMap peak memory usage not accurate after reset()
> 
>
> Key: SPARK-22721
> URL: https://issues.apache.org/jira/browse/SPARK-22721
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
> Fix For: 2.3.0
>
>
> BytesToBytesMap doesn't update peak memory usage before shrinking back to 
> initial capacity in reset(), so after a disk spill one never knows what was 
> the size of hash table was before spilling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-07 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-22126:
---
Description: 
Fix model-specific optimization support for ML tuning. This is discussed in 
SPARK-19357
more discussion is here
 https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0

I copy discussion from gist to here:

I propose to design API as:
{code}
def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
Array[Callable[Map[Int, M]]]
{code}

Let me use an example to explain the API:
{quote}
 It could be possible to still use the current parallelism and still allow for 
model-specific optimizations. For example, if we doing cross validation and 
have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets say 
that the cross validator could know that maxIter is optimized for the model 
being evaluated (e.g. a new method in Estimator that return such params). It 
would then be straightforward for the cross validator to remove maxIter from 
the param map that will be parallelized over and use it to create 2 arrays of 
paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, maxIter=10)) and 
((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
{quote}
In this example, we can see that, models computed from ((regParam=0.1, 
maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
maxIter=10))  in another thread. In this example, there're 4 paramMaps, but we 
can at most generate two threads to compute the models for them.

The API above allow "callable.call()" to return multiple models, and return 
type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
index for corresponding model. Use the example above, there're 4 paramMaps, but 
only return 2 callable objects, one callable object for ((regParam=0.1, 
maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
maxIter=5), (regParam=0.3, maxIter=10)).

and the default "fitCallables/fit with paramMaps" can be implemented as 
following:
{code}
def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
Array[Callable[Map[Int, M]]] = {
  paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
new Callable[Map[Int, M]] {
  override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
}
  }
}
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
   fitCallables(dataset, paramMaps).map { _.call().toSeq }
 .flatMap(_).sortBy(_._1).map(_._2)
}
{code}
If use the API I proposed above, the code in 
[CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
can be changed to:
{code}
  val trainingDataset = sparkSession.createDataFrame(training, 
schema).cache()
  val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()

  // Fit models in a Future for training in parallel
  val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
callable =>
 Future[Map[Int, Model[_]]] {
val modelMap = callable.call()
if (collectSubModelsParam) {
   ...
}
modelMap
 } (executionContext)
  }

  // Unpersist training data only when all models have trained
  Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
executionContext)
.onComplete { _ => trainingDataset.unpersist() } (executionContext)

  // Evaluate models in a Future that will calulate a metric and allow 
model to be cleaned up
  val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
modelMapFuture.map { modelMap =>
  modelMap.map { case (index: Int, model: Model[_]) =>
val metric = eval.evaluate(model.transform(validationDataset, 
paramMaps(index)))
(index, metric)
  }
} (executionContext)
  }

  // Wait for metrics to be calculated before unpersisting validation 
dataset
  val foldMetrics = foldMetricMapFutures.map(ThreadUtils.awaitResult(_, 
Duration.Inf))
  .map(_.toSeq).sortBy(_._1).map(_._2)
{code}


  was:
Fix model-specific optimization support for ML tuning. This is discussed in 
SPARK-19357
more discussion is here
 https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0

I copy discussion from gist to here:

I propose to design API as:
{code}
def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
Array[Callable[Map[Int, M]]]
{code}

Let me use an example to explain the API:
{quote}
 It could be possible to still use the current parallelism and still allow for 
model-specific optimizations. For example, if we doing cross validation and 
have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets say 
that the cross validator could know

[jira] [Updated] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-07 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-22126:
---
Description: 
Fix model-specific optimization support for ML tuning. This is discussed in 
SPARK-19357
more discussion is here
 https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0

I copy discussion from gist to here:

I propose to design API as:
{code}
def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
Array[Callable[Map[Int, M]]]
{code}

Let me use an example to explain the API:
{quote}
 It could be possible to still use the current parallelism and still allow for 
model-specific optimizations. For example, if we doing cross validation and 
have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets say 
that the cross validator could know that maxIter is optimized for the model 
being evaluated (e.g. a new method in Estimator that return such params). It 
would then be straightforward for the cross validator to remove maxIter from 
the param map that will be parallelized over and use it to create 2 arrays of 
paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, maxIter=10)) and 
((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
{quote}
In this example, we can see that, models computed from ((regParam=0.1, 
maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
maxIter=10))  in another thread. In this example, there're 4 paramMaps, but we 
can at most generate two threads to compute the models for them.

The API above allow "callable.call()" to return multiple models, and return 
type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
index for corresponding model. Use the example above, there're 4 paramMaps, but 
only return 2 callable objects, one callable object for ((regParam=0.1, 
maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
maxIter=5), (regParam=0.3, maxIter=10)).

and the default "fitCallables/fit with paramMaps" can be implemented as 
following:
{code}
def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
Array[Callable[Map[Int, M]]] = {
  paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
new Callable[Map[Int, M]] {
  override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
}
  }
}
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
   fitCallables(dataset, paramMaps).map { _.call().toSeq 
}.flatMap(_).sortBy(_._1).map(_._2)
}
{code}
If use the API I proposed above, the code in 
[CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
can be changed to:
{code}
  val trainingDataset = sparkSession.createDataFrame(training, 
schema).cache()
  val validationDataset = sparkSession.createDataFrame(validation, 
schema).cache()

  // Fit models in a Future for training in parallel
  val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
callable =>
 Future[Map[Int, Model[_]]] {
val modelMap = callable.call()
if (collectSubModelsParam) {
   ...
}
modelMap
 } (executionContext)
  }

  // Unpersist training data only when all models have trained
  Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
executionContext)
.onComplete { _ => trainingDataset.unpersist() } (executionContext)

  // Evaluate models in a Future that will calulate a metric and allow 
model to be cleaned up
  val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
modelMapFuture.map { modelMap =>
  modelMap.map { case (index: Int, model: Model[_]) =>
val metric = eval.evaluate(model.transform(validationDataset, 
paramMaps(index)))
(index, metric)
  }
} (executionContext)
  }

  // Wait for metrics to be calculated before unpersisting validation 
dataset
  val foldMetrics = foldMetricMapFutures.map(ThreadUtils.awaitResult(_, 
Duration.Inf))
  .map(_.toSeq).sortBy(_._1).map(_._2)
{code}


  was:
Fix model-specific optimization support for ML tuning. This is discussed in 
SPARK-19357
more discussion is here
 https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0



> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>

[jira] [Commented] (SPARK-22727) spark.executor.instances's default value should be 2

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281534#comment-16281534
 ] 

Apache Spark commented on SPARK-22727:
--

User 'liu-zhaokun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19919

> spark.executor.instances's default  value should be 2 
> --
>
> Key: SPARK-22727
> URL: https://issues.apache.org/jira/browse/SPARK-22727
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: liuzhaokun
>Priority: Minor
>
> spark.executor.instances's default value in running-on-yarn.md is 2,but in 
> ExecutorAllocationManager.scala and org.apache.spark.util.Utils.scala it is 
> used as the default value is 0.And I think it should be 2 for applications' 
> initialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22727) spark.executor.instances's default value should be 2

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22727:


Assignee: Apache Spark

> spark.executor.instances's default  value should be 2 
> --
>
> Key: SPARK-22727
> URL: https://issues.apache.org/jira/browse/SPARK-22727
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: liuzhaokun
>Assignee: Apache Spark
>Priority: Minor
>
> spark.executor.instances's default value in running-on-yarn.md is 2,but in 
> ExecutorAllocationManager.scala and org.apache.spark.util.Utils.scala it is 
> used as the default value is 0.And I think it should be 2 for applications' 
> initialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22727) spark.executor.instances's default value should be 2

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22727:


Assignee: (was: Apache Spark)

> spark.executor.instances's default  value should be 2 
> --
>
> Key: SPARK-22727
> URL: https://issues.apache.org/jira/browse/SPARK-22727
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: liuzhaokun
>Priority: Minor
>
> spark.executor.instances's default value in running-on-yarn.md is 2,but in 
> ExecutorAllocationManager.scala and org.apache.spark.util.Utils.scala it is 
> used as the default value is 0.And I think it should be 2 for applications' 
> initialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22727) spark.executor.instances's default value should be 2

2017-12-07 Thread liuzhaokun (JIRA)

liuzhaokun created SPARK-22727:
--

 Summary: spark.executor.instances's default  value should be 2 
 Key: SPARK-22727
 URL: https://issues.apache.org/jira/browse/SPARK-22727
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.2.0
Reporter: liuzhaokun
Priority: Minor


spark.executor.instances's default value in running-on-yarn.md is 2,but in 
ExecutorAllocationManager.scala and org.apache.spark.util.Utils.scala it is 
used as the default value is 0.And I think it should be 2 for applications' 
initialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22726) Basic tests for Binary Comparison and ImplicitTypeCasts

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22726:


Assignee: Apache Spark  (was: Xiao Li)

> Basic tests for Binary Comparison and ImplicitTypeCasts
> ---
>
> Key: SPARK-22726
> URL: https://issues.apache.org/jira/browse/SPARK-22726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Generate test cases for our binary comparison and ImplicitTypeCasts based on 
> the Apache Derby test cases in 
> https://github.com/apache/derby/blob/10.14/java/testing/org/apache/derbyTesting/functionTests/tests/lang/implicitConversions.sql



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22726) Basic tests for Binary Comparison and ImplicitTypeCasts

2017-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281478#comment-16281478
 ] 

Apache Spark commented on SPARK-22726:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19918

> Basic tests for Binary Comparison and ImplicitTypeCasts
> ---
>
> Key: SPARK-22726
> URL: https://issues.apache.org/jira/browse/SPARK-22726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Generate test cases for our binary comparison and ImplicitTypeCasts based on 
> the Apache Derby test cases in 
> https://github.com/apache/derby/blob/10.14/java/testing/org/apache/derbyTesting/functionTests/tests/lang/implicitConversions.sql



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22726) Basic tests for Binary Comparison and ImplicitTypeCasts

2017-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22726:


Assignee: Xiao Li  (was: Apache Spark)

> Basic tests for Binary Comparison and ImplicitTypeCasts
> ---
>
> Key: SPARK-22726
> URL: https://issues.apache.org/jira/browse/SPARK-22726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Generate test cases for our binary comparison and ImplicitTypeCasts based on 
> the Apache Derby test cases in 
> https://github.com/apache/derby/blob/10.14/java/testing/org/apache/derbyTesting/functionTests/tests/lang/implicitConversions.sql



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

86 matches

Mail list logo