[jira] [Comment Edited] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader

2018-02-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368121#comment-16368121
 ] 

Dongjoon Hyun edited comment on SPARK-23399 at 2/17/18 7:29 AM:


[~mgaido]. I understand what is your intention, but please see the JIRA issue 
title.
It's not about `Fix OrcQuerySuite`. Why do you reopen this issue? Please 
proceed to file a new JIRA issue for that.

This JIRA issue handles the designed scope as described in the manual test case 
in the PR.
For the reported case, I'll investigate more. 


was (Author: dongjoon):
[~mgaido]. I understand what is your intention, but please see the JIRA issue 
title.
It's not about `Fix OrcQuerySuite`. Why do you reopen this issue? Please 
proceed to file a new JIRA issue for that.

> Register a task completion listener first for OrcColumnarBatchReader
> 
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.1
>
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader

2018-02-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368121#comment-16368121
 ] 

Dongjoon Hyun commented on SPARK-23399:
---

[~mgaido]. I understand what is your intention, but please see the JIRA issue 
title.
It's not about `Fix OrcQuerySuite`. Why do you reopen this issue? Please 
proceed to file a new JIRA issue for that.

> Register a task completion listener first for OrcColumnarBatchReader
> 
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.1
>
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-02-16 Thread Pranav Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368117#comment-16368117
 ] 

Pranav Rao commented on SPARK-23442:


Repartitioning is unlikely to be helpful to a user because:

* The map part of repartition is still limited to num_buckets, so it's going to 
be very slow and not utilise available parallelism.
* The user would have pre-partitioned and bucketed his dataset and persisted it 
so, purely to avoid repartitioning/shuffle at read time. So the purpose of this 
feature is lost. 

> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Priority: Major
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? 
>  Let me know if I can help in any way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23455) Default Params in ML should be saved separately

2018-02-16 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368116#comment-16368116
 ] 

Liang-Chi Hsieh commented on SPARK-23455:
-

Currently, {{DefaultParamsWriter}} saves the following metadata + params:

   *  - class
   *  - timestamp
   *  - sparkVersion
   *  - uid
   *  - paramMap
   *  - (optionally, extra metadata)

User-supplied params and default params are all saved in {{paramMap}} field in 
JSON. We can have a {{defaultParamMap}} for saving default params.

For backward compatibility, when loading metadata, if it is a metadata file 
prior to Spark 2.4, we shouldn't raise error if we can't find 
{{defaultParamMap}} field in the file.

 

 

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23435) R tests should support latest testthat

2018-02-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-23435:


Assignee: Felix Cheung

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-16 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368109#comment-16368109
 ] 

Seth Hendrickson commented on SPARK-23437:
--

TBH, this seems like a pretty reasonable request. While I agree we do seem to 
tell people that the "standard" practice is to implement as a third party 
package and then integrate later, I don't see this happen in practice. I don't 
know that we've even validated that the "implement as third party package, then 
in Spark later on" approach even really works. Perhaps an even stronger reason 
for resisting new algorithms is just lack of reviewer/developer support on 
Spark ML. It's hard to predict if there will be anyone to review the PR within 
a reasonable amount of time, even if the code is well-designed. AFAIK, we 
haven't added any major algos since GeneralizedLinearRegression, which has to 
have been a couple years ago. 

That said, I think this is something to at least consider. We can start by 
discussing what algorithms exist, and why we'd choose a particular one. Strong 
arguments for why we need GPs in Spark ML are also beneficial. The fact that 
there isn't a non-parametric regression algo in Spark has some merit, but we 
don't write new algorithms just for the sake of filling in gaps - there needs 
to be user demand (which, unfortunately, is often hard to prove). It also helps 
to point to a package that already implements the algo you're proposing, but 
for example I don't believe scikit implements the linear-time version so we 
can't really leverage their experience. Providing more information on any/all 
of these categories will help make a stronger case, and I do think GPs can be a 
useful addition. Thanks for leading the discussion!

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23455) Default Params in ML should be saved separately

2018-02-16 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-23455:
---

 Summary: Default Params in ML should be saved separately
 Key: SPARK-23455
 URL: https://issues.apache.org/jira/browse/SPARK-23455
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


We save ML's user-supplied params and default params as one entity in JSON. 
During loading the saved models, we set all the loaded params into created ML 
model instances as user-supplied params.

It causes some problems, e.g., if we strictly disallow some params to be set at 
the same time, a default param can fail the param check because it is treated 
as user-supplied param after loading.

The loaded default params should not be set as user-supplied params. We should 
save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree

2018-02-16 Thread Alessandro Solimando (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368108#comment-16368108
 ] 

Alessandro Solimando commented on SPARK-3159:
-

As I was not aware of this Jira case I have opened a duplicate ticket and I 
worked on the proposed patch independently from the one I see. 

However the two approaches look pretty different (despite somehow close in 
spirit) so I think it is fine to check the PR independently from the other.

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree

2018-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368106#comment-16368106
 ] 

Apache Spark commented on SPARK-3159:
-

User 'asolimando' has created a pull request for this issue:
https://github.com/apache/spark/pull/20632

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23447) Cleanup codegen template for Literal

2018-02-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23447:
---

Assignee: Kris Mok

> Cleanup codegen template for Literal
> 
>
> Key: SPARK-23447
> URL: https://issues.apache.org/jira/browse/SPARK-23447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.4.0
>
>
> Ideally, the codegen templates for {{Literal}} should emit literals in the 
> {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be 
> effectively inlined into their use sites.
> But currently there are a couple of paths where {{Literal.doGenCode()}} 
> return {{ExprCode}} that has non-trivial {{code}} field, and all of those are 
> actually unnecessary.
> We can make a simple refactoring to make sure all codegen templates for 
> {{Literal}} return empty {{code}} and simple literal/constant expressions in 
> {{isNull}} and {{value}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23447) Cleanup codegen template for Literal

2018-02-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23447.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20626
[https://github.com/apache/spark/pull/20626]

> Cleanup codegen template for Literal
> 
>
> Key: SPARK-23447
> URL: https://issues.apache.org/jira/browse/SPARK-23447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.4.0
>
>
> Ideally, the codegen templates for {{Literal}} should emit literals in the 
> {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be 
> effectively inlined into their use sites.
> But currently there are a couple of paths where {{Literal.doGenCode()}} 
> return {{ExprCode}} that has non-trivial {{code}} field, and all of those are 
> actually unnecessary.
> We can make a simple refactoring to make sure all codegen templates for 
> {{Literal}} return empty {{code}} and simple literal/constant expressions in 
> {{isNull}} and {{value}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide

2018-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23454:


Assignee: Tathagata Das  (was: Apache Spark)

> Add Trigger information to the Structured Streaming programming guide
> -
>
> Key: SPARK-23454
> URL: https://issues.apache.org/jira/browse/SPARK-23454
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide

2018-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368077#comment-16368077
 ] 

Apache Spark commented on SPARK-23454:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20631

> Add Trigger information to the Structured Streaming programming guide
> -
>
> Key: SPARK-23454
> URL: https://issues.apache.org/jira/browse/SPARK-23454
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide

2018-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23454:


Assignee: Apache Spark  (was: Tathagata Das)

> Add Trigger information to the Structured Streaming programming guide
> -
>
> Key: SPARK-23454
> URL: https://issues.apache.org/jira/browse/SPARK-23454
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide

2018-02-16 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-23454:
--
Priority: Minor  (was: Major)

> Add Trigger information to the Structured Streaming programming guide
> -
>
> Key: SPARK-23454
> URL: https://issues.apache.org/jira/browse/SPARK-23454
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide

2018-02-16 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-23454:
-

 Summary: Add Trigger information to the Structured Streaming 
programming guide
 Key: SPARK-23454
 URL: https://issues.apache.org/jira/browse/SPARK-23454
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23453) ToolBox compiled Spark UDAF causes java.lang.InternalError: Malformed class name

2018-02-16 Thread Eric Lo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Lo updated SPARK-23453:

Description: 
Here is a weird problem I just ran into... My scenario is that I need to 
compile UDAF dynamically at runtime but it never worked.

I am using Scala 2.11.11 and Spark 2.2.1, please refer to my [StackOverflow 
post|https://stackoverflow.com/questions/48820212/toolbox-compiled-spark-udaf-causes-java-lang-internalerror-malformed-class-name]
 for detailed information and minimal examples. The problem itself is very 
similar to other Malformed class name tickets (for example  
-[https://github.com/apache/spark/pull/9568])-  which were caused by calling 
getSimpleName of nested class/object but actually it is different in this case 
and the problem is still there. The getSimpleName issue has been fixed in Java 
9 which Spark doesn't support it yet...so...any solution/workaround is 
appreciated.

  was:
Here is a weird problem I just ran into... My scenario is that I need to 
compile UDAF dynamically at runtime but it never worked.

I am using Scala 2.11.11 and Spark 2.2.1, please refer to my [StackOverflow 
post|https://stackoverflow.com/questions/48820212/toolbox-compiled-spark-udaf-causes-java-lang-internalerror-malformed-class-name]
 for detailed information. The problem itself is very similar to other 
Malformed class name tickets (for example - 
[https://github.com/apache/spark/pull/9568]) - which were caused by calling 
getSimpleName of nested class/object but actually it is different in this case 
and the problem is still there. The getSimpleName issue has been fixed in Java 
9 which Spark doesn't support it yet...so...any solution/workaround is 
appreciated.


> ToolBox compiled Spark UDAF causes java.lang.InternalError: Malformed class 
> name
> 
>
> Key: SPARK-23453
> URL: https://issues.apache.org/jira/browse/SPARK-23453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
> Environment: Spark 2.2.1
> Scala 2.11.11
> JDK 1.8
>Reporter: Eric Lo
>Priority: Major
>
> Here is a weird problem I just ran into... My scenario is that I need to 
> compile UDAF dynamically at runtime but it never worked.
> I am using Scala 2.11.11 and Spark 2.2.1, please refer to my [StackOverflow 
> post|https://stackoverflow.com/questions/48820212/toolbox-compiled-spark-udaf-causes-java-lang-internalerror-malformed-class-name]
>  for detailed information and minimal examples. The problem itself is very 
> similar to other Malformed class name tickets (for example  
> -[https://github.com/apache/spark/pull/9568])-  which were caused by calling 
> getSimpleName of nested class/object but actually it is different in this 
> case and the problem is still there. The getSimpleName issue has been fixed 
> in Java 9 which Spark doesn't support it yet...so...any solution/workaround 
> is appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23453) ToolBox compiled Spark UDAF causes java.lang.InternalError: Malformed class name

2018-02-16 Thread Eric Lo (JIRA)
Eric Lo created SPARK-23453:
---

 Summary: ToolBox compiled Spark UDAF causes 
java.lang.InternalError: Malformed class name
 Key: SPARK-23453
 URL: https://issues.apache.org/jira/browse/SPARK-23453
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
 Environment: Spark 2.2.1
Scala 2.11.11
JDK 1.8
Reporter: Eric Lo


Here is a weird problem I just ran into... My scenario is that I need to 
compile UDAF dynamically at runtime but it never worked.

I am using Scala 2.11.11 and Spark 2.2.1, please refer to my [StackOverflow 
post|https://stackoverflow.com/questions/48820212/toolbox-compiled-spark-udaf-causes-java-lang-internalerror-malformed-class-name]
 for detailed information. The problem itself is very similar to other 
Malformed class name tickets (for example - 
[https://github.com/apache/spark/pull/9568]) - which were caused by calling 
getSimpleName of nested class/object but actually it is different in this case 
and the problem is still there. The getSimpleName issue has been fixed in Java 
9 which Spark doesn't support it yet...so...any solution/workaround is 
appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23417) pyspark tests give wrong sbt instructions

2018-02-16 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368031#comment-16368031
 ] 

Bruce Robbins commented on SPARK-23417:
---

This does the trick:
{noformat}
build/sbt -Pkafka-0-8 assembly/package streaming-kafka-0-8-assembly/assembly
{noformat}
There are also errant instructions for building a flume assembly jar. In that 
case the following works:
{noformat}
build/sbt -Pflume assembly/package streaming-flume-assembly/assembly
{noformat}
I can submit a PR to fix these messages.

By the way, the above is just for the pyspark-streaming tests. The pyspark-sql 
tests have similar build requirements (e.g., at least one test needs a build 
with Hive profiles. Also, udf.py needs 
/sql/core/target/scala-2.11/test-classes/test/org/apache/spark/sql/JavaStringLength.class
 to exist.). The pyspark-sql tests don't check for these requirements, they 
just throw exceptions. But I won't address that here.

> pyspark tests give wrong sbt instructions
> -
>
> Key: SPARK-23417
> URL: https://issues.apache.org/jira/browse/SPARK-23417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> When running python/run-tests, the script indicates that I must run 
> "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or 
> 'build/mvn -Pkafka-0-8 package'". The sbt command fails:
>  
> [error] Expected ID character
> [error] Not a valid command: streaming-kafka-0-8-assembly
> [error] Expected project ID
> [error] Expected configuration
> [error] Expected ':' (if selecting a configuration)
> [error] Expected key
> [error] Not a valid key: streaming-kafka-0-8-assembly
> [error] streaming-kafka-0-8-assembly/assembly
> [error] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23362) Migrate Kafka microbatch source to v2

2018-02-16 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23362.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 20554
[https://github.com/apache/spark/pull/20554]

> Migrate Kafka microbatch source to v2
> -
>
> Key: SPARK-23362
> URL: https://issues.apache.org/jira/browse/SPARK-23362
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects

2018-02-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367933#comment-16367933
 ] 

Michael Armbrust commented on SPARK-23337:
--

This is essentially the same issue as SPARK-18084. We are taking a column name 
here, not an expression.  As such you can only reference top level columns.  I 
agree this is an annoying aspect of the API, but changing it might have to 
happen at a major release since it would be change in behavior.

> withWatermark raises an exception on struct objects
> ---
>
> Key: SPARK-23337
> URL: https://issues.apache.org/jira/browse/SPARK-23337
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
> Environment: Linux Ubuntu, Spark on standalone mode
>Reporter: Aydin Kocas
>Priority: Major
>
> Hi,
>  
> when using a nested object (I mean an object within a struct, here concrete: 
> _source.createTime) from a json file as the parameter for the 
> withWatermark-method, I get an exception (see below).
> Anything else works flawlessly with the nested object.
>  
> +*{color:#14892c}works:{color}*+ 
> {code:java}
> Dataset jsonRow = 
> spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime",
>  "10 seconds").toDF();{code}
>  
> json structure:
> {code:java}
> root
>  |-- _id: string (nullable = true)
>  |-- _index: string (nullable = true)
>  |-- _score: long (nullable = true)
>  |-- myTime: timestamp (nullable = true)
> ..{code}
> +*{color:#d04437}does not work - nested json{color}:*+
> {code:java}
> Dataset jsonRow = 
> spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime",
>  "10 seconds").toDF();{code}
>  
> json structure:
>  
> {code:java}
> root
>  |-- _id: string (nullable = true)
>  |-- _index: string (nullable = true)
>  |-- _score: long (nullable = true)
>  |-- _source: struct (nullable = true)
>  | |-- createTime: timestamp (nullable = true)
> ..
>  
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> 'EventTimeWatermark '_source.createTime, interval 10 seconds
> +- Deduplicate [_id#0], true
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true),
>  StructField(_index,StringType,true), StructField(_score,LongType,true), 
> StructField(_source,StructType(StructField(additionalData,StringType,true), 
> StructField(client,StringType,true), 
> StructField(clientDomain,BooleanType,true), 
> StructField(clientVersion,StringType,true), 
> StructField(country,StringType,true), 
> StructField(countryName,StringType,true), 
> StructField(createTime,TimestampType,true), 
> StructField(externalIP,StringType,true), 
> StructField(hostname,StringType,true), 
> StructField(internalIP,StringType,true), 
> StructField(location,StringType,true), 
> StructField(locationDestination,StringType,true), 
> StructField(login,StringType,true), 
> StructField(originalRequestString,StringType,true), 
> StructField(password,StringType,true), 
> StructField(peerIdent,StringType,true), 
> StructField(peerType,StringType,true), 
> StructField(recievedTime,TimestampType,true), 
> StructField(sessionEnd,StringType,true), 
> StructField(sessionStart,StringType,true), 
> StructField(sourceEntryAS,StringType,true), 
> StructField(sourceEntryIp,StringType,true), 
> StructField(sourceEntryPort,StringType,true), 
> StructField(targetCountry,StringType,true), 
> StructField(targetCountryName,StringType,true), 
> StructField(targetEntryAS,StringType,true), 
> StructField(targetEntryIp,StringType,true), 
> StructField(targetEntryPort,StringType,true), 
> StructField(targetport,StringType,true), 
> StructField(username,StringType,true), 
> StructField(vulnid,StringType,true)),true), 
> StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), 
> FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4]
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>  at 
> org.apache.spark

[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)

2018-02-16 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367925#comment-16367925
 ] 

Li Jin commented on SPARK-13127:


Hi all,

The status of the Jira is "Progress". I am wondering if this is being actively 
worked on?

> Upgrade Parquet to 1.9 (Fixes parquet sorting)
> --
>
> Key: SPARK-13127
> URL: https://issues.apache.org/jira/browse/SPARK-13127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Justin Pihony
>Priority: Major
>
> Currently, when you write a sorted DataFrame to Parquet, then reading the 
> data back out is not sorted by default. [This is due to a bug in 
> Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in 
> 1.9.
> There is a workaround to read the file back in using a file glob (filepath/*).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23452) Extend test coverage to all ORC readers

2018-02-16 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367905#comment-16367905
 ] 

Xiao Li commented on SPARK-23452:
-

Thanks! I will assign it to you. 

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23452) Extend test coverage to all ORC readers

2018-02-16 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23452:
---

Assignee: Dongjoon Hyun

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23409) RandomForest/DecisionTree (syntactic) pruning of redundant subtrees

2018-02-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23409.
---
Resolution: Duplicate

Linking old JIRA for this issue

> RandomForest/DecisionTree (syntactic) pruning of redundant subtrees
> ---
>
> Key: SPARK-23409
> URL: https://issues.apache.org/jira/browse/SPARK-23409
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.1
> Environment: 
>Reporter: Alessandro Solimando
>Priority: Minor
>
> Improvement: redundancy elimination from decision trees where all the leaves 
> of a given subtree share the same prediction.
> Benefits:
>  * Model interpretability
>  * Faster unitary model invocation (relevant for massive number of 
> invocations)
>  * Smaller model memory footprint
> For instance, consider the following decision tree.
> {panel:title=Original Decision Tree}
> {noformat}
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 
> nodes
>   If (feature 1 <= 0.5)
>If (feature 2 <= 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
>Else (feature 2 > 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
>   Else (feature 1 > 0.5)
>If (feature 2 <= 0.5)
> If (feature 0 <= 0.5)
>  Predict: 1.0
> Else (feature 0 > 0.5)
>  Predict: 1.0
>Else (feature 2 > 0.5)
> If (feature 0 <= 0.5)
>  Predict: 0.0
> Else (feature 0 > 0.5)
>  Predict: 0.0
> {noformat}
> {panel}
> The proposed method, taken as input the first tree, aims at producing as 
> output the following (semantically equivalent) tree:
> {panel:title=Pruned Decision Tree}
> {noformat}
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 
> nodes
>   If (feature 1 <= 0.5)
>Predict: 0.0
>   Else (feature 1 > 0.5)
>If (feature 2 <= 0.5)
> Predict: 1.0
>Else (feature 2 > 0.5)
> Predict: 0.0
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367895#comment-16367895
 ] 

Apache Spark commented on SPARK-23381:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20630

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367845#comment-16367845
 ] 

Joseph K. Bradley commented on SPARK-23381:
---

Copying my comment from the PR:
{quote}
For ML, I actually don't think this has to be a blocker. It's not great, but 
it's not a regression.

However, we should definitely fix this in the future and soon: For ML, it's 
really important that MurmurHash3 behave consistently across platforms.

To fix this, we'll need to maintain the old implementation of MurmushHash3 to 
maintain the behavior of ML Pipelines exported from previous versions of Spark.
{quote}

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23381:
--
Priority: Major  (was: Minor)

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23381:
--
Issue Type: Bug  (was: Improvement)

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Minor
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23452) Extend test coverage to all ORC readers

2018-02-16 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23452:
--
Issue Type: Improvement  (was: Test)

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23452) Extend test coverage to all ORC readers

2018-02-16 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23452:
--
Component/s: Tests

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23452) Extend test coverage to all ORC readers

2018-02-16 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23452:
--
Summary: Extend test coverage to all ORC readers  (was: Improve test 
coverage for ORC readers)

> Extend test coverage to all ORC readers
> ---
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23452) Improve test coverage for ORC readers

2018-02-16 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23452:
--
Description: 
We have five ORC readers. We had better have a test coverage for all ORC 
readers.

- Hive Serde
- Hive OrcFileFormat
- Apache ORC Vectorized Wrapper
- Apache ORC Vectorized Copy
- Apache ORC MR


  was:
We have five ORC readers. We had better have a test coverage for all cases.

- Hive Serde
- Hive OrcFileFormat
- Apache ORC Vectorized Wrapper
- Apache ORC Vectorized Copy
- Apache ORC MR



> Improve test coverage for ORC readers
> -
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all ORC 
> readers.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23452) Improve test coverage for ORC readers

2018-02-16 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23452:
--
Summary: Improve test coverage for ORC readers  (was: Improve test coverage 
for ORC file format)

> Improve test coverage for ORC readers
> -
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all cases.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23452) Improve test coverage for ORC file format

2018-02-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367792#comment-16367792
 ] 

Dongjoon Hyun commented on SPARK-23452:
---

I created this and will proceed this for 2.3.1, [~smilegator].

> Improve test coverage for ORC file format
> -
>
> Key: SPARK-23452
> URL: https://issues.apache.org/jira/browse/SPARK-23452
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> We have five ORC readers. We had better have a test coverage for all cases.
> - Hive Serde
> - Hive OrcFileFormat
> - Apache ORC Vectorized Wrapper
> - Apache ORC Vectorized Copy
> - Apache ORC MR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23452) Improve test coverage for ORC file format

2018-02-16 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23452:
-

 Summary: Improve test coverage for ORC file format
 Key: SPARK-23452
 URL: https://issues.apache.org/jira/browse/SPARK-23452
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.1
Reporter: Dongjoon Hyun


We have five ORC readers. We had better have a test coverage for all cases.

- Hive Serde
- Hive OrcFileFormat
- Apache ORC Vectorized Wrapper
- Apache ORC Vectorized Copy
- Apache ORC MR




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver

2018-02-16 Thread Pratik Dhumal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367742#comment-16367742
 ] 

Pratik Dhumal edited comment on SPARK-23427 at 2/16/18 7:18 PM:


{code:java}
// code placeholder
@Test
def testLoop() = {
  val schema = new StructType().add("test", types.IntegerType)
  var t1 = spark.createDataFrame(spark.sparkContext.parallelize(1 to 
100).map(i => Row(i)), schema)
  val t2 = spark.createDataFrame(spark.sparkContext.parallelize(4 to 
1400).map(i => Row(i)), schema)
  val t3 = spark.createDataFrame(spark.sparkContext.parallelize(15 to 
190).map(i => Row(i)), schema)
  val t4 = spark.createDataFrame(spark.sparkContext.parallelize(135 to 
652).map(i => Row(i)), schema)
  val t5 = spark.createDataFrame(spark.sparkContext.parallelize(86 to 
352).map(i => Row(i)), schema)

  t1.persist().count()
  t2.persist().count()
  t3.persist().count()
  t4.persist().count()
  t5.persist().count()
  var dfResult: DataFrame = null
  while (true) {
var t3Filter = t3.filter("test % 2 = 1")
var t4Filter = t4.filter("test % 2 = 0")
t1.createOrReplaceTempView("T1")
t2.createOrReplaceTempView("T2")
t3Filter.createOrReplaceTempView("T3")
t4Filter.createOrReplaceTempView("T4")
t5.createOrReplaceTempView("T5")

var query =
  """ SELECT T1.* FROM T1
| INNER JOIN T2 ON T1.test=t2.test
| LEFT JOIN T3 ON T1.test=t3.test
| LEFT JOIN T4 ON T1.test=t4.test
| LEFT JOIN T5 ON T1.test=t5.test

| """.stripMargin
if (t1 == null) {
  t1 = spark.sql(query)
  t1.persist().count()


} else {
  var tmp1 = spark.sql(query)
  var tmp2 = t1
  t1 = tmp1.union(tmp2)
  t1.persist().count()
  tmp2.unpersist(true)
  tmp2 = null
}


println("t1 : " + (SizeEstimator.estimate(t1) / 1024 / 1024))
// Do Something - Currently doing nothing

spark.catalog.dropTempView("T1")
spark.catalog.dropTempView("T2")
spark.catalog.dropTempView("T3")
spark.catalog.dropTempView("T4")
spark.catalog.dropTempView("T5")



  }

  t3.unpersist(true)
  t2.unpersist(true)
  t1.unpersist(true)
  t4.unpersist(true)
  t5.unpersist(true)

  println("VOID")
}


// RESULT LOG

t1 : 8

t1 : 208

t1 : 310

t1 : 187

t1 : 441

t1 : 440

t1 : 547

t1 : 651

t1 : 759

t1 : 733

t1 : 1129{code}
 

 

Hope this helps. 


was (Author: dpratik):
{code:java}
// code placeholder
@Test
def testLoop() = {
  val schema = new StructType().add("test", types.IntegerType)
  var t1 = spark.createDataFrame(spark.sparkContext.parallelize(1 to 
100).map(i => Row(i)), schema)
  val t2 = spark.createDataFrame(spark.sparkContext.parallelize(4 to 
1400).map(i => Row(i)), schema)
  val t3 = spark.createDataFrame(spark.sparkContext.parallelize(15 to 
190).map(i => Row(i)), schema)
  val t4 = spark.createDataFrame(spark.sparkContext.parallelize(135 to 
652).map(i => Row(i)), schema)
  val t5 = spark.createDataFrame(spark.sparkContext.parallelize(86 to 
352).map(i => Row(i)), schema)

  t1.persist().count()
  t2.persist().count()
  t3.persist().count()
  t4.persist().count()
  t5.persist().count()
  var dfResult: DataFrame = null
  while (true) {
var t3Filter = t3.filter("test % 2 = 1")
var t4Filter = t4.filter("test % 2 = 0")
t1.createOrReplaceTempView("T1")
t2.createOrReplaceTempView("T2")
t3Filter.createOrReplaceTempView("T3")
t4Filter.createOrReplaceTempView("T4")
t5.createOrReplaceTempView("T5")

var query =
  """ SELECT T1.* FROM T1
| INNER JOIN T2 ON T1.test=t2.test
| LEFT JOIN T3 ON T1.test=t3.test
| LEFT JOIN T4 ON T1.test=t4.test
| LEFT JOIN T5 ON T1.test=t5.test

| """.stripMargin
if (t1 == null) {
  t1 = spark.sql(query)
  t1.persist().count()


} else {
  var tmp1 = spark.sql(query)
  var tmp2 = t1
  t1 = tmp1.union(tmp2)
  t1.persist().count()
  tmp2.unpersist(true)
  tmp2 = null
}


println("t1 : " + (SizeEstimator.estimate(t1) / 1024 / 1024))
// Do Something - Currently doing nothing

spark.catalog.dropTempView("T1")
spark.catalog.dropTempView("T2")
spark.catalog.dropTempView("T3")
spark.catalog.dropTempView("T4")
spark.catalog.dropTempView("T5")



  }

  t3.unpersist(true)
  t2.unpersist(true)
  t1.unpersist(true)
  t4.unpersist(true)
  t5.unpersist(true)

  println("VOID")
}
{code}
Hope this helps. 

> spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver 
> -
>
> Key: SPARK-23427
> URL: https://issues.apache.org/jira/browse/SPARK-23427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SPARK 2.0 version
>Reporter: Dhiraj
> 

[jira] [Updated] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver

2018-02-16 Thread Dhiraj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhiraj updated SPARK-23427:
---
Summary: spark.sql.autoBroadcastJoinThreshold causing OOM exception in the 
driver   (was: spark.sql.autoBroadcastJoinThreshold causing OOM  in the driver )

> spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver 
> -
>
> Key: SPARK-23427
> URL: https://issues.apache.org/jira/browse/SPARK-23427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SPARK 2.0 version
>Reporter: Dhiraj
>Priority: Critical
>
> We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.
> With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver 
> memory used flat.
> With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
> goes up with rate depending upon the size of the autoBroadcastThreshold and 
> getting OOM exception. The problem is memory used by autoBroadcast is not 
> being free up in the driver.
> Application imports oracle tables as master dataframes which are persisted. 
> Each job applies filter to these tables and then registered them as 
> tempViewTable . Then sql query are using to process data further. At the end 
> all the intermediate dataFrame are unpersisted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM in the driver

2018-02-16 Thread Pratik Dhumal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367742#comment-16367742
 ] 

Pratik Dhumal commented on SPARK-23427:
---

{code:java}
// code placeholder
@Test
def testLoop() = {
  val schema = new StructType().add("test", types.IntegerType)
  var t1 = spark.createDataFrame(spark.sparkContext.parallelize(1 to 
100).map(i => Row(i)), schema)
  val t2 = spark.createDataFrame(spark.sparkContext.parallelize(4 to 
1400).map(i => Row(i)), schema)
  val t3 = spark.createDataFrame(spark.sparkContext.parallelize(15 to 
190).map(i => Row(i)), schema)
  val t4 = spark.createDataFrame(spark.sparkContext.parallelize(135 to 
652).map(i => Row(i)), schema)
  val t5 = spark.createDataFrame(spark.sparkContext.parallelize(86 to 
352).map(i => Row(i)), schema)

  t1.persist().count()
  t2.persist().count()
  t3.persist().count()
  t4.persist().count()
  t5.persist().count()
  var dfResult: DataFrame = null
  while (true) {
var t3Filter = t3.filter("test % 2 = 1")
var t4Filter = t4.filter("test % 2 = 0")
t1.createOrReplaceTempView("T1")
t2.createOrReplaceTempView("T2")
t3Filter.createOrReplaceTempView("T3")
t4Filter.createOrReplaceTempView("T4")
t5.createOrReplaceTempView("T5")

var query =
  """ SELECT T1.* FROM T1
| INNER JOIN T2 ON T1.test=t2.test
| LEFT JOIN T3 ON T1.test=t3.test
| LEFT JOIN T4 ON T1.test=t4.test
| LEFT JOIN T5 ON T1.test=t5.test

| """.stripMargin
if (t1 == null) {
  t1 = spark.sql(query)
  t1.persist().count()


} else {
  var tmp1 = spark.sql(query)
  var tmp2 = t1
  t1 = tmp1.union(tmp2)
  t1.persist().count()
  tmp2.unpersist(true)
  tmp2 = null
}


println("t1 : " + (SizeEstimator.estimate(t1) / 1024 / 1024))
// Do Something - Currently doing nothing

spark.catalog.dropTempView("T1")
spark.catalog.dropTempView("T2")
spark.catalog.dropTempView("T3")
spark.catalog.dropTempView("T4")
spark.catalog.dropTempView("T5")



  }

  t3.unpersist(true)
  t2.unpersist(true)
  t1.unpersist(true)
  t4.unpersist(true)
  t5.unpersist(true)

  println("VOID")
}
{code}
Hope this helps. 

> spark.sql.autoBroadcastJoinThreshold causing OOM  in the driver 
> 
>
> Key: SPARK-23427
> URL: https://issues.apache.org/jira/browse/SPARK-23427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SPARK 2.0 version
>Reporter: Dhiraj
>Priority: Critical
>
> We are facing issue around value of spark.sql.autoBroadcastJoinThreshold.
> With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver 
> memory used flat.
> With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used 
> goes up with rate depending upon the size of the autoBroadcastThreshold and 
> getting OOM exception. The problem is memory used by autoBroadcast is not 
> being free up in the driver.
> Application imports oracle tables as master dataframes which are persisted. 
> Each job applies filter to these tables and then registered them as 
> tempViewTable . Then sql query are using to process data further. At the end 
> all the intermediate dataFrame are unpersisted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-02-16 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367728#comment-16367728
 ] 

Shixiong Zhu commented on SPARK-23433:
--

[~irashid] I'm busy with other stuff and not working on this. Your approach 
sounds good to me. Please go ahead if you have time to work on this.

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23234) ML python test failure due to default outputCol

2018-02-16 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23234:

Priority: Major  (was: Blocker)

> ML python test failure due to default outputCol
> ---
>
> Key: SPARK-23234
> URL: https://issues.apache.org/jira/browse/SPARK-23234
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-22799 and SPARK-22797 are causing valid Python test failures. The 
> reason is that Python is setting the default params with set. So they are not 
> considered as defaults, but as params passed by the user.
> This means that an outputCol is set not as a default but as a real value.
> Anyway, this is a misbehavior of the python API which can cause serious 
> problems and I'd suggest to rethink the way this is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-02-16 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367666#comment-16367666
 ] 

Imran Rashid commented on SPARK-23433:
--

actually, I realized its more general than just marking it as a zombie -- it 
should even be able to mark tasks as completed, so you don't have tasks 
submitted by later attempts when an earlier attempt says the output is ready.

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-02-16 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367664#comment-16367664
 ] 

Imran Rashid commented on SPARK-23433:
--

yes I think you are right [~zsxwing].  Since a zombie taskset might still be 
running the same tasks as a the non-zombie one, when a zombie task finishes, it 
should be able to mark the non-zombie taskset as a zombie.  Or in this case, 
task 18.0 from 7580621.0 should be able to mark 7580621.1 as a zombie.

Are you working on this?

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23446) Explicitly check supported types in toPandas

2018-02-16 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23446.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.3.0

> Explicitly check supported types in toPandas
> 
>
> Key: SPARK-23446
> URL: https://issues.apache.org/jira/browse/SPARK-23446
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>
> See:
> {code}
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df = spark.createDataFrame([[bytearray("a")]])
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
>  _1
> 0  [97]
>   _1
> 0  a
> {code}
> We didn't finish binary type in Arrow conversion at Python side. We should 
> disallow it.
> Same thine applies to nested timestamps. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23451) Deprecate KMeans computeCost

2018-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367578#comment-16367578
 ] 

Apache Spark commented on SPARK-23451:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20629

> Deprecate KMeans computeCost
> 
>
> Key: SPARK-23451
> URL: https://issues.apache.org/jira/browse/SPARK-23451
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of 
> proper cluster evaluators. Now SPARK-14516 introduces a proper 
> {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove 
> it in the next releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23451) Deprecate KMeans computeCost

2018-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23451:


Assignee: (was: Apache Spark)

> Deprecate KMeans computeCost
> 
>
> Key: SPARK-23451
> URL: https://issues.apache.org/jira/browse/SPARK-23451
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of 
> proper cluster evaluators. Now SPARK-14516 introduces a proper 
> {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove 
> it in the next releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23451) Deprecate KMeans computeCost

2018-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23451:


Assignee: Apache Spark

> Deprecate KMeans computeCost
> 
>
> Key: SPARK-23451
> URL: https://issues.apache.org/jira/browse/SPARK-23451
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Trivial
>
> SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of 
> proper cluster evaluators. Now SPARK-14516 introduces a proper 
> {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove 
> it in the next releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23288) Incorrect number of written records in structured streaming

2018-02-16 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367569#comment-16367569
 ] 

Gabor Somogyi commented on SPARK-23288:
---

Seems like no statsTrackers created in FileStreamSink.

> Incorrect number of written records in structured streaming
> ---
>
> Key: SPARK-23288
> URL: https://issues.apache.org/jira/browse/SPARK-23288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Metrics, metrics
>
> I'm using SparkListener.onTaskEnd() to capture input and output metrics but 
> it seems that number of written records 
> ('taskEnd.taskMetrics().outputMetrics().recordsWritten()') is incorrect. Here 
> is my stream construction:
>  
> {code:java}
> StreamingQuery writeStream = session
> .readStream()
> .schema(RecordSchema.fromClass(TestRecord.class))
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv(inputFolder.getRoot().toPath().toString())
> .as(Encoders.bean(TestRecord.class))
> .flatMap(
> ((FlatMapFunction) (u) -> {
> List resultIterable = new ArrayList<>();
> try {
> TestVendingRecord result = transformer.convert(u);
> resultIterable.add(result);
> } catch (Throwable t) {
> System.err.println("Ooops");
> t.printStackTrace();
> }
> return resultIterable.iterator();
> }),
> Encoders.bean(TestVendingRecord.class))
> .writeStream()
> .outputMode(OutputMode.Append())
> .format("parquet")
> .option("path", outputFolder.getRoot().toPath().toString())
> .option("checkpointLocation", 
> checkpointFolder.getRoot().toPath().toString())
> .start();
> writeStream.processAllAvailable();
> writeStream.stop();
> {code}
> Tested it with one good and one bad (throwing exception in 
> transformer.convert(u)) input records and it produces following metrics:
>  
> {code:java}
> (TestMain.java:onTaskEnd(73)) - ---status--> SUCCESS
> (TestMain.java:onTaskEnd(75)) - ---recordsWritten--> 0
> (TestMain.java:onTaskEnd(76)) - ---recordsRead-> 2
> (TestMain.java:onTaskEnd(83)) - taskEnd.taskInfo().accumulables():
> (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max)
> (TestMain.java:onTaskEnd(85)) - value =  323
> (TestMain.java:onTaskEnd(84)) - name = number of output rows
> (TestMain.java:onTaskEnd(85)) - value =  2
> (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max)
> (TestMain.java:onTaskEnd(85)) - value =  364
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.recordsRead
> (TestMain.java:onTaskEnd(85)) - value =  2
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.bytesRead
> (TestMain.java:onTaskEnd(85)) - value =  157
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.resultSerializationTime
> (TestMain.java:onTaskEnd(85)) - value =  3
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.resultSize
> (TestMain.java:onTaskEnd(85)) - value =  2396
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorCpuTime
> (TestMain.java:onTaskEnd(85)) - value =  633807000
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorRunTime
> (TestMain.java:onTaskEnd(85)) - value =  683
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.executorDeserializeCpuTime
> (TestMain.java:onTaskEnd(85)) - value =  55662000
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.executorDeserializeTime
> (TestMain.java:onTaskEnd(85)) - value =  58
> (TestMain.java:onTaskEnd(89)) - input records 2
> Streaming query made progress: {
>   "id" : "1231f9cb-b2e8-4d10-804d-73d7826c1cb5",
>   "runId" : "bd23b60c-93f9-4e17-b3bc-55403edce4e7",
>   "name" : null,
>   "timestamp" : "2018-01-26T14:44:05.362Z",
>   "numInputRows" : 2,
>   "processedRowsPerSecond" : 0.8163265306122448,
>   "durationMs" : {
> "addBatch" : 1994,
> "getBatch" : 126,
> "getOffset" : 52,
> "queryPlanning" : 220,
> "triggerExecution" : 2450,
> "walCommit" : 41
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : 
> "FileStreamSource[file:/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5_/T/junit3661035412295337071]",
> "startOffset" : null,
> "endOffset" : {
>   "logOffset" : 0
> },
> "numInputRows" : 2,
> "processedRowsPerSecond" : 0.8163265306122448
>   } ],
>   "sink" : {
> "description" : 
> "FileSink[/var/folders/4w/z

[jira] [Commented] (SPARK-23288) Incorrect number of written records in structured streaming

2018-02-16 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367567#comment-16367567
 ] 

Gabor Somogyi commented on SPARK-23288:
---

I'm working on this issue.

> Incorrect number of written records in structured streaming
> ---
>
> Key: SPARK-23288
> URL: https://issues.apache.org/jira/browse/SPARK-23288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Metrics, metrics
>
> I'm using SparkListener.onTaskEnd() to capture input and output metrics but 
> it seems that number of written records 
> ('taskEnd.taskMetrics().outputMetrics().recordsWritten()') is incorrect. Here 
> is my stream construction:
>  
> {code:java}
> StreamingQuery writeStream = session
> .readStream()
> .schema(RecordSchema.fromClass(TestRecord.class))
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv(inputFolder.getRoot().toPath().toString())
> .as(Encoders.bean(TestRecord.class))
> .flatMap(
> ((FlatMapFunction) (u) -> {
> List resultIterable = new ArrayList<>();
> try {
> TestVendingRecord result = transformer.convert(u);
> resultIterable.add(result);
> } catch (Throwable t) {
> System.err.println("Ooops");
> t.printStackTrace();
> }
> return resultIterable.iterator();
> }),
> Encoders.bean(TestVendingRecord.class))
> .writeStream()
> .outputMode(OutputMode.Append())
> .format("parquet")
> .option("path", outputFolder.getRoot().toPath().toString())
> .option("checkpointLocation", 
> checkpointFolder.getRoot().toPath().toString())
> .start();
> writeStream.processAllAvailable();
> writeStream.stop();
> {code}
> Tested it with one good and one bad (throwing exception in 
> transformer.convert(u)) input records and it produces following metrics:
>  
> {code:java}
> (TestMain.java:onTaskEnd(73)) - ---status--> SUCCESS
> (TestMain.java:onTaskEnd(75)) - ---recordsWritten--> 0
> (TestMain.java:onTaskEnd(76)) - ---recordsRead-> 2
> (TestMain.java:onTaskEnd(83)) - taskEnd.taskInfo().accumulables():
> (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max)
> (TestMain.java:onTaskEnd(85)) - value =  323
> (TestMain.java:onTaskEnd(84)) - name = number of output rows
> (TestMain.java:onTaskEnd(85)) - value =  2
> (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max)
> (TestMain.java:onTaskEnd(85)) - value =  364
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.recordsRead
> (TestMain.java:onTaskEnd(85)) - value =  2
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.bytesRead
> (TestMain.java:onTaskEnd(85)) - value =  157
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.resultSerializationTime
> (TestMain.java:onTaskEnd(85)) - value =  3
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.resultSize
> (TestMain.java:onTaskEnd(85)) - value =  2396
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorCpuTime
> (TestMain.java:onTaskEnd(85)) - value =  633807000
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorRunTime
> (TestMain.java:onTaskEnd(85)) - value =  683
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.executorDeserializeCpuTime
> (TestMain.java:onTaskEnd(85)) - value =  55662000
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.executorDeserializeTime
> (TestMain.java:onTaskEnd(85)) - value =  58
> (TestMain.java:onTaskEnd(89)) - input records 2
> Streaming query made progress: {
>   "id" : "1231f9cb-b2e8-4d10-804d-73d7826c1cb5",
>   "runId" : "bd23b60c-93f9-4e17-b3bc-55403edce4e7",
>   "name" : null,
>   "timestamp" : "2018-01-26T14:44:05.362Z",
>   "numInputRows" : 2,
>   "processedRowsPerSecond" : 0.8163265306122448,
>   "durationMs" : {
> "addBatch" : 1994,
> "getBatch" : 126,
> "getOffset" : 52,
> "queryPlanning" : 220,
> "triggerExecution" : 2450,
> "walCommit" : 41
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : 
> "FileStreamSource[file:/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5_/T/junit3661035412295337071]",
> "startOffset" : null,
> "endOffset" : {
>   "logOffset" : 0
> },
> "numInputRows" : 2,
> "processedRowsPerSecond" : 0.8163265306122448
>   } ],
>   "sink" : {
> "description" : 
> "FileSink[/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5

[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-16 Thread Mitchell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367529#comment-16367529
 ] 

Mitchell commented on SPARK-23420:
--

Yes, I agree there appears to be no way currently for a user to distinguish a 
path to be treated normally vs. one to be treated as a glob. I think having two 
separate methods for specifying, or an option to specify how it should be 
treated. This probably isn't a common situation to have files/paths with these 
characters in them, but it's possible and should be able to be done.

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> java.util.regex.Pattern.compile(Pattern.java:1054) at 
> org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
> org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
> org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 
> 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, 
> (reason: User class threw exception: java.io.IOException: Illegal file 
> pattern: Unmatched closing 

[jira] [Created] (SPARK-23451) Deprecate KMeans computeCost

2018-02-16 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-23451:
---

 Summary: Deprecate KMeans computeCost
 Key: SPARK-23451
 URL: https://issues.apache.org/jira/browse/SPARK-23451
 Project: Spark
  Issue Type: Task
  Components: ML
Affects Versions: 2.4.0
Reporter: Marco Gaido


SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of 
proper cluster evaluators. Now SPARK-14516 introduces a proper 
{{ClusteringEvaluator}}, so we should deprecate this method and maybe remove it 
in the next releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23450) jars option in spark submit is documented in misleading way

2018-02-16 Thread Gregory Reshetniak (JIRA)
Gregory Reshetniak created SPARK-23450:
--

 Summary: jars option in spark submit is documented in misleading 
way
 Key: SPARK-23450
 URL: https://issues.apache.org/jira/browse/SPARK-23450
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.2.1
Reporter: Gregory Reshetniak


I am wondering if the {{--jars}} option on spark submit is actually meant for 
distributing the dependency jars onto the nodes in cluster?
 
In my case I can see it working as a "symlink" of sorts. But the documentation 
is written in the way that suggests otherwise. Please help me figure out if 
this is a bug or just my reading of the docs. Thanks!
_
 
 
 
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23449) Extra java options lose order in Docker context

2018-02-16 Thread Andrew Korzhuev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23449:

Description: 
`spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
processed in `entrypoint.sh` does not preserve its ordering, which makes 
`-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before any 
other experimental options.

 

Steps to reproduce:
 # Set `spark.driver.extraJavaOptions`, e.g. `-XX:+UnlockExperimentalVMOptions 
-XX:+UseG1GC -XX:+CMSClassUnloadingEnabled -XX:+UseCGroupMemoryLimitForHeap`
 # Submit application to k8s cluster.
 # Fetch logs and observe that on each run order of options is different and 
when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.

 

Expected behaviour:
 # Order of `extraJavaOptions` should be preserved.

 

Cause:

`entrypoint.sh` fetches environment options with `env`, which doesn't guarantee 
ordering.
{code:java}
env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt{code}

  was:
`spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` 
unusable, as you have to pass it before any other experimental options.

 

Steps to reproduce:
 # Set `spark.driver.extraJavaOptions`, e.g. `-XX:+UnlockExperimentalVMOptions 
-XX:+UseG1GC -XX:+CMSClassUnloadingEnabled -XX:+UseCGroupMemoryLimitForHeap`
 # Submit application to k8s cluster.
 # Fetch logs and observe that on each run order of options is different and 
when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.

 

Expected behaviour:
 # Order of `extraJavaOptions` should be preserved.

 

Cause:

`entrypoint.sh` fetches environment options with `env`, which doesn't guarantee 
ordering.
{code:java}
env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt{code}


> Extra java options lose order in Docker context
> ---
>
> Key: SPARK-23449
> URL: https://issues.apache.org/jira/browse/SPARK-23449
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8S with supplied Docker image. Passing 
> along extra java options.
>Reporter: Andrew Korzhuev
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
> processed in `entrypoint.sh` does not preserve its ordering, which makes 
> `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before 
> any other experimental options.
>  
> Steps to reproduce:
>  # Set `spark.driver.extraJavaOptions`, e.g. 
> `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled 
> -XX:+UseCGroupMemoryLimitForHeap`
>  # Submit application to k8s cluster.
>  # Fetch logs and observe that on each run order of options is different and 
> when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.
>  
> Expected behaviour:
>  # Order of `extraJavaOptions` should be preserved.
>  
> Cause:
> `entrypoint.sh` fetches environment options with `env`, which doesn't 
> guarantee ordering.
> {code:java}
> env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
> /tmp/java_opts.txt{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23439) Ambiguous reference when selecting column inside StructType with same name that outer colum

2018-02-16 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367414#comment-16367414
 ] 

Wenchen Fan commented on SPARK-23439:
-

This is a valid behavior, as `a.b` is an invalid column name for most of the 
external storages like parquet. I think it's reasonable to name the nested file 
according to the deepest field. Users should manually alias the column to avoid 
duplication before saving data to external storages.

> Ambiguous reference when selecting column inside StructType with same name 
> that outer colum
> ---
>
> Key: SPARK-23439
> URL: https://issues.apache.org/jira/browse/SPARK-23439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Scala 2.11.8, Spark 2.2.0
>Reporter: Alejandro Trujillo Caballero
>Priority: Minor
>
> Hi.
> I've seen that when working with nested struct fields in a DataFrame and 
> doing a select operation the nesting is lost and this can result in 
> collisions between column names.
> For example:
>  
> {code:java}
> case class Foo(a: Int, b: Bar)
> case class Bar(a: Int)
> val items = List(
>   Foo(1, Bar(1)),
>   Foo(2, Bar(2))
> )
> val df = spark.createDataFrame(items)
> val df_a_a = df.select($"a", $"b.a").show
> //+---+---+
> //|  a|  a|
> //+---+---+
> //|  1|  1|
> //|  2|  2|
> //+---+---+
> df.select($"a", $"b.a").printSchema
> //root
> //|-- a: integer (nullable = false)
> //|-- a: integer (nullable = true)
> df.select($"a", $"b.a").select($"a")
> //org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could 
> be: a#9, a#{code}
>  
>  
> Shouldn't the second column be named "b.a"?
>  
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23449) Extra java options lose order in Docker context

2018-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23449:


Assignee: (was: Apache Spark)

> Extra java options lose order in Docker context
> ---
>
> Key: SPARK-23449
> URL: https://issues.apache.org/jira/browse/SPARK-23449
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8S with supplied Docker image. Passing 
> along extra java options.
>Reporter: Andrew Korzhuev
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
> processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` 
> unusable, as you have to pass it before any other experimental options.
>  
> Steps to reproduce:
>  # Set `spark.driver.extraJavaOptions`, e.g. 
> `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled 
> -XX:+UseCGroupMemoryLimitForHeap`
>  # Submit application to k8s cluster.
>  # Fetch logs and observe that on each run order of options is different and 
> when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.
>  
> Expected behaviour:
>  # Order of `extraJavaOptions` should be preserved.
>  
> Cause:
> `entrypoint.sh` fetches environment options with `env`, which doesn't 
> guarantee ordering.
> {code:java}
> env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
> /tmp/java_opts.txt{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23449) Extra java options lose order in Docker context

2018-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367370#comment-16367370
 ] 

Apache Spark commented on SPARK-23449:
--

User 'andrusha' has created a pull request for this issue:
https://github.com/apache/spark/pull/20628

> Extra java options lose order in Docker context
> ---
>
> Key: SPARK-23449
> URL: https://issues.apache.org/jira/browse/SPARK-23449
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8S with supplied Docker image. Passing 
> along extra java options.
>Reporter: Andrew Korzhuev
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
> processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` 
> unusable, as you have to pass it before any other experimental options.
>  
> Steps to reproduce:
>  # Set `spark.driver.extraJavaOptions`, e.g. 
> `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled 
> -XX:+UseCGroupMemoryLimitForHeap`
>  # Submit application to k8s cluster.
>  # Fetch logs and observe that on each run order of options is different and 
> when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.
>  
> Expected behaviour:
>  # Order of `extraJavaOptions` should be preserved.
>  
> Cause:
> `entrypoint.sh` fetches environment options with `env`, which doesn't 
> guarantee ordering.
> {code:java}
> env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
> /tmp/java_opts.txt{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23449) Extra java options lose order in Docker context

2018-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23449:


Assignee: Apache Spark

> Extra java options lose order in Docker context
> ---
>
> Key: SPARK-23449
> URL: https://issues.apache.org/jira/browse/SPARK-23449
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8S with supplied Docker image. Passing 
> along extra java options.
>Reporter: Andrew Korzhuev
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
> processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` 
> unusable, as you have to pass it before any other experimental options.
>  
> Steps to reproduce:
>  # Set `spark.driver.extraJavaOptions`, e.g. 
> `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled 
> -XX:+UseCGroupMemoryLimitForHeap`
>  # Submit application to k8s cluster.
>  # Fetch logs and observe that on each run order of options is different and 
> when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.
>  
> Expected behaviour:
>  # Order of `extraJavaOptions` should be preserved.
>  
> Cause:
> `entrypoint.sh` fetches environment options with `env`, which doesn't 
> guarantee ordering.
> {code:java}
> env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
> /tmp/java_opts.txt{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23449) Extra java options lose order in Docker context

2018-02-16 Thread Andrew Korzhuev (JIRA)
Andrew Korzhuev created SPARK-23449:
---

 Summary: Extra java options lose order in Docker context
 Key: SPARK-23449
 URL: https://issues.apache.org/jira/browse/SPARK-23449
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
 Environment: Running Spark on K8S with supplied Docker image. Passing 
along extra java options.
Reporter: Andrew Korzhuev
 Fix For: 2.3.0


`spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` 
unusable, as you have to pass it before any other experimental options.

 

Steps to reproduce:
 # Set `spark.driver.extraJavaOptions`, e.g. `-XX:+UnlockExperimentalVMOptions 
-XX:+UseG1GC -XX:+CMSClassUnloadingEnabled -XX:+UseCGroupMemoryLimitForHeap`
 # Submit application to k8s cluster.
 # Fetch logs and observe that on each run order of options is different and 
when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.

 

Expected behaviour:
 # Order of `extraJavaOptions` should be preserved.

 

Cause:

`entrypoint.sh` fetches environment options with `env`, which doesn't guarantee 
ordering.
{code:java}
env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype

2018-02-16 Thread Ahmed ZAROUI (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed ZAROUI updated SPARK-23448:
-
Summary: Dataframe returns wrong result when column don't respect datatype  
(was: Data encoding problem when not finding the right type)

> Dataframe returns wrong result when column don't respect datatype
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: Local
>Reporter: Ahmed ZAROUI
>Priority: Major
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val schema = StructType(
>   Seq(StructField("attr1", StringType, true),
>   StructField("attr2", ArrayType(StringType, true), true)))
> spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23448) Data encoding problem when not finding the right type

2018-02-16 Thread Ahmed ZAROUI (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed ZAROUI updated SPARK-23448:
-
Description: 
I have the following json file that contains some noisy data(String instead of 
Array):

 
{code:java}
{"attr1":"val1","attr2":"[\"val2\"]"}
{"attr1":"val1","attr2":["val2"]}
{code}
And i need to specify schema programatically like this:

 
{code:java}
implicit val spark = SparkSession
  .builder()
  .master("local[*]")
  .config("spark.ui.enabled", false)
  .config("spark.sql.caseSensitive", "True")
  .getOrCreate()
import spark.implicits._

val schema = StructType(
  Seq(StructField("attr1", StringType, true),
  StructField("attr2", ArrayType(StringType, true), true)))

spark.read.schema(schema).json(input).collect().foreach(println)
{code}
The result given by this code is:
{code:java}
[null,null]
[val1,WrappedArray(val2)]
{code}
Instead of putting null in corrupted column, all columns of the first message 
are null

 

 

  was:
I have the following json file that contains some noisy data(String instead of 
Array):

 
{code:java}
{"attr1":"val1","attr2":"[\"val2\"]"}
{"attr1":"val1","attr2":["val2"]}
{code}
And i need to specify schema programatically like this:

 
{code:java}
implicit val spark = SparkSession
  .builder()
  .master("local[*]")
  .config("spark.ui.enabled", false)
  .config("spark.sql.caseSensitive", "True")
  .getOrCreate()
import spark.implicits._

val 
schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true)))
  spark.read.schema(schema).json(input).collect().foreach(println)
{code}
The result given by this code is:
{code:java}
[null,null]
[val1,WrappedArray(val2)]
{code}
Instead of putting null in corrupted column, all columns of the first message 
are null

 

 


> Data encoding problem when not finding the right type
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: Local
>Reporter: Ahmed ZAROUI
>Priority: Major
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val schema = StructType(
>   Seq(StructField("attr1", StringType, true),
>   StructField("attr2", ArrayType(StringType, true), true)))
> spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23448) Data encoding problem when not finding the right type

2018-02-16 Thread Ahmed ZAROUI (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed ZAROUI updated SPARK-23448:
-
Environment: Local  (was: Tested locally in linux machine)

> Data encoding problem when not finding the right type
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: Local
>Reporter: Ahmed ZAROUI
>Priority: Major
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val 
> schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true)))
>   spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23448) Data encoding problem when not finding the right type

2018-02-16 Thread Ahmed ZAROUI (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed ZAROUI updated SPARK-23448:
-
Description: 
I have the following json file that contains some noisy data(String instead of 
Array):

 
{code:java}
{"attr1":"val1","attr2":"[\"val2\"]"}
{"attr1":"val1","attr2":["val2"]}
{code}
And i need to specify schema programatically like this:

 
{code:java}
implicit val spark = SparkSession
  .builder()
  .master("local[*]")
  .config("spark.ui.enabled", false)
  .config("spark.sql.caseSensitive", "True")
  .getOrCreate()
import spark.implicits._

val 
schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true)))
  spark.read.schema(schema).json(input).collect().foreach(println)
{code}
The result given by this code is:
{code:java}
[null,null]
[val1,WrappedArray(val2)]
{code}
Instead of putting null in corrupted column, all columns of the first message 
are null

 

 

  was:
I have the following json file that contains some noisy data(String instead of 
Array):

 
{code:java}
{"attr1":"val1","attr2":["val2"]} 
{"attr1":"val1","attr2":"[\"val2\"]"}
{code}
And i need to specify schema programatically like this:

 
{code:java}
implicit val spark = SparkSession
  .builder()
  .master("local[*]")
  .config("spark.ui.enabled", false)
  .config("spark.sql.caseSensitive", "True")
  .getOrCreate()
import spark.implicits._

val 
schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true)))
  spark.read.schema(schema).json(input).collect().foreach(println)
{code}
The result given by this code is:
{code:java}
[null,null]
[val1,WrappedArray(val2)]
{code}
Instead of putting null in corrupted column, all columns of the first message 
are null

 

 


> Data encoding problem when not finding the right type
> -
>
> Key: SPARK-23448
> URL: https://issues.apache.org/jira/browse/SPARK-23448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: Tested locally in linux machine
>Reporter: Ahmed ZAROUI
>Priority: Major
>
> I have the following json file that contains some noisy data(String instead 
> of Array):
>  
> {code:java}
> {"attr1":"val1","attr2":"[\"val2\"]"}
> {"attr1":"val1","attr2":["val2"]}
> {code}
> And i need to specify schema programatically like this:
>  
> {code:java}
> implicit val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .config("spark.ui.enabled", false)
>   .config("spark.sql.caseSensitive", "True")
>   .getOrCreate()
> import spark.implicits._
> val 
> schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true)))
>   spark.read.schema(schema).json(input).collect().foreach(println)
> {code}
> The result given by this code is:
> {code:java}
> [null,null]
> [val1,WrappedArray(val2)]
> {code}
> Instead of putting null in corrupted column, all columns of the first message 
> are null
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23448) Data encoding problem when not finding the right type

2018-02-16 Thread Ahmed ZAROUI (JIRA)
Ahmed ZAROUI created SPARK-23448:


 Summary: Data encoding problem when not finding the right type
 Key: SPARK-23448
 URL: https://issues.apache.org/jira/browse/SPARK-23448
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
 Environment: Tested locally in linux machine
Reporter: Ahmed ZAROUI


I have the following json file that contains some noisy data(String instead of 
Array):

 
{code:java}
{"attr1":"val1","attr2":["val2"]} 
{"attr1":"val1","attr2":"[\"val2\"]"}
{code}
And i need to specify schema programatically like this:

 
{code:java}
implicit val spark = SparkSession
  .builder()
  .master("local[*]")
  .config("spark.ui.enabled", false)
  .config("spark.sql.caseSensitive", "True")
  .getOrCreate()
import spark.implicits._

val 
schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true)))
  spark.read.schema(schema).json(input).collect().foreach(println)
{code}
The result given by this code is:
{code:java}
[null,null]
[val1,WrappedArray(val2)]
{code}
Instead of putting null in corrupted column, all columns of the first message 
are null

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23439) Ambiguous reference when selecting column inside StructType with same name that outer colum

2018-02-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366945#comment-16366945
 ] 

Marco Gaido commented on SPARK-23439:
-

[~cloud_fan] I think this comes from https://github.com/apache/spark/pull/8215 
(https://github.com/apache/spark/blob/1dc2c1d5e85c5f404f470aeb44c1f3c22786bdea/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L203).
 We are adding an Alias to the name of the last extracted value. I am not sure 
whether this is the right behavior, so this JIRA is invalid, or this should be 
changed. What do you think? Thanks.

> Ambiguous reference when selecting column inside StructType with same name 
> that outer colum
> ---
>
> Key: SPARK-23439
> URL: https://issues.apache.org/jira/browse/SPARK-23439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Scala 2.11.8, Spark 2.2.0
>Reporter: Alejandro Trujillo Caballero
>Priority: Minor
>
> Hi.
> I've seen that when working with nested struct fields in a DataFrame and 
> doing a select operation the nesting is lost and this can result in 
> collisions between column names.
> For example:
>  
> {code:java}
> case class Foo(a: Int, b: Bar)
> case class Bar(a: Int)
> val items = List(
>   Foo(1, Bar(1)),
>   Foo(2, Bar(2))
> )
> val df = spark.createDataFrame(items)
> val df_a_a = df.select($"a", $"b.a").show
> //+---+---+
> //|  a|  a|
> //+---+---+
> //|  1|  1|
> //|  2|  2|
> //+---+---+
> df.select($"a", $"b.a").printSchema
> //root
> //|-- a: integer (nullable = false)
> //|-- a: integer (nullable = true)
> df.select($"a", $"b.a").select($"a")
> //org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could 
> be: a#9, a#{code}
>  
>  
> Shouldn't the second column be named "b.a"?
>  
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-02-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366898#comment-16366898
 ] 

Marco Gaido commented on SPARK-23442:
-

I am not sure it is what you are looking for, but you can repartition the 
resulting DataFrame in order to have more partitions.

> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Priority: Major
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? 
>  Let me know if I can help in any way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-16 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366857#comment-16366857
 ] 

Valeriy Avanesov commented on SPARK-23437:
--

[~mlnick], is that really supposed to happen to a textbook algorithm filling in 
the vacuum? There is currently no non-parametric regression techniques 
inferring a smooth function provided by MLlib. 

Regarding the guidelines: the requirements for the algorithm are 
 # Be widely known
 # Be used and accepted (academic citations and concrete use cases can help 
justify this)
 # Be highly scalable

and I think all of them hold (see the original post). 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-02-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366809#comment-16366809
 ] 

Nick Pentreath commented on SPARK-23265:


Thanks for the ping - yes it adds more detailed checking of the exclusive 
params and would introduce an error being thrown in certain additional 
situations (specifically {{numBucketsArray}} set for single-column transform, 
{{numBuckets}} and {{numBucketsArray}} set for multi-column transform, 
mismatched length of {{numBucketsArray}} with input/output columns for 
multi-column transform).

I reviewed the PR and LGTM so as I said there we can merge this now before RC4 
gets cut.

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> \{{numBuckets}} when transforming multiple columns, since that is then 
> applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader

2018-02-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366788#comment-16366788
 ] 

Marco Gaido commented on SPARK-23399:
-

I think we should reopen this, it is still happening: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87486/testReport/org.apache.spark.sql.execution.datasources.orc/OrcQuerySuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/

> Register a task completion listener first for OrcColumnarBatchReader
> 
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.1
>
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer

2018-02-16 Thread German Schiavon Matteo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366774#comment-16366774
 ] 

German Schiavon Matteo commented on SPARK-12140:


Ok [~jerryshao], Im testing your code and it works but doesn't refresh the 
streaming tab until the driver is dead/killed. I'm gonna give it a go and also 
wanna do some performance test about this to see the scalability issue.

> Support Streaming UI in HistoryServer
> -
>
> Key: SPARK-12140
> URL: https://issues.apache.org/jira/browse/SPARK-12140
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> SPARK-11206 added infrastructure that would allow the streaming UI to be 
> shown in the History Server. We should add the necessary code to make that 
> happen, although it requires some changes to how events and listeners are 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-02-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-23265:
---
Description: 
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
\{{numBuckets}} when transforming multiple columns, since that is then applied 
to all columns.

  was:
SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
both single- and mulit-column params are set (specifically {{inputCol}} / 
{{inputCols}}) an error is thrown.

However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
for this transformer, it is acceptable to set the single-column param for 
{{numBuckets }}when transforming multiple columns, since that is then applied 
to all columns.


> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> \{{numBuckets}} when transforming multiple columns, since that is then 
> applied to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23217) Add cosine distance measure to ClusteringEvaluator

2018-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366762#comment-16366762
 ] 

Apache Spark commented on SPARK-23217:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20627

> Add cosine distance measure to ClusteringEvaluator
> --
>
> Key: SPARK-23217
> URL: https://issues.apache.org/jira/browse/SPARK-23217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-23217.pdf
>
>
> SPARK-22119 introduced the cosine distance measure for KMeans. Therefore it 
> would be useful to provide also an implementation of ClusteringEvaluator 
> using the cosine distance measure.
>  
> Attached you can find a design document for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366744#comment-16366744
 ] 

Nick Pentreath commented on SPARK-23437:


It sounds interesting - however the standard practice is that new algorithms 
should probably be released as a 3rd party Spark package. If they become 
widely-used then there is a stronger argument for integration into MLlib.

See [http://spark.apache.org/contributing.html] under the MLlib section for 
more details. 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23447) Cleanup codegen template for Literal

2018-02-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23447:


Assignee: (was: Apache Spark)

> Cleanup codegen template for Literal
> 
>
> Key: SPARK-23447
> URL: https://issues.apache.org/jira/browse/SPARK-23447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>Priority: Major
>
> Ideally, the codegen templates for {{Literal}} should emit literals in the 
> {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be 
> effectively inlined into their use sites.
> But currently there are a couple of paths where {{Literal.doGenCode()}} 
> return {{ExprCode}} that has non-trivial {{code}} field, and all of those are 
> actually unnecessary.
> We can make a simple refactoring to make sure all codegen templates for 
> {{Literal}} return empty {{code}} and simple literal/constant expressions in 
> {{isNull}} and {{value}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23447) Cleanup codegen template for Literal

2018-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366688#comment-16366688
 ] 

Apache Spark commented on SPARK-23447:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/20626

> Cleanup codegen template for Literal
> 
>
> Key: SPARK-23447
> URL: https://issues.apache.org/jira/browse/SPARK-23447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>Priority: Major
>
> Ideally, the codegen templates for {{Literal}} should emit literals in the 
> {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be 
> effectively inlined into their use sites.
> But currently there are a couple of paths where {{Literal.doGenCode()}} 
> return {{ExprCode}} that has non-trivial {{code}} field, and all of those are 
> actually unnecessary.
> We can make a simple refactoring to make sure all codegen templates for 
> {{Literal}} return empty {{code}} and simple literal/constant expressions in 
> {{isNull}} and {{value}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org