[jira] [Comment Edited] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader
[ https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368121#comment-16368121 ] Dongjoon Hyun edited comment on SPARK-23399 at 2/17/18 7:29 AM: [~mgaido]. I understand what is your intention, but please see the JIRA issue title. It's not about `Fix OrcQuerySuite`. Why do you reopen this issue? Please proceed to file a new JIRA issue for that. This JIRA issue handles the designed scope as described in the manual test case in the PR. For the reported case, I'll investigate more. was (Author: dongjoon): [~mgaido]. I understand what is your intention, but please see the JIRA issue title. It's not about `Fix OrcQuerySuite`. Why do you reopen this issue? Please proceed to file a new JIRA issue for that. > Register a task completion listener first for OrcColumnarBatchReader > > > Key: SPARK-23399 > URL: https://issues.apache.org/jira/browse/SPARK-23399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1 > > > This is related with SPARK-23390. > Currently, there was a opened file leak for OrcColumnarBatchReader. > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader
[ https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368121#comment-16368121 ] Dongjoon Hyun commented on SPARK-23399: --- [~mgaido]. I understand what is your intention, but please see the JIRA issue title. It's not about `Fix OrcQuerySuite`. Why do you reopen this issue? Please proceed to file a new JIRA issue for that. > Register a task completion listener first for OrcColumnarBatchReader > > > Key: SPARK-23399 > URL: https://issues.apache.org/jira/browse/SPARK-23399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1 > > > This is related with SPARK-23390. > Currently, there was a opened file leak for OrcColumnarBatchReader. > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368117#comment-16368117 ] Pranav Rao commented on SPARK-23442: Repartitioning is unlikely to be helpful to a user because: * The map part of repartition is still limited to num_buckets, so it's going to be very slow and not utilise available parallelism. * The user would have pre-partitioned and bucketed his dataset and persisted it so, purely to avoid repartitioning/shuffle at read time. So the purpose of this feature is lost. > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Priority: Major > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? > Let me know if I can help in any way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23455) Default Params in ML should be saved separately
[ https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368116#comment-16368116 ] Liang-Chi Hsieh commented on SPARK-23455: - Currently, {{DefaultParamsWriter}} saves the following metadata + params: * - class * - timestamp * - sparkVersion * - uid * - paramMap * - (optionally, extra metadata) User-supplied params and default params are all saved in {{paramMap}} field in JSON. We can have a {{defaultParamMap}} for saving default params. For backward compatibility, when loading metadata, if it is a metadata file prior to Spark 2.4, we shouldn't raise error if we can't find {{defaultParamMap}} field in the file. > Default Params in ML should be saved separately > --- > > Key: SPARK-23455 > URL: https://issues.apache.org/jira/browse/SPARK-23455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We save ML's user-supplied params and default params as one entity in JSON. > During loading the saved models, we set all the loaded params into created ML > model instances as user-supplied params. > It causes some problems, e.g., if we strictly disallow some params to be set > at the same time, a default param can fail the param check because it is > treated as user-supplied param after loading. > The loaded default params should not be set as user-supplied params. We > should save ML default params separately in JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23435) R tests should support latest testthat
[ https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-23435: Assignee: Felix Cheung > R tests should support latest testthat > -- > > Key: SPARK-23435 > URL: https://issues.apache.org/jira/browse/SPARK-23435 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was > released in Dec 2017, and its method has been changed. > In order for our tests to keep working, we need to detect that and call a > different method. > Jenkins is running 1.0.1 though, we need to check if it is going to work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368109#comment-16368109 ] Seth Hendrickson commented on SPARK-23437: -- TBH, this seems like a pretty reasonable request. While I agree we do seem to tell people that the "standard" practice is to implement as a third party package and then integrate later, I don't see this happen in practice. I don't know that we've even validated that the "implement as third party package, then in Spark later on" approach even really works. Perhaps an even stronger reason for resisting new algorithms is just lack of reviewer/developer support on Spark ML. It's hard to predict if there will be anyone to review the PR within a reasonable amount of time, even if the code is well-designed. AFAIK, we haven't added any major algos since GeneralizedLinearRegression, which has to have been a couple years ago. That said, I think this is something to at least consider. We can start by discussing what algorithms exist, and why we'd choose a particular one. Strong arguments for why we need GPs in Spark ML are also beneficial. The fact that there isn't a non-parametric regression algo in Spark has some merit, but we don't write new algorithms just for the sake of filling in gaps - there needs to be user demand (which, unfortunately, is often hard to prove). It also helps to point to a package that already implements the algo you're proposing, but for example I don't believe scikit implements the linear-time version so we can't really leverage their experience. Providing more information on any/all of these categories will help make a stronger case, and I do think GPs can be a useful addition. Thanks for leading the discussion! > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23455) Default Params in ML should be saved separately
Liang-Chi Hsieh created SPARK-23455: --- Summary: Default Params in ML should be saved separately Key: SPARK-23455 URL: https://issues.apache.org/jira/browse/SPARK-23455 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.0 Reporter: Liang-Chi Hsieh We save ML's user-supplied params and default params as one entity in JSON. During loading the saved models, we set all the loaded params into created ML model instances as user-supplied params. It causes some problems, e.g., if we strictly disallow some params to be set at the same time, a default param can fail the param check because it is treated as user-supplied param after loading. The loaded default params should not be set as user-supplied params. We should save ML default params separately in JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368108#comment-16368108 ] Alessandro Solimando commented on SPARK-3159: - As I was not aware of this Jira case I have opened a duplicate ticket and I worked on the proposed patch independently from the one I see. However the two approaches look pretty different (despite somehow close in spirit) so I think it is fine to check the PR independently from the other. > Check for reducible DecisionTree > > > Key: SPARK-3159 > URL: https://issues.apache.org/jira/browse/SPARK-3159 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: test-time computation > Currently, pairs of leaf nodes with the same parent can both output the same > prediction. This happens since the splitting criterion (e.g., Gini) is not > the same as prediction accuracy/MSE; the splitting criterion can sometimes be > improved even when both children would still output the same prediction > (e.g., based on the majority label for classification). > We could check the tree and reduce it if possible after training. > Note: This happens with scikit-learn as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368106#comment-16368106 ] Apache Spark commented on SPARK-3159: - User 'asolimando' has created a pull request for this issue: https://github.com/apache/spark/pull/20632 > Check for reducible DecisionTree > > > Key: SPARK-3159 > URL: https://issues.apache.org/jira/browse/SPARK-3159 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: test-time computation > Currently, pairs of leaf nodes with the same parent can both output the same > prediction. This happens since the splitting criterion (e.g., Gini) is not > the same as prediction accuracy/MSE; the splitting criterion can sometimes be > improved even when both children would still output the same prediction > (e.g., based on the majority label for classification). > We could check the tree and reduce it if possible after training. > Note: This happens with scikit-learn as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23447) Cleanup codegen template for Literal
[ https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23447: --- Assignee: Kris Mok > Cleanup codegen template for Literal > > > Key: SPARK-23447 > URL: https://issues.apache.org/jira/browse/SPARK-23447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 2.4.0 > > > Ideally, the codegen templates for {{Literal}} should emit literals in the > {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be > effectively inlined into their use sites. > But currently there are a couple of paths where {{Literal.doGenCode()}} > return {{ExprCode}} that has non-trivial {{code}} field, and all of those are > actually unnecessary. > We can make a simple refactoring to make sure all codegen templates for > {{Literal}} return empty {{code}} and simple literal/constant expressions in > {{isNull}} and {{value}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23447) Cleanup codegen template for Literal
[ https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23447. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20626 [https://github.com/apache/spark/pull/20626] > Cleanup codegen template for Literal > > > Key: SPARK-23447 > URL: https://issues.apache.org/jira/browse/SPARK-23447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 2.4.0 > > > Ideally, the codegen templates for {{Literal}} should emit literals in the > {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be > effectively inlined into their use sites. > But currently there are a couple of paths where {{Literal.doGenCode()}} > return {{ExprCode}} that has non-trivial {{code}} field, and all of those are > actually unnecessary. > We can make a simple refactoring to make sure all codegen templates for > {{Literal}} return empty {{code}} and simple literal/constant expressions in > {{isNull}} and {{value}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-23454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23454: Assignee: Tathagata Das (was: Apache Spark) > Add Trigger information to the Structured Streaming programming guide > - > > Key: SPARK-23454 > URL: https://issues.apache.org/jira/browse/SPARK-23454 > Project: Spark > Issue Type: Improvement > Components: Documentation, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-23454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368077#comment-16368077 ] Apache Spark commented on SPARK-23454: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/20631 > Add Trigger information to the Structured Streaming programming guide > - > > Key: SPARK-23454 > URL: https://issues.apache.org/jira/browse/SPARK-23454 > Project: Spark > Issue Type: Improvement > Components: Documentation, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-23454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23454: Assignee: Apache Spark (was: Tathagata Das) > Add Trigger information to the Structured Streaming programming guide > - > > Key: SPARK-23454 > URL: https://issues.apache.org/jira/browse/SPARK-23454 > Project: Spark > Issue Type: Improvement > Components: Documentation, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-23454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-23454: -- Priority: Minor (was: Major) > Add Trigger information to the Structured Streaming programming guide > - > > Key: SPARK-23454 > URL: https://issues.apache.org/jira/browse/SPARK-23454 > Project: Spark > Issue Type: Improvement > Components: Documentation, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23454) Add Trigger information to the Structured Streaming programming guide
Tathagata Das created SPARK-23454: - Summary: Add Trigger information to the Structured Streaming programming guide Key: SPARK-23454 URL: https://issues.apache.org/jira/browse/SPARK-23454 Project: Spark Issue Type: Improvement Components: Documentation, Structured Streaming Affects Versions: 2.3.0 Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23453) ToolBox compiled Spark UDAF causes java.lang.InternalError: Malformed class name
[ https://issues.apache.org/jira/browse/SPARK-23453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Lo updated SPARK-23453: Description: Here is a weird problem I just ran into... My scenario is that I need to compile UDAF dynamically at runtime but it never worked. I am using Scala 2.11.11 and Spark 2.2.1, please refer to my [StackOverflow post|https://stackoverflow.com/questions/48820212/toolbox-compiled-spark-udaf-causes-java-lang-internalerror-malformed-class-name] for detailed information and minimal examples. The problem itself is very similar to other Malformed class name tickets (for example -[https://github.com/apache/spark/pull/9568])- which were caused by calling getSimpleName of nested class/object but actually it is different in this case and the problem is still there. The getSimpleName issue has been fixed in Java 9 which Spark doesn't support it yet...so...any solution/workaround is appreciated. was: Here is a weird problem I just ran into... My scenario is that I need to compile UDAF dynamically at runtime but it never worked. I am using Scala 2.11.11 and Spark 2.2.1, please refer to my [StackOverflow post|https://stackoverflow.com/questions/48820212/toolbox-compiled-spark-udaf-causes-java-lang-internalerror-malformed-class-name] for detailed information. The problem itself is very similar to other Malformed class name tickets (for example - [https://github.com/apache/spark/pull/9568]) - which were caused by calling getSimpleName of nested class/object but actually it is different in this case and the problem is still there. The getSimpleName issue has been fixed in Java 9 which Spark doesn't support it yet...so...any solution/workaround is appreciated. > ToolBox compiled Spark UDAF causes java.lang.InternalError: Malformed class > name > > > Key: SPARK-23453 > URL: https://issues.apache.org/jira/browse/SPARK-23453 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 > Environment: Spark 2.2.1 > Scala 2.11.11 > JDK 1.8 >Reporter: Eric Lo >Priority: Major > > Here is a weird problem I just ran into... My scenario is that I need to > compile UDAF dynamically at runtime but it never worked. > I am using Scala 2.11.11 and Spark 2.2.1, please refer to my [StackOverflow > post|https://stackoverflow.com/questions/48820212/toolbox-compiled-spark-udaf-causes-java-lang-internalerror-malformed-class-name] > for detailed information and minimal examples. The problem itself is very > similar to other Malformed class name tickets (for example > -[https://github.com/apache/spark/pull/9568])- which were caused by calling > getSimpleName of nested class/object but actually it is different in this > case and the problem is still there. The getSimpleName issue has been fixed > in Java 9 which Spark doesn't support it yet...so...any solution/workaround > is appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23453) ToolBox compiled Spark UDAF causes java.lang.InternalError: Malformed class name
Eric Lo created SPARK-23453: --- Summary: ToolBox compiled Spark UDAF causes java.lang.InternalError: Malformed class name Key: SPARK-23453 URL: https://issues.apache.org/jira/browse/SPARK-23453 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Environment: Spark 2.2.1 Scala 2.11.11 JDK 1.8 Reporter: Eric Lo Here is a weird problem I just ran into... My scenario is that I need to compile UDAF dynamically at runtime but it never worked. I am using Scala 2.11.11 and Spark 2.2.1, please refer to my [StackOverflow post|https://stackoverflow.com/questions/48820212/toolbox-compiled-spark-udaf-causes-java-lang-internalerror-malformed-class-name] for detailed information. The problem itself is very similar to other Malformed class name tickets (for example - [https://github.com/apache/spark/pull/9568]) - which were caused by calling getSimpleName of nested class/object but actually it is different in this case and the problem is still there. The getSimpleName issue has been fixed in Java 9 which Spark doesn't support it yet...so...any solution/workaround is appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23417) pyspark tests give wrong sbt instructions
[ https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368031#comment-16368031 ] Bruce Robbins commented on SPARK-23417: --- This does the trick: {noformat} build/sbt -Pkafka-0-8 assembly/package streaming-kafka-0-8-assembly/assembly {noformat} There are also errant instructions for building a flume assembly jar. In that case the following works: {noformat} build/sbt -Pflume assembly/package streaming-flume-assembly/assembly {noformat} I can submit a PR to fix these messages. By the way, the above is just for the pyspark-streaming tests. The pyspark-sql tests have similar build requirements (e.g., at least one test needs a build with Hive profiles. Also, udf.py needs /sql/core/target/scala-2.11/test-classes/test/org/apache/spark/sql/JavaStringLength.class to exist.). The pyspark-sql tests don't check for these requirements, they just throw exceptions. But I won't address that here. > pyspark tests give wrong sbt instructions > - > > Key: SPARK-23417 > URL: https://issues.apache.org/jira/browse/SPARK-23417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > When running python/run-tests, the script indicates that I must run > "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or > 'build/mvn -Pkafka-0-8 package'". The sbt command fails: > > [error] Expected ID character > [error] Not a valid command: streaming-kafka-0-8-assembly > [error] Expected project ID > [error] Expected configuration > [error] Expected ':' (if selecting a configuration) > [error] Expected key > [error] Not a valid key: streaming-kafka-0-8-assembly > [error] streaming-kafka-0-8-assembly/assembly > [error] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23362) Migrate Kafka microbatch source to v2
[ https://issues.apache.org/jira/browse/SPARK-23362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23362. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 20554 [https://github.com/apache/spark/pull/20554] > Migrate Kafka microbatch source to v2 > - > > Key: SPARK-23362 > URL: https://issues.apache.org/jira/browse/SPARK-23362 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects
[ https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367933#comment-16367933 ] Michael Armbrust commented on SPARK-23337: -- This is essentially the same issue as SPARK-18084. We are taking a column name here, not an expression. As such you can only reference top level columns. I agree this is an annoying aspect of the API, but changing it might have to happen at a major release since it would be change in behavior. > withWatermark raises an exception on struct objects > --- > > Key: SPARK-23337 > URL: https://issues.apache.org/jira/browse/SPARK-23337 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 > Environment: Linux Ubuntu, Spark on standalone mode >Reporter: Aydin Kocas >Priority: Major > > Hi, > > when using a nested object (I mean an object within a struct, here concrete: > _source.createTime) from a json file as the parameter for the > withWatermark-method, I get an exception (see below). > Anything else works flawlessly with the nested object. > > +*{color:#14892c}works:{color}*+ > {code:java} > Dataset jsonRow = > spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime", > "10 seconds").toDF();{code} > > json structure: > {code:java} > root > |-- _id: string (nullable = true) > |-- _index: string (nullable = true) > |-- _score: long (nullable = true) > |-- myTime: timestamp (nullable = true) > ..{code} > +*{color:#d04437}does not work - nested json{color}:*+ > {code:java} > Dataset jsonRow = > spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime", > "10 seconds").toDF();{code} > > json structure: > > {code:java} > root > |-- _id: string (nullable = true) > |-- _index: string (nullable = true) > |-- _score: long (nullable = true) > |-- _source: struct (nullable = true) > | |-- createTime: timestamp (nullable = true) > .. > > Exception in thread "main" > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > 'EventTimeWatermark '_source.createTime, interval 10 seconds > +- Deduplicate [_id#0], true > +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true), > StructField(_index,StringType,true), StructField(_score,LongType,true), > StructField(_source,StructType(StructField(additionalData,StringType,true), > StructField(client,StringType,true), > StructField(clientDomain,BooleanType,true), > StructField(clientVersion,StringType,true), > StructField(country,StringType,true), > StructField(countryName,StringType,true), > StructField(createTime,TimestampType,true), > StructField(externalIP,StringType,true), > StructField(hostname,StringType,true), > StructField(internalIP,StringType,true), > StructField(location,StringType,true), > StructField(locationDestination,StringType,true), > StructField(login,StringType,true), > StructField(originalRequestString,StringType,true), > StructField(password,StringType,true), > StructField(peerIdent,StringType,true), > StructField(peerType,StringType,true), > StructField(recievedTime,TimestampType,true), > StructField(sessionEnd,StringType,true), > StructField(sessionStart,StringType,true), > StructField(sourceEntryAS,StringType,true), > StructField(sourceEntryIp,StringType,true), > StructField(sourceEntryPort,StringType,true), > StructField(targetCountry,StringType,true), > StructField(targetCountryName,StringType,true), > StructField(targetEntryAS,StringType,true), > StructField(targetEntryIp,StringType,true), > StructField(targetEntryPort,StringType,true), > StructField(targetport,StringType,true), > StructField(username,StringType,true), > StructField(vulnid,StringType,true)),true), > StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), > FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4] > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > at > org.apache.spark
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367925#comment-16367925 ] Li Jin commented on SPARK-13127: Hi all, The status of the Jira is "Progress". I am wondering if this is being actively worked on? > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony >Priority: Major > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23452) Extend test coverage to all ORC readers
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367905#comment-16367905 ] Xiao Li commented on SPARK-23452: - Thanks! I will assign it to you. > Extend test coverage to all ORC readers > --- > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Priority: Minor > > We have five ORC readers. We had better have a test coverage for all ORC > readers. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23452) Extend test coverage to all ORC readers
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-23452: --- Assignee: Dongjoon Hyun > Extend test coverage to all ORC readers > --- > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > We have five ORC readers. We had better have a test coverage for all ORC > readers. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23409) RandomForest/DecisionTree (syntactic) pruning of redundant subtrees
[ https://issues.apache.org/jira/browse/SPARK-23409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23409. --- Resolution: Duplicate Linking old JIRA for this issue > RandomForest/DecisionTree (syntactic) pruning of redundant subtrees > --- > > Key: SPARK-23409 > URL: https://issues.apache.org/jira/browse/SPARK-23409 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.1 > Environment: >Reporter: Alessandro Solimando >Priority: Minor > > Improvement: redundancy elimination from decision trees where all the leaves > of a given subtree share the same prediction. > Benefits: > * Model interpretability > * Faster unitary model invocation (relevant for massive number of > invocations) > * Smaller model memory footprint > For instance, consider the following decision tree. > {panel:title=Original Decision Tree} > {noformat} > DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 > nodes > If (feature 1 <= 0.5) >If (feature 2 <= 0.5) > If (feature 0 <= 0.5) > Predict: 0.0 > Else (feature 0 > 0.5) > Predict: 0.0 >Else (feature 2 > 0.5) > If (feature 0 <= 0.5) > Predict: 0.0 > Else (feature 0 > 0.5) > Predict: 0.0 > Else (feature 1 > 0.5) >If (feature 2 <= 0.5) > If (feature 0 <= 0.5) > Predict: 1.0 > Else (feature 0 > 0.5) > Predict: 1.0 >Else (feature 2 > 0.5) > If (feature 0 <= 0.5) > Predict: 0.0 > Else (feature 0 > 0.5) > Predict: 0.0 > {noformat} > {panel} > The proposed method, taken as input the first tree, aims at producing as > output the following (semantically equivalent) tree: > {panel:title=Pruned Decision Tree} > {noformat} > DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 > nodes > If (feature 1 <= 0.5) >Predict: 0.0 > Else (feature 1 > 0.5) >If (feature 2 <= 0.5) > Predict: 1.0 >Else (feature 2 > 0.5) > Predict: 0.0 > {noformat} > {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367895#comment-16367895 ] Apache Spark commented on SPARK-23381: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20630 > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367845#comment-16367845 ] Joseph K. Bradley commented on SPARK-23381: --- Copying my comment from the PR: {quote} For ML, I actually don't think this has to be a blocker. It's not great, but it's not a regression. However, we should definitely fix this in the future and soon: For ML, it's really important that MurmurHash3 behave consistently across platforms. To fix this, we'll need to maintain the old implementation of MurmushHash3 to maintain the behavior of ML Pipelines exported from previous versions of Spark. {quote} > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23381: -- Priority: Major (was: Minor) > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23381: -- Issue Type: Bug (was: Improvement) > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Minor > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23452) Extend test coverage to all ORC readers
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23452: -- Issue Type: Improvement (was: Test) > Extend test coverage to all ORC readers > --- > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Priority: Minor > > We have five ORC readers. We had better have a test coverage for all ORC > readers. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23452) Extend test coverage to all ORC readers
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23452: -- Component/s: Tests > Extend test coverage to all ORC readers > --- > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Priority: Minor > > We have five ORC readers. We had better have a test coverage for all ORC > readers. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23452) Extend test coverage to all ORC readers
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23452: -- Summary: Extend test coverage to all ORC readers (was: Improve test coverage for ORC readers) > Extend test coverage to all ORC readers > --- > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Priority: Minor > > We have five ORC readers. We had better have a test coverage for all ORC > readers. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23452) Improve test coverage for ORC readers
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23452: -- Description: We have five ORC readers. We had better have a test coverage for all ORC readers. - Hive Serde - Hive OrcFileFormat - Apache ORC Vectorized Wrapper - Apache ORC Vectorized Copy - Apache ORC MR was: We have five ORC readers. We had better have a test coverage for all cases. - Hive Serde - Hive OrcFileFormat - Apache ORC Vectorized Wrapper - Apache ORC Vectorized Copy - Apache ORC MR > Improve test coverage for ORC readers > - > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Priority: Minor > > We have five ORC readers. We had better have a test coverage for all ORC > readers. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23452) Improve test coverage for ORC readers
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23452: -- Summary: Improve test coverage for ORC readers (was: Improve test coverage for ORC file format) > Improve test coverage for ORC readers > - > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Priority: Minor > > We have five ORC readers. We had better have a test coverage for all cases. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23452) Improve test coverage for ORC file format
[ https://issues.apache.org/jira/browse/SPARK-23452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367792#comment-16367792 ] Dongjoon Hyun commented on SPARK-23452: --- I created this and will proceed this for 2.3.1, [~smilegator]. > Improve test coverage for ORC file format > - > > Key: SPARK-23452 > URL: https://issues.apache.org/jira/browse/SPARK-23452 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dongjoon Hyun >Priority: Minor > > We have five ORC readers. We had better have a test coverage for all cases. > - Hive Serde > - Hive OrcFileFormat > - Apache ORC Vectorized Wrapper > - Apache ORC Vectorized Copy > - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23452) Improve test coverage for ORC file format
Dongjoon Hyun created SPARK-23452: - Summary: Improve test coverage for ORC file format Key: SPARK-23452 URL: https://issues.apache.org/jira/browse/SPARK-23452 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.3.1 Reporter: Dongjoon Hyun We have five ORC readers. We had better have a test coverage for all cases. - Hive Serde - Hive OrcFileFormat - Apache ORC Vectorized Wrapper - Apache ORC Vectorized Copy - Apache ORC MR -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver
[ https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367742#comment-16367742 ] Pratik Dhumal edited comment on SPARK-23427 at 2/16/18 7:18 PM: {code:java} // code placeholder @Test def testLoop() = { val schema = new StructType().add("test", types.IntegerType) var t1 = spark.createDataFrame(spark.sparkContext.parallelize(1 to 100).map(i => Row(i)), schema) val t2 = spark.createDataFrame(spark.sparkContext.parallelize(4 to 1400).map(i => Row(i)), schema) val t3 = spark.createDataFrame(spark.sparkContext.parallelize(15 to 190).map(i => Row(i)), schema) val t4 = spark.createDataFrame(spark.sparkContext.parallelize(135 to 652).map(i => Row(i)), schema) val t5 = spark.createDataFrame(spark.sparkContext.parallelize(86 to 352).map(i => Row(i)), schema) t1.persist().count() t2.persist().count() t3.persist().count() t4.persist().count() t5.persist().count() var dfResult: DataFrame = null while (true) { var t3Filter = t3.filter("test % 2 = 1") var t4Filter = t4.filter("test % 2 = 0") t1.createOrReplaceTempView("T1") t2.createOrReplaceTempView("T2") t3Filter.createOrReplaceTempView("T3") t4Filter.createOrReplaceTempView("T4") t5.createOrReplaceTempView("T5") var query = """ SELECT T1.* FROM T1 | INNER JOIN T2 ON T1.test=t2.test | LEFT JOIN T3 ON T1.test=t3.test | LEFT JOIN T4 ON T1.test=t4.test | LEFT JOIN T5 ON T1.test=t5.test | """.stripMargin if (t1 == null) { t1 = spark.sql(query) t1.persist().count() } else { var tmp1 = spark.sql(query) var tmp2 = t1 t1 = tmp1.union(tmp2) t1.persist().count() tmp2.unpersist(true) tmp2 = null } println("t1 : " + (SizeEstimator.estimate(t1) / 1024 / 1024)) // Do Something - Currently doing nothing spark.catalog.dropTempView("T1") spark.catalog.dropTempView("T2") spark.catalog.dropTempView("T3") spark.catalog.dropTempView("T4") spark.catalog.dropTempView("T5") } t3.unpersist(true) t2.unpersist(true) t1.unpersist(true) t4.unpersist(true) t5.unpersist(true) println("VOID") } // RESULT LOG t1 : 8 t1 : 208 t1 : 310 t1 : 187 t1 : 441 t1 : 440 t1 : 547 t1 : 651 t1 : 759 t1 : 733 t1 : 1129{code} Hope this helps. was (Author: dpratik): {code:java} // code placeholder @Test def testLoop() = { val schema = new StructType().add("test", types.IntegerType) var t1 = spark.createDataFrame(spark.sparkContext.parallelize(1 to 100).map(i => Row(i)), schema) val t2 = spark.createDataFrame(spark.sparkContext.parallelize(4 to 1400).map(i => Row(i)), schema) val t3 = spark.createDataFrame(spark.sparkContext.parallelize(15 to 190).map(i => Row(i)), schema) val t4 = spark.createDataFrame(spark.sparkContext.parallelize(135 to 652).map(i => Row(i)), schema) val t5 = spark.createDataFrame(spark.sparkContext.parallelize(86 to 352).map(i => Row(i)), schema) t1.persist().count() t2.persist().count() t3.persist().count() t4.persist().count() t5.persist().count() var dfResult: DataFrame = null while (true) { var t3Filter = t3.filter("test % 2 = 1") var t4Filter = t4.filter("test % 2 = 0") t1.createOrReplaceTempView("T1") t2.createOrReplaceTempView("T2") t3Filter.createOrReplaceTempView("T3") t4Filter.createOrReplaceTempView("T4") t5.createOrReplaceTempView("T5") var query = """ SELECT T1.* FROM T1 | INNER JOIN T2 ON T1.test=t2.test | LEFT JOIN T3 ON T1.test=t3.test | LEFT JOIN T4 ON T1.test=t4.test | LEFT JOIN T5 ON T1.test=t5.test | """.stripMargin if (t1 == null) { t1 = spark.sql(query) t1.persist().count() } else { var tmp1 = spark.sql(query) var tmp2 = t1 t1 = tmp1.union(tmp2) t1.persist().count() tmp2.unpersist(true) tmp2 = null } println("t1 : " + (SizeEstimator.estimate(t1) / 1024 / 1024)) // Do Something - Currently doing nothing spark.catalog.dropTempView("T1") spark.catalog.dropTempView("T2") spark.catalog.dropTempView("T3") spark.catalog.dropTempView("T4") spark.catalog.dropTempView("T5") } t3.unpersist(true) t2.unpersist(true) t1.unpersist(true) t4.unpersist(true) t5.unpersist(true) println("VOID") } {code} Hope this helps. > spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver > - > > Key: SPARK-23427 > URL: https://issues.apache.org/jira/browse/SPARK-23427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: SPARK 2.0 version >Reporter: Dhiraj >
[jira] [Updated] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver
[ https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dhiraj updated SPARK-23427: --- Summary: spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver (was: spark.sql.autoBroadcastJoinThreshold causing OOM in the driver ) > spark.sql.autoBroadcastJoinThreshold causing OOM exception in the driver > - > > Key: SPARK-23427 > URL: https://issues.apache.org/jira/browse/SPARK-23427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: SPARK 2.0 version >Reporter: Dhiraj >Priority: Critical > > We are facing issue around value of spark.sql.autoBroadcastJoinThreshold. > With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver > memory used flat. > With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used > goes up with rate depending upon the size of the autoBroadcastThreshold and > getting OOM exception. The problem is memory used by autoBroadcast is not > being free up in the driver. > Application imports oracle tables as master dataframes which are persisted. > Each job applies filter to these tables and then registered them as > tempViewTable . Then sql query are using to process data further. At the end > all the intermediate dataFrame are unpersisted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23427) spark.sql.autoBroadcastJoinThreshold causing OOM in the driver
[ https://issues.apache.org/jira/browse/SPARK-23427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367742#comment-16367742 ] Pratik Dhumal commented on SPARK-23427: --- {code:java} // code placeholder @Test def testLoop() = { val schema = new StructType().add("test", types.IntegerType) var t1 = spark.createDataFrame(spark.sparkContext.parallelize(1 to 100).map(i => Row(i)), schema) val t2 = spark.createDataFrame(spark.sparkContext.parallelize(4 to 1400).map(i => Row(i)), schema) val t3 = spark.createDataFrame(spark.sparkContext.parallelize(15 to 190).map(i => Row(i)), schema) val t4 = spark.createDataFrame(spark.sparkContext.parallelize(135 to 652).map(i => Row(i)), schema) val t5 = spark.createDataFrame(spark.sparkContext.parallelize(86 to 352).map(i => Row(i)), schema) t1.persist().count() t2.persist().count() t3.persist().count() t4.persist().count() t5.persist().count() var dfResult: DataFrame = null while (true) { var t3Filter = t3.filter("test % 2 = 1") var t4Filter = t4.filter("test % 2 = 0") t1.createOrReplaceTempView("T1") t2.createOrReplaceTempView("T2") t3Filter.createOrReplaceTempView("T3") t4Filter.createOrReplaceTempView("T4") t5.createOrReplaceTempView("T5") var query = """ SELECT T1.* FROM T1 | INNER JOIN T2 ON T1.test=t2.test | LEFT JOIN T3 ON T1.test=t3.test | LEFT JOIN T4 ON T1.test=t4.test | LEFT JOIN T5 ON T1.test=t5.test | """.stripMargin if (t1 == null) { t1 = spark.sql(query) t1.persist().count() } else { var tmp1 = spark.sql(query) var tmp2 = t1 t1 = tmp1.union(tmp2) t1.persist().count() tmp2.unpersist(true) tmp2 = null } println("t1 : " + (SizeEstimator.estimate(t1) / 1024 / 1024)) // Do Something - Currently doing nothing spark.catalog.dropTempView("T1") spark.catalog.dropTempView("T2") spark.catalog.dropTempView("T3") spark.catalog.dropTempView("T4") spark.catalog.dropTempView("T5") } t3.unpersist(true) t2.unpersist(true) t1.unpersist(true) t4.unpersist(true) t5.unpersist(true) println("VOID") } {code} Hope this helps. > spark.sql.autoBroadcastJoinThreshold causing OOM in the driver > > > Key: SPARK-23427 > URL: https://issues.apache.org/jira/browse/SPARK-23427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: SPARK 2.0 version >Reporter: Dhiraj >Priority: Critical > > We are facing issue around value of spark.sql.autoBroadcastJoinThreshold. > With spark.sql.autoBroadcastJoinThreshold -1 ( disable) we seeing driver > memory used flat. > With any other values 10MB, 5MB, 2 MB, 1MB, 10K, 1K we see driver memory used > goes up with rate depending upon the size of the autoBroadcastThreshold and > getting OOM exception. The problem is memory used by autoBroadcast is not > being free up in the driver. > Application imports oracle tables as master dataframes which are persisted. > Each job applies filter to these tables and then registered them as > tempViewTable . Then sql query are using to process data further. At the end > all the intermediate dataFrame are unpersisted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage
[ https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367728#comment-16367728 ] Shixiong Zhu commented on SPARK-23433: -- [~irashid] I'm busy with other stuff and not working on this. Your approach sounds good to me. Please go ahead if you have time to work on this. > java.lang.IllegalStateException: more than one active taskSet for stage > --- > > Key: SPARK-23433 > URL: https://issues.apache.org/jira/browse/SPARK-23433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shixiong Zhu >Priority: Major > > This following error thrown by DAGScheduler stopped the cluster: > {code} > 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage > 7580621: 7580621.2,7580621.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23234) ML python test failure due to default outputCol
[ https://issues.apache.org/jira/browse/SPARK-23234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23234: Priority: Major (was: Blocker) > ML python test failure due to default outputCol > --- > > Key: SPARK-23234 > URL: https://issues.apache.org/jira/browse/SPARK-23234 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Major > > SPARK-22799 and SPARK-22797 are causing valid Python test failures. The > reason is that Python is setting the default params with set. So they are not > considered as defaults, but as params passed by the user. > This means that an outputCol is set not as a default but as a real value. > Anyway, this is a misbehavior of the python API which can cause serious > problems and I'd suggest to rethink the way this is done. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage
[ https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367666#comment-16367666 ] Imran Rashid commented on SPARK-23433: -- actually, I realized its more general than just marking it as a zombie -- it should even be able to mark tasks as completed, so you don't have tasks submitted by later attempts when an earlier attempt says the output is ready. > java.lang.IllegalStateException: more than one active taskSet for stage > --- > > Key: SPARK-23433 > URL: https://issues.apache.org/jira/browse/SPARK-23433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shixiong Zhu >Priority: Major > > This following error thrown by DAGScheduler stopped the cluster: > {code} > 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage > 7580621: 7580621.2,7580621.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage
[ https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367664#comment-16367664 ] Imran Rashid commented on SPARK-23433: -- yes I think you are right [~zsxwing]. Since a zombie taskset might still be running the same tasks as a the non-zombie one, when a zombie task finishes, it should be able to mark the non-zombie taskset as a zombie. Or in this case, task 18.0 from 7580621.0 should be able to mark 7580621.1 as a zombie. Are you working on this? > java.lang.IllegalStateException: more than one active taskSet for stage > --- > > Key: SPARK-23433 > URL: https://issues.apache.org/jira/browse/SPARK-23433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shixiong Zhu >Priority: Major > > This following error thrown by DAGScheduler stopped the cluster: > {code} > 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage > 7580621: 7580621.2,7580621.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23446) Explicitly check supported types in toPandas
[ https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23446. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.3.0 > Explicitly check supported types in toPandas > > > Key: SPARK-23446 > URL: https://issues.apache.org/jira/browse/SPARK-23446 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.0 > > > See: > {code} > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df = spark.createDataFrame([[bytearray("a")]]) > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > _1 > 0 [97] > _1 > 0 a > {code} > We didn't finish binary type in Arrow conversion at Python side. We should > disallow it. > Same thine applies to nested timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23451) Deprecate KMeans computeCost
[ https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367578#comment-16367578 ] Apache Spark commented on SPARK-23451: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20629 > Deprecate KMeans computeCost > > > Key: SPARK-23451 > URL: https://issues.apache.org/jira/browse/SPARK-23451 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Trivial > > SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of > proper cluster evaluators. Now SPARK-14516 introduces a proper > {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove > it in the next releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23451) Deprecate KMeans computeCost
[ https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23451: Assignee: (was: Apache Spark) > Deprecate KMeans computeCost > > > Key: SPARK-23451 > URL: https://issues.apache.org/jira/browse/SPARK-23451 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Trivial > > SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of > proper cluster evaluators. Now SPARK-14516 introduces a proper > {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove > it in the next releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23451) Deprecate KMeans computeCost
[ https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23451: Assignee: Apache Spark > Deprecate KMeans computeCost > > > Key: SPARK-23451 > URL: https://issues.apache.org/jira/browse/SPARK-23451 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Trivial > > SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of > proper cluster evaluators. Now SPARK-14516 introduces a proper > {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove > it in the next releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23288) Incorrect number of written records in structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367569#comment-16367569 ] Gabor Somogyi commented on SPARK-23288: --- Seems like no statsTrackers created in FileStreamSink. > Incorrect number of written records in structured streaming > --- > > Key: SPARK-23288 > URL: https://issues.apache.org/jira/browse/SPARK-23288 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Metrics, metrics > > I'm using SparkListener.onTaskEnd() to capture input and output metrics but > it seems that number of written records > ('taskEnd.taskMetrics().outputMetrics().recordsWritten()') is incorrect. Here > is my stream construction: > > {code:java} > StreamingQuery writeStream = session > .readStream() > .schema(RecordSchema.fromClass(TestRecord.class)) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv(inputFolder.getRoot().toPath().toString()) > .as(Encoders.bean(TestRecord.class)) > .flatMap( > ((FlatMapFunction) (u) -> { > List resultIterable = new ArrayList<>(); > try { > TestVendingRecord result = transformer.convert(u); > resultIterable.add(result); > } catch (Throwable t) { > System.err.println("Ooops"); > t.printStackTrace(); > } > return resultIterable.iterator(); > }), > Encoders.bean(TestVendingRecord.class)) > .writeStream() > .outputMode(OutputMode.Append()) > .format("parquet") > .option("path", outputFolder.getRoot().toPath().toString()) > .option("checkpointLocation", > checkpointFolder.getRoot().toPath().toString()) > .start(); > writeStream.processAllAvailable(); > writeStream.stop(); > {code} > Tested it with one good and one bad (throwing exception in > transformer.convert(u)) input records and it produces following metrics: > > {code:java} > (TestMain.java:onTaskEnd(73)) - ---status--> SUCCESS > (TestMain.java:onTaskEnd(75)) - ---recordsWritten--> 0 > (TestMain.java:onTaskEnd(76)) - ---recordsRead-> 2 > (TestMain.java:onTaskEnd(83)) - taskEnd.taskInfo().accumulables(): > (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max) > (TestMain.java:onTaskEnd(85)) - value = 323 > (TestMain.java:onTaskEnd(84)) - name = number of output rows > (TestMain.java:onTaskEnd(85)) - value = 2 > (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max) > (TestMain.java:onTaskEnd(85)) - value = 364 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.recordsRead > (TestMain.java:onTaskEnd(85)) - value = 2 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.bytesRead > (TestMain.java:onTaskEnd(85)) - value = 157 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.resultSerializationTime > (TestMain.java:onTaskEnd(85)) - value = 3 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.resultSize > (TestMain.java:onTaskEnd(85)) - value = 2396 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorCpuTime > (TestMain.java:onTaskEnd(85)) - value = 633807000 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorRunTime > (TestMain.java:onTaskEnd(85)) - value = 683 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.executorDeserializeCpuTime > (TestMain.java:onTaskEnd(85)) - value = 55662000 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.executorDeserializeTime > (TestMain.java:onTaskEnd(85)) - value = 58 > (TestMain.java:onTaskEnd(89)) - input records 2 > Streaming query made progress: { > "id" : "1231f9cb-b2e8-4d10-804d-73d7826c1cb5", > "runId" : "bd23b60c-93f9-4e17-b3bc-55403edce4e7", > "name" : null, > "timestamp" : "2018-01-26T14:44:05.362Z", > "numInputRows" : 2, > "processedRowsPerSecond" : 0.8163265306122448, > "durationMs" : { > "addBatch" : 1994, > "getBatch" : 126, > "getOffset" : 52, > "queryPlanning" : 220, > "triggerExecution" : 2450, > "walCommit" : 41 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : > "FileStreamSource[file:/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5_/T/junit3661035412295337071]", > "startOffset" : null, > "endOffset" : { > "logOffset" : 0 > }, > "numInputRows" : 2, > "processedRowsPerSecond" : 0.8163265306122448 > } ], > "sink" : { > "description" : > "FileSink[/var/folders/4w/z
[jira] [Commented] (SPARK-23288) Incorrect number of written records in structured streaming
[ https://issues.apache.org/jira/browse/SPARK-23288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367567#comment-16367567 ] Gabor Somogyi commented on SPARK-23288: --- I'm working on this issue. > Incorrect number of written records in structured streaming > --- > > Key: SPARK-23288 > URL: https://issues.apache.org/jira/browse/SPARK-23288 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Yuriy Bondaruk >Priority: Major > Labels: Metrics, metrics > > I'm using SparkListener.onTaskEnd() to capture input and output metrics but > it seems that number of written records > ('taskEnd.taskMetrics().outputMetrics().recordsWritten()') is incorrect. Here > is my stream construction: > > {code:java} > StreamingQuery writeStream = session > .readStream() > .schema(RecordSchema.fromClass(TestRecord.class)) > .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) > .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) > .csv(inputFolder.getRoot().toPath().toString()) > .as(Encoders.bean(TestRecord.class)) > .flatMap( > ((FlatMapFunction) (u) -> { > List resultIterable = new ArrayList<>(); > try { > TestVendingRecord result = transformer.convert(u); > resultIterable.add(result); > } catch (Throwable t) { > System.err.println("Ooops"); > t.printStackTrace(); > } > return resultIterable.iterator(); > }), > Encoders.bean(TestVendingRecord.class)) > .writeStream() > .outputMode(OutputMode.Append()) > .format("parquet") > .option("path", outputFolder.getRoot().toPath().toString()) > .option("checkpointLocation", > checkpointFolder.getRoot().toPath().toString()) > .start(); > writeStream.processAllAvailable(); > writeStream.stop(); > {code} > Tested it with one good and one bad (throwing exception in > transformer.convert(u)) input records and it produces following metrics: > > {code:java} > (TestMain.java:onTaskEnd(73)) - ---status--> SUCCESS > (TestMain.java:onTaskEnd(75)) - ---recordsWritten--> 0 > (TestMain.java:onTaskEnd(76)) - ---recordsRead-> 2 > (TestMain.java:onTaskEnd(83)) - taskEnd.taskInfo().accumulables(): > (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max) > (TestMain.java:onTaskEnd(85)) - value = 323 > (TestMain.java:onTaskEnd(84)) - name = number of output rows > (TestMain.java:onTaskEnd(85)) - value = 2 > (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max) > (TestMain.java:onTaskEnd(85)) - value = 364 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.recordsRead > (TestMain.java:onTaskEnd(85)) - value = 2 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.bytesRead > (TestMain.java:onTaskEnd(85)) - value = 157 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.resultSerializationTime > (TestMain.java:onTaskEnd(85)) - value = 3 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.resultSize > (TestMain.java:onTaskEnd(85)) - value = 2396 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorCpuTime > (TestMain.java:onTaskEnd(85)) - value = 633807000 > (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorRunTime > (TestMain.java:onTaskEnd(85)) - value = 683 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.executorDeserializeCpuTime > (TestMain.java:onTaskEnd(85)) - value = 55662000 > (TestMain.java:onTaskEnd(84)) - name = > internal.metrics.executorDeserializeTime > (TestMain.java:onTaskEnd(85)) - value = 58 > (TestMain.java:onTaskEnd(89)) - input records 2 > Streaming query made progress: { > "id" : "1231f9cb-b2e8-4d10-804d-73d7826c1cb5", > "runId" : "bd23b60c-93f9-4e17-b3bc-55403edce4e7", > "name" : null, > "timestamp" : "2018-01-26T14:44:05.362Z", > "numInputRows" : 2, > "processedRowsPerSecond" : 0.8163265306122448, > "durationMs" : { > "addBatch" : 1994, > "getBatch" : 126, > "getOffset" : 52, > "queryPlanning" : 220, > "triggerExecution" : 2450, > "walCommit" : 41 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : > "FileStreamSource[file:/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5_/T/junit3661035412295337071]", > "startOffset" : null, > "endOffset" : { > "logOffset" : 0 > }, > "numInputRows" : 2, > "processedRowsPerSecond" : 0.8163265306122448 > } ], > "sink" : { > "description" : > "FileSink[/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5
[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.
[ https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367529#comment-16367529 ] Mitchell commented on SPARK-23420: -- Yes, I agree there appears to be no way currently for a user to distinguish a path to be treated normally vs. one to be treated as a glob. I think having two separate methods for specifying, or an option to specify how it should be treated. This probably isn't a common situation to have files/paths with these characters in them, but it's possible and should be able to be done. > Datasource loading not handling paths with regex chars. > --- > > Key: SPARK-23420 > URL: https://issues.apache.org/jira/browse/SPARK-23420 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.1 >Reporter: Mitchell >Priority: Major > > Greetings, during some recent testing I ran across an issue attempting to > load files with regex chars like []()* etc. in them. The files are valid in > the various storages and the normal hadoop APIs all function properly > accessing them. > When my code is executed, I get the following stack trace. > 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: > java.io.IOException: Illegal file pattern: Unmatched closing ')' near index > 130 > A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? > ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near > index 130 > A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? > ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at > org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at > org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at > org.apache.hadoop.fs.Globber.glob(Globber.java:149) at > org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at > org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) > at > org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244) > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at > scala.collection.immutable.List.flatMap(List.scala:344) at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at > org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at > org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at > com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635) > Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' > near index 130 > A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? > ^ at java.util.regex.Pattern.error(Pattern.java:1955) at > java.util.regex.Pattern.compile(Pattern.java:1700) at > java.util.regex.Pattern.(Pattern.java:1351) at > java.util.regex.Pattern.compile(Pattern.java:1054) at > org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at > org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at > org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 > 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, > (reason: User class threw exception: java.io.IOException: Illegal file > pattern: Unmatched closing
[jira] [Created] (SPARK-23451) Deprecate KMeans computeCost
Marco Gaido created SPARK-23451: --- Summary: Deprecate KMeans computeCost Key: SPARK-23451 URL: https://issues.apache.org/jira/browse/SPARK-23451 Project: Spark Issue Type: Task Components: ML Affects Versions: 2.4.0 Reporter: Marco Gaido SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of proper cluster evaluators. Now SPARK-14516 introduces a proper {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove it in the next releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23450) jars option in spark submit is documented in misleading way
Gregory Reshetniak created SPARK-23450: -- Summary: jars option in spark submit is documented in misleading way Key: SPARK-23450 URL: https://issues.apache.org/jira/browse/SPARK-23450 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.2.1 Reporter: Gregory Reshetniak I am wondering if the {{--jars}} option on spark submit is actually meant for distributing the dependency jars onto the nodes in cluster? In my case I can see it working as a "symlink" of sorts. But the documentation is written in the way that suggests otherwise. Please help me figure out if this is a bug or just my reading of the docs. Thanks! _ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23449) Extra java options lose order in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Korzhuev updated SPARK-23449: Description: `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when processed in `entrypoint.sh` does not preserve its ordering, which makes `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before any other experimental options. Steps to reproduce: # Set `spark.driver.extraJavaOptions`, e.g. `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled -XX:+UseCGroupMemoryLimitForHeap` # Submit application to k8s cluster. # Fetch logs and observe that on each run order of options is different and when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. Expected behaviour: # Order of `extraJavaOptions` should be preserved. Cause: `entrypoint.sh` fetches environment options with `env`, which doesn't guarantee ordering. {code:java} env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt{code} was: `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before any other experimental options. Steps to reproduce: # Set `spark.driver.extraJavaOptions`, e.g. `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled -XX:+UseCGroupMemoryLimitForHeap` # Submit application to k8s cluster. # Fetch logs and observe that on each run order of options is different and when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. Expected behaviour: # Order of `extraJavaOptions` should be preserved. Cause: `entrypoint.sh` fetches environment options with `env`, which doesn't guarantee ordering. {code:java} env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt{code} > Extra java options lose order in Docker context > --- > > Key: SPARK-23449 > URL: https://issues.apache.org/jira/browse/SPARK-23449 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8S with supplied Docker image. Passing > along extra java options. >Reporter: Andrew Korzhuev >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when > processed in `entrypoint.sh` does not preserve its ordering, which makes > `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before > any other experimental options. > > Steps to reproduce: > # Set `spark.driver.extraJavaOptions`, e.g. > `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled > -XX:+UseCGroupMemoryLimitForHeap` > # Submit application to k8s cluster. > # Fetch logs and observe that on each run order of options is different and > when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. > > Expected behaviour: > # Order of `extraJavaOptions` should be preserved. > > Cause: > `entrypoint.sh` fetches environment options with `env`, which doesn't > guarantee ordering. > {code:java} > env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > > /tmp/java_opts.txt{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23439) Ambiguous reference when selecting column inside StructType with same name that outer colum
[ https://issues.apache.org/jira/browse/SPARK-23439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367414#comment-16367414 ] Wenchen Fan commented on SPARK-23439: - This is a valid behavior, as `a.b` is an invalid column name for most of the external storages like parquet. I think it's reasonable to name the nested file according to the deepest field. Users should manually alias the column to avoid duplication before saving data to external storages. > Ambiguous reference when selecting column inside StructType with same name > that outer colum > --- > > Key: SPARK-23439 > URL: https://issues.apache.org/jira/browse/SPARK-23439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Scala 2.11.8, Spark 2.2.0 >Reporter: Alejandro Trujillo Caballero >Priority: Minor > > Hi. > I've seen that when working with nested struct fields in a DataFrame and > doing a select operation the nesting is lost and this can result in > collisions between column names. > For example: > > {code:java} > case class Foo(a: Int, b: Bar) > case class Bar(a: Int) > val items = List( > Foo(1, Bar(1)), > Foo(2, Bar(2)) > ) > val df = spark.createDataFrame(items) > val df_a_a = df.select($"a", $"b.a").show > //+---+---+ > //| a| a| > //+---+---+ > //| 1| 1| > //| 2| 2| > //+---+---+ > df.select($"a", $"b.a").printSchema > //root > //|-- a: integer (nullable = false) > //|-- a: integer (nullable = true) > df.select($"a", $"b.a").select($"a") > //org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could > be: a#9, a#{code} > > > Shouldn't the second column be named "b.a"? > > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23449) Extra java options lose order in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23449: Assignee: (was: Apache Spark) > Extra java options lose order in Docker context > --- > > Key: SPARK-23449 > URL: https://issues.apache.org/jira/browse/SPARK-23449 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8S with supplied Docker image. Passing > along extra java options. >Reporter: Andrew Korzhuev >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when > processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` > unusable, as you have to pass it before any other experimental options. > > Steps to reproduce: > # Set `spark.driver.extraJavaOptions`, e.g. > `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled > -XX:+UseCGroupMemoryLimitForHeap` > # Submit application to k8s cluster. > # Fetch logs and observe that on each run order of options is different and > when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. > > Expected behaviour: > # Order of `extraJavaOptions` should be preserved. > > Cause: > `entrypoint.sh` fetches environment options with `env`, which doesn't > guarantee ordering. > {code:java} > env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > > /tmp/java_opts.txt{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23449) Extra java options lose order in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367370#comment-16367370 ] Apache Spark commented on SPARK-23449: -- User 'andrusha' has created a pull request for this issue: https://github.com/apache/spark/pull/20628 > Extra java options lose order in Docker context > --- > > Key: SPARK-23449 > URL: https://issues.apache.org/jira/browse/SPARK-23449 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8S with supplied Docker image. Passing > along extra java options. >Reporter: Andrew Korzhuev >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when > processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` > unusable, as you have to pass it before any other experimental options. > > Steps to reproduce: > # Set `spark.driver.extraJavaOptions`, e.g. > `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled > -XX:+UseCGroupMemoryLimitForHeap` > # Submit application to k8s cluster. > # Fetch logs and observe that on each run order of options is different and > when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. > > Expected behaviour: > # Order of `extraJavaOptions` should be preserved. > > Cause: > `entrypoint.sh` fetches environment options with `env`, which doesn't > guarantee ordering. > {code:java} > env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > > /tmp/java_opts.txt{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23449) Extra java options lose order in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23449: Assignee: Apache Spark > Extra java options lose order in Docker context > --- > > Key: SPARK-23449 > URL: https://issues.apache.org/jira/browse/SPARK-23449 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8S with supplied Docker image. Passing > along extra java options. >Reporter: Andrew Korzhuev >Assignee: Apache Spark >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when > processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` > unusable, as you have to pass it before any other experimental options. > > Steps to reproduce: > # Set `spark.driver.extraJavaOptions`, e.g. > `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled > -XX:+UseCGroupMemoryLimitForHeap` > # Submit application to k8s cluster. > # Fetch logs and observe that on each run order of options is different and > when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. > > Expected behaviour: > # Order of `extraJavaOptions` should be preserved. > > Cause: > `entrypoint.sh` fetches environment options with `env`, which doesn't > guarantee ordering. > {code:java} > env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > > /tmp/java_opts.txt{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23449) Extra java options lose order in Docker context
Andrew Korzhuev created SPARK-23449: --- Summary: Extra java options lose order in Docker context Key: SPARK-23449 URL: https://issues.apache.org/jira/browse/SPARK-23449 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.3.0 Environment: Running Spark on K8S with supplied Docker image. Passing along extra java options. Reporter: Andrew Korzhuev Fix For: 2.3.0 `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when processed in `entrypoint.sh`, which makes `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before any other experimental options. Steps to reproduce: # Set `spark.driver.extraJavaOptions`, e.g. `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled -XX:+UseCGroupMemoryLimitForHeap` # Submit application to k8s cluster. # Fetch logs and observe that on each run order of options is different and when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. Expected behaviour: # Order of `extraJavaOptions` should be preserved. Cause: `entrypoint.sh` fetches environment options with `env`, which doesn't guarantee ordering. {code:java} env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype
[ https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed ZAROUI updated SPARK-23448: - Summary: Dataframe returns wrong result when column don't respect datatype (was: Data encoding problem when not finding the right type) > Dataframe returns wrong result when column don't respect datatype > - > > Key: SPARK-23448 > URL: https://issues.apache.org/jira/browse/SPARK-23448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: Local >Reporter: Ahmed ZAROUI >Priority: Major > > I have the following json file that contains some noisy data(String instead > of Array): > > {code:java} > {"attr1":"val1","attr2":"[\"val2\"]"} > {"attr1":"val1","attr2":["val2"]} > {code} > And i need to specify schema programatically like this: > > {code:java} > implicit val spark = SparkSession > .builder() > .master("local[*]") > .config("spark.ui.enabled", false) > .config("spark.sql.caseSensitive", "True") > .getOrCreate() > import spark.implicits._ > val schema = StructType( > Seq(StructField("attr1", StringType, true), > StructField("attr2", ArrayType(StringType, true), true))) > spark.read.schema(schema).json(input).collect().foreach(println) > {code} > The result given by this code is: > {code:java} > [null,null] > [val1,WrappedArray(val2)] > {code} > Instead of putting null in corrupted column, all columns of the first message > are null > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23448) Data encoding problem when not finding the right type
[ https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed ZAROUI updated SPARK-23448: - Description: I have the following json file that contains some noisy data(String instead of Array): {code:java} {"attr1":"val1","attr2":"[\"val2\"]"} {"attr1":"val1","attr2":["val2"]} {code} And i need to specify schema programatically like this: {code:java} implicit val spark = SparkSession .builder() .master("local[*]") .config("spark.ui.enabled", false) .config("spark.sql.caseSensitive", "True") .getOrCreate() import spark.implicits._ val schema = StructType( Seq(StructField("attr1", StringType, true), StructField("attr2", ArrayType(StringType, true), true))) spark.read.schema(schema).json(input).collect().foreach(println) {code} The result given by this code is: {code:java} [null,null] [val1,WrappedArray(val2)] {code} Instead of putting null in corrupted column, all columns of the first message are null was: I have the following json file that contains some noisy data(String instead of Array): {code:java} {"attr1":"val1","attr2":"[\"val2\"]"} {"attr1":"val1","attr2":["val2"]} {code} And i need to specify schema programatically like this: {code:java} implicit val spark = SparkSession .builder() .master("local[*]") .config("spark.ui.enabled", false) .config("spark.sql.caseSensitive", "True") .getOrCreate() import spark.implicits._ val schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) spark.read.schema(schema).json(input).collect().foreach(println) {code} The result given by this code is: {code:java} [null,null] [val1,WrappedArray(val2)] {code} Instead of putting null in corrupted column, all columns of the first message are null > Data encoding problem when not finding the right type > - > > Key: SPARK-23448 > URL: https://issues.apache.org/jira/browse/SPARK-23448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: Local >Reporter: Ahmed ZAROUI >Priority: Major > > I have the following json file that contains some noisy data(String instead > of Array): > > {code:java} > {"attr1":"val1","attr2":"[\"val2\"]"} > {"attr1":"val1","attr2":["val2"]} > {code} > And i need to specify schema programatically like this: > > {code:java} > implicit val spark = SparkSession > .builder() > .master("local[*]") > .config("spark.ui.enabled", false) > .config("spark.sql.caseSensitive", "True") > .getOrCreate() > import spark.implicits._ > val schema = StructType( > Seq(StructField("attr1", StringType, true), > StructField("attr2", ArrayType(StringType, true), true))) > spark.read.schema(schema).json(input).collect().foreach(println) > {code} > The result given by this code is: > {code:java} > [null,null] > [val1,WrappedArray(val2)] > {code} > Instead of putting null in corrupted column, all columns of the first message > are null > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23448) Data encoding problem when not finding the right type
[ https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed ZAROUI updated SPARK-23448: - Environment: Local (was: Tested locally in linux machine) > Data encoding problem when not finding the right type > - > > Key: SPARK-23448 > URL: https://issues.apache.org/jira/browse/SPARK-23448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: Local >Reporter: Ahmed ZAROUI >Priority: Major > > I have the following json file that contains some noisy data(String instead > of Array): > > {code:java} > {"attr1":"val1","attr2":"[\"val2\"]"} > {"attr1":"val1","attr2":["val2"]} > {code} > And i need to specify schema programatically like this: > > {code:java} > implicit val spark = SparkSession > .builder() > .master("local[*]") > .config("spark.ui.enabled", false) > .config("spark.sql.caseSensitive", "True") > .getOrCreate() > import spark.implicits._ > val > schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) > spark.read.schema(schema).json(input).collect().foreach(println) > {code} > The result given by this code is: > {code:java} > [null,null] > [val1,WrappedArray(val2)] > {code} > Instead of putting null in corrupted column, all columns of the first message > are null > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23448) Data encoding problem when not finding the right type
[ https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmed ZAROUI updated SPARK-23448: - Description: I have the following json file that contains some noisy data(String instead of Array): {code:java} {"attr1":"val1","attr2":"[\"val2\"]"} {"attr1":"val1","attr2":["val2"]} {code} And i need to specify schema programatically like this: {code:java} implicit val spark = SparkSession .builder() .master("local[*]") .config("spark.ui.enabled", false) .config("spark.sql.caseSensitive", "True") .getOrCreate() import spark.implicits._ val schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) spark.read.schema(schema).json(input).collect().foreach(println) {code} The result given by this code is: {code:java} [null,null] [val1,WrappedArray(val2)] {code} Instead of putting null in corrupted column, all columns of the first message are null was: I have the following json file that contains some noisy data(String instead of Array): {code:java} {"attr1":"val1","attr2":["val2"]} {"attr1":"val1","attr2":"[\"val2\"]"} {code} And i need to specify schema programatically like this: {code:java} implicit val spark = SparkSession .builder() .master("local[*]") .config("spark.ui.enabled", false) .config("spark.sql.caseSensitive", "True") .getOrCreate() import spark.implicits._ val schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) spark.read.schema(schema).json(input).collect().foreach(println) {code} The result given by this code is: {code:java} [null,null] [val1,WrappedArray(val2)] {code} Instead of putting null in corrupted column, all columns of the first message are null > Data encoding problem when not finding the right type > - > > Key: SPARK-23448 > URL: https://issues.apache.org/jira/browse/SPARK-23448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: Tested locally in linux machine >Reporter: Ahmed ZAROUI >Priority: Major > > I have the following json file that contains some noisy data(String instead > of Array): > > {code:java} > {"attr1":"val1","attr2":"[\"val2\"]"} > {"attr1":"val1","attr2":["val2"]} > {code} > And i need to specify schema programatically like this: > > {code:java} > implicit val spark = SparkSession > .builder() > .master("local[*]") > .config("spark.ui.enabled", false) > .config("spark.sql.caseSensitive", "True") > .getOrCreate() > import spark.implicits._ > val > schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) > spark.read.schema(schema).json(input).collect().foreach(println) > {code} > The result given by this code is: > {code:java} > [null,null] > [val1,WrappedArray(val2)] > {code} > Instead of putting null in corrupted column, all columns of the first message > are null > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23448) Data encoding problem when not finding the right type
Ahmed ZAROUI created SPARK-23448: Summary: Data encoding problem when not finding the right type Key: SPARK-23448 URL: https://issues.apache.org/jira/browse/SPARK-23448 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Environment: Tested locally in linux machine Reporter: Ahmed ZAROUI I have the following json file that contains some noisy data(String instead of Array): {code:java} {"attr1":"val1","attr2":["val2"]} {"attr1":"val1","attr2":"[\"val2\"]"} {code} And i need to specify schema programatically like this: {code:java} implicit val spark = SparkSession .builder() .master("local[*]") .config("spark.ui.enabled", false) .config("spark.sql.caseSensitive", "True") .getOrCreate() import spark.implicits._ val schema=StructType(Seq(StructField("attr1",StringType,true),StructField("attr2",ArrayType(StringType,true),true))) spark.read.schema(schema).json(input).collect().foreach(println) {code} The result given by this code is: {code:java} [null,null] [val1,WrappedArray(val2)] {code} Instead of putting null in corrupted column, all columns of the first message are null -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23439) Ambiguous reference when selecting column inside StructType with same name that outer colum
[ https://issues.apache.org/jira/browse/SPARK-23439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366945#comment-16366945 ] Marco Gaido commented on SPARK-23439: - [~cloud_fan] I think this comes from https://github.com/apache/spark/pull/8215 (https://github.com/apache/spark/blob/1dc2c1d5e85c5f404f470aeb44c1f3c22786bdea/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L203). We are adding an Alias to the name of the last extracted value. I am not sure whether this is the right behavior, so this JIRA is invalid, or this should be changed. What do you think? Thanks. > Ambiguous reference when selecting column inside StructType with same name > that outer colum > --- > > Key: SPARK-23439 > URL: https://issues.apache.org/jira/browse/SPARK-23439 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Scala 2.11.8, Spark 2.2.0 >Reporter: Alejandro Trujillo Caballero >Priority: Minor > > Hi. > I've seen that when working with nested struct fields in a DataFrame and > doing a select operation the nesting is lost and this can result in > collisions between column names. > For example: > > {code:java} > case class Foo(a: Int, b: Bar) > case class Bar(a: Int) > val items = List( > Foo(1, Bar(1)), > Foo(2, Bar(2)) > ) > val df = spark.createDataFrame(items) > val df_a_a = df.select($"a", $"b.a").show > //+---+---+ > //| a| a| > //+---+---+ > //| 1| 1| > //| 2| 2| > //+---+---+ > df.select($"a", $"b.a").printSchema > //root > //|-- a: integer (nullable = false) > //|-- a: integer (nullable = true) > df.select($"a", $"b.a").select($"a") > //org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could > be: a#9, a#{code} > > > Shouldn't the second column be named "b.a"? > > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366898#comment-16366898 ] Marco Gaido commented on SPARK-23442: - I am not sure it is what you are looking for, but you can repartition the resulting DataFrame in order to have more partitions. > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Priority: Major > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? > Let me know if I can help in any way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366857#comment-16366857 ] Valeriy Avanesov commented on SPARK-23437: -- [~mlnick], is that really supposed to happen to a textbook algorithm filling in the vacuum? There is currently no non-parametric regression techniques inferring a smooth function provided by MLlib. Regarding the guidelines: the requirements for the algorithm are # Be widely known # Be used and accepted (academic citations and concrete use cases can help justify this) # Be highly scalable and I think all of them hold (see the original post). > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366809#comment-16366809 ] Nick Pentreath commented on SPARK-23265: Thanks for the ping - yes it adds more detailed checking of the exclusive params and would introduce an error being thrown in certain additional situations (specifically {{numBucketsArray}} set for single-column transform, {{numBuckets}} and {{numBucketsArray}} set for multi-column transform, mismatched length of {{numBucketsArray}} with input/output columns for multi-column transform). I reviewed the PR and LGTM so as I said there we can merge this now before RC4 gets cut. > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > \{{numBuckets}} when transforming multiple columns, since that is then > applied to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23399) Register a task completion listener first for OrcColumnarBatchReader
[ https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366788#comment-16366788 ] Marco Gaido commented on SPARK-23399: - I think we should reopen this, it is still happening: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87486/testReport/org.apache.spark.sql.execution.datasources.orc/OrcQuerySuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > Register a task completion listener first for OrcColumnarBatchReader > > > Key: SPARK-23399 > URL: https://issues.apache.org/jira/browse/SPARK-23399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1 > > > This is related with SPARK-23390. > Currently, there was a opened file leak for OrcColumnarBatchReader. > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer
[ https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366774#comment-16366774 ] German Schiavon Matteo commented on SPARK-12140: Ok [~jerryshao], Im testing your code and it works but doesn't refresh the streaming tab until the driver is dead/killed. I'm gonna give it a go and also wanna do some performance test about this to see the scalability issue. > Support Streaming UI in HistoryServer > - > > Key: SPARK-12140 > URL: https://issues.apache.org/jira/browse/SPARK-12140 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Major > > SPARK-11206 added infrastructure that would allow the streaming UI to be > shown in the History Server. We should add the necessary code to make that > happen, although it requires some changes to how events and listeners are > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-23265: --- Description: SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If both single- and mulit-column params are set (specifically {{inputCol}} / {{inputCols}}) an error is thrown. However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that for this transformer, it is acceptable to set the single-column param for \{{numBuckets}} when transforming multiple columns, since that is then applied to all columns. was: SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If both single- and mulit-column params are set (specifically {{inputCol}} / {{inputCols}}) an error is thrown. However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that for this transformer, it is acceptable to set the single-column param for {{numBuckets }}when transforming multiple columns, since that is then applied to all columns. > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > \{{numBuckets}} when transforming multiple columns, since that is then > applied to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23217) Add cosine distance measure to ClusteringEvaluator
[ https://issues.apache.org/jira/browse/SPARK-23217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366762#comment-16366762 ] Apache Spark commented on SPARK-23217: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20627 > Add cosine distance measure to ClusteringEvaluator > -- > > Key: SPARK-23217 > URL: https://issues.apache.org/jira/browse/SPARK-23217 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-23217.pdf > > > SPARK-22119 introduced the cosine distance measure for KMeans. Therefore it > would be useful to provide also an implementation of ClusteringEvaluator > using the cosine distance measure. > > Attached you can find a design document for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366744#comment-16366744 ] Nick Pentreath commented on SPARK-23437: It sounds interesting - however the standard practice is that new algorithms should probably be released as a 3rd party Spark package. If they become widely-used then there is a stronger argument for integration into MLlib. See [http://spark.apache.org/contributing.html] under the MLlib section for more details. > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23447) Cleanup codegen template for Literal
[ https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23447: Assignee: (was: Apache Spark) > Cleanup codegen template for Literal > > > Key: SPARK-23447 > URL: https://issues.apache.org/jira/browse/SPARK-23447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok >Priority: Major > > Ideally, the codegen templates for {{Literal}} should emit literals in the > {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be > effectively inlined into their use sites. > But currently there are a couple of paths where {{Literal.doGenCode()}} > return {{ExprCode}} that has non-trivial {{code}} field, and all of those are > actually unnecessary. > We can make a simple refactoring to make sure all codegen templates for > {{Literal}} return empty {{code}} and simple literal/constant expressions in > {{isNull}} and {{value}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23447) Cleanup codegen template for Literal
[ https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366688#comment-16366688 ] Apache Spark commented on SPARK-23447: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/20626 > Cleanup codegen template for Literal > > > Key: SPARK-23447 > URL: https://issues.apache.org/jira/browse/SPARK-23447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok >Priority: Major > > Ideally, the codegen templates for {{Literal}} should emit literals in the > {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be > effectively inlined into their use sites. > But currently there are a couple of paths where {{Literal.doGenCode()}} > return {{ExprCode}} that has non-trivial {{code}} field, and all of those are > actually unnecessary. > We can make a simple refactoring to make sure all codegen templates for > {{Literal}} return empty {{code}} and simple literal/constant expressions in > {{isNull}} and {{value}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org