[jira] [Commented] (SPARK-29584) NOT NULL is not supported in Spark
[ https://issues.apache.org/jira/browse/SPARK-29584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958540#comment-16958540 ] pavithra ramachandran commented on SPARK-29584: --- i shall work on this > NOT NULL is not supported in Spark > -- > > Key: SPARK-29584 > URL: https://issues.apache.org/jira/browse/SPARK-29584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Spark while creating table restricting column for NULL value is not supported. > As below > PostgreSQL: SUCCESS No Exception > CREATE TABLE Persons (ID int *NOT NULL*, LastName varchar(255) *NOT > NULL*,FirstName varchar(255) NOT NULL, Age int); > insert into Persons values(1,'GUPTA','Abhi',NULL); > select * from persons; > > Spark: Parse Exception > jdbc:hive2://10.18.19.208:23040/default> CREATE TABLE Persons (ID int NOT > NULL, LastName varchar(255) NOT NULL,FirstName varchar(255) NOT NULL, Age > int); > Error: org.apache.spark.sql.catalyst.parser.ParseException: > no viable alternative at input 'CREATE TABLE Persons (ID int NOT'(line 1, pos > 29) > Parse Exception -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29566) Imputer should support single-column input/ouput
[ https://issues.apache.org/jira/browse/SPARK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958539#comment-16958539 ] Huaxin Gao commented on SPARK-29566: I will work on this. Thanks! [~podongfeng] > Imputer should support single-column input/ouput > > > Key: SPARK-29566 > URL: https://issues.apache.org/jira/browse/SPARK-29566 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > Imputer should support single-column input/ouput > refer to https://issues.apache.org/jira/browse/SPARK-29565 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29565) OneHotEncoder should support single-column input/ouput
[ https://issues.apache.org/jira/browse/SPARK-29565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958537#comment-16958537 ] Huaxin Gao commented on SPARK-29565: I will work on this. Thanks for ping me [~podongfeng] > OneHotEncoder should support single-column input/ouput > -- > > Key: SPARK-29565 > URL: https://issues.apache.org/jira/browse/SPARK-29565 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > Current feature algs > ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color}) > are designed to support both single-col & multi-col. > And there is already some internal utils (like > {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this. > For OneHotEncoder, it's reasonable to support single-col. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29584) NOT NULL is not supported in Spark
ABHISHEK KUMAR GUPTA created SPARK-29584: Summary: NOT NULL is not supported in Spark Key: SPARK-29584 URL: https://issues.apache.org/jira/browse/SPARK-29584 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA Spark while creating table restricting column for NULL value is not supported. As below PostgreSQL: SUCCESS No Exception CREATE TABLE Persons (ID int *NOT NULL*, LastName varchar(255) *NOT NULL*,FirstName varchar(255) NOT NULL, Age int); insert into Persons values(1,'GUPTA','Abhi',NULL); select * from persons; Spark: Parse Exception jdbc:hive2://10.18.19.208:23040/default> CREATE TABLE Persons (ID int NOT NULL, LastName varchar(255) NOT NULL,FirstName varchar(255) NOT NULL, Age int); Error: org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'CREATE TABLE Persons (ID int NOT'(line 1, pos 29) Parse Exception -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16483) Unifying struct fields and columns
[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-16483: - Labels: sql (was: bulk-closed sql) > Unifying struct fields and columns > -- > > Key: SPARK-16483 > URL: https://issues.apache.org/jira/browse/SPARK-16483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Priority: Major > Labels: sql > > This issue comes as a result of an exchange with Michael Armbrust outside of > the usual JIRA/dev list channels. > DataFrame provides a full set of manipulation operations for top-level > columns. They have be added, removed, modified and renamed. The same is not > true about fields inside structs yet, from a logical standpoint, Spark users > may very well want to perform the same operations on struct fields, > especially since automatic schema discovery from JSON input tends to create > deeply nested structs. > Common use-cases include: > - Remove and/or rename struct field(s) to adjust the schema > - Fix a data quality issue with a struct field (update/rewrite) > To do this with the existing API by hand requires manually calling > {{named_struct}} and listing all fields, including ones we don't want to > manipulate. This leads to complex, fragile code that cannot survive schema > evolution. > It would be far better if the various APIs that can now manipulate top-level > columns were extended to handle struct fields at arbitrary locations or, > alternatively, if we introduced new APIs for modifying any field in a > dataframe, whether it is a top-level one or one nested inside a struct. > Purely for discussion purposes (overloaded methods are not shown): > {code:java} > class Column(val expr: Expression) extends Logging { > // ... > // matches Dataset.schema semantics > def schema: StructType > // matches Dataset.select() semantics > // '* support allows multiple new fields to be added easily, saving > cumbersome repeated withColumn() calls > def select(cols: Column*): Column > // matches Dataset.withColumn() semantics of add or replace > def withColumn(colName: String, col: Column): Column > // matches Dataset.drop() semantics > def drop(colName: String): Column > } > class Dataset[T] ... { > // ... > // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema) > def cast(newShema: StructType): DataFrame > } > {code} > The benefit of the above API is that it unifies manipulating top-level & > nested columns. The addition of {{schema}} and {{select()}} to {{Column}} > allows for nested field reordering, casting, etc., which is important in data > exchange scenarios where field position matters. That's also the reason to > add {{cast}} to {{Dataset}}: it improves consistency and readability (with > method chaining). Another way to think of {{Dataset.cast}} is as the Spark > schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala > encodable type is to a {{StructType}} instance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29583) extract support interval type
[ https://issues.apache.org/jira/browse/SPARK-29583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958479#comment-16958479 ] Yuming Wang commented on SPARK-29583: - cc [~maxgekk] > extract support interval type > - > > Key: SPARK-29583 > URL: https://issues.apache.org/jira/browse/SPARK-29583 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > postgres=# select extract(minute from INTERVAL '1 YEAR 10 DAYS 50 MINUTES'); > date_part > --- > 50 > (1 row) > postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as > timestamp) - cast('2019-07-01 15:57:07.912' as timestamp)); > date_part > --- > 15 > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29583) extract support interval type
[ https://issues.apache.org/jira/browse/SPARK-29583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29583: Description: {code:sql} postgres=# select extract(minute from INTERVAL '1 YEAR 10 DAYS 50 MINUTES'); date_part --- 50 (1 row) postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as timestamp) - cast('2019-07-01 15:57:07.912' as timestamp)); date_part --- 15 (1 row) {code} was: {code:sql} postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as timestamp) - cast('2019-07-01 15:57:07.912' as timestamp)); date_part --- 15 (1 row) {code} > extract support interval type > - > > Key: SPARK-29583 > URL: https://issues.apache.org/jira/browse/SPARK-29583 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > postgres=# select extract(minute from INTERVAL '1 YEAR 10 DAYS 50 MINUTES'); > date_part > --- > 50 > (1 row) > postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as > timestamp) - cast('2019-07-01 15:57:07.912' as timestamp)); > date_part > --- > 15 > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29583) extract support interval type
Yuming Wang created SPARK-29583: --- Summary: extract support interval type Key: SPARK-29583 URL: https://issues.apache.org/jira/browse/SPARK-29583 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang {code:sql} postgres=# select extract(minute from cast('2019-07-01 17:12:33.068' as timestamp) - cast('2019-07-01 15:57:07.912' as timestamp)); date_part --- 15 (1 row) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29582) Unify the behavior of pyspark.TaskContext with spark core
Xianyang Liu created SPARK-29582: Summary: Unify the behavior of pyspark.TaskContext with spark core Key: SPARK-29582 URL: https://issues.apache.org/jira/browse/SPARK-29582 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Xianyang Liu In Spark Core, there is a `TaskContext` object which is a singleton. We set a task context instance which can be TaskContext or BarrierTaskContext before the task function startup, and unset it to none after the function end. So we can both get TaskContext and BarrierTaskContext with the object. How we can only get the BarrierTaskContext with `BarrierTaskContext`, we will get `None` if we get it by `TaskContext.get` in a barrier stage. In this patch, we unify the behavior of TaskContext for pyspark with Spark core. This is useful when people switch from normal code to barrier code, and only need a little update. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958471#comment-16958471 ] Zhaoyang Qin commented on SPARK-15348: -- [~asomani] [~georg.kf.hei...@gmail.com] Thank you very much for your advice. But I focused on this because I wanted to use hive managed tables by Spark. My concern is that SparkSQL reads Hive's internal tables much faster than the external tables, especially in large Data. I have compared the two performance and found that using an internal table was five times faster than using an external table at 1T of TPCDS data. So I'm more concerned about spark's solution for reading managed tables. I'll keep going. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29576) Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus
[ https://issues.apache.org/jira/browse/SPARK-29576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29576. --- Resolution: Fixed Issue resolved by pull request 26235 [https://github.com/apache/spark/pull/26235] > Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus > - > > Key: SPARK-29576 > URL: https://issues.apache.org/jira/browse/SPARK-29576 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > Instead of using ZStd codec directly, we use Spark's CompressionCodec which > wraps ZStd codec in buffered stream to avoid overhead excessive of JNI call > while trying to compress small amount of data. > Also, by using Spark's CompressionCodec, we can easily to make it > configurable in the future if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29576) Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus
[ https://issues.apache.org/jira/browse/SPARK-29576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29576: - Assignee: DB Tsai > Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus > - > > Key: SPARK-29576 > URL: https://issues.apache.org/jira/browse/SPARK-29576 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > Instead of using ZStd codec directly, we use Spark's CompressionCodec which > wraps ZStd codec in buffered stream to avoid overhead excessive of JNI call > while trying to compress small amount of data. > Also, by using Spark's CompressionCodec, we can easily to make it > configurable in the future if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958412#comment-16958412 ] Jungtaek Lim commented on SPARK-28594: -- Please note that SPARK-29579 and SPARK-29581 could be moved out of SPARK-28594, as the reason of splitting these issues out of existing one is that we couldn't find good way to do that. Things can change if we get some brilliant idea before finishing SPARK-28870, but if not, I'd rather set SPARK-28870 as finish line of this and move both issues out of this. > Allow event logs for running streaming apps to be rolled over. > -- > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: This has been reported on 2.0.2.22 but affects all > currently available versions. >Reporter: Stephen Levett >Priority: Major > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958410#comment-16958410 ] Dongjoon Hyun commented on SPARK-29569: --- Oh.. Too bad. Got it. Thank you for update, [~jiangxb1987]. > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 3.0.0 > > Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png > > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29567) Update JDBC Integration Test Docker Images
[ https://issues.apache.org/jira/browse/SPARK-29567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29567: - Assignee: Dongjoon Hyun > Update JDBC Integration Test Docker Images > -- > > Key: SPARK-29567 > URL: https://issues.apache.org/jira/browse/SPARK-29567 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29567) Update JDBC Integration Test Docker Images
[ https://issues.apache.org/jira/browse/SPARK-29567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29567. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26224 [https://github.com/apache/spark/pull/26224] > Update JDBC Integration Test Docker Images > -- > > Key: SPARK-29567 > URL: https://issues.apache.org/jira/browse/SPARK-29567 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29581) Enable cleanup old event log files
Jungtaek Lim created SPARK-29581: Summary: Enable cleanup old event log files Key: SPARK-29581 URL: https://issues.apache.org/jira/browse/SPARK-29581 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Jungtaek Lim This issue can be start only with SPARK-29579 is addressed properly. After SPARK-29579 Spark would guarantee strong compatibility on both live entities and snapshots, which means snapshot file could replace older origin event log files. This issue tracks the efforts on automatically cleaning up old event logs if snapshot file can replace them, which lets overall size of event log on streaming query to be manageable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29580) KafkaDelegationTokenSuite fails to create new KafkaAdminClient
[ https://issues.apache.org/jira/browse/SPARK-29580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29580: -- Issue Type: Bug (was: Improvement) > KafkaDelegationTokenSuite fails to create new KafkaAdminClient > -- > > Key: SPARK-29580 > URL: https://issues.apache.org/jira/browse/SPARK-29580 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > {code} > sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to > create new KafkaAdminClient > at > org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407) > at > org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: > javax.security.auth.login.LoginException: Server not found in Kerberos > database (7) - Server not found in Kerberos database > at > org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160) > at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146) > at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67) > at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99) > at > org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382) > ... 16 more > Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: > Server not found in Kerberos database (7) - Server not found in Kerberos > database > at > com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) > at > com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) > at > javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) > at java.security.AccessController.doPrivileged(Native Method) > at > javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) > at javax.security.auth.login.LoginContext.login(LoginContext.java:587) > at > org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60) > at > org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103) > at > org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61) > at > org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104) > at > org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:149) > ... 20 more > Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException:
[jira] [Commented] (SPARK-29580) KafkaDelegationTokenSuite fails to create new KafkaAdminClient
[ https://issues.apache.org/jira/browse/SPARK-29580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958374#comment-16958374 ] Dongjoon Hyun commented on SPARK-29580: --- Hi, [~gsomogyi]. Could you take a look at this failure? > KafkaDelegationTokenSuite fails to create new KafkaAdminClient > -- > > Key: SPARK-29580 > URL: https://issues.apache.org/jira/browse/SPARK-29580 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > {code} > sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to > create new KafkaAdminClient > at > org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407) > at > org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: > javax.security.auth.login.LoginException: Server not found in Kerberos > database (7) - Server not found in Kerberos database > at > org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160) > at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146) > at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67) > at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99) > at > org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382) > ... 16 more > Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: > Server not found in Kerberos database (7) - Server not found in Kerberos > database > at > com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) > at > com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) > at > javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) > at java.security.AccessController.doPrivileged(Native Method) > at > javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) > at javax.security.auth.login.LoginContext.login(LoginContext.java:587) > at > org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60) > at > org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103) > at > org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61) > at > org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104) > at > org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:149) > ... 20 more > Caused by:
[jira] [Commented] (SPARK-29580) KafkaDelegationTokenSuite fails to create new KafkaAdminClient
[ https://issues.apache.org/jira/browse/SPARK-29580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958372#comment-16958372 ] Dongjoon Hyun commented on SPARK-29580: --- This is a different failure from SPARK-29027. > KafkaDelegationTokenSuite fails to create new KafkaAdminClient > -- > > Key: SPARK-29580 > URL: https://issues.apache.org/jira/browse/SPARK-29580 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > {code} > sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to > create new KafkaAdminClient > at > org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407) > at > org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: > javax.security.auth.login.LoginException: Server not found in Kerberos > database (7) - Server not found in Kerberos database > at > org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160) > at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146) > at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67) > at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99) > at > org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382) > ... 16 more > Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: > Server not found in Kerberos database (7) - Server not found in Kerberos > database > at > com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) > at > com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) > at > javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) > at java.security.AccessController.doPrivileged(Native Method) > at > javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) > at javax.security.auth.login.LoginContext.login(LoginContext.java:587) > at > org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60) > at > org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103) > at > org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61) > at > org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104) > at > org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:149) > ... 20 more > Caused by:
[jira] [Created] (SPARK-29580) KafkaDelegationTokenSuite fails to create new KafkaAdminClient
Dongjoon Hyun created SPARK-29580: - Summary: KafkaDelegationTokenSuite fails to create new KafkaAdminClient Key: SPARK-29580 URL: https://issues.apache.org/jira/browse/SPARK-29580 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ {code} sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to create new KafkaAdminClient at org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407) at org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55) at org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227) at org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249) at org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Server not found in Kerberos database (7) - Server not found in Kerberos database at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160) at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146) at org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67) at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99) at org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382) ... 16 more Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: Server not found in Kerberos database (7) - Server not found in Kerberos database at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) at javax.security.auth.login.LoginContext.login(LoginContext.java:587) at org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60) at org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103) at org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61) at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104) at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:149) ... 20 more Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database at sun.security.krb5.KrbAsRep.(KrbAsRep.java:82) at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316) at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361) at
[jira] [Created] (SPARK-29579) Guarantee compatibility of snapshot (live entities, KVstore entities)
Jungtaek Lim created SPARK-29579: Summary: Guarantee compatibility of snapshot (live entities, KVstore entities) Key: SPARK-29579 URL: https://issues.apache.org/jira/browse/SPARK-29579 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Jungtaek Lim This issue is a follow-up issue after SPARK-29111 and SPARK-29261, which both issues WILL NOT guarantee compatibility. To safely clean up old event log files after snapshot has been written for these files, we have to ensure the snapshot file can restore the state as same as we replay from these event log files. The issue is on migrating to the newer Spark version - if snapshot is not readable due to incompatibility, the app cannot be read entirely as we've already removed old event log files. If we can guarantee compatibility we can move on to the next item, cleaning up old event log files to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29261) Support recover live entities from KVStore for (SQL)AppStatusListener
[ https://issues.apache.org/jira/browse/SPARK-29261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-29261: - Description: To achieve incremental reply goal in SHS, we need to support recover live entities from KVStore for both SQLAppStatusListener and AppStatusListener. Note that we don't guarantee any compatibility of live entities here - we will file another issue to deal with it altogether. was:To achieve incremental reply goal in SHS, we need to support recover live entities from KVStore for both SQLAppStatusListener and AppStatusListener. > Support recover live entities from KVStore for (SQL)AppStatusListener > - > > Key: SPARK-29261 > URL: https://issues.apache.org/jira/browse/SPARK-29261 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > To achieve incremental reply goal in SHS, we need to support recover live > entities from KVStore for both SQLAppStatusListener and AppStatusListener. > Note that we don't guarantee any compatibility of live entities here - we > will file another issue to deal with it altogether. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29111) Support snapshot/restore of KVStore
[ https://issues.apache.org/jira/browse/SPARK-29111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-29111: - Description: This issue tracks the effort of supporting snapshot/restore from/to KVStore. Note that this issue will not touch current behavior - following issue will leverage the output of this issue. This is to reduce the size of each PR. This will not be guaranteeing any compatibility on snapshot - it means this issue must have an approach to determine whether the snapshot is compatible with current version of SHS. was: This issue tracks the effort of supporting snapshot/restore from/to KVStore. Note that this issue will not touch current behavior - following issue will leverage the output of this issue. This is to reduce the size of each PR. > Support snapshot/restore of KVStore > --- > > Key: SPARK-29111 > URL: https://issues.apache.org/jira/browse/SPARK-29111 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the effort of supporting snapshot/restore from/to KVStore. > Note that this issue will not touch current behavior - following issue will > leverage the output of this issue. This is to reduce the size of each PR. > This will not be guaranteeing any compatibility on snapshot - it means this > issue must have an approach to determine whether the snapshot is compatible > with current version of SHS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28870) Snapshot event log files to support incremental reading
[ https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-28870: - Description: This issue tracks the effort on compacting event log files into snapshot and enable incremental reading to speed up replaying event logs. This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled event log files. This issue will be also on top of SPARK-29111 and SPARK-29261, as SPARK-29111 will add the ability to snapshot/restore from/to KVStore and SPARK-29261 will add the ability to snapshot/restore of state of (SQL)AppStatusListeners. was: This issue tracks the effort on compacting event log files into snapshot and enable incremental reading to speed up replaying event logs. This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 will add the ability to snapshot/restore from/to KVStore. > Snapshot event log files to support incremental reading > --- > > Key: SPARK-28870 > URL: https://issues.apache.org/jira/browse/SPARK-28870 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the effort on compacting event log files into snapshot and > enable incremental reading to speed up replaying event logs. > This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled > event log files. This issue will be also on top of SPARK-29111 and > SPARK-29261, as SPARK-29111 will add the ability to snapshot/restore from/to > KVStore and SPARK-29261 will add the ability to snapshot/restore of state of > (SQL)AppStatusListeners. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28870) Snapshot event log files to support incremental reading
[ https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-28870: - Description: This issue tracks the effort on compacting event log files into snapshot and enable incremental reading to speed up replaying event logs. This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 will add the ability to snapshot/restore from/to KVStore. was: This issue tracks the effort on compacting old event log files into snapshot and achieve both goals, 1) reduce overall event log file size 2) speed up replaying event logs. It also deals with cleaning event log files as snapshot will provide the safe way to clean up old event log files without losing ability to replay whole event logs. This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 will add the ability to snapshot/restore from/to KVStore. > Snapshot event log files to support incremental reading > --- > > Key: SPARK-28870 > URL: https://issues.apache.org/jira/browse/SPARK-28870 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the effort on compacting event log files into snapshot and > enable incremental reading to speed up replaying event logs. > This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled > event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 > will add the ability to snapshot/restore from/to KVStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28870) Snapshot old event log files to support compaction
[ https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958300#comment-16958300 ] Jungtaek Lim commented on SPARK-28870: -- Discussed with Marcelo/Imran offline: I'm changing the goal of snapshot here to only allow incremental reading - as "how to guarantee compatibility of live entities/snapshots" has been playing as a blocker for long time and we haven't figure out good solution yet. The goal for opening up the chance on cleanup old event log files will be split out to another issue, with adding explicit requirements - we should guarantee strong compatibility with both live entities/snapshots to achieve the functionality. > Snapshot old event log files to support compaction > -- > > Key: SPARK-28870 > URL: https://issues.apache.org/jira/browse/SPARK-28870 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the effort on compacting old event log files into snapshot > and achieve both goals, 1) reduce overall event log file size 2) speed up > replaying event logs. It also deals with cleaning event log files as snapshot > will provide the safe way to clean up old event log files without losing > ability to replay whole event logs. > This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled > event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 > will add the ability to snapshot/restore from/to KVStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28870) Snapshot event log files to support incremental reading
[ https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-28870: - Summary: Snapshot event log files to support incremental reading (was: Snapshot old event log files to support compaction) > Snapshot event log files to support incremental reading > --- > > Key: SPARK-28870 > URL: https://issues.apache.org/jira/browse/SPARK-28870 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the effort on compacting old event log files into snapshot > and achieve both goals, 1) reduce overall event log file size 2) speed up > replaying event logs. It also deals with cleaning event log files as snapshot > will provide the safe way to clean up old event log files without losing > ability to replay whole event logs. > This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled > event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 > will add the ability to snapshot/restore from/to KVStore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29578) JDK 1.8.0_232 timezone updates cause "Kwajalein" test failures again
Sean R. Owen created SPARK-29578: Summary: JDK 1.8.0_232 timezone updates cause "Kwajalein" test failures again Key: SPARK-29578 URL: https://issues.apache.org/jira/browse/SPARK-29578 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.4.4, 3.0.0 Reporter: Sean R. Owen Assignee: Sean R. Owen I have a report that tests fail in JDK 1.8.0_232 because of timezone changes in (I believe) tzdata2018i or later, per https://www.oracle.com/technetwork/java/javase/tzdata-versions-138805.html: {{*** FAILED *** with 8634 did not equal 8633 Round trip of 8633 did not work in tz}} See also https://issues.apache.org/jira/browse/SPARK-24950 I say "I've heard" because I can't get this version easily on my Mac. However the fix is so inconsequential that I think we can just make it, to allow this additional variation just as before. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29577) Implement p-value simulation and unit tests for chi2 test
Alexander Tronchin-James created SPARK-29577: Summary: Implement p-value simulation and unit tests for chi2 test Key: SPARK-29577 URL: https://issues.apache.org/jira/browse/SPARK-29577 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.4.5, 3.0.0 Reporter: Alexander Tronchin-James Spark mllib's chi-squared test does not yet include p-value simulation for the goodness of fit test, and implementing a robust/scaleable approach was non-trivial for us, so we wanted to give this work back to the community for others to use. https://github.com/apache/spark/pull/26197 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29576) Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus
DB Tsai created SPARK-29576: --- Summary: Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus Key: SPARK-29576 URL: https://issues.apache.org/jira/browse/SPARK-29576 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.4 Reporter: DB Tsai Fix For: 3.0.0 Instead of using ZStd codec directly, we use Spark's CompressionCodec which wraps ZStd codec in buffered stream to avoid overhead excessive of JNI call while trying to compress small amount of data. Also, by using Spark's CompressionCodec, we can easily to make it configurable in the future if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29571) Fix UT in AllExecutionsPageSuite class
[ https://issues.apache.org/jira/browse/SPARK-29571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958237#comment-16958237 ] Ankit raj boudh commented on SPARK-29571: - Assert condition is wrong in AllExecutionsPageSuite.scala for testname : " sorting should be successful" , if IllegalArgumentException will occurs then also unit test will pass (actually it should fail.) > Fix UT in AllExecutionsPageSuite class > --- > > Key: SPARK-29571 > URL: https://issues.apache.org/jira/browse/SPARK-29571 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Ankit Raj Boudh >Priority: Minor > > sorting should be successful UT in class AllExecutionsPageSuite failing due > to invalid assert condition. Needs to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29538) Test failure: org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.multiple joins
[ https://issues.apache.org/jira/browse/SPARK-29538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-29538. -- Resolution: Duplicate SPARK-29552 dealt with this. Will reopen if it is still flaky. > Test failure: > org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.multiple joins > --- > > Key: SPARK-29538 > URL: https://issues.apache.org/jira/browse/SPARK-29538 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112373/testReport] > org.scalatest.exceptions.TestFailedException: 2 did not equal 1 > > This doesn't look like occurring rarely - it had been passed, but it was > failed once, and failed nearly 1 or 2 failure(s) per a page of history. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112373/testReport/junit/org.apache.spark.sql.execution.adaptive/AdaptiveQueryExecSuite/multiple_joins/history] > (Please track older iteratively to see how often it failed.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958225#comment-16958225 ] Georg Heiler commented on SPARK-15348: -- Another workaround could be external tables [https://stackoverflow.com/questions/58406125/how-to-write-a-table-to-hive-from-spark-without-using-the-warehouse-connector-in] as outlined here. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29571) Fix UT in AllExecutionsPageSuite class
[ https://issues.apache.org/jira/browse/SPARK-29571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958171#comment-16958171 ] shahid commented on SPARK-29571: Could you clarify which UT is failing? > Fix UT in AllExecutionsPageSuite class > --- > > Key: SPARK-29571 > URL: https://issues.apache.org/jira/browse/SPARK-29571 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Ankit Raj Boudh >Priority: Minor > > sorting should be successful UT in class AllExecutionsPageSuite failing due > to invalid assert condition. Needs to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958166#comment-16958166 ] Xingbo Jiang commented on SPARK-29569: -- [~dongjoon] Not yet, the release script is still failing, Wenchen and I are investigating more. > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 3.0.0 > > Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png > > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6
[ https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29557: -- Description: This proposes to upgrade the dropwizard/codahale metrics library version used by Spark to `3.2.6` which is the last version supporting Ganglia. Spark is currently using Dropwizard metrics version 3.1.5, a version that is no more actively developed nor maintained, according to the project's Github repo README. (was: This proposes to upgrade the dropwizard/codahale metrics library version used by Spark to a recent version, tentatively 4.1.1. Spark is currently using Dropwizard metrics version 3.1.5, a version that is no more actively developed nor maintained, according to the project's Github repo README.) > Upgrade dropwizard metrics library to 3.2.6 > --- > > Key: SPARK-29557 > URL: https://issues.apache.org/jira/browse/SPARK-29557 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > > This proposes to upgrade the dropwizard/codahale metrics library version used > by Spark to `3.2.6` which is the last version supporting Ganglia. Spark is > currently using Dropwizard metrics version 3.1.5, a version that is no more > actively developed nor maintained, according to the project's Github repo > README. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6
[ https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29557. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26212 [https://github.com/apache/spark/pull/26212] > Upgrade dropwizard metrics library to 3.2.6 > --- > > Key: SPARK-29557 > URL: https://issues.apache.org/jira/browse/SPARK-29557 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > > This proposes to upgrade the dropwizard/codahale metrics library version used > by Spark to a recent version, tentatively 4.1.1. Spark is currently using > Dropwizard metrics version 3.1.5, a version that is no more actively > developed nor maintained, according to the project's Github repo README. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6
[ https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29557: - Assignee: Luca Canali > Upgrade dropwizard metrics library to 3.2.6 > --- > > Key: SPARK-29557 > URL: https://issues.apache.org/jira/browse/SPARK-29557 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > > This proposes to upgrade the dropwizard/codahale metrics library version used > by Spark to a recent version, tentatively 4.1.1. Spark is currently using > Dropwizard metrics version 3.1.5, a version that is no more actively > developed nor maintained, according to the project's Github repo README. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958065#comment-16958065 ] Dongjoon Hyun commented on SPARK-29569: --- Sorry for being late to the party! Thank you for swift fixing, [~hyukjin.kwon]! I saw new `3.0.0-preview-rc1` tag. Now, it's ready for vote? :) > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 3.0.0 > > Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png > > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor Lopez updated SPARK-29575: - Description: I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. The issue appears when using {{from_json}} to parse a column in a Spark dataframe. It seems like {{from_json}} ignores whether the schema provided has any {{nullable:False}} property. {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} was: I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. {{The issue appears when using from_json to parse a column in a Spark dataframe. It seems like from_json ignores whether the schema provided has any nullable:False property.}} {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-29575 > URL: https://issues.apache.org/jira/browse/SPARK-29575 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Victor Lopez >Priority: Major > > I believe this issue was resolved elsewhere > (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this > bug seems to still be there. > The issue appears when using {{from_json}} to parse a column in a Spark > dataframe. It seems like {{from_json}} ignores whether the schema provided > has any {{nullable:False}} property. > {code:java} > schema = T.StructType().add(T.StructField('id', T.LongType(), > nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) > data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': > 'jane'})}] > df = spark.read.json(sc.parallelize(data)) > df.withColumn("details", F.from_json("user", > schema)).select("details.*").show() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor Lopez updated SPARK-29575: - Description: I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. {{The issue appears when using `from_json` to parse a column in a Spark dataframe. It seems like `from_json` ignores whether the schema provided has any `nullable:False` property.}} {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} was: I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. The issue appears when using `from_json` to parse a column in a Spark dataframe. It seems like `from_json` ignores whether the schema provided has any `nullable:False` property. {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-29575 > URL: https://issues.apache.org/jira/browse/SPARK-29575 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Victor Lopez >Priority: Major > > I believe this issue was resolved elsewhere > (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this > bug seems to still be there. > {{The issue appears when using `from_json` to parse a column in a Spark > dataframe. It seems like `from_json` ignores whether the schema provided has > any `nullable:False` property.}} > {code:java} > schema = T.StructType().add(T.StructField('id', T.LongType(), > nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) > data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': > 'jane'})}] > df = spark.read.json(sc.parallelize(data)) > df.withColumn("details", F.from_json("user", > schema)).select("details.*").show() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor Lopez updated SPARK-29575: - Description: I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. {{The issue appears when using from_json to parse a column in a Spark dataframe. It seems like from_json ignores whether the schema provided has any nullable:False property.}} {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} was: I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. {{The issue appears when using `from_json` to parse a column in a Spark dataframe. It seems like `from_json` ignores whether the schema provided has any `nullable:False` property.}} {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-29575 > URL: https://issues.apache.org/jira/browse/SPARK-29575 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Victor Lopez >Priority: Major > > I believe this issue was resolved elsewhere > (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this > bug seems to still be there. > {{The issue appears when using from_json to parse a column in a Spark > dataframe. It seems like from_json ignores whether the schema provided has > any nullable:False property.}} > {code:java} > schema = T.StructType().add(T.StructField('id', T.LongType(), > nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) > data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': > 'jane'})}] > df = spark.read.json(sc.parallelize(data)) > df.withColumn("details", F.from_json("user", > schema)).select("details.*").show() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958056#comment-16958056 ] Shane Knapp edited comment on SPARK-29106 at 10/23/19 5:29 PM: --- > For pyspark test, you mentioned we didn't install any python debs for > testing. Is there any "requirements.txt" or "test-requirements.txt" in the > spark repo? I'm failed to find them. When we test the pyspark before, we just > realize that we need to install numpy package with pip, because when we exec > the pyspark test scripts, the fail message noticed us. So you mentioned > "pyspark testing debs" before, you mean that we should figure all out > manually? Is there any kind suggest from your side? i manage the jenkins configs via ansible, and python specifically through anaconda. anaconda was my initial choice for package management because we need to support multiple python versions (2.7, 3.x, pypy) and specific package versions for each python version itself. sadly there is no official ARM anaconda python distribution, which is a BIG hurdle for this project. i also don't use requirements.txt and pip to do the initial python env setup as pip is flakier than i like, and the conda envs just work a LOT better. see: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#building-identical-conda-environments i could check in the specific python package configs in to the spark repo, but they're specific to our worker configsn and even though the worker deployment process is automated (via ansible) there is ALWAYS some stupid dependency loop that pops up and requires manual intervention. another issue is that i do NOT want any builds installing/updating/creating either python environments OR packages. builds should NEVER EVER modify the bare-metal (or VM) system-level configs. so, to summarize what needs to happen to get the python tests up and running: 1) there is no conda distribution for the ARM architecture, meaning... 2) i need to use venv to install everything... 3) which means i need to use pip/requirements.txt, which is known to be flaky... 4) and the python packages for ARM are named differently than x86... 5) or don't exist... 6) or are the wrong version... 7) meaning that setting up and testing three different python versions with differing package names and versions makes this a lot of trial and error. i would like to get this done asap, but i will need to carve some serious time to get my brain wrapped around the > For sparkR test, we compile a higher R version 3.6.1 by fix many lib > dependency, and make it work. And exec the R test script, till to all of them > return pass. So we wonder the difficult about the test when we truelly in > amplab, could you please share more to us? i have a deep and comprehensive hatred of installing and setting up R. i've attached a couple of files showing the packages installed, their versions, and some of the ansible snippets i use to do the initial install. https://issues.apache.org/jira/secure/attachment/12983856/R-ansible.yml https://issues.apache.org/jira/secure/attachment/12983857/R-libs.txt just like you, i need to go back and manually fix lib dependency and version errors once the initial setup is complete. this is why i have a deep and comprehensive hatred of installing and setting up R. > For current periodic jobs, you said it will be triggered 2 times per day. > Each build will cost most 11 hours. I have a thought about the next job > deployment, wish to know your thought about it. My thought is we can setup 2 > jobs per day, one is the current maven UT test triggered by SCM changes(11h), > the other will run the pyspark and sparkR tests also triggered by SCM > changes(including spark build and tests, may cost 5-6 hours). How about this? > We can talk and discuss if we don't realize how difficult to do these now. yeah, i am amenable to having a second ARM build. i'd be curious as to the impact on the VM's performance when we have two builds running simultaneously. if i have some time today i'll experiment. shane was (Author: shaneknapp): > For pyspark test, you mentioned we didn't install any python debs for > testing. Is there any "requirements.txt" or "test-requirements.txt" in the > spark repo? I'm failed to find them. When we test the pyspark before, we just > realize that we need to install numpy package with pip, because when we exec > the pyspark test scripts, the fail message noticed us. So you mentioned > "pyspark testing debs" before, you mean that we should figure all out > manually? Is there any kind suggest from your side? i manage the jenkins configs via ansible, and python specifically through anaconda. anaconda was my initial choice for package management because we need to support multiple python versions (2.7, 3.x, pypy) and specific package
[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor Lopez updated SPARK-29575: - Description: I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. The issue appears when using `from_json` to parse a column in a Spark dataframe. It seems like `from_json` ignores whether the schema provided has any `nullable:False` property. {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} was: I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. The issue appears when using `from_json` to parse a column in a Spark dataframe. It seems like `from_json` ignores whether the schema provided has any `nullable:False` property. {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-29575 > URL: https://issues.apache.org/jira/browse/SPARK-29575 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Victor Lopez >Priority: Major > > I believe this issue was resolved elsewhere > (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this > bug seems to still be there. > The issue appears when using `from_json` to parse a column in a Spark > dataframe. It seems like `from_json` ignores whether the schema provided has > any `nullable:False` property. > {code:java} > schema = T.StructType().add(T.StructField('id', T.LongType(), > nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) > data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': > 'jane'})}] > df = spark.read.json(sc.parallelize(data)) > df.withColumn("details", F.from_json("user", > schema)).select("details.*").show() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp updated SPARK-29106: Attachment: R-libs.txt R-ansible.yml > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958056#comment-16958056 ] Shane Knapp commented on SPARK-29106: - > For pyspark test, you mentioned we didn't install any python debs for > testing. Is there any "requirements.txt" or "test-requirements.txt" in the > spark repo? I'm failed to find them. When we test the pyspark before, we just > realize that we need to install numpy package with pip, because when we exec > the pyspark test scripts, the fail message noticed us. So you mentioned > "pyspark testing debs" before, you mean that we should figure all out > manually? Is there any kind suggest from your side? i manage the jenkins configs via ansible, and python specifically through anaconda. anaconda was my initial choice for package management because we need to support multiple python versions (2.7, 3.x, pypy) and specific package versions for each python version itself. sadly there is no official ARM anaconda python distribution, which is a BIG hurdle for this project. i also don't use requirements.txt and pip to do the initial python env setup as pip is flakier than i like, and the conda envs just work a LOT better. see: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#building-identical-conda-environments i could check in the specific python package configs in to the spark repo, but they're specific to our worker configsn and even though the worker deployment process is automated (via ansible) there is ALWAYS some stupid dependency loop that pops up and requires manual intervention. another issue is that i do NOT want any builds installing/updating/creating either python environments OR packages. builds should NEVER EVER modify the bare-metal (or VM) system-level configs. so, to summarize what needs to happen to get the python tests up and running: 1) there is no conda distribution for the ARM architecture, meaning... 2) i need to use venv to install everything... 3) which means i need to use pip/requirements.txt, which is known to be flaky... 4) and the python packages for ARM are named differently than x86... 5) or don't exist... 6) or are the wrong version... 7) meaning that setting up and testing three different python versions with differing package names and versions makes this a lot of trial and error. i would like to get this done asap, but i will need to carve some serious time to get my brain wrapped around the > For sparkR test, we compile a higher R version 3.6.1 by fix many lib > dependency, and make it work. And exec the R test script, till to all of them > return pass. So we wonder the difficult about the test when we truelly in > amplab, could you please share more to us? i have a deep and comprehensive hatred of installing and setting up R. i'll attach a couple of files showing the packages installed, their versions, and some of the ansible snippets i use to do the initial install. just like you, i need to go back and manually fix lib dependency and version errors once the initial setup is complete. this is why i have a deep and comprehensive hatred of installing and setting up R. > For current periodic jobs, you said it will be triggered 2 times per day. > Each build will cost most 11 hours. I have a thought about the next job > deployment, wish to know your thought about it. My thought is we can setup 2 > jobs per day, one is the current maven UT test triggered by SCM changes(11h), > the other will run the pyspark and sparkR tests also triggered by SCM > changes(including spark build and tests, may cost 5-6 hours). How about this? > We can talk and discuss if we don't realize how difficult to do these now. yeah, i am amenable to having a second ARM build. i'd be curious as to the impact on the VM's performance when we have two builds running simultaneously. if i have some time today i'll experiment. shane > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when
[jira] [Assigned] (SPARK-29552) Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins
[ https://issues.apache.org/jira/browse/SPARK-29552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29552: --- Assignee: Ke Jia > Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins > > > Key: SPARK-29552 > URL: https://issues.apache.org/jira/browse/SPARK-29552 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > > AQE will optimize the logical plan once there is query stage finished. So for > inner join, when two join side is all small to be the build side. The planner > of converting logical plan to physical plan will select the build side as > BuildRight if right side finished firstly or BuildLeft if left side finished > firstly. In some case, when BuildRight or BuildLeft may introduce additional > exchange to the parent node. The revert approach in > OptimizeLocalShuffleReader rule may be too conservative, which revert all the > local shuffle reader when introduce additional exchange not revert the local > shuffle reader introduced shuffle. It may be expense to only revert the > local shuffle reader introduced shuffle. The workaround is to apply the > OptimizeLocalShuffleReader rule again when creating new query stage to > further optimize the sub tree shuffle reader to local shuffle reader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29552) Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins
[ https://issues.apache.org/jira/browse/SPARK-29552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29552. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26207 [https://github.com/apache/spark/pull/26207] > Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins > > > Key: SPARK-29552 > URL: https://issues.apache.org/jira/browse/SPARK-29552 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > > AQE will optimize the logical plan once there is query stage finished. So for > inner join, when two join side is all small to be the build side. The planner > of converting logical plan to physical plan will select the build side as > BuildRight if right side finished firstly or BuildLeft if left side finished > firstly. In some case, when BuildRight or BuildLeft may introduce additional > exchange to the parent node. The revert approach in > OptimizeLocalShuffleReader rule may be too conservative, which revert all the > local shuffle reader when introduce additional exchange not revert the local > shuffle reader introduced shuffle. It may be expense to only revert the > local shuffle reader introduced shuffle. The workaround is to apply the > OptimizeLocalShuffleReader rule again when creating new query stage to > further optimize the sub tree shuffle reader to local shuffle reader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29503) MapObjects doesn't copy Unsafe data when nested under Safe data
[ https://issues.apache.org/jira/browse/SPARK-29503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29503: --- Assignee: Jungtaek Lim > MapObjects doesn't copy Unsafe data when nested under Safe data > --- > > Key: SPARK-29503 > URL: https://issues.apache.org/jira/browse/SPARK-29503 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 3.0.0 >Reporter: Aaron Lewis >Assignee: Jungtaek Lim >Priority: Major > Labels: correctness > > In order for MapObjects to operate safely, it checks to see if the result of > the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, > UnsafeMapData) and performs a copy before writing it into MapObjects' output > array. This is to protect against expressions which re-use the same native > memory buffer to represent its result across evaluations; if the copy wasn't > here, all results would be pointing to the same native buffer and would > represent the last result written to the buffer. However, MapObjects misses > this needed copy if the Unsafe data is nested below some safe structure, for > instance a GenericArrrayData whose elements are all UnsafeRows. In this > scenario, all elements of the GenericArrayData will be pointing to the same > native UnsafeRow buffer which will hold the last value written to it. > > Right now, this bug seems to only occur when a `ProjectExec` goes down the > `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` > path. > > Example Reproduction Code: > {code:scala} > import org.apache.spark.sql.catalyst.expressions.objects.MapObjects > import org.apache.spark.sql.catalyst.expressions.CreateArray > import org.apache.spark.sql.catalyst.expressions.Expression > import org.apache.spark.sql.functions.{array, struct} > import org.apache.spark.sql.Column > import org.apache.spark.sql.types.ArrayType > // For the purpose of demonstration, we need to disable WholeStage codegen > spark.conf.set("spark.sql.codegen.wholeStage", "false") > val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, > 3))).toDF("items") > // Trivial example: Nest unsafe struct inside safe array > // items: Seq[Int] => items.map{item => Seq(Struct(item))} > val result = exampleDS.select( > new Column(MapObjects( > {item: Expression => array(struct(new Column(item))).expr}, > $"items".expr, > exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType > )) as "items" > ) > result.show(10, false) > {code} > > Actual Output: > {code:java} > +-+ > |items| > +-+ > |[WrappedArray([3]), WrappedArray([3]), WrappedArray([3])]| > +-+ > {code} > > Expected Output: > {code:java} > +-+ > |items| > +-+ > |[WrappedArray([1]), WrappedArray([2]), WrappedArray([3])]| > +-+ > {code} > > We've confirmed that the bug exists on version 2.1.1 as well as on master > (which I assume corresponds to version 3.0.0?) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29503) MapObjects doesn't copy Unsafe data when nested under Safe data
[ https://issues.apache.org/jira/browse/SPARK-29503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29503. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26173 [https://github.com/apache/spark/pull/26173] > MapObjects doesn't copy Unsafe data when nested under Safe data > --- > > Key: SPARK-29503 > URL: https://issues.apache.org/jira/browse/SPARK-29503 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 3.0.0 >Reporter: Aaron Lewis >Assignee: Jungtaek Lim >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > In order for MapObjects to operate safely, it checks to see if the result of > the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, > UnsafeMapData) and performs a copy before writing it into MapObjects' output > array. This is to protect against expressions which re-use the same native > memory buffer to represent its result across evaluations; if the copy wasn't > here, all results would be pointing to the same native buffer and would > represent the last result written to the buffer. However, MapObjects misses > this needed copy if the Unsafe data is nested below some safe structure, for > instance a GenericArrrayData whose elements are all UnsafeRows. In this > scenario, all elements of the GenericArrayData will be pointing to the same > native UnsafeRow buffer which will hold the last value written to it. > > Right now, this bug seems to only occur when a `ProjectExec` goes down the > `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` > path. > > Example Reproduction Code: > {code:scala} > import org.apache.spark.sql.catalyst.expressions.objects.MapObjects > import org.apache.spark.sql.catalyst.expressions.CreateArray > import org.apache.spark.sql.catalyst.expressions.Expression > import org.apache.spark.sql.functions.{array, struct} > import org.apache.spark.sql.Column > import org.apache.spark.sql.types.ArrayType > // For the purpose of demonstration, we need to disable WholeStage codegen > spark.conf.set("spark.sql.codegen.wholeStage", "false") > val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, > 3))).toDF("items") > // Trivial example: Nest unsafe struct inside safe array > // items: Seq[Int] => items.map{item => Seq(Struct(item))} > val result = exampleDS.select( > new Column(MapObjects( > {item: Expression => array(struct(new Column(item))).expr}, > $"items".expr, > exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType > )) as "items" > ) > result.show(10, false) > {code} > > Actual Output: > {code:java} > +-+ > |items| > +-+ > |[WrappedArray([3]), WrappedArray([3]), WrappedArray([3])]| > +-+ > {code} > > Expected Output: > {code:java} > +-+ > |items| > +-+ > |[WrappedArray([1]), WrappedArray([2]), WrappedArray([3])]| > +-+ > {code} > > We've confirmed that the bug exists on version 2.1.1 as well as on master > (which I assume corresponds to version 3.0.0?) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable
Victor Lopez created SPARK-29575: Summary: from_json can produce nulls for fields which are marked as non-nullable Key: SPARK-29575 URL: https://issues.apache.org/jira/browse/SPARK-29575 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Victor Lopez I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there. The issue appears when using `from_json` to parse a column in a Spark dataframe. It seems like `from_json` ignores whether the schema provided has any `nullable:False` property. {code:java} schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}] df = spark.read.json(sc.parallelize(data)) df.withColumn("details", F.from_json("user", schema)).select("details.*").show() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958019#comment-16958019 ] Shane Knapp commented on SPARK-29106: - [~huangtianhua]: > we don't have to download and install leveldbjni-all-1.8 in our arm test > instance, we have installed it and it was there. it's a very inexpensive step to execute and i'd rather have builds be atomic. if for some reason the dependency get wiped/corrupted/etc, the download will ensure we're properly building. > maybe we can try to use 'mvn clean package ' instead of 'mvn clean > install '? sure, i'll give that a shot now. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29415) Stage Level Sched: Add base ResourceProfile and Request classes
[ https://issues.apache.org/jira/browse/SPARK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957974#comment-16957974 ] Thomas Graves commented on SPARK-29415: --- >From a high level design point, this is the base classes needed for other >jira/components to be implemented. You can see the design doc attached to >SPARK-27495 for the entire overview, but for this specifically this is what we >are looking to add. These will start out private until we have other parts >implemented and then make public incase this isn't fully implemented for a >release. ResourceProfile: The user will have to build up a _ResourceProfile_ to pass into an RDD withResources call. This profile will have a limited set of resources the user is allowed to specify. It will allow both task and executor resources. It will be a builder type interface where the main function called will be _ResourceProfile.require._ Adding the ResourceProfile API class leaves it open to do more advanced things in the future. For instance, perhaps you want a _ResourceProfile.prefer_ option where it would run on a node with some resources if available but then fall back if they aren’t. The config names supported correspond to the regular spark configs with the prefix removed. For instance overhead memory in this api is memoryOverhead, which is spark.executor.memoryOverhead with the spark.executor removed. Resources like GPUs are resource.gpu (spark configs spark.executor.resource.gpu.*). | | *_def_* _require(request: TaskResourceRequest):_ *_this_*_._*_type_* *_def_* _require(request: ExecutorResourceRequest):_ *_this_*_._*_type_* It will also have functions to get the resources out for both scala and java. *Resource Requests:* *_class_* _ExecutorResourceRequest(_ _val resourceName: String,_ _val amount: Int, // potentially make this handle fractional resources_ _val units: String, // to handle memory unit types _ _val discoveryScript: Option[__String__] = None,_ _val vendor: Option[__String__] = None)_ *_class_* _TaskResourceRequest(_ _val resourceName: String,_ _val amount: Double) // double to handle fractional resources (ie 2 tasks using 1 resource ) _ This will allow the user to programmatically set the resources vs just using the configs like they can in Spark 3.0 now. The first implementation would support cpu, memory (overhead, pyspark, on heap, off heap), and the generic resources. __ An example of the way this might work is: __ _val_ *_rp_* _= new ResourceProfile()_ _rp.require(new ExecutorResourceRequest("memory", 2048))_ _rp.require(new ExecutorResourceRequest("cores", 2))_ _rp.require(new ExecutorResourceRequest("gpu", 1, Some("/opt/gpuScripts/getGpus")))_ _rp.require(new TaskResourceRequest("gpu", 1))_ Internally we will also create a default profile, which will be based on the normal spark configs passed in. This default one can be used everywhere where user hasn't explicitly set the ResourceProfile > Stage Level Sched: Add base ResourceProfile and Request classes > --- > > Key: SPARK-29415 > URL: https://issues.apache.org/jira/browse/SPARK-29415 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > this is just to add initial ResourceProfile, ExecutorResourceRequest and > taskResourceRequest classes that are used by the other parts of the code. > Initially we will have them private until we have other pieces in place. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29574) spark with user provided hadoop doesn't work on kubernetes
Michał Wesołowski created SPARK-29574: - Summary: spark with user provided hadoop doesn't work on kubernetes Key: SPARK-29574 URL: https://issues.apache.org/jira/browse/SPARK-29574 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.4 Reporter: Michał Wesołowski When spark-submit is run with image built with "hadoop free" spark and user provided hadoop it fails on kubernetes (hadoop libraries are not on spark's classpath). I downloaded spark [Pre-built with user-provided Apache Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz]. I created docker image with usage of [docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]]. Based on this image (2.4.4-without-hadoop) I created another one with Dockerfile {code:java} FROM spark-py:2.4.4-without-hadoop ENV SPARK_HOME=/opt/spark/ # This is needed for newer kubernetes versions ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar $SPARK_HOME/jars COPY spark-env.sh /opt/spark/conf/spark-env.sh RUN chmod +x /opt/spark/conf/spark-env.sh RUN wget -qO- https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz | tar xz -C /opt/ ENV HADOOP_HOME=/opt/hadoop-3.2.1 ENV PATH=${HADOOP_HOME}/bin:${PATH} {code} Contents of spark-env.sh: {code:java} #!/usr/bin/env bash export SPARK_DIST_CLASSPATH=$(hadoop classpath):$HADOOP_HOME/share/hadoop/tools/lib/* export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native {code} spark-submit run with image crated this way fails since spark-env.sh is overwritten by [volume created when pod starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108] As quick workaround I tried to modify [entrypoint script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh] to run spark-env.sh during startup and moving spark-env.sh to a different directory. Driver starts without issues in this setup however, evethough SPARK_DIST_CLASSPATH is set executor is constantly failing: {code:java} PS C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda> kubectl logs rda-script-1571835692837-exec-12 ++ id -u + myuid=0 ++ id -g + mygid=0 + set +e ++ getent passwd 0 + uidentry=root:x:0:0:root:/root:/bin/ash + set -e + '[' -z root:x:0:0:root:/root:/bin/ash ']' + source /opt/spark-env.sh +++ hadoop classpath ++ export 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++ SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*' ++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native ++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native ++ echo 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*' SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/* ++ echo LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native + SPARK_K8S_CMD=executor LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native + case "$SPARK_K8S_CMD" in + shift 1 + SPARK_CLASSPATH=':/opt/spark//jars/*' + env + sed 's/[^=]*=\(.*\)/\1/g' + sort -t_
[jira] [Assigned] (SPARK-29513) REFRESH TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-29513: --- Assignee: Terry Kim > REFRESH TABLE should look up catalog/table like v2 commands > --- > > Key: SPARK-29513 > URL: https://issues.apache.org/jira/browse/SPARK-29513 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > REFRESH TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29513) REFRESH TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-29513. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26183 [https://github.com/apache/spark/pull/26183] > REFRESH TABLE should look up catalog/table like v2 commands > --- > > Key: SPARK-29513 > URL: https://issues.apache.org/jira/browse/SPARK-29513 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > REFRESH TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6
[ https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957945#comment-16957945 ] Luca Canali commented on SPARK-29557: - Upgrading Apache Spark to use dropwizard/codahale metrics library version 4.x or higher is currently blocked by the fact that the Ganglia reporter has been dropped by Dropwizard metrics library in version 4.0. Dropwizard metrics library version 3.2 still includes a Ganglia reporter. > Upgrade dropwizard metrics library to 3.2.6 > --- > > Key: SPARK-29557 > URL: https://issues.apache.org/jira/browse/SPARK-29557 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to upgrade the dropwizard/codahale metrics library version used > by Spark to a recent version, tentatively 4.1.1. Spark is currently using > Dropwizard metrics version 3.1.5, a version that is no more actively > developed nor maintained, according to the project's Github repo README. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29557) Upgrade dropwizard metrics library to 3.2.6
[ https://issues.apache.org/jira/browse/SPARK-29557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-29557: Summary: Upgrade dropwizard metrics library to 3.2.6 (was: Upgrade dropwizard metrics library to 4.1.1) > Upgrade dropwizard metrics library to 3.2.6 > --- > > Key: SPARK-29557 > URL: https://issues.apache.org/jira/browse/SPARK-29557 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to upgrade the dropwizard/codahale metrics library version used > by Spark to a recent version, tentatively 4.1.1. Spark is currently using > Dropwizard metrics version 3.1.5, a version that is no more actively > developed nor maintained, according to the project's Github repo README. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21287) Cannot use Int.MIN_VALUE as Spark SQL fetchsize
[ https://issues.apache.org/jira/browse/SPARK-21287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957939#comment-16957939 ] Hu Fuwang commented on SPARK-21287: --- [~smilegator] [~srowen] Just submitted a PR for this : [https://github.com/apache/spark/pull/26230] Please help review. > Cannot use Int.MIN_VALUE as Spark SQL fetchsize > --- > > Key: SPARK-21287 > URL: https://issues.apache.org/jira/browse/SPARK-21287 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Maciej Bryński >Priority: Major > > MySQL JDBC driver gives possibility to not store ResultSet in memory. > We can do this by setting fetchSize to Int.MIN_VALUE. > Unfortunately this configuration isn't correct in Spark. > {code} > java.lang.IllegalArgumentException: requirement failed: Invalid value > `-2147483648` for parameter `fetchsize`. The minimum value is 0. When the > value is 0, the JDBC driver ignores the value and does the estimates. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206) > at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:748) > {code} > https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29573) Spark should work as PostgreSQL when using + Operator
ABHISHEK KUMAR GUPTA created SPARK-29573: Summary: Spark should work as PostgreSQL when using + Operator Key: SPARK-29573 URL: https://issues.apache.org/jira/browse/SPARK-29573 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA Spark and PostgreSQL result is different when concatenating as below : Spark : Giving NULL result 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12; +-+-+ | id | name | +-+-+ | 20 | test | | 10 | number | +-+-+ 2 rows selected (3.683 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.649 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.406 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ PostgreSQL: Saying throwing Error saying not supported create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+','+name as address from emp12; output: invalid input syntax for integer: "," create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+name as address from emp12; Output: 42883: operator does not exist: integer + character varying It should throw Error in Spark if it is not supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29572) add v1 read fallback API in DS v2
Wenchen Fan created SPARK-29572: --- Summary: add v1 read fallback API in DS v2 Key: SPARK-29572 URL: https://issues.apache.org/jira/browse/SPARK-29572 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957904#comment-16957904 ] Abhishek Somani commented on SPARK-15348: - [~Kelvin.FE] This seems to be happening because you might have "hive.strict.managed.tables" set to true on the hive metastore server. You can either try setting it to false or running the above query as "create external table test.cars ... " instead of "create table" If you still face an issue or have more questions, please feel free to open an issue at [https://github.com/qubole/spark-acid/issues] > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29571) Fix UT in AllExecutionsPageSuite class
Ankit Raj Boudh created SPARK-29571: --- Summary: Fix UT in AllExecutionsPageSuite class Key: SPARK-29571 URL: https://issues.apache.org/jira/browse/SPARK-29571 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.0.0 Reporter: Ankit Raj Boudh sorting should be successful UT in class AllExecutionsPageSuite failing due to invalid assert condition. Needs to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29571) Fix UT in AllExecutionsPageSuite class
[ https://issues.apache.org/jira/browse/SPARK-29571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957898#comment-16957898 ] Ankit Raj Boudh commented on SPARK-29571: - i will raise the PR soon > Fix UT in AllExecutionsPageSuite class > --- > > Key: SPARK-29571 > URL: https://issues.apache.org/jira/browse/SPARK-29571 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Ankit Raj Boudh >Priority: Minor > > sorting should be successful UT in class AllExecutionsPageSuite failing due > to invalid assert condition. Needs to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29570) Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns
[ https://issues.apache.org/jira/browse/SPARK-29570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957897#comment-16957897 ] Ankit Raj Boudh commented on SPARK-29570: - I will fix this issue > Improve tooltip for Executor Tab for Shuffle > Write,Blacklisted,Logs,Threaddump columns > -- > > Key: SPARK-29570 > URL: https://issues.apache.org/jira/browse/SPARK-29570 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > When User move mouse over the columns under Executors Shuffle > Write,Blacklisted,Logs,Threaddump columns, tooltip not display at center. > Check the other columns it display at center. > Please fix this issue in all Spark WEB UI page and History UI Page. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29570) Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns
ABHISHEK KUMAR GUPTA created SPARK-29570: Summary: Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns Key: SPARK-29570 URL: https://issues.apache.org/jira/browse/SPARK-29570 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA When User move mouse over the columns under Executors Shuffle Write,Blacklisted,Logs,Threaddump columns, tooltip not display at center. Check the other columns it display at center. Please fix this issue in all Spark WEB UI page and History UI Page. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29499) Add mapPartitionsWithIndex for RDDBarrier
[ https://issues.apache.org/jira/browse/SPARK-29499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-29499. -- Assignee: Xianyang Liu Resolution: Fixed > Add mapPartitionsWithIndex for RDDBarrier > - > > Key: SPARK-29499 > URL: https://issues.apache.org/jira/browse/SPARK-29499 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.4 >Reporter: Xianyang Liu >Assignee: Xianyang Liu >Priority: Major > > There is only one method in `RDDBarrier`. We often use the partition index as > a label for the current partition. We need to get the index from > `TaskContext` index in the method of `mapPartitions` which is not convenient. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957785#comment-16957785 ] Hyukjin Kwon commented on SPARK-29569: -- This seems to start to happen after Scala 2.12 upgrade. It seems pretty critical since it's unable to generate the doc ... > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Blocker > Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png > > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29569: - Component/s: docs > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build, docs >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Blocker > Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png > > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29542) [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29542: Assignee: feiwang > [SQL][DOC] The descriptions of `spark.sql.files.*` are confused. > > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value for spark sql. > As shown in the attachment, the value of spark.sql.files.maxPartitionBytes > is 128MB. > For stage 1, its input is 16.3TB, but there are only 6400 tasks. > I checked the code, it is only effective for data source table. > So, its description is confused. > Same as all the descriptions of `spark.sql.files.*`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29542) [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29542. -- Fix Version/s: 3.0.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/26200 > [SQL][DOC] The descriptions of `spark.sql.files.*` are confused. > > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: feiwang >Priority: Minor > Fix For: 3.0.0 > > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value for spark sql. > As shown in the attachment, the value of spark.sql.files.maxPartitionBytes > is 128MB. > For stage 1, its input is 16.3TB, but there are only 6400 tasks. > I checked the code, it is only effective for data source table. > So, its description is confused. > Same as all the descriptions of `spark.sql.files.*`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957772#comment-16957772 ] Hyukjin Kwon commented on SPARK-29569: -- I attached the ScalaDoc output from the current master. Seems like, at some point, the documentation style became completely different. > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Blocker > Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png > > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29569: - Attachment: Screen Shot 2019-10-23 at 8.25.01 PM.png > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Blocker > Attachments: Screen Shot 2019-10-23 at 8.25.01 PM.png > > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957731#comment-16957731 ] Xingbo Jiang commented on SPARK-29569: -- [~sowen][~dongjoon] Can you take a look at this issue? > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Blocker > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29569) doc build fails with `/api/scala/lib/jquery.js` doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-29569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang updated SPARK-29569: - Summary: doc build fails with `/api/scala/lib/jquery.js` doesn't exist (was: doc build fails because `/api/scala/lib/jquery.js` doesn't exist) > doc build fails with `/api/scala/lib/jquery.js` doesn't exist > - > > Key: SPARK-29569 > URL: https://issues.apache.org/jira/browse/SPARK-29569 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Blocker > > Run `jekyll build` under `./spark/docs`, the command fail with the following > error message: > {code} > Making directory api/scala > cp -r ../target/scala-2.12/unidoc/. api/scala > Making directory api/java > cp -r ../target/javaunidoc/. api/java > Updating JavaDoc files for badge post-processing > Copying jquery.js from Scala API to Java API for page post-processing of > badges > jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - > ./api/scala/lib/jquery.js > {code} > This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29569) doc build fails because `/api/scala/lib/jquery.js` doesn't exist
Xingbo Jiang created SPARK-29569: Summary: doc build fails because `/api/scala/lib/jquery.js` doesn't exist Key: SPARK-29569 URL: https://issues.apache.org/jira/browse/SPARK-29569 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: Xingbo Jiang Run `jekyll build` under `./spark/docs`, the command fail with the following error message: {code} Making directory api/scala cp -r ../target/scala-2.12/unidoc/. api/scala Making directory api/java cp -r ../target/javaunidoc/. api/java Updating JavaDoc files for badge post-processing Copying jquery.js from Scala API to Java API for page post-processing of badges jekyll 3.8.6 | Error: No such file or directory @ rb_sysopen - ./api/scala/lib/jquery.js {code} This error only happens on master branch, the command works on branch-2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29568) Add flag to stop existing stream when new copy starts
Burak Yavuz created SPARK-29568: --- Summary: Add flag to stop existing stream when new copy starts Key: SPARK-29568 URL: https://issues.apache.org/jira/browse/SPARK-29568 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Burak Yavuz In multi-tenant environments where you have multiple SparkSessions, you can accidentally start multiple copies of the same stream (i.e. streams using the same checkpoint location). This will cause all new instantiations of the new stream to fail. However, sometimes you may want to turn off the old stream, as the old stream may have turned into a zombie (you no longer have access to the query handle or SparkSession). It would be nice to have a SQL flag that allows the stopping of the old stream for such zombie cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957679#comment-16957679 ] Zhaoyang Qin edited comment on SPARK-15348 at 10/23/19 9:15 AM: [~asomani] when i use the following codes:` {{scala> spark.sql("create table test.cars using HiveAcid options ('table' 'test.acidtbl')")}}`, i got a AnalysisException(HiveException) : `org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table test.cars failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);` Any help for this? Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0. was (Author: kelvin.fe): [~asomani] when i use the following codes:` {{scala> spark.sql("create table symlinkacidtable using HiveAcid options ('table' 'default.acidtbl')")}}`, i got a AnalysisException(HiveException) : `org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table test.cars failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);` Any help for this? Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957679#comment-16957679 ] Zhaoyang Qin edited comment on SPARK-15348 at 10/23/19 9:15 AM: [~asomani] when i use the following codes:` {{scala> spark.sql("create table test.cars using HiveAcid options ('table' 'test.acidtbl')")}}`, i got an AnalysisException(HiveException) : `org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table test.cars failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);` Any help for this? Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0. was (Author: kelvin.fe): [~asomani] when i use the following codes:` {{scala> spark.sql("create table test.cars using HiveAcid options ('table' 'test.acidtbl')")}}`, i got a AnalysisException(HiveException) : `org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table test.cars failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);` Any help for this? Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957679#comment-16957679 ] Zhaoyang Qin commented on SPARK-15348: -- [~asomani] when i use the following codes:` {{scala> spark.sql("create table symlinkacidtable using HiveAcid options ('table' 'default.acidtbl')")}}`, i got a AnalysisException(HiveException) : `org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table test.cars failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);` Any help for this? Also, this cluster is HDP3.0 based,the Spark ver2.3.1 & hive 3.0.0. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29352) Move active streaming query state to the SharedState
[ https://issues.apache.org/jira/browse/SPARK-29352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29352. - Fix Version/s: 3.0.0 Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/26018] > Move active streaming query state to the SharedState > > > Key: SPARK-29352 > URL: https://issues.apache.org/jira/browse/SPARK-29352 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > We have checks to prevent the restarting of the same stream on the same spark > session, but we can actually make that better in multi-tenant environments by > actually putting that state in the SharedState instead of SessionState. This > would allow a more comprehensive check for multi-tenant clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29564) Cluster deploy mode should support Spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-29564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29564: --- Description: Cluster deploy mode is not applicable to Spark Thrift server now. This restriction is too rude. In our production, we use multiple Spark Thrift servers as long running services which are used yarn-cluster mode to launch. The life cycle of STS is managed by upper layer manager system which is also used to dispatcher user's JDBC connection to applicable STS. was: Cluster deploy mode is not applicable to Spark Thrift server from SPARK-21403. This restriction is too rude. In our production, we use multiple Spark Thrift servers as long running services which are used yarn-cluster mode to launch. The life cycle of STS is managed by upper layer manager system which is also used to dispatcher user's JDBC connection to applicable STS. SPARK-21403 banned this case. > Cluster deploy mode should support Spark Thrift server > -- > > Key: SPARK-29564 > URL: https://issues.apache.org/jira/browse/SPARK-29564 > Project: Spark > Issue Type: Bug > Components: Spark Submit, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Cluster deploy mode is not applicable to Spark Thrift server now. This > restriction is too rude. > In our production, we use multiple Spark Thrift servers as long running > services which are used yarn-cluster mode to launch. The life cycle of STS is > managed by upper layer manager system which is also used to dispatcher user's > JDBC connection to applicable STS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29567) Update JDBC Integration Test Docker Images
[ https://issues.apache.org/jira/browse/SPARK-29567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29567: -- Summary: Update JDBC Integration Test Docker Images (was: Upgrade JDBC Integration Test Docker Images) > Update JDBC Integration Test Docker Images > -- > > Key: SPARK-29567 > URL: https://issues.apache.org/jira/browse/SPARK-29567 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29567) Upgrade JDBC Integration Test Docker Images
Dongjoon Hyun created SPARK-29567: - Summary: Upgrade JDBC Integration Test Docker Images Key: SPARK-29567 URL: https://issues.apache.org/jira/browse/SPARK-29567 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29565) OneHotEncoder should support single-column input/ouput
[ https://issues.apache.org/jira/browse/SPARK-29565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957664#comment-16957664 ] zhengruifeng commented on SPARK-29565: -- [~huaxingao] In [https://github.com/apache/spark/pull/26064,] I guess you maybe interested in the tickets (SPARK-29565/SPARK-29566). If you would like to work on this, please feel free to ping me in the PRs. > OneHotEncoder should support single-column input/ouput > -- > > Key: SPARK-29565 > URL: https://issues.apache.org/jira/browse/SPARK-29565 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > Current feature algs > ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color}) > are designed to support both single-col & multi-col. > And there is already some internal utils (like > {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this. > For OneHotEncoder, it's reasonable to support single-col. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29566) Imputer should support single-column input/ouput
[ https://issues.apache.org/jira/browse/SPARK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-29566: - Description: Imputer should support single-column input/ouput refer to https://issues.apache.org/jira/browse/SPARK-29565 was:Imputer should support single-column input/ouput > Imputer should support single-column input/ouput > > > Key: SPARK-29566 > URL: https://issues.apache.org/jira/browse/SPARK-29566 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > Imputer should support single-column input/ouput > refer to https://issues.apache.org/jira/browse/SPARK-29565 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29566) Imputer should support single-column input/ouput
zhengruifeng created SPARK-29566: Summary: Imputer should support single-column input/ouput Key: SPARK-29566 URL: https://issues.apache.org/jira/browse/SPARK-29566 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng Imputer should support single-column input/ouput -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29565) OneHotEncoder should support single-column input/ouput
zhengruifeng created SPARK-29565: Summary: OneHotEncoder should support single-column input/ouput Key: SPARK-29565 URL: https://issues.apache.org/jira/browse/SPARK-29565 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng Current feature algs ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color}) are designed to support both single-col & multi-col. And there is already some internal utils (like {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this. For OneHotEncoder, it's reasonable to support single-col. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29564) Cluster deploy mode should support Spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-29564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29564: --- Description: Cluster deploy mode is not applicable to Spark Thrift server from SPARK-21403. This restriction is too rude. In our production, we use multiple Spark Thrift servers as long running services which are used yarn-cluster mode to launch. The life cycle of STS is managed by upper layer manager system which is also used to dispatcher user's JDBC connection to applicable STS. SPARK-21403 banned this case. was: Cluster deploy mode is not applicable to Spark Thrift server from [SPARK-21403|https://issues.apache.org/jira/browse/SPARK-21403]. This restriction is too rude. In our production, we use multiple Spark Thrift servers as long running services which are used yarn-cluster mode to launch. The life cycle of STS is managed by upper layer manager system which is also used to dispatcher user's JDBC connection to applicable STS. SPARK-21403 banned this case. > Cluster deploy mode should support Spark Thrift server > -- > > Key: SPARK-29564 > URL: https://issues.apache.org/jira/browse/SPARK-29564 > Project: Spark > Issue Type: Bug > Components: Spark Submit, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Cluster deploy mode is not applicable to Spark Thrift server from > SPARK-21403. This restriction is too rude. > In our production, we use multiple Spark Thrift servers as long running > services which are used yarn-cluster mode to launch. The life cycle of STS is > managed by upper layer manager system which is also used to dispatcher user's > JDBC connection to applicable STS. SPARK-21403 banned this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29564) Cluster deploy mode should support to launch Spark Thrift server
Lantao Jin created SPARK-29564: -- Summary: Cluster deploy mode should support to launch Spark Thrift server Key: SPARK-29564 URL: https://issues.apache.org/jira/browse/SPARK-29564 Project: Spark Issue Type: Bug Components: Spark Submit, SQL Affects Versions: 2.4.4, 3.0.0 Reporter: Lantao Jin Cluster deploy mode is not applicable to Spark Thrift server from [SPARK-21403|https://issues.apache.org/jira/browse/SPARK-21403]. This restriction is too rude. In our production, we use multiple Spark Thrift servers as long running services which are used yarn-cluster mode to launch. The life cycle of STS is managed by upper layer manager system which is also used to dispatcher user's JDBC connection to applicable STS. SPARK-21403 banned this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29564) Cluster deploy mode should support Spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-29564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29564: --- Summary: Cluster deploy mode should support Spark Thrift server (was: Cluster deploy mode should support to launch Spark Thrift server) > Cluster deploy mode should support Spark Thrift server > -- > > Key: SPARK-29564 > URL: https://issues.apache.org/jira/browse/SPARK-29564 > Project: Spark > Issue Type: Bug > Components: Spark Submit, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Cluster deploy mode is not applicable to Spark Thrift server from > [SPARK-21403|https://issues.apache.org/jira/browse/SPARK-21403]. This > restriction is too rude. > In our production, we use multiple Spark Thrift servers as long running > services which are used yarn-cluster mode to launch. The life cycle of STS is > managed by upper layer manager system which is also used to dispatcher user's > JDBC connection to applicable STS. SPARK-21403 banned this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-21492: Fix Version/s: 2.4.5 > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0 >Reporter: Zhan Zhang >Assignee: Yuanjian Li >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large
[ https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957623#comment-16957623 ] carlos yan commented on SPARK-24666: I also get this question, and my spark version is 2.1.0. I used 1000w record for train and words size is about 100w. When the numIterations>10, the vectors generated contain *infinity* and *NaN*. > Word2Vec generate infinity vectors when numIterations are large > --- > > Key: SPARK-24666 > URL: https://issues.apache.org/jira/browse/SPARK-24666 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.1 > Environment: 2.0.X, 2.1.X, 2.2.X, 2.3.X >Reporter: ZhongYu >Priority: Critical > > We found that Word2Vec generate large absolute value vectors when > numIterations are large, and if numIterations are large enough (>20), the > vector's value many be *infinity(or -**infinity)***, resulting in useless > vectors. > In normal situations, vectors values are mainly around -1.0~1.0 when > numIterations = 1. > The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X. > There are already issues report this bug: > https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works > seems missing. > Other people's reports: > [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec] > [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29324) saveAsTable with overwrite mode results in metadata loss
[ https://issues.apache.org/jira/browse/SPARK-29324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29324. -- Resolution: Not A Problem > saveAsTable with overwrite mode results in metadata loss > > > Key: SPARK-29324 > URL: https://issues.apache.org/jira/browse/SPARK-29324 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karuppayya >Priority: Major > > {code:java} > scala> spark.range(1).write.option("path", > "file:///tmp/tbl").format("orc").saveAsTable("tbl") > scala> spark.sql("desc extended tbl").collect.foreach(println) > [id,bigint,null] > [,,] > [# Detailed Table Information,,] > [Database,default,] > [Table,tbl,] > [Owner,karuppayyar,] > [Created Time,Wed Oct 02 09:29:06 IST 2019,] > [Last Access,UNKNOWN,] > [Created By,Spark 3.0.0-SNAPSHOT,] > [Type,EXTERNAL,] > [Provider,orc,] > [Location,file:/tmp/tbl_loc,] > [Serde Library,org.apache.hadoop.hive.ql.io.orc.OrcSerde,] > [InputFormat,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] > [OutputFormat,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] > {code} > {code:java} > // Overwriting table > scala> spark.range(100).write.mode("overwrite").saveAsTable("tbl") > scala> spark.sql("desc extended tbl").collect.foreach(println) > [id,bigint,null] > [,,] > [# Detailed Table Information,,] > [Database,default,] > [Table,tbl,] > [Owner,karuppayyar,] > [Created Time,Wed Oct 02 09:30:36 IST 2019,] > [Last Access,UNKNOWN,] > [Created By,Spark 3.0.0-SNAPSHOT,] > [Type,MANAGED,] > [Provider,parquet,] > [Location,file:/tmp/tbl,] > [Serde Library,org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,] > [InputFormat,org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,] > [OutputFormat,org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat,] > {code} > > > The first code block creates an EXTERNAL table in Orc format > The second code block overwrites it with more data > After the overwrite, > 1. The external table became a managed table. > 2. The fileformat has changed from Orc to parquet(default fileformat). > And other information(like owner, TBLPROPERTIES) are also overwritten. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29546) Recover jersey-guava test dependency in docker-integration-tests
[ https://issues.apache.org/jira/browse/SPARK-29546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29546. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26203 [https://github.com/apache/spark/pull/26203] > Recover jersey-guava test dependency in docker-integration-tests > > > Key: SPARK-29546 > URL: https://issues.apache.org/jira/browse/SPARK-29546 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > While SPARK-28737 upgrades `Jersey` to 2.29, `docker-integration-tests` is > broken because `com.spotify.docker-client` depends on `jersey-guava`. The > latest `com.spotify.docker-client` is also still depending on that, too. > - https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2 > -> > https://mvnrepository.com/artifact/org.glassfish.jersey.core/jersey-client/2.19 > -> > https://mvnrepository.com/artifact/org.glassfish.jersey.core/jersey-common/2.19 > -> > https://mvnrepository.com/artifact/org.glassfish.jersey.bundles.repackaged/jersey-guava/2.19 > **AFTER** > {code} > build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 > -Dtest=none > -DwildcardSuites=org.apache.spark.sql.jdbc.PostgresIntegrationSuite test > Tests: succeeded 6, failed 0, canceled 0, ignored 0, pending 0 > All tests passed. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957603#comment-16957603 ] zhao bo commented on SPARK-29106: - Hi [~shaneknapp], Sorry for disturb. I have some questions about the following work want to discuss with you. I list them in the following. # For pyspark test, you mentioned we didn't install any python debs for testing. Is there any "requirements.txt" or "test-requirements.txt" in the spark repo? I'm failed to find them. When we test the pyspark before, we just realize that we need to install numpy package with pip, because when we exec the pyspark test scripts, the fail message noticed us. So you mentioned "pyspark testing debs" before, you mean that we should figure all out manually? Is there any kind suggest from your side? # For sparkR test, we compile a higher R version 3.6.1 by fix many lib dependency, and make it work. And exec the R test script, till to all of them return pass. So we wonder the difficult about the test when we truelly in amplab, could you please share more to us? # For current periodic jobs, you said it will be triggered 2 times per day. Each build will cost most 11 hours. I have a thought about the next job deployment, wish to know your thought about it. My thought is we can setup 2 jobs per day, one is the current maven UT test triggered by SCM changes(11h), the other will run the pyspark and sparkR tests also triggered by SCM changes(including spark build and tests, may cost 5-6 hours). How about this? We can talk and discuss if we don't realize how difficult to do these now. Thanks very much, shane. And hope you could reply if you are free. ;) > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py
[ https://issues.apache.org/jira/browse/SPARK-29093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29093: Assignee: Huaxin Gao > Remove automatically generated param setters in _shared_params_code_gen.py > -- > > Key: SPARK-29093 > URL: https://issues.apache.org/jira/browse/SPARK-29093 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Major > > The main difference between scala and py sides come from the automatically > generated param setter in _shared_params_code_gen.py. > To make them in sync, we should remove those setters in _shared_.py, and add > the corresponding setters manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29093) Remove automatically generated param setters in _shared_params_code_gen.py
[ https://issues.apache.org/jira/browse/SPARK-29093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957601#comment-16957601 ] zhengruifeng commented on SPARK-29093: -- [~huaxingao] Thanks! > Remove automatically generated param setters in _shared_params_code_gen.py > -- > > Key: SPARK-29093 > URL: https://issues.apache.org/jira/browse/SPARK-29093 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > The main difference between scala and py sides come from the automatically > generated param setter in _shared_params_code_gen.py. > To make them in sync, we should remove those setters in _shared_.py, and add > the corresponding setters manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans
[ https://issues.apache.org/jira/browse/SPARK-23171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957593#comment-16957593 ] Takeshi Yamamuro commented on SPARK-23171: -- oh, nice, the performance looks much better. > Reduce the time costs of the rule runs that do not change the plans > > > Key: SPARK-23171 > URL: https://issues.apache.org/jira/browse/SPARK-23171 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > Labels: bulk-closed > > Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules > and reduce the time costs, especially for the runs that do not change the > plans. > {noformat} > === Metrics of Analyzer/Optimizer Rules === > Total number of runs = 175827 > Total time: 20.699042877 seconds > Rule > Total Time Effective Time Total Runs > Effective Runs > org.apache.spark.sql.catalyst.optimizer.ColumnPruning > 2340563794 1338268224 1875 > 761 > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution > 1632672623 1625071881 788 > 37 > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions > 1395087131 347339931 1982 > 38 > org.apache.spark.sql.catalyst.optimizer.PruneFilters > 1177711364 21344174 1590 > 3 > org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries > 1145135465 1131417128 285 > 39 > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences > 1008347217 663112062 1982 > 616 > org.apache.spark.sql.catalyst.optimizer.ReorderJoin > 767024424 693001699 1590 > 132 > org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability > 598524650 40802876 742 > 12 > org.apache.spark.sql.catalyst.analysis.DecimalPrecision > 595384169 436153128 1982 > 211 > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery > 548178270 459695885 1982 > 49 > org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts > 423002864 139869503 1982 > 86 > org.apache.spark.sql.catalyst.optimizer.BooleanSimplification > 405544962 17250184 1590 > 7 > org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin > 383837603 284174662 1590 > 708 > org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases > 372901885 33623321590 > 9 > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints > 364628214 343815519 285 > 192 > org.apache.spark.sql.execution.datasources.FindDataSourceTable > 303293296 285344766 1982 > 233 > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions > 233195019 92648171 1982 > 294 > org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion > 220568919 73932736 1982 > 38 > org.apache.spark.sql.catalyst.optimizer.NullPropagation > 207976072 90723051590 >