[jira] [Resolved] (SPARK-30730) Wrong results of `converTz` for different session and system time zones
[ https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-30730. Resolution: Won't Fix > Wrong results of `converTz` for different session and system time zones > --- > > Key: SPARK-30730 > URL: https://issues.apache.org/jira/browse/SPARK-30730 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, DateTimeUtils.convertTz() assumes that timestamp strings are > casted to TimestampType using the JVM system timezone but in fact the session > time zone defined by the SQL config *spark.sql.session.timeZone* is used in > the casting. This leads to wrong results of from_utc_timestamp and > to_utc_timestamp when session time zone is different from JVM time zones. The > issues can be reproduces by the code: > {code:java} > test("to_utc_timestamp in various system and session time zones") { > val localTs = "2020-02-04T22:42:10" > val defaultTz = TimeZone.getDefault > try { > DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz => > TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz)) > DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz => > withSQLConf( > SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true", > SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) { > DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz => > val instant = LocalDateTime > .parse(localTs) > .atZone(DateTimeUtils.getZoneId(toTz)) > .toInstant > val df = Seq(localTs).toDF("localTs") > val res = df.select(to_utc_timestamp(col("localTs"), > toTz)).first().apply(0) > if (instant != res) { > println(s"system = $systemTz session = $sessionTz to = $toTz") > } > } > } > } > } > } catch { > case NonFatal(_) => TimeZone.setDefault(defaultTz) > } > } > {code} > {code:java} > system = UTC session = PST to = UTC > system = UTC session = PST to = PST > system = UTC session = PST to = CET > system = UTC session = PST to = Africa/Dakar > system = UTC session = PST to = America/Los_Angeles > system = UTC session = PST to = Antarctica/Vostok > system = UTC session = PST to = Asia/Hong_Kong > system = UTC session = PST to = Europe/Amsterdam > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30708) first_value/last_value window function throws ParseException
[ https://issues.apache.org/jira/browse/SPARK-30708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng resolved SPARK-30708. Resolution: Won't Fix > first_value/last_value window function throws ParseException > > > Key: SPARK-30708 > URL: https://issues.apache.org/jira/browse/SPARK-30708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > first_value/last_value throws ParseException > > {code:java} > SELECT first_value(unique1) over w, > last_value(unique1) over w, unique1, four > FROM tenk1 WHERE unique1 < 10 > +WINDOW w AS (order by four range between current row and unbounded following) > > org.apache.spark.sql.catalyst.parser.ParseException > > no viable alternative at input 'first_value'(line 1, pos 7) > > == SQL == > SELECT first_value(unique1) over w, > ---^^^ > last_value(unique1) over w, unique1, four > FROM tenk1 WHERE unique1 < 10 > WINDOW w AS (order by four range between current row and unbounded following) > {code} > > Maybe we need fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile
[ https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032106#comment-17032106 ] Abhishek Rao commented on SPARK-30619: -- Hi, Any updates on this? > org.slf4j.Logger and org.apache.commons.collections classes not built as part > of hadoop-provided profile > > > Key: SPARK-30619 > URL: https://issues.apache.org/jira/browse/SPARK-30619 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.2, 2.4.4 > Environment: Spark on kubernetes >Reporter: Abhishek Rao >Priority: Major > > We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count > (org.apache.spark.examples.JavaWordCount) example on local files. > But we're seeing that it is expecting org.slf4j.Logger and > org.apache.commons.collections classes to be available for executing this. > We expected the binary to work as it is for local files. Is there anything > which we're missing? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files
[ https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032102#comment-17032102 ] Hyukjin Kwon commented on SPARK-30712: -- SPARK-24914 JIRA and PR are still open. > Estimate sizeInBytes from file metadata for parquet files > - > > Key: SPARK-30712 > URL: https://issues.apache.org/jira/browse/SPARK-30712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: liupengcheng >Priority: Major > > Currently, Spark will use a compressionFactor when calculating `sizeInBytes` > for `HadoopFsRelation`, but this is not accurate and it's hard to choose the > best `compressionFactor`. Sometimes, this can causing OOMs due to improper > BroadcastHashJoin. > So I propose to use the rowCount in the BlockMetadata to estimate the size in > memory, which can be more accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files
[ https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032076#comment-17032076 ] liupengcheng commented on SPARK-30712: -- [~hyukjin.kwon] SPARK-24914 seems already closed, I left some comments. I also create another related Jira: [SPARK-30394|https://issues.apache.org/jira/browse/SPARK-30394] > Estimate sizeInBytes from file metadata for parquet files > - > > Key: SPARK-30712 > URL: https://issues.apache.org/jira/browse/SPARK-30712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: liupengcheng >Priority: Major > > Currently, Spark will use a compressionFactor when calculating `sizeInBytes` > for `HadoopFsRelation`, but this is not accurate and it's hard to choose the > best `compressionFactor`. Sometimes, this can causing OOMs due to improper > BroadcastHashJoin. > So I propose to use the rowCount in the BlockMetadata to estimate the size in > memory, which can be more accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30394) Skip collecting stats in DetermineTableStats rule when hive table is convertible to datasource tables
[ https://issues.apache.org/jira/browse/SPARK-30394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-30394: - Description: Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will scan hdfs files to collect table stats in `DetermineTableStats` rule. But this can be expensive and not accurate(only file size on disk, not accounting compression factor), acutually we can skip this if this hive table can be converted to datasource table(parquet etc.), and do better estimation in `HadoopFsRelation`. BeforeSPARK-28573, the implementation will update the CatalogTableStatistics, which will cause the improper stats(for parquet, this size is greatly smaller than real size in memory) be used in joinSelection when the hive table can be convert to datasource table. In our production environment, user's highly compressed parquet table can cause OOMs when doing `broadcastHashJoin` due to this improper stats. was: Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will scan hdfs files to collect table stats in `DetermineTableStats` rule. But this can be expensive in some cases, acutually we can skip this if this hive table can be converted to datasource table(parquet etc.). Before[SPARK-28573|https://issues.apache.org/jira/browse/SPARK-28573], the implementaion will update the CatalogTableStatistics, which will cause the improper stats be used in joinSelection when the hive table can be convert to datasource table. In our production environment, user's highly compressed parquet table can cause OOMs when doing `broadcastHashJoin` due to this improper stats. > Skip collecting stats in DetermineTableStats rule when hive table is > convertible to datasource tables > -- > > Key: SPARK-30394 > URL: https://issues.apache.org/jira/browse/SPARK-30394 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 3.0.0 >Reporter: liupengcheng >Priority: Major > > Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark > will scan hdfs files to collect table stats in `DetermineTableStats` rule. > But this can be expensive and not accurate(only file size on disk, not > accounting compression factor), acutually we can skip this if this hive table > can be converted to datasource table(parquet etc.), and do better estimation > in `HadoopFsRelation`. > BeforeSPARK-28573, the implementation will update the CatalogTableStatistics, > which will cause the improper stats(for parquet, this size is greatly smaller > than real size in memory) be used in joinSelection when the hive table can be > convert to datasource table. > In our production environment, user's highly compressed parquet table can > cause OOMs when doing `broadcastHashJoin` due to this improper stats. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files
[ https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032066#comment-17032066 ] liupengcheng commented on SPARK-30712: -- OK, thanks! [~hyukjin.kwon]. > Estimate sizeInBytes from file metadata for parquet files > - > > Key: SPARK-30712 > URL: https://issues.apache.org/jira/browse/SPARK-30712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: liupengcheng >Priority: Major > > Currently, Spark will use a compressionFactor when calculating `sizeInBytes` > for `HadoopFsRelation`, but this is not accurate and it's hard to choose the > best `compressionFactor`. Sometimes, this can causing OOMs due to improper > BroadcastHashJoin. > So I propose to use the rowCount in the BlockMetadata to estimate the size in > memory, which can be more accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30735. -- Resolution: Won't Fix > Improving writing performance by adding repartition based on columns to > partitionBy for DataFrameWriter > --- > > Key: SPARK-30735 > URL: https://issues.apache.org/jira/browse/SPARK-30735 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 > Environment: * Spark-3.0.0 > * Scala: version 2.12.10 > * sbt 0.13.18, script ver: 1.3.7 (Built using sbt) > * Java: 1.8.0_231 > ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11) > ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode) >Reporter: Tomohiro Tanaka >Priority: Trivial > Labels: performance, pull-request-available > Attachments: repartition-before-partitionby.png > > Original Estimate: 336h > Remaining Estimate: 336h > > h1. New functionality for {{partitionBy}} > To enhance performance using partitionBy , calling {{repartition}} method > based on columns is much good before calling {{partitionBy}}. I added new > function: {color:#0747a6}{{partitionBy(, columns>}}{color} to > {{partitionBy}}. > > h2. Problems when not using {{repartition}} before {{partitionBy}}. > When using {{paritionBy}}, following problems happen because of specified > columns in {{partitionBy}} are located separately. > * The spark application which includes {{partitionBy}} takes much longer > (for example, [[python - partitionBy taking too long while saving a dataset > on S3 using Pyspark - Stack > Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])] > * When using {{partitionBy}}, memory usage increases much high compared with > not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3). > * Additional information about memory usage affection by partitionBy: Please > check the attachment (the left figure shows "using partitionBy", the other > shows "not using partitionBy)". > h2. How to use? > It's very simple. If you want to use repartition method before > {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in > {{partitionBy}}. > Example: > {code:java} > val df = spark.read.format("csv").option("header", true).load() > df.write.format("json").partitionBy(true, columns).save(){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files
[ https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032036#comment-17032036 ] Hyukjin Kwon commented on SPARK-30712: -- SPARK-24914 is trying to add the base for this mechanism in general. You should probably take a look for the PRs and help review first. > Estimate sizeInBytes from file metadata for parquet files > - > > Key: SPARK-30712 > URL: https://issues.apache.org/jira/browse/SPARK-30712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: liupengcheng >Priority: Major > > Currently, Spark will use a compressionFactor when calculating `sizeInBytes` > for `HadoopFsRelation`, but this is not accurate and it's hard to choose the > best `compressionFactor`. Sometimes, this can causing OOMs due to improper > BroadcastHashJoin. > So I propose to use the rowCount in the BlockMetadata to estimate the size in > memory, which can be more accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30732) BroadcastExchangeExec does not fully honor "spark.broadcast.compress"
[ https://issues.apache.org/jira/browse/SPARK-30732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031996#comment-17031996 ] L. C. Hsieh commented on SPARK-30732: - `getByteArrayRdd` is not used only there. And as the comment said, UnsafeRow is highly compressible. It sounds not a good idea to disable compression there. I think `spark.broadcast.compress` provides an option to disable compression because you might have objects in a RDD that is hardly compressible. For `getByteArrayRdd`, the purpose is to collect highly compressible rows back. > BroadcastExchangeExec does not fully honor "spark.broadcast.compress" > - > > Key: SPARK-30732 > URL: https://issues.apache.org/jira/browse/SPARK-30732 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Puneet >Priority: Major > > Setting {{spark.broadcast.compress}} to false disables compression while > sending broadcast variable to executors > ([https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization]) > However this does not disable compression for any child relations sent by the > executors to the driver. > Setting spark.boradcast.compress to false should disable both sides of the > traffic, allowing users to disable compression for the whole broadcast join > for example. > [https://github.com/puneetguptanitj/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L89] > ^here `executeCollectIterator` calls `getByteArrayRdd` which by default > always gets a compressed stream > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
[ https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031938#comment-17031938 ] Sunitha Kambhampati commented on SPARK-27298: - Thanks for trying it out in your env. That is good to know, that you are getting the right result on spark-2.4.4 and not on Spark-2.3.0. Based on that, I ran this test on spark 2.3.0 in my linux environment and I can see the wrong count. I generated the explain to debug this and the plan is optimized to Spark 2.3.0 {code:java} == Optimized Logical Plan == Aggregate [CustID#92, DOB#93, Gender#94, HouseholdID#95, Income#96, Initials#97, Occupation#98, Surname#99, Telephone#100L, Title#101], [CustID#92, DOB#93, Gender#94, HouseholdID#95, Income#96, Initials#97, Occupation#98, Surname#99, Telephone#100L, Title#101] +- Filter ((isnotnull(Gender#94) && (Gender#94 = M)) && NOT Income#96 IN (503.65,495.54,486.82,481.28,479.79)) +- Relation[CustID#92,DOB#93,Gender#94,HouseholdID#95,Income#96,Initials#97,Occupation#98,Surname#99,Telephone#100L,Title#101] parquet {code} With Spark 2.3.3, where it generates the correct count, I see the optimized plan. {code:java} == Optimized Logical Plan == Aggregate [CustID#92, DOB#93, Gender#94, HouseholdID#95, Income#96, Initials#97, Occupation#98, Surname#99, Telephone#100L, Title#101], [CustID#92, DOB#93, Gender#94, HouseholdID#95, Income#96, Initials#97, Occupation#98, Surname#99, Telephone#100L, Title#101] +- Filter ((isnotnull(Gender#94) && (Gender#94 = M)) && NOT coalesce(Income#96 IN (503.65,495.54,486.82,481.28,479.79), false)) +- Relation[CustID#92,DOB#93,Gender#94,HouseholdID#95,Income#96,Initials#97,Occupation#98,Surname#99,Telephone#100L,Title#101] parquet {code} Observations: This issue is caused by the nulls in Income rows that were being filtered out incorrectly. This is coming from the optimizer rule 'ReplaceExceptWithFilter'. This bug was fixed in SPARK-26366 and back ported and fixed in spark 2.3.3. - There doesn't seem to be anything specific to OS in this fix, so I am not sure why you were seeing the correct count on windows with Spark 2.3.0(that has the bug). For this, will need to get hold of the explain and also the count on how many rows were null for income column for gender=M on windows (as mentioned in my earlier comments). > Dataset except operation gives different results(dataset count) on Spark > 2.3.0 Windows and Spark 2.3.0 Linux environment > > > Key: SPARK-27298 > URL: https://issues.apache.org/jira/browse/SPARK-27298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.2 >Reporter: Mahima Khatri >Priority: Blocker > Labels: data-loss > Attachments: Console-Result-Windows.txt, > Linux-spark-2.3.0_result.txt, Linux-spark-2.4.4_result.txt, > console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, > console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, > console-result-spark-2.4.2-windows, customer.csv, pom.xml > > > {code:java} > // package com.verifyfilter.example; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.Column; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SaveMode; > public class ExcludeInTesting { > public static void main(String[] args) { > SparkSession spark = SparkSession.builder() > .appName("ExcludeInTesting") > .config("spark.some.config.option", "some-value") > .getOrCreate(); > Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv") > .option("header", "true") > .option("delimiter", "|") > .option("inferSchema", "true") > //.load("E:/resources/customer.csv"); local //below path for VM > .load("/home/myproject/bda/home/bin/customer.csv"); > dataReadFromCSV.printSchema(); > dataReadFromCSV.show(); > //Adding an extra step of saving to db and then loading it again > dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer"); > Dataset dataLoaded = spark.sql("select * from customer"); > //Gender EQ M > Column genderCol = dataLoaded.col("Gender"); > Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M")); > //Dataset onlyMaleDS = spark.sql("select count(*) from customer where > Gender='M'"); > onlyMaleDS.show(); > System.out.println("The count of Male customers is :"+ onlyMaleDS.count()); > System.out.println("*"); > // Income in the list > Object[] valuesArray = new Object[5]; > valuesArray[0]=503.65; > valuesArray[1]=495.54; > valuesArray[2]=486.82; > valuesArray[3]=481.28; > valuesArray[4]=479.79; >
[jira] [Commented] (SPARK-24655) [K8S] Custom Docker Image Expectations and Documentation
[ https://issues.apache.org/jira/browse/SPARK-24655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031870#comment-17031870 ] Thomas Graves commented on SPARK-24655: --- some other discussions on this from https://github.com/apache/spark/pull/23347 > [K8S] Custom Docker Image Expectations and Documentation > > > Key: SPARK-24655 > URL: https://issues.apache.org/jira/browse/SPARK-24655 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Matt Cheah >Priority: Major > > A common use case we want to support with Kubernetes is the usage of custom > Docker images. Some examples include: > * A user builds an application using Gradle or Maven, using Spark as a > compile-time dependency. The application's jars (both the custom-written jars > and the dependencies) need to be packaged in a docker image that can be run > via spark-submit. > * A user builds a PySpark or R application and desires to include custom > dependencies > * A user wants to switch the base image from Alpine to CentOS while using > either built-in or custom jars > We currently do not document how these custom Docker images are supposed to > be built, nor do we guarantee stability of these Docker images with various > spark-submit versions. To illustrate how this can break down, suppose for > example we decide to change the names of environment variables that denote > the driver/executor extra JVM options specified by > {{spark.[driver|executor].extraJavaOptions}}. If we change the environment > variable spark-submit provides then the user must update their custom > Dockerfile and build new images. > Rather than jumping to an implementation immediately though, it's worth > taking a step back and considering these matters from the perspective of the > end user. Towards that end, this ticket will serve as a forum where we can > answer at least the following questions, and any others pertaining to the > matter: > # What would be the steps a user would need to take to build a custom Docker > image, given their desire to customize the dependencies and the content (OS > or otherwise) of said images? > # How can we ensure the user does not need to rebuild the image if only the > spark-submit version changes? > The end deliverable for this ticket is a design document, and then we'll > create sub-issues for the technical implementation and documentation of the > contract. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"
[ https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031860#comment-17031860 ] Xiao Li commented on SPARK-30668: - I think this is still not resolved. Spark 3.0 should not silently return a wrong result for a query whose pattern was right in the previous versions. I did not see the fallback mentioned in [~cloud_fan] > to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern > "-MM-dd'T'HH:mm:ss.SSSz" > > > Key: SPARK-30668 > URL: https://issues.apache.org/jira/browse/SPARK-30668 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Maxim Gekk >Priority: Blocker > Fix For: 3.0.0 > > > {code:java} > SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz") > {code} > This can return a valid value in Spark 2.4 but return NULL in the latest > master > **2.4.5 RC2** > {code} > scala> sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz")""").show > ++ > |to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSz')| > ++ > | 2020-01-27 20:06:11| > ++ > {code} > **2.2.3 ~ 2.4.4** (2.0.2 ~ 2.1.3 doesn't have `to_timestamp`). > {code} > spark-sql> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz"); > 2020-01-27 20:06:11 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"
[ https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-30668: - > to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern > "-MM-dd'T'HH:mm:ss.SSSz" > > > Key: SPARK-30668 > URL: https://issues.apache.org/jira/browse/SPARK-30668 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Maxim Gekk >Priority: Blocker > Fix For: 3.0.0 > > > {code:java} > SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz") > {code} > This can return a valid value in Spark 2.4 but return NULL in the latest > master > **2.4.5 RC2** > {code} > scala> sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz")""").show > ++ > |to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSz')| > ++ > | 2020-01-27 20:06:11| > ++ > {code} > **2.2.3 ~ 2.4.4** (2.0.2 ~ 2.1.3 doesn't have `to_timestamp`). > {code} > spark-sql> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz"); > 2020-01-27 20:06:11 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30752) Wrong result of to_utc_timestamp() on daylight saving day
[ https://issues.apache.org/jira/browse/SPARK-30752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031834#comment-17031834 ] Dongjoon Hyun commented on SPARK-30752: --- The next release 2.4.6 is scheduled but there is no release manager for that yet. > Wrong result of to_utc_timestamp() on daylight saving day > - > > Key: SPARK-30752 > URL: https://issues.apache.org/jira/browse/SPARK-30752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > The to_utc_timestamp() function returns wrong result when: > * JVM system time zone is PST > * the session local time zone is UTC > * fromZone is Asia/Hong_Kong > for the local date '2019-11-03T12:00:00', the result must be > '2019-11-03T04:00:00' > {code} > scala> import java.util.TimeZone > import java.util.TimeZone > scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._ > import org.apache.spark.sql.catalyst.util.DateTimeUtils._ > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> TimeZone.setDefault(getTimeZone("PST")) > scala> spark.conf.set("spark.sql.session.timeZone", "UTC") > scala> val df = Seq("2019-11-03T12:00:00").toDF("localTs") > df: org.apache.spark.sql.DataFrame = [localTs: string] > scala> df.select(to_utc_timestamp(col("localTs"), "Asia/Hong_Kong")).show > +-+ > |to_utc_timestamp(localTs, Asia/Hong_Kong)| > +-+ > | 2019-11-03 03:00:00| > +-+ > {code} > > See > https://www.worldtimebuddy.com/?qm=1=8,1819729,100=8=2019-11-2=21-22 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30752) Wrong result of to_utc_timestamp() on daylight saving day
[ https://issues.apache.org/jira/browse/SPARK-30752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031833#comment-17031833 ] Dongjoon Hyun commented on SPARK-30752: --- Thanks, bit you don't need to ping me from today because I'm not a release manager any more, [~maxgekk]. :) > Wrong result of to_utc_timestamp() on daylight saving day > - > > Key: SPARK-30752 > URL: https://issues.apache.org/jira/browse/SPARK-30752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > The to_utc_timestamp() function returns wrong result when: > * JVM system time zone is PST > * the session local time zone is UTC > * fromZone is Asia/Hong_Kong > for the local date '2019-11-03T12:00:00', the result must be > '2019-11-03T04:00:00' > {code} > scala> import java.util.TimeZone > import java.util.TimeZone > scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._ > import org.apache.spark.sql.catalyst.util.DateTimeUtils._ > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> TimeZone.setDefault(getTimeZone("PST")) > scala> spark.conf.set("spark.sql.session.timeZone", "UTC") > scala> val df = Seq("2019-11-03T12:00:00").toDF("localTs") > df: org.apache.spark.sql.DataFrame = [localTs: string] > scala> df.select(to_utc_timestamp(col("localTs"), "Asia/Hong_Kong")).show > +-+ > |to_utc_timestamp(localTs, Asia/Hong_Kong)| > +-+ > | 2019-11-03 03:00:00| > +-+ > {code} > > See > https://www.worldtimebuddy.com/?qm=1=8,1819729,100=8=2019-11-2=21-22 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30719) AQE should not issue a "not supported" warning for queries being by-passed
[ https://issues.apache.org/jira/browse/SPARK-30719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-30719. - Fix Version/s: 3.0.0 Assignee: Wenchen Fan Resolution: Fixed > AQE should not issue a "not supported" warning for queries being by-passed > -- > > Key: SPARK-30719 > URL: https://issues.apache.org/jira/browse/SPARK-30719 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wenchen Fan >Priority: Minor > Fix For: 3.0.0 > > > This is a follow up for [https://github.com/apache/spark/pull/26813]. > AQE bypasses queries that don't have exchanges or subqueries. This is not a > limitation and it is different from queries that are not supported in AQE. > Issuing a warning in this case can be confusing and annoying. > It would also be good to add an internal conf for this bypassing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30730) Wrong results of `converTz` for different session and system time zones
[ https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031776#comment-17031776 ] Maxim Gekk commented on SPARK-30730: I am going to close this ticket w.r.t https://issues.apache.org/jira/browse/SPARK-30752 and bug fix [https://github.com/apache/spark/pull/27474] where I eliminated any assumptions. > Wrong results of `converTz` for different session and system time zones > --- > > Key: SPARK-30730 > URL: https://issues.apache.org/jira/browse/SPARK-30730 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, DateTimeUtils.convertTz() assumes that timestamp strings are > casted to TimestampType using the JVM system timezone but in fact the session > time zone defined by the SQL config *spark.sql.session.timeZone* is used in > the casting. This leads to wrong results of from_utc_timestamp and > to_utc_timestamp when session time zone is different from JVM time zones. The > issues can be reproduces by the code: > {code:java} > test("to_utc_timestamp in various system and session time zones") { > val localTs = "2020-02-04T22:42:10" > val defaultTz = TimeZone.getDefault > try { > DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz => > TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz)) > DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz => > withSQLConf( > SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true", > SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) { > DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz => > val instant = LocalDateTime > .parse(localTs) > .atZone(DateTimeUtils.getZoneId(toTz)) > .toInstant > val df = Seq(localTs).toDF("localTs") > val res = df.select(to_utc_timestamp(col("localTs"), > toTz)).first().apply(0) > if (instant != res) { > println(s"system = $systemTz session = $sessionTz to = $toTz") > } > } > } > } > } > } catch { > case NonFatal(_) => TimeZone.setDefault(defaultTz) > } > } > {code} > {code:java} > system = UTC session = PST to = UTC > system = UTC session = PST to = PST > system = UTC session = PST to = CET > system = UTC session = PST to = Africa/Dakar > system = UTC session = PST to = America/Los_Angeles > system = UTC session = PST to = Antarctica/Vostok > system = UTC session = PST to = Asia/Hong_Kong > system = UTC session = PST to = Europe/Amsterdam > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30752) Wrong result of to_utc_timestamp() on daylight saving day
[ https://issues.apache.org/jira/browse/SPARK-30752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031774#comment-17031774 ] Maxim Gekk commented on SPARK-30752: [~dongjoon] FYI, 2.4 has the bug. > Wrong result of to_utc_timestamp() on daylight saving day > - > > Key: SPARK-30752 > URL: https://issues.apache.org/jira/browse/SPARK-30752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > The to_utc_timestamp() function returns wrong result when: > * JVM system time zone is PST > * the session local time zone is UTC > * fromZone is Asia/Hong_Kong > for the local date '2019-11-03T12:00:00', the result must be > '2019-11-03T04:00:00' > {code} > scala> import java.util.TimeZone > import java.util.TimeZone > scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._ > import org.apache.spark.sql.catalyst.util.DateTimeUtils._ > scala> import org.apache.spark.sql.functions._ > import org.apache.spark.sql.functions._ > scala> TimeZone.setDefault(getTimeZone("PST")) > scala> spark.conf.set("spark.sql.session.timeZone", "UTC") > scala> val df = Seq("2019-11-03T12:00:00").toDF("localTs") > df: org.apache.spark.sql.DataFrame = [localTs: string] > scala> df.select(to_utc_timestamp(col("localTs"), "Asia/Hong_Kong")).show > +-+ > |to_utc_timestamp(localTs, Asia/Hong_Kong)| > +-+ > | 2019-11-03 03:00:00| > +-+ > {code} > > See > https://www.worldtimebuddy.com/?qm=1=8,1819729,100=8=2019-11-2=21-22 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30752) Wrong result of to_utc_timestamp() on daylight saving day
Maxim Gekk created SPARK-30752: -- Summary: Wrong result of to_utc_timestamp() on daylight saving day Key: SPARK-30752 URL: https://issues.apache.org/jira/browse/SPARK-30752 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 3.0.0 Reporter: Maxim Gekk The to_utc_timestamp() function returns wrong result when: * JVM system time zone is PST * the session local time zone is UTC * fromZone is Asia/Hong_Kong for the local date '2019-11-03T12:00:00', the result must be '2019-11-03T04:00:00' {code} scala> import java.util.TimeZone import java.util.TimeZone scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._ import org.apache.spark.sql.catalyst.util.DateTimeUtils._ scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> TimeZone.setDefault(getTimeZone("PST")) scala> spark.conf.set("spark.sql.session.timeZone", "UTC") scala> val df = Seq("2019-11-03T12:00:00").toDF("localTs") df: org.apache.spark.sql.DataFrame = [localTs: string] scala> df.select(to_utc_timestamp(col("localTs"), "Asia/Hong_Kong")).show +-+ |to_utc_timestamp(localTs, Asia/Hong_Kong)| +-+ | 2019-11-03 03:00:00| +-+ {code} See https://www.worldtimebuddy.com/?qm=1=8,1819729,100=8=2019-11-2=21-22 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30751) Combine the skewed readers into one in AQE skew join optimizations
[ https://issues.apache.org/jira/browse/SPARK-30751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Xue updated SPARK-30751: Description: Assume we have N partitions based on the original join keys, and for a specific partition id {{Pi}} (i = 1 to N), we slice the left partition into {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without skew) joins. *This can be a serious performance concern as the size of the query plan now depends on the number and size of skewed partitions.* Now instead of generating so many joins we can create a “repeated” reader for either side of the join so that: # for the left side, with each partition id Pi and any given slice {{Sj}} in {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}} # for the right side, with each partition id Pi and any given slice {{Tk}} in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}} That way, we can have one SMJ for all the partitions and only one type of special reader. was: Assume we have N partitions based on the original join keys, and for a specific partition id Pi (i = 1 to N), we slice the left partition into L(i) sub-partitions (L = 1 if no skew; L > 1 if skewed), the right partition into M(i) sub-partitions (M = 1 if no skew; M > 1 if skewed). With the current approach, we’ll end up with a sum of L(i) * M(i) (i = 1 to N where L(i) > 1 or M(i) > 1) plus one joins. *This can be a serious performance concern as the size of the query plan now depends on the number and size of skewed partitions.* Now instead of generating so many joins we can create a “repeated” reader for either side of the join so that: # for the left side, with each partition id Pi and any given slice Sj in Pi (j = 1 to L(i)), it generates M(i) repeated partitions with respective join keys as PiSjT1, PiSjT2, …, PiSjTm # for the right side, with each partition id Pi and any given slice Tk in Pi (k = 1 to M(i)), it generates L(i) repeated partitions with respective join keys as PiS1Tk, PiS2Tk, …, PiSlTk That way, we can have one SMJ for all the partitions and only one type of special reader. > Combine the skewed readers into one in AQE skew join optimizations > -- > > Key: SPARK-30751 > URL: https://issues.apache.org/jira/browse/SPARK-30751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Priority: Major > > Assume we have N partitions based on the original join keys, and for a > specific partition id {{Pi}} (i = 1 to N), we slice the left partition into > {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right > partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). > With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 > to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without > skew) joins. *This can be a serious performance concern as the size of the > query plan now depends on the number and size of skewed partitions.* > Now instead of generating so many joins we can create a “repeated” reader for > either side of the join so that: > # for the left side, with each partition id Pi and any given slice {{Sj}} in > {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective > join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}} > # for the right side, with each partition id Pi and any given slice {{Tk}} > in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with > respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}} > That way, we can have one SMJ for all the partitions and only one type of > special reader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30751) Combine the skewed readers into one in AQE skew join optimizations
Wei Xue created SPARK-30751: --- Summary: Combine the skewed readers into one in AQE skew join optimizations Key: SPARK-30751 URL: https://issues.apache.org/jira/browse/SPARK-30751 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wei Xue Assume we have N partitions based on the original join keys, and for a specific partition id Pi (i = 1 to N), we slice the left partition into L(i) sub-partitions (L = 1 if no skew; L > 1 if skewed), the right partition into M(i) sub-partitions (M = 1 if no skew; M > 1 if skewed). With the current approach, we’ll end up with a sum of L(i) * M(i) (i = 1 to N where L(i) > 1 or M(i) > 1) plus one joins. *This can be a serious performance concern as the size of the query plan now depends on the number and size of skewed partitions.* Now instead of generating so many joins we can create a “repeated” reader for either side of the join so that: # for the left side, with each partition id Pi and any given slice Sj in Pi (j = 1 to L(i)), it generates M(i) repeated partitions with respective join keys as PiSjT1, PiSjT2, …, PiSjTm # for the right side, with each partition id Pi and any given slice Tk in Pi (k = 1 to M(i)), it generates L(i) repeated partitions with respective join keys as PiS1Tk, PiS2Tk, …, PiSlTk That way, we can have one SMJ for all the partitions and only one type of special reader. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30750) stage level scheduling: Add ability to set dynamic allocation configs
Thomas Graves created SPARK-30750: - Summary: stage level scheduling: Add ability to set dynamic allocation configs Key: SPARK-30750 URL: https://issues.apache.org/jira/browse/SPARK-30750 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Thomas Graves the initial Jira to modify dynamic allocation for stage level scheduling applies the configs - minimum, initial, max number of executors to each profile individually. This means that you can't set those different per resource profile. Ideally those would be configurable per ResourceProfile so look at adding support for that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30749) stage level scheduling: Better cleanup of Resource profiles
[ https://issues.apache.org/jira/browse/SPARK-30749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-30749: -- Summary: stage level scheduling: Better cleanup of Resource profiles (was: Better cleanup of Resource profiles) > stage level scheduling: Better cleanup of Resource profiles > --- > > Key: SPARK-30749 > URL: https://issues.apache.org/jira/browse/SPARK-30749 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > In the initial stage level scheduling jiras we aren't doing a great job of > cleaning up datastructures when the ResourceProfile is done being used. This > is mostly because its hard to tell when its done being used. > We should find a way to clean up better. > Also in the dynamic allocation manager, if the resource profile is not being > used anymore we should be more active about getting rid of the executors that > are up. Especially if there is a minimum number that is set, we can kill > those off. > > I think this can be done as followup to the main feature -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30749) Better cleanup of Resource profiles
Thomas Graves created SPARK-30749: - Summary: Better cleanup of Resource profiles Key: SPARK-30749 URL: https://issues.apache.org/jira/browse/SPARK-30749 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Thomas Graves In the initial stage level scheduling jiras we aren't doing a great job of cleaning up datastructures when the ResourceProfile is done being used. This is mostly because its hard to tell when its done being used. We should find a way to clean up better. Also in the dynamic allocation manager, if the resource profile is not being used anymore we should be more active about getting rid of the executors that are up. Especially if there is a minimum number that is set, we can kill those off. I think this can be done as followup to the main feature -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27570) java.io.EOFException Reached the end of stream - Reading Parquet from Swift
[ https://issues.apache.org/jira/browse/SPARK-27570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031683#comment-17031683 ] Hadrien Negros commented on SPARK-27570: I have the same problem with reading pretty large parquet files stored in Openstack Swift with Spark 2.4.4 Has someone found a solution? > java.io.EOFException Reached the end of stream - Reading Parquet from Swift > --- > > Key: SPARK-27570 > URL: https://issues.apache.org/jira/browse/SPARK-27570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Harry Hough >Priority: Major > > I did see issue SPARK-25966 but it seems there are some differences as his > problem was resolved after rebuilding the parquet files on write. This is > 100% reproducible for me across many different days of data. > I get exceptions such as "Reached the end of stream with 750477 bytes left to > read" during some read operations of parquet files. I am reading these files > from Openstack swift using openstack-hadoop 2.7.7 on Spark 2.4. > The issues seem to happen with the where statement. I have also tried filter > and combining the statements into one as well as the dataset method with > column without any luck. Which column or what the actual filter is on the > where also doesn't seem to make a difference to the error occurring or not. > > {code:java} > val engagementDS = spark > .read > .parquet(createSwiftAddr("engagements", folder)) > .where("engtype != 0") > .where("engtype != 1000") > .groupBy($"accid", $"sessionkey") > .agg(collect_list(struct($"time", $"pid", $"engtype", $"pageid", > $"testid")).as("engagements")) > // Exiting paste mode, now interpreting. > [Stage 53:> (0 + 32) / 32]2019-04-25 19:02:12 ERROR Executor:91 - Exception > in task 24.0 in stage 53.0 (TID 688) > java.io.EOFException: Reached the end of stream with 1323959 bytes left to > read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031647#comment-17031647 ] Sujith Chacko commented on SPARK-24615: --- Great!!Thanks for the update. > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files
[ https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031644#comment-17031644 ] liupengcheng commented on SPARK-30712: -- [~hyukjin.kwon] We use the rowCount info in metadata and the schema to infer the memory consumption of `UnsafeRow`s in memory. It works fine. > Estimate sizeInBytes from file metadata for parquet files > - > > Key: SPARK-30712 > URL: https://issues.apache.org/jira/browse/SPARK-30712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: liupengcheng >Priority: Major > > Currently, Spark will use a compressionFactor when calculating `sizeInBytes` > for `HadoopFsRelation`, but this is not accurate and it's hard to choose the > best `compressionFactor`. Sometimes, this can causing OOMs due to improper > BroadcastHashJoin. > So I propose to use the rowCount in the BlockMetadata to estimate the size in > memory, which can be more accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031637#comment-17031637 ] Thomas Graves commented on SPARK-24615: --- yes it will be in 3.0, the feature is complete other then if someone wants mesos support, see the linked jiras in the epic. You can find documentation checked in master branch: https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files
[ https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031636#comment-17031636 ] liupengcheng commented on SPARK-30712: -- [~hyukjin.kwon] Yes, in our customed spark version, we use parquet metadata to compute this size, it's more accurate and work well for some tables. I think we still scan files to get the file size in `DetermineTableStats` Rule when `fallBackToHdfs` is true. If you worry about that we can just also add a config for this. Also, in many cases, we can make use of the summary-metadata of parquet files to speed up this estimation. > Estimate sizeInBytes from file metadata for parquet files > - > > Key: SPARK-30712 > URL: https://issues.apache.org/jira/browse/SPARK-30712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: liupengcheng >Priority: Major > > Currently, Spark will use a compressionFactor when calculating `sizeInBytes` > for `HadoopFsRelation`, but this is not accurate and it's hard to choose the > best `compressionFactor`. Sometimes, this can causing OOMs due to improper > BroadcastHashJoin. > So I propose to use the rowCount in the BlockMetadata to estimate the size in > memory, which can be more accurate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
[ https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031614#comment-17031614 ] Attila Zsolt Piros edited comment on SPARK-30688 at 2/6/20 2:22 PM: I have checked on Spark 3.0.0-preview2 and week in year fails there too (but for invalid format I would say this is fine): {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); +-+ |unix_timestamp(20201, ww)| +-+ | null| +-+ {noformat} But it fails for a correct pattern too: {noformat} scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show(); ++ |unix_timestamp(2020-10, -ww)| ++ |null| ++ {noformat} But for this it is strangely works: {noformat} scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show(); ++ |unix_timestamp(2020-01, -ww)| ++ | 1577833200| ++ {noformat} was (Author: attilapiros): I have checked on Spark 3.0.0-preview2 and week in year fails there too: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); +-+ |unix_timestamp(20201, ww)| +-+ | null| +-+ {noformat} As you can see it fails for even a simpler pattern too: {noformat} scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show(); ++ |unix_timestamp(2020-10, -ww)| ++ |null| ++ {noformat} But this strangely works: {noformat} scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show(); ++ |unix_timestamp(2020-01, -ww)| ++ | 1577833200| ++ {noformat} > Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF > -- > > Key: SPARK-30688 > URL: https://issues.apache.org/jira/browse/SPARK-30688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Rajkumar Singh >Priority: Major > > > {code:java} > scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); > +-+ > |unix_timestamp(20201, ww)| > +-+ > | null| > +-+ > > scala> spark.sql("select unix_timestamp('20202', 'ww')").show(); > -+ > |unix_timestamp(20202, ww)| > +-+ > | 1578182400| > +-+ > > {code} > > > This seems to happen for leap year only, I dig deeper into it and it seems > that Spark is using the java.text.SimpleDateFormat and try to parse the > expression here > [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652] > {code:java} > formatter.parse( > t.asInstanceOf[UTF8String].toString).getTime / 1000L{code} > but fail and SimpleDateFormat unable to parse the date throw Unparseable > Exception but Spark handle it silently and returns NULL. > > *Spark-3.0:* I did some tests where spark no longer using the legacy > java.text.SimpleDateFormat but java date/time API, it seems date/time API > expect a valid date with valid format > org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
[ https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031614#comment-17031614 ] Attila Zsolt Piros commented on SPARK-30688: I have checked on Spark 3.0.0-preview2 and week in year fails there too: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); +-+ |unix_timestamp(20201, ww)| +-+ | null| +-+ {noformat} As you can see it fails for even a simpler pattern too: {noformat} scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show(); ++ |unix_timestamp(2020-10, -ww)| ++ |null| ++ {noformat} But this strangely works: {noformat} scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show(); ++ |unix_timestamp(2020-01, -ww)| ++ | 1577833200| ++ {noformat} > Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF > -- > > Key: SPARK-30688 > URL: https://issues.apache.org/jira/browse/SPARK-30688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Rajkumar Singh >Priority: Major > > > {code:java} > scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); > +-+ > |unix_timestamp(20201, ww)| > +-+ > | null| > +-+ > > scala> spark.sql("select unix_timestamp('20202', 'ww')").show(); > -+ > |unix_timestamp(20202, ww)| > +-+ > | 1578182400| > +-+ > > {code} > > > This seems to happen for leap year only, I dig deeper into it and it seems > that Spark is using the java.text.SimpleDateFormat and try to parse the > expression here > [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652] > {code:java} > formatter.parse( > t.asInstanceOf[UTF8String].toString).getTime / 1000L{code} > but fail and SimpleDateFormat unable to parse the date throw Unparseable > Exception but Spark handle it silently and returns NULL. > > *Spark-3.0:* I did some tests where spark no longer using the legacy > java.text.SimpleDateFormat but java date/time API, it seems date/time API > expect a valid date with valid format > org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30748) Storage Memory in Spark Web UI means
islandshinji created SPARK-30748: Summary: Storage Memory in Spark Web UI means Key: SPARK-30748 URL: https://issues.apache.org/jira/browse/SPARK-30748 Project: Spark Issue Type: Question Components: Web UI Affects Versions: 2.4.0 Reporter: islandshinji Does the denominator of 'Storage Memory' in Spark Web UI include execution memory? In my environment, set 'spark.executor.memory' to 20g and the denominator of 'Storage Memory' is 11.3g. I think it is too big just include storage memory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30744) Optimize AnalyzePartitionCommand by calculating location sizes in parallel
[ https://issues.apache.org/jira/browse/SPARK-30744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30744. - Fix Version/s: 3.1.0 Assignee: wuyi Resolution: Fixed > Optimize AnalyzePartitionCommand by calculating location sizes in parallel > -- > > Key: SPARK-30744 > URL: https://issues.apache.org/jira/browse/SPARK-30744 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > AnalyzePartitionCommand could use CommandUtils.calculateTotalLocationSize to > calculate location sizes in parallel to improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30747) Update roxygen2 to 7.0.1
[ https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031471#comment-17031471 ] Maciej Szymkiewicz commented on SPARK-30747: CC [~felixcheung] [~hyukjin.kwon] [~shaneknapp] > Update roxygen2 to 7.0.1 > > > Key: SPARK-30747 > URL: https://issues.apache.org/jira/browse/SPARK-30747 > Project: Spark > Issue Type: Improvement > Components: SparkR, Tests >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old > (2015-11-11) so it could be a good idea to use current R updates to update it > as well. > At crude inspection: > * SPARK-22430 has been resolved a while ago. > * SPARK-30737][SPARK-27262, https://github.com/apache/spark/pull/27437 and > https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed > resolved persisting warnings > * Documentation builds and CRAN checks pass > * Generated HTML docs are identical to 5.0.1 > Since {{roxygen2}} shares some potentially unstable dependencies with > {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in > sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being > overwritten by local tests). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30747) Update roxygen2 to 7.0.1
[ https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-30747: --- Description: Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old (2015-11-11) so it could be a good idea to use current R updates to update it as well. At crude inspection: * SPARK-22430 has been resolved a while ago. * SPARK-30737][SPARK-27262, https://github.com/apache/spark/pull/27437 and https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed resolved persisting warnings * Documentation builds and CRAN checks pass * Generated HTML docs are identical to 5.0.1 Since {{roxygen2}} shares some potentially unstable dependencies with {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being overwritten by local tests). was: Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old (2015-11-11) so it could be a good idea to use current R updates to update it as well. At crude inspection: * SPARK-22430 has been resolved a while ago. * https://github.com/apache/spark/pull/27437 and https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed resolved persisting warnings * Documentation builds and CRAN checks pass * Generated HTML docs are identical to 5.0.1 > Update roxygen2 to 7.0.1 > > > Key: SPARK-30747 > URL: https://issues.apache.org/jira/browse/SPARK-30747 > Project: Spark > Issue Type: Improvement > Components: SparkR, Tests >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old > (2015-11-11) so it could be a good idea to use current R updates to update it > as well. > At crude inspection: > * SPARK-22430 has been resolved a while ago. > * SPARK-30737][SPARK-27262, https://github.com/apache/spark/pull/27437 and > https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed > resolved persisting warnings > * Documentation builds and CRAN checks pass > * Generated HTML docs are identical to 5.0.1 > Since {{roxygen2}} shares some potentially unstable dependencies with > {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in > sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being > overwritten by local tests). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30747) Update roxygen2 to 7.0.1
Maciej Szymkiewicz created SPARK-30747: -- Summary: Update roxygen2 to 7.0.1 Key: SPARK-30747 URL: https://issues.apache.org/jira/browse/SPARK-30747 Project: Spark Issue Type: Improvement Components: SparkR, Tests Affects Versions: 3.0.0 Reporter: Maciej Szymkiewicz Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old (2015-11-11) so it could be a good idea to use current R updates to update it as well. At crude inspection: * SPARK-22430 has been resolved a while ago. * https://github.com/apache/spark/pull/27437 and https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed resolved persisting warnings * Documentation builds and CRAN checks pass * Generated HTML docs are identical to 5.0.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30739) unable to turn off Hadoop's trash feature
[ https://issues.apache.org/jira/browse/SPARK-30739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031437#comment-17031437 ] Ohad Raviv commented on SPARK-30739: Closing as I realized this is actually the documented behaviour [here|https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/core-default.xml]. _fs.trash.interval_ _Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval. If zero, the value is set to the value of fs.trash.interval. Every time the checkpointer runs it creates a new checkpoint out of current and removes checkpoints created more than fs.trash.interval minutes ago._ so decided to use the _fs.trash.classname_ approach. > unable to turn off Hadoop's trash feature > - > > Key: SPARK-30739 > URL: https://issues.apache.org/jira/browse/SPARK-30739 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Minor > > We're trying to turn off the `TrashPolicyDefault` in one of our Spark > applications by setting `spark.hadoop.fs.trash.interval=0`, but it just stays > `360` as configured in our cluster's `core-site.xml`. > Trying to debug it we managed to set > `spark.hadoop.fs.trash.classname=OtherTrashPolicy` and it worked. the main > difference seems to be that `spark.hadoop.fs.trash.classname` does not appear > in any of the `*-site.xml` files. > when we print the conf that get initialized in `TrashPolicyDefault` we get: > ``` > Configuration: core-default.xml, core-site.xml, yarn-default.xml, > yarn-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, > hdfs-site.xml, > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@561f0431, > file:/hadoop03/yarn/local/usercache/.../hive-site.xml > ``` > and: > `fs.trash.interval=360 [programatically]` > `fs.trash.classname=OtherTrashPolicy [programatically]` > > any idea why `fs.trash.classname` works but `fs.trash.interval` doesn't? > this seems maybe related to: -SPARK-9825.- > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30739) unable to turn off Hadoop's trash feature
[ https://issues.apache.org/jira/browse/SPARK-30739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ohad Raviv resolved SPARK-30739. Resolution: Workaround > unable to turn off Hadoop's trash feature > - > > Key: SPARK-30739 > URL: https://issues.apache.org/jira/browse/SPARK-30739 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Minor > > We're trying to turn off the `TrashPolicyDefault` in one of our Spark > applications by setting `spark.hadoop.fs.trash.interval=0`, but it just stays > `360` as configured in our cluster's `core-site.xml`. > Trying to debug it we managed to set > `spark.hadoop.fs.trash.classname=OtherTrashPolicy` and it worked. the main > difference seems to be that `spark.hadoop.fs.trash.classname` does not appear > in any of the `*-site.xml` files. > when we print the conf that get initialized in `TrashPolicyDefault` we get: > ``` > Configuration: core-default.xml, core-site.xml, yarn-default.xml, > yarn-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, > hdfs-site.xml, > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@561f0431, > file:/hadoop03/yarn/local/usercache/.../hive-site.xml > ``` > and: > `fs.trash.interval=360 [programatically]` > `fs.trash.classname=OtherTrashPolicy [programatically]` > > any idea why `fs.trash.classname` works but `fs.trash.interval` doesn't? > this seems maybe related to: -SPARK-9825.- > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031378#comment-17031378 ] Sujith Chacko commented on SPARK-24615: --- Will this feature be a part of Spark 3.0? Any update on release timeline of this feature. Thanks > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Thomas Graves >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org