[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up
[ https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269182#comment-17269182 ] Attila Zsolt Piros commented on SPARK-34167: [~razajafri] could you share with us how the parquet files are created? I tried to reproduce this issue in the following way but I had no luck: {noformat} Spark context Web UI available at http://192.168.0.17:4045 Spark context available as 'sc' (master = local, app id = local-1611221568779). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> import java.math.BigDecimal import java.math.BigDecimal scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType} import org.apache.spark.sql.types.{DecimalType, StructField, StructType} scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true))) schema: org.apache.spark.sql.types.StructType = StructType(StructField(num,DecimalType(8,2),true)) scala> val rdd = sc.parallelize((0 to 9).map(v => new BigDecimal(s"123456.7$v"))) rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] at parallelize at :27 scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema) df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)] scala> df.show() +-+ | num| +-+ |123456.70| |123456.71| |123456.72| |123456.73| |123456.74| |123456.75| |123456.76| |123456.77| |123456.78| |123456.79| +-+ scala> df.write.parquet("num.parquet") scala> spark.read.parquet("num.parquet").show() +-+ | num| +-+ |123456.70| |123456.71| |123456.72| |123456.73| |123456.74| |123456.75| |123456.76| |123456.77| |123456.78| |123456.79| +-+ {noformat} > Reading parquet with Decimal(8,2) written as a Decimal64 blows up > - > > Key: SPARK-34167 > URL: https://issues.apache.org/jira/browse/SPARK-34167 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.1 >Reporter: Raza Jafri >Priority: Major > Attachments: > part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, > part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet > > > When reading a parquet file written with Decimals with precision < 10 as a > 64-bit representation, Spark tries to read it as an INT and fails > > Steps to reproduce: > Read the attached file that has a single Decimal(8,2) column with 10 values > {code:java} > scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show > ... > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756) > at >
[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
[ https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265863#comment-17265863 ] Attila Zsolt Piros commented on SPARK-33711: Sure. Working on that. > Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns > -- > > Key: SPARK-33711 > URL: https://issues.apache.org/jira/browse/SPARK-33711 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.2.0, 3.1.1 > > > Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD > changes which could wrongfully lead to detecting of missing PODs (PODs known > by scheduler backend but missing from POD snapshots) by the executor POD > lifecycle manager. > A key indicator of this is seeing this log msg: > "The executor with ID [some_id] was not found in the cluster but we didn't > get a reason why. Marking the executor as failed. The executor may have been > deleted but the driver missed the deletion event." > So one of the problem is running the missing POD detection even when a single > pod is changed without having a full consistent snapshot about all the PODs > (see ExecutorPodsPollingSnapshotSource). The other could be a race between > the executor POD lifecycle manager and the scheduler backend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261398#comment-17261398 ] Attila Zsolt Piros edited comment on SPARK-26836 at 1/8/21, 4:15 PM: - I am working on this and a PR can be expected this weekend / next week was (Author: attilapiros): I am working on a this > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamás Németh >Priority: Critical > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 >
[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
[ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261398#comment-17261398 ] Attila Zsolt Piros commented on SPARK-26836: I am working on a this > Columns get switched in Spark SQL using Avro backed Hive table if schema > evolves > > > Key: SPARK-26836 > URL: https://issues.apache.org/jira/browse/SPARK-26836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 > Environment: I tested with Hive and HCatalog which runs on version > 2.3.4 and with Spark 2.3.1 and 2.4 >Reporter: Tamás Németh >Priority: Critical > Labels: correctness > Attachments: doctors.avro, doctors_evolved.avro, > doctors_evolved.json, original.avsc > > > I have a hive avro table where the avro schema is stored on s3 next to the > avro files. > In the table definiton the avro.schema.url always points to the latest > partition's _schema.avsc file which is always the lates schema. (Avro schemas > are backward and forward compatible in a table) > When new data comes in, I always add a new partition where the > avro.schema.url properties also set to the _schema.avsc which was used when > it was added and of course I always update the table avro.schema.url property > to the latest one. > Querying this table works fine until the schema evolves in a way that a new > optional property is added in the middle. > When this happens then after the spark sql query the columns in the old > partition gets mixed up and it shows the wrong data for the columns. > If I query the table with Hive then everything is perfectly fine and it gives > me back the correct columns for the partitions which were created the old > schema and for the new which was created the evolved schema. > > Here is how I could reproduce with the > [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro] > example data in sql test suite. > # I have created two partition folder: > {code:java} > [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/ > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors > /dt=2019-02-05/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-05/doctors.avro > Found 2 items > -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors > /dt=2019-02-06/_schema.avsc > -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors > /dt=2019-02-06/doctors_evolved.avro{code} > Here the first partition had data which was created with the schema before > evolving and the second one had the evolved one. (the evolved schema is the > same as in your testcase except I moved the extra_field column to the last > from the second and I generated two lines of avro data with the evolved > schema. > # I have created a hive table with the following command: > > {code:java} > CREATE EXTERNAL TABLE `default.doctors` > PARTITIONED BY ( > `dt` string > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > WITH SERDEPROPERTIES ( > 'avro.schema.url'='s3://somelocation/doctors/ > /dt=2019-02-06/_schema.avsc') > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION > 's3://somelocation/doctors/' > TBLPROPERTIES ( > 'transient_lastDdlTime'='1538130975'){code} > > Here as you can see the table schema url points to the latest schema > 3. I ran an msck _repair table_ to pick up all the partitions. > Fyi: If I run my select * query from here then everything is fine and no > columns switch happening. > 4. Then I changed the first partition's avro.schema.url url to points to the > schema which is under the partition folder (non-evolved one -> > s3://somelocation/doctors/ > /dt=2019-02-05/_schema.avsc) > Then if you ran a _select * from default.spark_test_ then the columns will be > mixed up (on the data below the first name column becomes the extra_field > column. I guess because in the latest schema it is the second column): > > {code:java} > number,extra_field,first_name,last_name,dt > 6,Colin,Baker,null,2019-02-05 > 3,Jon,Pertwee,null,2019-02-05 > 4,Tom,Baker,null,2019-02-05 > 5,Peter,Davison,null,2019-02-05 > 11,Matt,Smith,null,2019-02-05 > 1,William,Hartnell,null,2019-02-05 > 7,Sylvester,McCoy,null,2019-02-05 > 8,Paul,McGann,null,2019-02-05 > 2,Patrick,Troughton,null,2019-02-05 > 9,Christopher,Eccleston,null,2019-02-05 > 10,David,Tennant,null,2019-02-05 > 21,fishfinger,Jim,Baker,2019-02-06 > 24,fishfinger,Bean,Pertwee,2019-02-06 > {code} > If I
[jira] [Commented] (SPARK-32617) Upgrade kubernetes client version to support latest minikube version.
[ https://issues.apache.org/jira/browse/SPARK-32617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248168#comment-17248168 ] Attila Zsolt Piros commented on SPARK-32617: I will have PR for this soon. > Upgrade kubernetes client version to support latest minikube version. > - > > Key: SPARK-32617 > URL: https://issues.apache.org/jira/browse/SPARK-32617 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > Following error comes, when the k8s integration tests are run against the > minikube cluster with version 1.2.1 > {code:java} > Run starting. Expected test count is: 18 > KubernetesSuite: > org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED *** > io.fabric8.kubernetes.client.KubernetesClientException: An error has > occurred. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:196) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:62) > at io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51) > at > io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:105) > at > org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:81) > at > org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33) > at > org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:131) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > ... > Cause: java.nio.file.NoSuchFileException: /root/.minikube/apiserver.crt > at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at java.nio.file.Files.readAllBytes(Files.java:3152) > at > io.fabric8.kubernetes.client.internal.CertUtils.getInputStreamFromDataOrFile(CertUtils.java:72) > at > io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:242) > at > io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) > ... > Run completed in 1 second, 821 milliseconds. > Total number of tests run: 0 > Suites: completed 1, aborted 1 > Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 > *** 1 SUITE ABORTED *** > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.454 > s] > [INFO] Spark Project Tags . SUCCESS [ 4.768 > s] > [INFO] Spark Project Local DB . SUCCESS [ 2.961 > s] > [INFO] Spark Project Networking ... SUCCESS [ 4.258 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 5.703 > s] > [INFO] Spark Project Unsafe ... SUCCESS [ 3.239 > s] > [INFO] Spark Project Launcher . SUCCESS [ 3.224 > s] > [INFO] Spark Project Core . SUCCESS [02:25 > min] > [INFO] Spark Project Kubernetes Integration Tests . FAILURE [ 17.244 > s] > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 03:12 min > [INFO] Finished at: 2020-08-11T06:26:15-05:00 > [INFO] > > [ERROR] Failed to execute goal > org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project > spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR]
[jira] [Comment Edited] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
[ https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246092#comment-17246092 ] Attila Zsolt Piros edited comment on SPARK-33711 at 12/8/20, 7:21 PM: -- Yes, it does. I have updated the affected versions accordingly. Thanks [~dongjoon]. was (Author: attilapiros): Yes, it does. I have updated the affected versions accordingly. > Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns > -- > > Key: SPARK-33711 > URL: https://issues.apache.org/jira/browse/SPARK-33711 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD > changes which could wrongfully lead to detecting of missing PODs (PODs known > by scheduler backend but missing from POD snapshots) by the executor POD > lifecycle manager. > A key indicator of this is seeing this log msg: > "The executor with ID [some_id] was not found in the cluster but we didn't > get a reason why. Marking the executor as failed. The executor may have been > deleted but the driver missed the deletion event." > So one of the problem is running the missing POD detection even when a single > pod is changed without having a full consistent snapshot about all the PODs > (see ExecutorPodsPollingSnapshotSource). The other could be a race between > the executor POD lifecycle manager and the scheduler backend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
[ https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246092#comment-17246092 ] Attila Zsolt Piros commented on SPARK-33711: Yes, it does. I have updated the affected versions accordingly. > Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns > -- > > Key: SPARK-33711 > URL: https://issues.apache.org/jira/browse/SPARK-33711 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD > changes which could wrongfully lead to detecting of missing PODs (PODs known > by scheduler backend but missing from POD snapshots) by the executor POD > lifecycle manager. > A key indicator of this is seeing this log msg: > "The executor with ID [some_id] was not found in the cluster but we didn't > get a reason why. Marking the executor as failed. The executor may have been > deleted but the driver missed the deletion event." > So one of the problem is running the missing POD detection even when a single > pod is changed without having a full consistent snapshot about all the PODs > (see ExecutorPodsPollingSnapshotSource). The other could be a race between > the executor POD lifecycle manager and the scheduler backend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
[ https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-33711: --- Affects Version/s: 2.3.4 > Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns > -- > > Key: SPARK-33711 > URL: https://issues.apache.org/jira/browse/SPARK-33711 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD > changes which could wrongfully lead to detecting of missing PODs (PODs known > by scheduler backend but missing from POD snapshots) by the executor POD > lifecycle manager. > A key indicator of this is seeing this log msg: > "The executor with ID [some_id] was not found in the cluster but we didn't > get a reason why. Marking the executor as failed. The executor may have been > deleted but the driver missed the deletion event." > So one of the problem is running the missing POD detection even when a single > pod is changed without having a full consistent snapshot about all the PODs > (see ExecutorPodsPollingSnapshotSource). The other could be a race between > the executor POD lifecycle manager and the scheduler backend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
[ https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-33711: --- Affects Version/s: (was: 3.0.1) 3.2.0 3.1.0 > Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns > -- > > Key: SPARK-33711 > URL: https://issues.apache.org/jira/browse/SPARK-33711 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.7, 3.0.0, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD > changes which could wrongfully lead to detecting of missing PODs (PODs known > by scheduler backend but missing from POD snapshots) by the executor POD > lifecycle manager. > A key indicator of this is seeing this log msg: > "The executor with ID [some_id] was not found in the cluster but we didn't > get a reason why. Marking the executor as failed. The executor may have been > deleted but the driver missed the deletion event." > So one of the problem is running the missing POD detection even when a single > pod is changed without having a full consistent snapshot about all the PODs > (see ExecutorPodsPollingSnapshotSource). The other could be a race between > the executor POD lifecycle manager and the scheduler backend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
[ https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-33711: --- Summary: Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns (was: Race condition in Spark k8s Pod lifecycle management that leads to shutdowns) > Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns > -- > > Key: SPARK-33711 > URL: https://issues.apache.org/jira/browse/SPARK-33711 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: Attila Zsolt Piros >Priority: Major > > Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD > changes which could wrongfully lead to detecting of missing PODs (PODs known > by scheduler backend but missing from POD snapshots) by the executor POD > lifecycle manager. > A key indicator of this is seeing this log msg: > "The executor with ID [some_id] was not found in the cluster but we didn't > get a reason why. Marking the executor as failed. The executor may have been > deleted but the driver missed the deletion event." > So one of the problem is running the missing POD detection even when a single > pod is changed without having a full consistent snapshot about all the PODs > (see ExecutorPodsPollingSnapshotSource). The other could be a race between > the executor POD lifecycle manager and the scheduler backend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle management that leads to shutdowns
[ https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245993#comment-17245993 ] Attila Zsolt Piros commented on SPARK-33711: I am working on this and soon I will open a PR. > Race condition in Spark k8s Pod lifecycle management that leads to shutdowns > - > > Key: SPARK-33711 > URL: https://issues.apache.org/jira/browse/SPARK-33711 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: Attila Zsolt Piros >Priority: Major > > Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD > changes which could wrongfully lead to detecting of missing PODs (PODs known > by scheduler backend but missing from POD snapshots) by the executor POD > lifecycle manager. > A key indicator of this is seeing this log msg: > "The executor with ID [some_id] was not found in the cluster but we didn't > get a reason why. Marking the executor as failed. The executor may have been > deleted but the driver missed the deletion event." > So one of the problem is running the missing POD detection even when a single > pod is changed without having a full consistent snapshot about all the PODs > (see ExecutorPodsPollingSnapshotSource). The other could be a race between > the executor POD lifecycle manager and the scheduler backend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33711) Race condition in Spark k8s Pod lifecycle management that leads to shutdowns
Attila Zsolt Piros created SPARK-33711: -- Summary: Race condition in Spark k8s Pod lifecycle management that leads to shutdowns Key: SPARK-33711 URL: https://issues.apache.org/jira/browse/SPARK-33711 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.1, 3.0.0, 2.4.7 Reporter: Attila Zsolt Piros Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD changes which could wrongfully lead to detecting of missing PODs (PODs known by scheduler backend but missing from POD snapshots) by the executor POD lifecycle manager. A key indicator of this is seeing this log msg: "The executor with ID [some_id] was not found in the cluster but we didn't get a reason why. Marking the executor as failed. The executor may have been deleted but the driver missed the deletion event." So one of the problem is running the missing POD detection even when a single pod is changed without having a full consistent snapshot about all the PODs (see ExecutorPodsPollingSnapshotSource). The other could be a race between the executor POD lifecycle manager and the scheduler backend. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28210) Shuffle Storage API: Reads
[ https://issues.apache.org/jira/browse/SPARK-28210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196013#comment-17196013 ] Attila Zsolt Piros edited comment on SPARK-28210 at 9/15/20, 9:38 AM: -- [~tianczha] [~devaraj] I would like to work on this issue if that's fine for you. I intend to progress along the ideas of the linked PR: to pass the metadata when the reducer task is constructed. was (Author: attilapiros): [~tianczha] [~devaraj] I would like to work on this issue if that's fine for you. I would like to progress along the ideas of the linked PR: to pass the metadata when the reducer task is constructed. > Shuffle Storage API: Reads > -- > > Key: SPARK-28210 > URL: https://issues.apache.org/jira/browse/SPARK-28210 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > > As part of the effort to store shuffle data in arbitrary places, this issue > tracks implementing an API for reading the shuffle data stored by the write > API. Also ensure that the existing shuffle implementation is refactored to > use the API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28210) Shuffle Storage API: Reads
[ https://issues.apache.org/jira/browse/SPARK-28210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196013#comment-17196013 ] Attila Zsolt Piros commented on SPARK-28210: [~tianczha] [~devaraj] I would like to work on this issue if that's fine for you. I would like to progress along the ideas of the linked PR: to pass the metadata when the reducer task is constructed. > Shuffle Storage API: Reads > -- > > Key: SPARK-28210 > URL: https://issues.apache.org/jira/browse/SPARK-28210 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > > As part of the effort to store shuffle data in arbitrary places, this issue > tracks implementing an API for reading the shuffle data stored by the write > API. Also ensure that the existing shuffle implementation is refactored to > use the API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server
[ https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181138#comment-17181138 ] Attila Zsolt Piros commented on SPARK-32663: You are right. The clients will be closed via [TransportClientFactory#close()|https://github.com/attilapiros/spark/blob/e6795cd3416bbe32505efd4c1fa3202f451bf74d/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java#L324]. > TransportClient getting closed when there are outstanding requests to the > server > > > Key: SPARK-32663 > URL: https://issues.apache.org/jira/browse/SPARK-32663 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Chandni Singh >Priority: Major > > The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in > {{ExternalBlockStoreClient}} closes the client after processing a response in > the callback. > This is a cached client which will be re-used for other responses. There > could be other outstanding request to the shuffle service, so it should not > be closed after processing a response. > Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. > The older methods {{registerWithShuffleServer}} and {{fetchBlocks}} didn't > close the client. > cc [~attilapiros] [~vanzin] [~mridulm80] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option
[ https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-32293: --- Target Version/s: 3.1.0 > Inconsistent default unit between Spark memory configs and JVM option > - > > Key: SPARK-32293 > URL: https://issues.apache.org/jira/browse/SPARK-32293 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, > 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Spark's maximum memory can be configured in several ways: > - via Spark config > - command line argument > - environment variables > Both for executors and for the driver the memory can be configured > separately. All of these are following the format of JVM memory > configurations in a way they are using the very same size unit suffixes ("k", > "m", "g" or "t") but there is an inconsistency regarding the default unit. > When no suffix is given then the given amount is passed as it is to the JVM > (to the -Xmx and -Xms options) where this memory options are using bytes as a > default unit, for this please see the example > [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: > {noformat} > The following examples show how to set the maximum allowed size of allocated > memory to 80 MB using various units: > -Xmx83886080 > -Xmx81920k > -Xmx80m > {noformat} > Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option
[ https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-32293: --- Summary: Inconsistent default unit between Spark memory configs and JVM option (was: Inconsistent default units for configuring Spark memory) > Inconsistent default unit between Spark memory configs and JVM option > - > > Key: SPARK-32293 > URL: https://issues.apache.org/jira/browse/SPARK-32293 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, > 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Spark's maximum memory can be configured in several ways: > - via Spark config > - command line argument > - environment variables > Both for executors and for the driver the memory can be configured > separately. All of these are following the format of JVM memory > configurations in a way they are using the very same size unit suffixes ("k", > "m", "g" or "t") but there is an inconsistency regarding the default unit. > When no suffix is given then the given amount is passed as it is to the JVM > (to the -Xmx and -Xms options) where this memory options are using bytes as a > default unit, for this please see the example > [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: > {noformat} > The following examples show how to set the maximum allowed size of allocated > memory to 80 MB using various units: > -Xmx83886080 > -Xmx81920k > -Xmx80m > {noformat} > Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32293) Inconsistent default units for configuring Spark memory
[ https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156884#comment-17156884 ] Attila Zsolt Piros commented on SPARK-32293: I am working on this. > Inconsistent default units for configuring Spark memory > --- > > Key: SPARK-32293 > URL: https://issues.apache.org/jira/browse/SPARK-32293 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, > 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Spark's maximum memory can be configured in several ways: > - via Spark config > - command line argument > - environment variables > Both for executors and for the driver the memory can be configured > separately. All of these are following the format of JVM memory > configurations in a way they are using the very same size unit suffixes ("k", > "m", "g" or "t") but there is an inconsistency regarding the default unit. > When no suffix is given then the given amount is passed as it is to the JVM > (to the -Xmx and -Xms options) where this memory options are using bytes as a > default unit, for this please see the example > [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: > {noformat} > The following examples show how to set the maximum allowed size of allocated > memory to 80 MB using various units: > -Xmx83886080 > -Xmx81920k > -Xmx80m > {noformat} > Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32293) Inconsistent default units for configuring Spark memory
Attila Zsolt Piros created SPARK-32293: -- Summary: Inconsistent default units for configuring Spark memory Key: SPARK-32293 URL: https://issues.apache.org/jira/browse/SPARK-32293 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 3.0.0, 2.4.6, 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 2.3.4, 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.3, 2.2.2, 2.2.1, 3.0.1, 3.1.0 Reporter: Attila Zsolt Piros Spark's maximum memory can be configured in several ways: - via Spark config - command line argument - environment variables Both for executors and for the driver the memory can be configured separately. All of these are following the format of JVM memory configurations in a way they are using the very same size unit suffixes ("k", "m", "g" or "t") but there is an inconsistency regarding the default unit. When no suffix is given then the given amount is passed as it is to the JVM (to the -Xmx and -Xms options) where this memory options are using bytes as a default unit, for this please see the example [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]: {noformat} The following examples show how to set the maximum allowed size of allocated memory to 80 MB using various units: -Xmx83886080 -Xmx81920k -Xmx80m {noformat} Although the Spark memory config default suffix unit is "m". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32149) Improve file path name normalisation at block resolution within the external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-32149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-32149: --- Affects Version/s: (was: 3.0.1) 3.1.0 > Improve file path name normalisation at block resolution within the external > shuffle service > > > Key: SPARK-32149 > URL: https://issues.apache.org/jira/browse/SPARK-32149 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Attila Zsolt Piros >Priority: Major > > In the external shuffle service during the block resolution the file paths > (for disk persisted RDD and for shuffle blocks) are normalized by a custom > Spark code which uses an OS dependent regexp. This is a redundant code of the > package-private JDK counterpart. > As the code not a perfect match even it could happen one method results in a > bit different (but semantically equal) path. > The reason of this redundant transformation is the interning of the > normalized path to save some heap here which is only possible if both results > in the same string. > Checking the JDK code I believe there is a better solution which is perfect > match for the JDK code as it uses that package private method. Moreover based > on some benchmarking even this new method seams to be more performant too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32149) Improve file path name normalisation at block resolution within the external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-32149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149440#comment-17149440 ] Attila Zsolt Piros commented on SPARK-32149: I am working on this > Improve file path name normalisation at block resolution within the external > shuffle service > > > Key: SPARK-32149 > URL: https://issues.apache.org/jira/browse/SPARK-32149 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.1 >Reporter: Attila Zsolt Piros >Priority: Major > > In the external shuffle service during the block resolution the file paths > (for disk persisted RDD and for shuffle blocks) are normalized by a custom > Spark code which uses an OS dependent regexp. This is a redundant code of the > package-private JDK counterpart. > As the code not a perfect match even it could happen one method results in a > bit different (but semantically equal) path. > The reason of this redundant transformation is the interning of the > normalized path to save some heap here which is only possible if both results > in the same string. > Checking the JDK code I believe there is a better solution which is perfect > match for the JDK code as it uses that package private method. Moreover based > on some benchmarking even this new method seams to be more performant too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32149) Improve file path name normalisation at block resolution within the external shuffle service
Attila Zsolt Piros created SPARK-32149: -- Summary: Improve file path name normalisation at block resolution within the external shuffle service Key: SPARK-32149 URL: https://issues.apache.org/jira/browse/SPARK-32149 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.0.1 Reporter: Attila Zsolt Piros In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path. The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both results in the same string. Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052348#comment-17052348 ] Attila Zsolt Piros commented on SPARK-27651: well in case of dynamic allocation and a recalculation the executors could be already gone. > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from the external shuffle service running on the same > host. This can be avoided by getting the local directories of the same host > executors from the external shuffle service and accessing those blocks from > the disk directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27651: --- Description: When a shuffle block (content) is fetched the network is always used even when it is fetched from the external shuffle service running on the same host. This can be avoided by getting the local directories of the same host executors from the external shuffle service and accessing those blocks from the disk directly. (was: When a shuffle block (content) is fetched the network is always used even when it is fetched from an executor (or the external shuffle service) running on the same host.) > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from the external shuffle service running on the same > host. This can be avoided by getting the local directories of the same host > executors from the external shuffle service and accessing those blocks from > the disk directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051995#comment-17051995 ] Attila Zsolt Piros commented on SPARK-27651: Yes, the final implementation works only when the external shuffle service is used as the local directories of the other host local executors are asked from the external shuffle service. The initial implementation when the PR was opened was using the driver to get the host local directories. The technical reasons of asking the external shuffle service was: * decreasing network pressure on the driver (main reason). * getting rid of an unbounded (or bounded but in that case complex fall back logic at the fetcher) map which maps the executors to local dirs. In addition does that redundantly as this information is already available at the external shuffle service just stored in distributed way I mean at a running ext shuffle service process only for those executor data are stored which are on the same host. > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from an executor (or the external shuffle service) running > on the same host. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
[ https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031614#comment-17031614 ] Attila Zsolt Piros edited comment on SPARK-30688 at 2/6/20 2:22 PM: I have checked on Spark 3.0.0-preview2 and week in year fails there too (but for invalid format I would say this is fine): {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); +-+ |unix_timestamp(20201, ww)| +-+ | null| +-+ {noformat} But it fails for a correct pattern too: {noformat} scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show(); ++ |unix_timestamp(2020-10, -ww)| ++ |null| ++ {noformat} But for this it is strangely works: {noformat} scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show(); ++ |unix_timestamp(2020-01, -ww)| ++ | 1577833200| ++ {noformat} was (Author: attilapiros): I have checked on Spark 3.0.0-preview2 and week in year fails there too: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); +-+ |unix_timestamp(20201, ww)| +-+ | null| +-+ {noformat} As you can see it fails for even a simpler pattern too: {noformat} scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show(); ++ |unix_timestamp(2020-10, -ww)| ++ |null| ++ {noformat} But this strangely works: {noformat} scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show(); ++ |unix_timestamp(2020-01, -ww)| ++ | 1577833200| ++ {noformat} > Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF > -- > > Key: SPARK-30688 > URL: https://issues.apache.org/jira/browse/SPARK-30688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Rajkumar Singh >Priority: Major > > > {code:java} > scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); > +-+ > |unix_timestamp(20201, ww)| > +-+ > | null| > +-+ > > scala> spark.sql("select unix_timestamp('20202', 'ww')").show(); > -+ > |unix_timestamp(20202, ww)| > +-+ > | 1578182400| > +-+ > > {code} > > > This seems to happen for leap year only, I dig deeper into it and it seems > that Spark is using the java.text.SimpleDateFormat and try to parse the > expression here > [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652] > {code:java} > formatter.parse( > t.asInstanceOf[UTF8String].toString).getTime / 1000L{code} > but fail and SimpleDateFormat unable to parse the date throw Unparseable > Exception but Spark handle it silently and returns NULL. > > *Spark-3.0:* I did some tests where spark no longer using the legacy > java.text.SimpleDateFormat but java date/time API, it seems date/time API > expect a valid date with valid format > org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
[ https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031614#comment-17031614 ] Attila Zsolt Piros commented on SPARK-30688: I have checked on Spark 3.0.0-preview2 and week in year fails there too: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); +-+ |unix_timestamp(20201, ww)| +-+ | null| +-+ {noformat} As you can see it fails for even a simpler pattern too: {noformat} scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show(); ++ |unix_timestamp(2020-10, -ww)| ++ |null| ++ {noformat} But this strangely works: {noformat} scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show(); ++ |unix_timestamp(2020-01, -ww)| ++ | 1577833200| ++ {noformat} > Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF > -- > > Key: SPARK-30688 > URL: https://issues.apache.org/jira/browse/SPARK-30688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Rajkumar Singh >Priority: Major > > > {code:java} > scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); > +-+ > |unix_timestamp(20201, ww)| > +-+ > | null| > +-+ > > scala> spark.sql("select unix_timestamp('20202', 'ww')").show(); > -+ > |unix_timestamp(20202, ww)| > +-+ > | 1578182400| > +-+ > > {code} > > > This seems to happen for leap year only, I dig deeper into it and it seems > that Spark is using the java.text.SimpleDateFormat and try to parse the > expression here > [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652] > {code:java} > formatter.parse( > t.asInstanceOf[UTF8String].toString).getTime / 1000L{code} > but fail and SimpleDateFormat unable to parse the date throw Unparseable > Exception but Spark handle it silently and returns NULL. > > *Spark-3.0:* I did some tests where spark no longer using the legacy > java.text.SimpleDateFormat but java date/time API, it seems date/time API > expect a valid date with valid format > org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16996571#comment-16996571 ] Attila Zsolt Piros commented on SPARK-27021: [~roncenzhao] it seems to me you bumped into https://issues.apache.org/jira/browse/SPARK-26418 > Leaking Netty event loop group for shuffle chunk fetch requests > --- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-12-14-23-23-50-384.png > > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995538#comment-16995538 ] Attila Zsolt Piros commented on SPARK-27021: [~roncenzhao] # yes # I do not think so. This bug mostly effects the test system as test execution is the place where multiple TransportContext, NettyRpcEnv, etc ... are created and not closed correctly. > Leaking Netty event loop group for shuffle chunk fetch requests > --- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30235) Keeping compatibility with 2.4 external shuffle service regarding host local shuffle blocks reading
Attila Zsolt Piros created SPARK-30235: -- Summary: Keeping compatibility with 2.4 external shuffle service regarding host local shuffle blocks reading Key: SPARK-30235 URL: https://issues.apache.org/jira/browse/SPARK-30235 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros When `spark.shuffle.readHostLocalDisk.enabled` is true then a new message is used which is not supported by Spark 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30235) Keeping compatibility with 2.4 external shuffle service regarding host local shuffle blocks reading
[ https://issues.apache.org/jira/browse/SPARK-30235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994672#comment-16994672 ] Attila Zsolt Piros commented on SPARK-30235: I am working on this. > Keeping compatibility with 2.4 external shuffle service regarding host local > shuffle blocks reading > --- > > Key: SPARK-30235 > URL: https://issues.apache.org/jira/browse/SPARK-30235 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Minor > > When `spark.shuffle.readHostLocalDisk.enabled` is true then a new message is > used which is not supported by Spark 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27819) Retry cleanup of disk persisted RDD via external shuffle service when it failed via executor
[ https://issues.apache.org/jira/browse/SPARK-27819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1684#comment-1684 ] Attila Zsolt Piros commented on SPARK-27819: I am working on this. > Retry cleanup of disk persisted RDD via external shuffle service when it > failed via executor > > > Key: SPARK-27819 > URL: https://issues.apache.org/jira/browse/SPARK-27819 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This issue is created for not losing the idea came up during SPARK-27677 (at > org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd): > {panel:title=Quote} > In certain situations (e.g. executor death) it would make sense to retry this > through the shuffle service. But I'm not sure how to safely detect those > situations (or whether it makes sense to always retry through the shuffle > service). > {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27819) Retry cleanup of disk persisted RDD via external shuffle service when it failed via executor
[ https://issues.apache.org/jira/browse/SPARK-27819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27819: --- Description: This issue is created for not losing the idea came up during SPARK-27677 (at org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd): {panel:title=Quote} In certain situations (e.g. executor death) it would make sense to retry this through the shuffle service. But I'm not sure how to safely detect those situations (or whether it makes sense to always retry through the shuffle service). {panel} was: This issue is created for not losing the idea came up during SPARK-27677 (at org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd): {noformat} In certain situations (e.g. executor death) it would make sense to retry this through the shuffle service. But I'm not sure how to safely detect those situations (or whether it makes sense to always retry through the shuffle service). {noformat} > Retry cleanup of disk persisted RDD via external shuffle service when it > failed via executor > > > Key: SPARK-27819 > URL: https://issues.apache.org/jira/browse/SPARK-27819 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This issue is created for not losing the idea came up during SPARK-27677 (at > org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd): > {panel:title=Quote} > In certain situations (e.g. executor death) it would make sense to retry this > through the shuffle service. But I'm not sure how to safely detect those > situations (or whether it makes sense to always retry through the shuffle > service). > {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27819) Retry cleanup of disk persisted RDD via external shuffle service when it failed via executor
Attila Zsolt Piros created SPARK-27819: -- Summary: Retry cleanup of disk persisted RDD via external shuffle service when it failed via executor Key: SPARK-27819 URL: https://issues.apache.org/jira/browse/SPARK-27819 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros This issue is created for not losing the idea came up during SPARK-27677 (at org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd): {noformat} In certain situations (e.g. executor death) it would make sense to retry this through the shuffle service. But I'm not sure how to safely detect those situations (or whether it makes sense to always retry through the shuffle service). {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834832#comment-16834832 ] Attila Zsolt Piros commented on SPARK-27651: I am already working on this. > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from an executor (or the external shuffle service) running > on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
Attila Zsolt Piros created SPARK-27651: -- Summary: Avoid the network when block manager fetches shuffle blocks from the same host Key: SPARK-27651 URL: https://issues.apache.org/jira/browse/SPARK-27651 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros When a shuffle block (content) is fetched the network is always used even when it is fetched from an executor (or the external shuffle service) running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoid network communication when block manager fetches disk persisted RDD blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoid network communication when block manager fetches disk persisted RDD blocks from the same host (was: Avoid network communication when block manger fetches from the same host) > Avoid network communication when block manager fetches disk persisted RDD > blocks from the same host > --- > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoid the network when block manager fetches disk persisted RDD blocks from the same host (was: Avoid network communication when block manager fetches disk persisted RDD blocks from the same host) > Avoid the network when block manager fetches disk persisted RDD blocks from > the same host > - > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27622) Avoid network communication when block manger fetches from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831553#comment-16831553 ] Attila Zsolt Piros edited comment on SPARK-27622 at 5/6/19 6:22 PM: I am already working on this. There is already a working prototype for RDD blocks. was (Author: attilapiros): I am already working on this. A working prototype for RDD blocks are ready and working. > Avoid network communication when block manger fetches from the same host > > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoid network communication when block manger fetches from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoid network communication when block manger fetches from the same host (was: Avoiding network communication when block manger fetching from the same host) > Avoid network communication when block manger fetches from the same host > > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoiding network communication when block manger fetching from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoiding network communication when block manger fetching from the same host (was: Avoiding network communication when block mangers are running on the same host ) > Avoiding network communication when block manger fetching from the same host > > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27622) Avoiding network communication when block mangers are running on the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27622: --- Summary: Avoiding network communication when block mangers are running on the same host (was: Avoiding network communication when block mangers are running on the host ) > Avoiding network communication when block mangers are running on the same > host > --- > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27622) Avoiding network communication when block mangers are running on the host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831553#comment-16831553 ] Attila Zsolt Piros commented on SPARK-27622: I am already working on this. A working prototype for RDD blocks are ready and working. > Avoiding network communication when block mangers are running on the host > -- > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27622) Avoiding network communication when block mangers are running on the host
Attila Zsolt Piros created SPARK-27622: -- Summary: Avoiding network communication when block mangers are running on the host Key: SPARK-27622 URL: https://issues.apache.org/jira/browse/SPARK-27622 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros Currently fetching blocks always uses the network even when the two block managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25888) Service requests for persist() blocks via external service after dynamic deallocation
[ https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825248#comment-16825248 ] Attila Zsolt Piros commented on SPARK-25888: I am working on this. > Service requests for persist() blocks via external service after dynamic > deallocation > - > > Key: SPARK-25888 > URL: https://issues.apache.org/jira/browse/SPARK-25888 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Shuffle, YARN >Affects Versions: 2.3.2 >Reporter: Adam Kennedy >Priority: Major > > Large and highly multi-tenant Spark on YARN clusters with diverse job > execution often display terrible utilization rates (we have observed as low > as 3-7% CPU at max container allocation, but 50% CPU utilization on even a > well policed cluster is not uncommon). > As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250 > users and 50,000 runs of 1,000 distinct applications per week, with > predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark > Notebook jobs (no streaming) > Utilization problems appear to be due in large part to difficulties with > persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation. > In situations where an external shuffle service is present (which is typical > on clusters of this type) we already solve this for the shuffle block case by > offloading the IO handling of shuffle blocks to the external service, > allowing dynamic deallocation to proceed. > Allowing Executors to transfer persist() blocks to some external "shuffle" > service in a similar manner would be an enormous win for Spark multi-tenancy > as it would limit deallocation blocking scenarios to only MEMORY-only cache() > scenarios. > I'm not sure if I'm correct, but I seem to recall seeing in the original > external shuffle service commits that may have been considered at the time > but getting shuffle blocks moved to the external shuffle service was the > first priority. > With support for external persist() DISK blocks in place, we could also then > handle deallocation of DISK+MEMORY, as the memory instance could first be > dropped, changing the block to DISK only, and then further transferred to the > shuffle service. > We have tried to resolve the persist() issue via extensive user training, but > that has typically only allowed us to improve utilization of the worst > offenders (10% utilization) up to around 40-60% utilization, as the need for > persist() is often legitimate and occurs during the middle stages of a job. > In a healthy multi-tenant scenario, a large job might spool up to say 10,000 > cores, persist() data, release executors across a long tail down to 100 > cores, and then spool back up to 10,000 cores for the following stage without > impact on the persist() data. > In an ideal world, if an new executor started up on a node on which blocks > had been transferred to the shuffle service, the new executor might even be > able to "recapture" control of those blocks (if that would help with > performance in some way). > And the behavior of gradually expanding up and down several times over the > course of a job would not just improve utilization, but would allow resources > to more easily be redistributed to other jobs which start on the cluster > during the long-tail periods, which would improve multi-tenancy and bring us > closer to optimal "envy free" YARN scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
[ https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787822#comment-16787822 ] Attila Zsolt Piros commented on SPARK-27090: I would have waited a little bit to find out what others opinion about this. But fine for me. > Deplementing old LEGACY_DRIVER_IDENTIFIER ("") > -- > > Key: SPARK-27090 > URL: https://issues.apache.org/jira/browse/SPARK-27090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places > along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver > is running or an executor. > The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So > I think we have a chance to get rid of the LEGACY_DRIVER_IDENTIFIER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
Attila Zsolt Piros created SPARK-27090: -- Summary: Deplementing old LEGACY_DRIVER_IDENTIFIER ("") Key: SPARK-27090 URL: https://issues.apache.org/jira/browse/SPARK-27090 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver is running or an executor. The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So I think we have a chance to get rid of the LEGACY_DRIVER_IDENTIFIER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27021) Netty related thread leaks ()
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27021: --- Summary: Netty related thread leaks () (was: Netty related thread leaks) > Netty related thread leaks () > - > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27021: --- Summary: Leaking Netty event loop group for shuffle chunk fetch requests (was: Netty related thread leaks (event loop group for shuffle chunk fetch requests)) > Leaking Netty event loop group for shuffle chunk fetch requests > --- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27021) Netty related thread leaks (event loop group for shuffle chunk fetch requests)
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27021: --- Summary: Netty related thread leaks (event loop group for shuffle chunk fetch requests) (was: Netty related thread leaks ()) > Netty related thread leaks (event loop group for shuffle chunk fetch requests) > -- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27021) Netty related thread leaks
Attila Zsolt Piros created SPARK-27021: -- Summary: Netty related thread leaks Key: SPARK-27021 URL: https://issues.apache.org/jira/browse/SPARK-27021 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 2.4.1, 3.0.0 Reporter: Attila Zsolt Piros The extra event loop group created for handling shuffle chunk fetch requests are never closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27021) Netty related thread leaks
[ https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781707#comment-16781707 ] Attila Zsolt Piros commented on SPARK-27021: I am already working on it. > Netty related thread leaks > -- > > Key: SPARK-27021 > URL: https://issues.apache.org/jira/browse/SPARK-27021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > The extra event loop group created for handling shuffle chunk fetch requests > are never closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26891) Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable"
[ https://issues.apache.org/jira/browse/SPARK-26891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769323#comment-16769323 ] Attila Zsolt Piros commented on SPARK-26891: I am woking on it. > Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node > blacklist and is serializable" > --- > > Key: SPARK-26891 > URL: https://issues.apache.org/jira/browse/SPARK-26891 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For even an unrelated change sometimes the > YarnSchedulerBackendSuite#"RequestExecutors reflects node blacklist and is > serializable" test fails with the following error: > {noformat} > Error Message > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: EmptySet$ cannot be > returned by resourceOffers() resourceOffers() should return Seq *** If you're > unsure why you're getting above error read on. Due to the nature of the > syntax above problem might occur because: 1. This exception *might* occur in > wrongly written multi-threaded tests.Please refer to Mockito FAQ on > limitations of concurrency testing. 2. A spy is stubbed using > when(spy.foo()).then() syntax. It is safer to stub spies - - with > doReturn|Throw() family of methods. More in javadocs for Mockito.spy() > method. > Stacktrace > sbt.ForkMain$ForkError: > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > EmptySet$ cannot be returned by resourceOffers() > resourceOffers() should return Seq > *** > If you're unsure why you're getting above error read on. > Due to the nature of the syntax above problem might occur because: > 1. This exception *might* occur in wrongly written multi-threaded tests. >Please refer to Mockito FAQ on limitations of concurrency testing. > 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub > spies - >- with doReturn|Throw() family of methods. More in javadocs for > Mockito.spy() method. > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26891) Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable"
[ https://issues.apache.org/jira/browse/SPARK-26891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26891: --- Description: For even an unrelated change sometimes the YarnSchedulerBackendSuite#"RequestExecutors reflects node blacklist and is serializable" test fails with the following error: {noformat} Error Message org.mockito.exceptions.misusing.WrongTypeOfReturnValue: EmptySet$ cannot be returned by resourceOffers() resourceOffers() should return Seq *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests.Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. Stacktrace sbt.ForkMain$ForkError: org.mockito.exceptions.misusing.WrongTypeOfReturnValue: EmptySet$ cannot be returned by resourceOffers() resourceOffers() should return Seq *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests. Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. {noformat} was: For unrelated changes sometimes YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable" test fails with the following error: {noformat} Error Message org.mockito.exceptions.misusing.WrongTypeOfReturnValue: EmptySet$ cannot be returned by resourceOffers() resourceOffers() should return Seq *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests.Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. Stacktrace sbt.ForkMain$ForkError: org.mockito.exceptions.misusing.WrongTypeOfReturnValue: EmptySet$ cannot be returned by resourceOffers() resourceOffers() should return Seq *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests. Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. {noformat} > Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node > blacklist and is serializable" > --- > > Key: SPARK-26891 > URL: https://issues.apache.org/jira/browse/SPARK-26891 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For even an unrelated change sometimes the > YarnSchedulerBackendSuite#"RequestExecutors reflects node blacklist and is > serializable" test fails with the following error: > {noformat} > Error Message > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: EmptySet$ cannot be > returned by resourceOffers() resourceOffers() should return Seq *** If you're > unsure why you're getting above error read on. Due to the nature of the > syntax above problem might occur because: 1. This exception *might* occur in > wrongly written multi-threaded tests.Please refer to Mockito FAQ on > limitations of concurrency testing. 2. A spy is stubbed using > when(spy.foo()).then() syntax. It is safer to stub spies - - with > doReturn|Throw() family of methods. More in javadocs for Mockito.spy() > method. > Stacktrace > sbt.ForkMain$ForkError: > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > EmptySet$ cannot be returned by resourceOffers() > resourceOffers() should return Seq > *** > If you're unsure why you're getting above error read on. > Due to the nature of the syntax above problem might occur because: > 1. This exception *might* occur in wrongly written multi-threaded tests. >Please refer to Mockito FAQ on limitations of concurrency testing. > 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub > spies - >- with
[jira] [Created] (SPARK-26891) Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable"
Attila Zsolt Piros created SPARK-26891: -- Summary: Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable" Key: SPARK-26891 URL: https://issues.apache.org/jira/browse/SPARK-26891 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.4.0, 2.3.0, 3.0.0 Reporter: Attila Zsolt Piros For unrelated changes sometimes YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable" test fails with the following error: {noformat} Error Message org.mockito.exceptions.misusing.WrongTypeOfReturnValue: EmptySet$ cannot be returned by resourceOffers() resourceOffers() should return Seq *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests.Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. Stacktrace sbt.ForkMain$ForkError: org.mockito.exceptions.misusing.WrongTypeOfReturnValue: EmptySet$ cannot be returned by resourceOffers() resourceOffers() should return Seq *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests. Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string
[ https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763094#comment-16763094 ] Attila Zsolt Piros commented on SPARK-26845: This also works: {code} test("roundtrip in to_avro and from_avro - string") { val df = spark.createDataset(Seq("1", "2", "3")).select('value.cast("string").as("str")) val avroDF = df.select(to_avro('str).as("b")) val avroTypeStr = s""" |{ | "type": "record", | "name": "topLevelRecord", | "fields": [ | { | "name": "str", | "type": ["string", "null"] | } | ] |}""".stripMargin checkAnswer( avroDF.select(from_avro('b, avroTypeStr).as("rec")).select($"rec.str"), df) } {code} I have introduced a topLevelRecord as at the top level union types is not allowed / not working (good question why), I mean this: {code:javascript} { "name": "str", "type": ["string", "null"] } {code} Throws an exception: {noformat} org.apache.avro.SchemaParseException: No type: {"name":"str","type":["string","null"]} {noformat} > Avro to_avro from_avro roundtrip fails if data type is string > - > > Key: SPARK-26845 > URL: https://issues.apache.org/jira/browse/SPARK-26845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Gabor Somogyi >Priority: Critical > Labels: correctness > > I was playing with AvroFunctionsSuite and created a situation where test > fails which I believe it shouldn't: > {code:java} > test("roundtrip in to_avro and from_avro - string") { > val df = spark.createDataset(Seq("1", "2", > "3")).select('value.cast("string").as("str")) > val avroDF = df.select(to_avro('str).as("b")) > val avroTypeStr = s""" > |{ > | "type": "string", > | "name": "str" > |} > """.stripMargin > checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df) > } > {code} > {code:java} > == Results == > !== Correct Answer - 3 == == Spark Answer - 3 == > !struct struct > ![1][] > ![2][] > ![3][] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string
[ https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763066#comment-16763066 ] Attila Zsolt Piros commented on SPARK-26845: The test would work if you replace the line {code:java} val df = spark.createDataset(Seq("1", "2", "3")).select('value.cast("string").as("str")) {code} with {code:java} val df = spark.range(3).select('id.cast("string").as("str")) {code} *And the difference is caused by the nullable flag of the _StructField_.* For the _Seq_ you used the schema is: {code:java} scala> spark.createDataset(Seq("1", "2", "3")).select('value.cast("string").as("str")).schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(str,StringType,true)) {code} And for the range: {code:java} scala> spark.range(3).select('id.cast("string").as("str")).schema res1: org.apache.spark.sql.types.StructType = StructType(StructField(str,StringType,false)) {code} So in your case the _avroTypeStr_ does not match to the data. > Avro to_avro from_avro roundtrip fails if data type is string > - > > Key: SPARK-26845 > URL: https://issues.apache.org/jira/browse/SPARK-26845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Gabor Somogyi >Priority: Critical > Labels: correctness > > I was playing with AvroFunctionsSuite and created a situation where test > fails which I believe it shouldn't: > {code:java} > test("roundtrip in to_avro and from_avro - string") { > val df = spark.createDataset(Seq("1", "2", > "3")).select('value.cast("string").as("str")) > val avroDF = df.select(to_avro('str).as("b")) > val avroTypeStr = s""" > |{ > | "type": "string", > | "name": "str" > |} > """.stripMargin > checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df) > } > {code} > {code:java} > == Results == > !== Correct Answer - 3 == == Spark Answer - 3 == > !struct struct > ![1][] > ![2][] > ![3][] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25035) Replicating disk-stored blocks should avoid memory mapping
[ https://issues.apache.org/jira/browse/SPARK-25035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755322#comment-16755322 ] Attila Zsolt Piros commented on SPARK-25035: I am working on this. > Replicating disk-stored blocks should avoid memory mapping > -- > > Key: SPARK-25035 > URL: https://issues.apache.org/jira/browse/SPARK-25035 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > This is a follow-up to SPARK-24296. > When replicating a disk-cached block, even if we fetch-to-disk, we still > memory-map the file, just to copy it to another location. > Ideally we'd just move the tmp file to the right location. But even without > that, we could read the file as an input stream, instead of memory-mapping > the whole thing. Memory-mapping is particularly a problem when running under > yarn, as the OS may believe there is plenty of memory available, meanwhile > yarn decides to kill the process for exceeding memory limits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748922#comment-16748922 ] Attila Zsolt Piros commented on SPARK-26688: As I know a node only could have one label. So if node label is already used for something else (to group somehow your nodes) this could be a lightweight way to still exclude some YARN nodes. > Provide configuration of initially blacklisted YARN nodes > - > > Key: SPARK-26688 > URL: https://issues.apache.org/jira/browse/SPARK-26688 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Introducing new config for initially blacklisted YARN nodes. > This came up in the apache spark user mailing list: > [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26688: --- Description: Introducing new config for initially blacklisted YARN nodes. This came up in the apache spark user mailing list: [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html] was:Introducing new config for initially blacklisted YARN nodes. > Provide configuration of initially blacklisted YARN nodes > - > > Key: SPARK-26688 > URL: https://issues.apache.org/jira/browse/SPARK-26688 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Introducing new config for initially blacklisted YARN nodes. > This came up in the apache spark user mailing list: > [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes
Attila Zsolt Piros created SPARK-26688: -- Summary: Provide configuration of initially blacklisted YARN nodes Key: SPARK-26688 URL: https://issues.apache.org/jira/browse/SPARK-26688 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros Introducing new config for initially blacklisted YARN nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748632#comment-16748632 ] Attila Zsolt Piros commented on SPARK-26688: I am working on this. > Provide configuration of initially blacklisted YARN nodes > - > > Key: SPARK-26688 > URL: https://issues.apache.org/jira/browse/SPARK-26688 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Introducing new config for initially blacklisted YARN nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26615) Fixing transport server/client resource leaks in the core unittests
[ https://issues.apache.org/jira/browse/SPARK-26615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742243#comment-16742243 ] Attila Zsolt Piros commented on SPARK-26615: I am working on it (soon a PR will be created) > Fixing transport server/client resource leaks in the core unittests > > > Key: SPARK-26615 > URL: https://issues.apache.org/jira/browse/SPARK-26615 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > The testing of SPARK-24938 PR ([https://github.com/apache/spark/pull/22114)] > always fails with OOM. Analysing this problem lead to identifying some > resource leaks where TransportClient/TransportServer instances are nor closed > properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26615) Fixing transport server/client resource leaks in the core unittests
[ https://issues.apache.org/jira/browse/SPARK-26615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26615: --- Affects Version/s: 3.0.0 > Fixing transport server/client resource leaks in the core unittests > > > Key: SPARK-26615 > URL: https://issues.apache.org/jira/browse/SPARK-26615 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > The testing of SPARK-24938 PR ([https://github.com/apache/spark/pull/22114)] > always fails with OOM. Analysing this problem lead to identifying some > resource leaks where TransportClient/TransportServer instances are nor closed > properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26615) Fixing transport server/client resource leaks in the core unittests
Attila Zsolt Piros created SPARK-26615: -- Summary: Fixing transport server/client resource leaks in the core unittests Key: SPARK-26615 URL: https://issues.apache.org/jira/browse/SPARK-26615 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Attila Zsolt Piros The testing of SPARK-24938 PR ([https://github.com/apache/spark/pull/22114)] always fails with OOM. Analysing this problem lead to identifying some resource leaks where TransportClient/TransportServer instances are nor closed properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24920) Spark should allow sharing netty's memory pools across all uses
[ https://issues.apache.org/jira/browse/SPARK-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715040#comment-16715040 ] Attila Zsolt Piros commented on SPARK-24920: I started to work on it. I would keep a separate memory pool for transport clients and transport servers. This way cache allowance for each pool would be like before the change for transport servers it would be on and for transport clients it would be off. > Spark should allow sharing netty's memory pools across all uses > --- > > Key: SPARK-24920 > URL: https://issues.apache.org/jira/browse/SPARK-24920 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > Spark currently creates separate netty memory pools for each of the following > "services": > 1) RPC Client > 2) RPC Server > 3) BlockTransfer Client > 4) BlockTransfer Server > 5) ExternalShuffle Client > Depending on configuration and whether its an executor or driver JVM, > different of these are active, but its always either 3 or 4. > Having them independent somewhat defeats the purpose of using pools at all. > In my experiments I've found each pool will grow due to a burst of activity > in the related service (eg. task start / end msgs), followed another burst in > a different service (eg. sending torrent broadcast blocks). Because of the > way these pools work, they allocate memory in large chunks (16 MB by default) > for each netty thread, so there is often a surge of 128 MB of allocated > memory, even for really tiny messages. Also a lot of this memory is offheap > by default, which makes it even tougher for users to manage. > I think it would make more sense to combine all of these into a single pool. > In some experiments I tried, this noticeably decreased memory usage, both > onheap and offheap (no significant performance effect in my small > experiments). > As this is a pretty core change, as I first step I'd propose just exposing > this as a conf, to let user experiment more broadly across a wider range of > workloads -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24920) Spark should allow sharing netty's memory pools across all uses
[ https://issues.apache.org/jira/browse/SPARK-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715040#comment-16715040 ] Attila Zsolt Piros edited comment on SPARK-24920 at 12/10/18 4:39 PM: -- I started to work on it. I would keep a separate memory pool for transport clients and transport servers. This way cache allowance for each pool would be like before this change: for transport servers it would be on and for transport clients it would be off. was (Author: attilapiros): I started to work on it. I would keep a separate memory pool for transport clients and transport servers. This way cache allowance for each pool would be like before the change for transport servers it would be on and for transport clients it would be off. > Spark should allow sharing netty's memory pools across all uses > --- > > Key: SPARK-24920 > URL: https://issues.apache.org/jira/browse/SPARK-24920 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > Spark currently creates separate netty memory pools for each of the following > "services": > 1) RPC Client > 2) RPC Server > 3) BlockTransfer Client > 4) BlockTransfer Server > 5) ExternalShuffle Client > Depending on configuration and whether its an executor or driver JVM, > different of these are active, but its always either 3 or 4. > Having them independent somewhat defeats the purpose of using pools at all. > In my experiments I've found each pool will grow due to a burst of activity > in the related service (eg. task start / end msgs), followed another burst in > a different service (eg. sending torrent broadcast blocks). Because of the > way these pools work, they allocate memory in large chunks (16 MB by default) > for each netty thread, so there is often a surge of 128 MB of allocated > memory, even for really tiny messages. Also a lot of this memory is offheap > by default, which makes it even tougher for users to manage. > I think it would make more sense to combine all of these into a single pool. > In some experiments I tried, this noticeably decreased memory usage, both > onheap and offheap (no significant performance effect in my small > experiments). > As this is a pretty core change, as I first step I'd propose just exposing > this as a conf, to let user experiment more broadly across a wider range of > workloads -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26232) Remove old Netty3 dependency from the build
[ https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros resolved SPARK-26232. Resolution: Won't Fix One of the third party tool uses Netty3, so as a transitive dependency Netty3 is really needed. > Remove old Netty3 dependency from the build > > > Key: SPARK-26232 > URL: https://issues.apache.org/jira/browse/SPARK-26232 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Old Netty artifact (3.9.9.Final) is unused. > The reason it is not collide with Netty4 is they are using different package > names: > * Netty3: org.jboss.netty.* > * Netty4: io.netty.* > Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26232) Remove old Netty3 dependency from the build
[ https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26232: --- Summary: Remove old Netty3 dependency from the build (was: Remove old Netty3 artifact from the build ) > Remove old Netty3 dependency from the build > > > Key: SPARK-26232 > URL: https://issues.apache.org/jira/browse/SPARK-26232 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Old Netty artifact (3.9.9.Final) is unused. > The reason it is not collide with Netty4 is they are using different package > names: > * Netty3: org.jboss.netty.* > * Netty4: io.netty.* > Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26232) Remove old Netty3 artifact from the build
Attila Zsolt Piros created SPARK-26232: -- Summary: Remove old Netty3 artifact from the build Key: SPARK-26232 URL: https://issues.apache.org/jira/browse/SPARK-26232 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros Old Netty artifact (3.9.9.Final) is unused. The reason it is not collide with Netty4 is they are using different package names: * Netty3: org.jboss.netty.* * Netty4: io.netty.* Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26118: --- Issue Type: New Feature (was: Bug) > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26118: --- Issue Type: Bug (was: Improvement) > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692018#comment-16692018 ] Attila Zsolt Piros commented on SPARK-26118: I am working on a PR. > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26118: --- Affects Version/s: (was: 2.3.2) (was: 2.4.0) (was: 2.2.2) (was: 2.1.3) > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
Attila Zsolt Piros created SPARK-26118: -- Summary: Make Jetty's requestHeaderSize configurable in Spark Key: SPARK-26118 URL: https://issues.apache.org/jira/browse/SPARK-26118 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.0, 2.3.2, 2.2.2, 2.1.3, 3.0.0 Reporter: Attila Zsolt Piros For long authorization fields the request header size could be over the default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request Entity Too Large). This issue may occur if the user is a member of many Active Directory user groups. The HTTP request to the server contains the Kerberos token in the WWW-Authenticate header. The header size increases together with the number of user groups. Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26002) SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01
[ https://issues.apache.org/jira/browse/SPARK-26002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682404#comment-16682404 ] Attila Zsolt Piros commented on SPARK-26002: I am already working on a fix. > SQL date operators calculates with incorrect dayOfYears for dates before > 1500-03-01 > --- > > Key: SPARK-26002 > URL: https://issues.apache.org/jira/browse/SPARK-26002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0, > 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Running the following SQL the result is incorrect: > {noformat} > scala> sql("select dayOfYear('1500-01-02')").show() > +---+ > |dayofyear(CAST(1500-01-02 AS DATE))| > +---+ > | 1| > +---+ > {noformat} > This off by one day is more annoying right at the beginning of a year: > {noformat} > scala> sql("select year('1500-01-01')").show() > +--+ > |year(CAST(1500-01-01 AS DATE))| > +--+ > | 1499| > +--+ > scala> sql("select month('1500-01-01')").show() > +---+ > |month(CAST(1500-01-01 AS DATE))| > +---+ > | 12| > +---+ > scala> sql("select dayOfYear('1500-01-01')").show() > +---+ > |dayofyear(CAST(1500-01-01 AS DATE))| > +---+ > |365| > +---+ > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26002) SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01
Attila Zsolt Piros created SPARK-26002: -- Summary: SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01 Key: SPARK-26002 URL: https://issues.apache.org/jira/browse/SPARK-26002 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0, 2.1.3, 3.0.0 Reporter: Attila Zsolt Piros Running the following SQL the result is incorrect: {noformat} scala> sql("select dayOfYear('1500-01-02')").show() +---+ |dayofyear(CAST(1500-01-02 AS DATE))| +---+ | 1| +---+ {noformat} This off by one day is more annoying right at the beginning of a year: {noformat} scala> sql("select year('1500-01-01')").show() +--+ |year(CAST(1500-01-01 AS DATE))| +--+ | 1499| +--+ scala> sql("select month('1500-01-01')").show() +---+ |month(CAST(1500-01-01 AS DATE))| +---+ | 12| +---+ scala> sql("select dayOfYear('1500-01-01')").show() +---+ |dayofyear(CAST(1500-01-01 AS DATE))| +---+ |365| +---+ {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue
[ https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517515#comment-16517515 ] Attila Zsolt Piros commented on SPARK-24578: [~wbzhao] oh sorry I read your comment late, definitely without your help (pointing to the right commit caused the problem) it would took much much longer. If you like please create another PR, it is fine for me. > Reading remote cache block behavior changes and causes timeout issue > > > Key: SPARK-24578 > URL: https://issues.apache.org/jira/browse/SPARK-24578 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Wenbo Zhao >Priority: Major > > After Spark 2.3, we observed lots of errors like the following in some of our > production job > {code:java} > 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result > ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, > chunkIndex=0}, > buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to > /172.22.18.7:60865; closing connection > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at > org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156) > at > org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142) > at > org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123) > at > io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355) > at > io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224) > at > io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983) > at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248) > at > io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > {code} > > Here is a small reproducible for a small cluster of 2 executors (say host-1 > and host-2) each with 8 cores. Here, the memory of driver and executors are > not an import factor here as long as it is big enough, say 20G. > {code:java} > val n = 1 > val df0 = sc.parallelize(1 to n).toDF > val df = df0.withColumn("x0", rand()).withColumn("x0", rand() > ).withColumn("x1", rand() > ).withColumn("x2", rand() > ).withColumn("x3", rand() > ).withColumn("x4", rand() > ).withColumn("x5", rand() > ).withColumn("x6", rand()
[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue
[ https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517452#comment-16517452 ] Attila Zsolt Piros commented on SPARK-24578: [~wbzhao] Yes, you are right, readerIndex is the first (I just checked skipBytes), thanks. [~irashid] Yes, I can create a PR soon. [~vanzin] Netty was upgraded between 2.2.1 and 2.3.0: from 4.0.43.Final to 4.1.17.Final. Could it be? > Reading remote cache block behavior changes and causes timeout issue > > > Key: SPARK-24578 > URL: https://issues.apache.org/jira/browse/SPARK-24578 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Wenbo Zhao >Priority: Major > > After Spark 2.3, we observed lots of errors like the following in some of our > production job > {code:java} > 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result > ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, > chunkIndex=0}, > buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to > /172.22.18.7:60865; closing connection > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at > org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156) > at > org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142) > at > org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123) > at > io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355) > at > io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224) > at > io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983) > at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248) > at > io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > {code} > > Here is a small reproducible for a small cluster of 2 executors (say host-1 > and host-2) each with 8 cores. Here, the memory of driver and executors are > not an import factor here as long as it is big enough, say 20G. > {code:java} > val n = 1 > val df0 = sc.parallelize(1 to n).toDF > val df = df0.withColumn("x0", rand()).withColumn("x0", rand() > ).withColumn("x1", rand() > ).withColumn("x2", rand() > ).withColumn("x3", rand() > ).withColumn("x4", rand() > ).withColumn("x5", rand() >
[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue
[ https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517416#comment-16517416 ] Attila Zsolt Piros commented on SPARK-24578: I have written a small test I know it is a bit naive but I still thing it shows us something: [https://gist.github.com/attilapiros/730d67b62317d14f5fd0f6779adea245] And the result is: {noformat} n = 500 duration221 = 2 duration221.new = 0 duration230 = 5242 duration230.new = 7 {noformat} My guess the receiver timeouts and the sender is writing into a closed socket. > Reading remote cache block behavior changes and causes timeout issue > > > Key: SPARK-24578 > URL: https://issues.apache.org/jira/browse/SPARK-24578 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Wenbo Zhao >Priority: Major > > After Spark 2.3, we observed lots of errors like the following in some of our > production job > {code:java} > 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result > ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, > chunkIndex=0}, > buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to > /172.22.18.7:60865; closing connection > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at > org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156) > at > org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142) > at > org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123) > at > io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355) > at > io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224) > at > io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983) > at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248) > at > io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > {code} > > Here is a small reproducible for a small cluster of 2 executors (say host-1 > and host-2) each with 8 cores. Here, the memory of driver and executors are > not an import factor here as long as it is big enough, say 20G. > {code:java} > val n = 1 > val df0 = sc.parallelize(1 to n).toDF > val df = df0.withColumn("x0", rand()).withColumn("x0", rand() >
[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue
[ https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517287#comment-16517287 ] Attila Zsolt Piros commented on SPARK-24578: The copyByteBuf() along with transferTo() is called several times on a huge byteBuf. If we have a really huge chunk from small chunks we are re-merging the small chunks again and again by [https://github.com/apache/spark/blob/v2.3.0/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java#L140|https://github.com/apache/spark/blob/v2.3.0/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java#L140.] although we need only a very small part of it. Is it possible the following line would help? {code:java} ByteBuffer buffer = buf.nioBuffer(0, Math.min(buf.readableBytes(), NIO_BUFFER_LIMIT));{code} > Reading remote cache block behavior changes and causes timeout issue > > > Key: SPARK-24578 > URL: https://issues.apache.org/jira/browse/SPARK-24578 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Wenbo Zhao >Priority: Major > > After Spark 2.3, we observed lots of errors like the following in some of our > production job > {code:java} > 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result > ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, > chunkIndex=0}, > buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to > /172.22.18.7:60865; closing connection > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at > org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156) > at > org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142) > at > org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123) > at > io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355) > at > io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224) > at > io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776) > at > io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768) > at > io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749) > at > io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983) > at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248) > at > io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > {code} > > Here is a small reproducible for a small cluster of 2 executors
[jira] [Commented] (SPARK-24594) Introduce metrics for YARN executor allocation problems
[ https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517028#comment-16517028 ] Attila Zsolt Piros commented on SPARK-24594: I am working on this. > Introduce metrics for YARN executor allocation problems > > > Key: SPARK-24594 > URL: https://issues.apache.org/jira/browse/SPARK-24594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Within SPARK-16630 it come up to introduce metrics for YARN executor > allocation problems. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24594) Introduce metrics for YARN executor allocation problems
[ https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-24594: --- Description: Within SPARK-16630 it come up to introduce metrics for YARN executor allocation problems. (was: Within https://issues.apache.org/jira/browse/SPARK-16630 it come up to introduce metrics for YARN executor allocation problems.) > Introduce metrics for YARN executor allocation problems > > > Key: SPARK-24594 > URL: https://issues.apache.org/jira/browse/SPARK-24594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Within SPARK-16630 it come up to introduce metrics for YARN executor > allocation problems. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24594) Introduce metrics for YARN executor allocation problems
Attila Zsolt Piros created SPARK-24594: -- Summary: Introduce metrics for YARN executor allocation problems Key: SPARK-24594 URL: https://issues.apache.org/jira/browse/SPARK-24594 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.4.0 Reporter: Attila Zsolt Piros Within https://issues.apache.org/jira/browse/SPARK-16630 it come up to introduce metrics for YARN executor allocation problems. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19526) Spark should raise an exception when it tries to read a Hive view but it doesn't have read access on the corresponding table(s)
[ https://issues.apache.org/jira/browse/SPARK-19526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504827#comment-16504827 ] Attila Zsolt Piros commented on SPARK-19526: I started working on this issue. > Spark should raise an exception when it tries to read a Hive view but it > doesn't have read access on the corresponding table(s) > --- > > Key: SPARK-19526 > URL: https://issues.apache.org/jira/browse/SPARK-19526 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.4, 2.0.3, 2.2.0, 2.3.0 >Reporter: Reza Safi >Priority: Major > > Spark sees a Hive views as a set of hdfs "files". So to read anything from a > Hive view, Spark needs access to all of the files that belongs to the > table(s) that the view queries them. In other words a Spark user cannot be > granted fine grained permissions at the levels of Hive columns or records. > Consider that there is a Spark job that contains a SQL query that tries to > read a Hive view. Currently the Spark job will finish successfully if the > user that runs the Spark job doesn't have proper read access permissions to > the tables that the Hive view has been built on top of them. It will just > return an empty result set. This can be confusing for the users, since the > job will be finishes without any exception or error. > Spark should raise an exception like AccessDenied when it tries to run a > Hive view query and its user doesn't have proper permissions to the tables > that the Hive view is created on top of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.
[ https://issues.apache.org/jira/browse/SPARK-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467137#comment-16467137 ] Attila Zsolt Piros commented on SPARK-19181: I am working on this. > SparkListenerSuite.local metrics fails when average executorDeserializeTime > is too short. > - > > Key: SPARK-19181 > URL: https://issues.apache.org/jira/browse/SPARK-19181 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Jose Soltren >Priority: Minor > > https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249 > The "local metrics" test asserts that tasks should take more than 1ms on > average to complete, even though a code comment notes that this is a small > test and tasks may finish faster. I've been seeing some "failures" here on > fast systems that finish these tasks quite quickly. > There are a few ways forward here: > 1. Disable this test. > 2. Relax this check. > 3. Implement sub-millisecond granularity for task times throughout Spark. > 4. (Imran Rashid's suggestion) Add buffer time by, say, having the task > reference a partition that implements a custom Externalizable.readExternal, > which always waits 1ms before returning. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432495#comment-16432495 ] Attila Zsolt Piros commented on SPARK-16630: Let me illustrate my problem with an example: - the limit for blacklisted nodes is configured to 2 - we have one node blacklisted close to the yarn allocator ("host1" -> expiryTime1), this is the new code I am working on - scheduler requests a new executors along with blacklisted nodes (task-level): "host2", "host3" (org.apache.spark.deploy.yarn.YarnAllocator#requestTotalExecutorsWithPreferredLocalities) So I have to choose 2 nodes to communicate towards YARN. My idea to pass expiryTime2 and expiryTime3 to the YarnAllocator to choose the most relevant 2 nodes (the one which expires latter are the more relevant). For this in the case class RequestExecutors the nodeBlacklist field type is changed to Map[String, Long] from Set[String]. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432415#comment-16432415 ] Attila Zsolt Piros commented on SPARK-16630: I would need the expiry times to choose the most relevant (most fresh) subset of nodes to backlist when the limit is less then the union of all blacklist-able nodes. So it is only used for sorting. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432165#comment-16432165 ] Attila Zsolt Piros edited comment on SPARK-16630 at 4/10/18 12:15 PM: -- I have question regarding limiting the number of blacklisted nodes according to the cluster size. With this change there will be two sources of nodes to be backlisted: - one list is coming from the scheduler (existing node level backlisting) - the other is computed here close to the YARN allocator (stored along with the expiry times) I think it makes sense to have the limit for the complete list (union) of blacklisted nodes, am I right? If this limit is for the complete list then regarding the subset I think the newly blacklisted nodes are more up-to-date to be used then the earlier backlisted ones. So I would pass the expiry times from the scheduler to the YARN allocator to make the subset of backlisted nodes to be communicated to YARN. What is your opinion? was (Author: attilapiros): I have question regarding limiting the number of blacklisted nodes according to the cluster size. With this change there will be two sources of nodes to be backlisted: - one list is coming from the scheduler (existing node level backlisting) - the other is computed here close to the YARN (stored along with the expiry times) I think it makes sense to have the limit for the complete list (union) of blacklisted nodes, am I right? If this limit is for the complete list then regarding the subset I think the newly blacklisted nodes are more up-to-date to be used then the earlier backlisted ones. So I would pass the expiry times from the scheduler to the YARN allocator to make the subset of backlisted nodes to be communicated to YARN. What is your opinion? > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432165#comment-16432165 ] Attila Zsolt Piros commented on SPARK-16630: I have question regarding limiting the number of blacklisted nodes according to the cluster size. With this change there will be two sources of nodes to be backlisted: - one list is coming from the scheduler (existing node level backlisting) - the other is computed here close to the YARN (stored along with the expiry times) I think it makes sense to have the limit for the complete list (union) of blacklisted nodes, am I right? If this limit is for the complete list then regarding the subset I think the newly blacklisted nodes are more up-to-date to be used then the earlier backlisted ones. So I would pass the expiry times from the scheduler to the YARN allocator to make the subset of backlisted nodes to be communicated to YARN. What is your opinion? > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428721#comment-16428721 ] Attila Zsolt Piros commented on SPARK-16630: Yes you are right more executors can be allocated on the same host. I have checked and using getNumClusterNodes is not as hard so I will do that. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427181#comment-16427181 ] Attila Zsolt Piros commented on SPARK-16630: [~tgraves] what about stopping YARN backlisting when a configured limit with the default of "spark.executor.instances" * "spark.yarn.backlisting.default.executor.instances.size.weight" (a better name for the weight is welcomed) limit is reached (including all the blacklisted nodes even stage and task level blacklisted nodes) and in case of dynamic allocation the default is Int.MaxValue so there is no limit at all? This idea comes from the calculation of the default for maxNumExecutorFailures. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427123#comment-16427123 ] Attila Zsolt Piros commented on SPARK-16630: [~irashid] I would reuse spark.blacklist.application.maxFailedExecutorsPerNode which has already a default = 2. I think it makes sense to use the same limit for per node failures before adding the node to the backlist. But in this case I cannot consider spark.yarn.max.executor.failures into the default. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23728) ML test with expected exceptions testing streaming fails on 2.3
[ https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-23728: --- Summary: ML test with expected exceptions testing streaming fails on 2.3 (was: ML test with expected exceptions via streaming fails on 2.3) > ML test with expected exceptions testing streaming fails on 2.3 > --- > > Key: SPARK-23728 > URL: https://issues.apache.org/jira/browse/SPARK-23728 > Project: Spark > Issue Type: Bug > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Attila Zsolt Piros >Priority: Major > > The testTransformerByInterceptingException fails to catch the expected > message as on 2.3 during streaming if an exception happens within an ml > feature then feature generated message is not at the direct caused by > exception but even one level deeper. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23728) ML test with expected exceptions via streaming fails on 2.3
Attila Zsolt Piros created SPARK-23728: -- Summary: ML test with expected exceptions via streaming fails on 2.3 Key: SPARK-23728 URL: https://issues.apache.org/jira/browse/SPARK-23728 Project: Spark Issue Type: Bug Components: ML, Tests Affects Versions: 2.3.0 Reporter: Attila Zsolt Piros The testTransformerByInterceptingException fails to catch the expected message as on 2.3 during streaming if an exception happens within an ml feature then feature generated message is not at the direct caused by exception but even one level deeper. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23728) ML test with expected exceptions via streaming fails on 2.3
[ https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403850#comment-16403850 ] Attila Zsolt Piros commented on SPARK-23728: I am soon creating a PR. > ML test with expected exceptions via streaming fails on 2.3 > --- > > Key: SPARK-23728 > URL: https://issues.apache.org/jira/browse/SPARK-23728 > Project: Spark > Issue Type: Bug > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Attila Zsolt Piros >Priority: Major > > The testTransformerByInterceptingException fails to catch the expected > message as on 2.3 during streaming if an exception happens within an ml > feature then feature generated message is not at the direct caused by > exception but even one level deeper. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391649#comment-16391649 ] Attila Zsolt Piros commented on SPARK-16630: I am working on this issue. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387071#comment-16387071 ] Attila Zsolt Piros commented on SPARK-16630: Of course we can consider "spark.yarn.executor.failuresValidityInterval" for "yarn-level" backlisting too. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386789#comment-16386789 ] Attila Zsolt Piros commented on SPARK-16630: I have checked the existing sources and I would like to open a discussion about the possible solution. As I have seen YarnAllocator#processCompletedContainers could be extended to track the number of failures by host. Also YarnAllocator is responsible to update the task-level backlisted nodes with YARN (calling AMRMClient#updateBlacklist). So a relatively easy solution would be to have a separate counter here (which is independent from task level failures) with its own configured limit and updating YARN with the union of task-level backlisted nodes and "allocator-level" backlisted nodes. What is your opinion? > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379056#comment-16379056 ] Attila Zsolt Piros commented on SPARK-22915: Thank you as I have already a lot of work in it. I can do a quick PR right now, but I would prefer to do one more self review and test executions beforehand. I can promise finishing it in the following two days (as I am now on a journey). > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org