[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269182#comment-17269182
 ] 

Attila Zsolt Piros commented on SPARK-34167:


[~razajafri] could you share with us how the parquet files are created?

I tried to reproduce this issue in the following way but I had no luck:

{noformat}
Spark context Web UI available at http://192.168.0.17:4045
Spark context available as 'sc' (master = local, app id = local-1611221568779).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.math.BigDecimal
import java.math.BigDecimal

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType}
import org.apache.spark.sql.types.{DecimalType, StructField, StructType}

scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(num,DecimalType(8,2),true))

scala> val rdd = sc.parallelize((0 to 9).map(v => new 
BigDecimal(s"123456.7$v")))
rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] 
at parallelize at :27

scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema)
df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)]

scala> df.show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+


scala> df.write.parquet("num.parquet")

scala> spark.read.parquet("num.parquet").show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+

{noformat}



> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> 

[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-15 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265863#comment-17265863
 ] 

Attila Zsolt Piros commented on SPARK-33711:


Sure. Working on that.

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

2021-01-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261398#comment-17261398
 ] 

Attila Zsolt Piros edited comment on SPARK-26836 at 1/8/21, 4:15 PM:
-

I am working on this and a PR can be expected this weekend / next week


was (Author: attilapiros):
I am working on a this 

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> 
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
> Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>Reporter: Tamás Németh
>Priority: Critical
>  Labels: correctness
> Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be 
> mixed up (on the data below the first name column becomes the extra_field 
> column. I guess because in the latest schema it is the second column):
>  
> {code:java}
> number,extra_field,first_name,last_name,dt 
> 6,Colin,Baker,null,2019-02-05 
> 3,Jon,Pertwee,null,2019-02-05 
> 4,Tom,Baker,null,2019-02-05 
> 5,Peter,Davison,null,2019-02-05 
> 11,Matt,Smith,null,2019-02-05 
> 1,William,Hartnell,null,2019-02-05 
> 7,Sylvester,McCoy,null,2019-02-05 
> 8,Paul,McGann,null,2019-02-05 
> 2,Patrick,Troughton,null,2019-02-05 
> 

[jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

2021-01-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261398#comment-17261398
 ] 

Attila Zsolt Piros commented on SPARK-26836:


I am working on a this 

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> 
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
> Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>Reporter: Tamás Németh
>Priority: Critical
>  Labels: correctness
> Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be 
> mixed up (on the data below the first name column becomes the extra_field 
> column. I guess because in the latest schema it is the second column):
>  
> {code:java}
> number,extra_field,first_name,last_name,dt 
> 6,Colin,Baker,null,2019-02-05 
> 3,Jon,Pertwee,null,2019-02-05 
> 4,Tom,Baker,null,2019-02-05 
> 5,Peter,Davison,null,2019-02-05 
> 11,Matt,Smith,null,2019-02-05 
> 1,William,Hartnell,null,2019-02-05 
> 7,Sylvester,McCoy,null,2019-02-05 
> 8,Paul,McGann,null,2019-02-05 
> 2,Patrick,Troughton,null,2019-02-05 
> 9,Christopher,Eccleston,null,2019-02-05 
> 10,David,Tennant,null,2019-02-05 
> 21,fishfinger,Jim,Baker,2019-02-06 
> 24,fishfinger,Bean,Pertwee,2019-02-06
> {code}
> If I 

[jira] [Commented] (SPARK-32617) Upgrade kubernetes client version to support latest minikube version.

2020-12-11 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248168#comment-17248168
 ] 

Attila Zsolt Piros commented on SPARK-32617:


I will have PR for this soon. 

> Upgrade kubernetes client version to support latest minikube version.
> -
>
> Key: SPARK-32617
> URL: https://issues.apache.org/jira/browse/SPARK-32617
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Following error comes, when the k8s integration tests are run against the 
> minikube cluster with version 1.2.1
> {code:java}
> Run starting. Expected test count is: 18
> KubernetesSuite:
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED ***
>   io.fabric8.kubernetes.client.KubernetesClientException: An error has 
> occurred.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:196)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:62)
>   at io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51)
>   at 
> io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:105)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:81)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:131)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   ...
>   Cause: java.nio.file.NoSuchFileException: /root/.minikube/apiserver.crt
>   at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
>   at java.nio.file.Files.newByteChannel(Files.java:361)
>   at java.nio.file.Files.newByteChannel(Files.java:407)
>   at java.nio.file.Files.readAllBytes(Files.java:3152)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.getInputStreamFromDataOrFile(CertUtils.java:72)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:242)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
>   ...
> Run completed in 1 second, 821 milliseconds.
> Total number of tests run: 0
> Suites: completed 1, aborted 1
> Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
> *** 1 SUITE ABORTED ***
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.454 
> s]
> [INFO] Spark Project Tags . SUCCESS [  4.768 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.961 
> s]
> [INFO] Spark Project Networking ... SUCCESS [  4.258 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  5.703 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  3.239 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  3.224 
> s]
> [INFO] Spark Project Core . SUCCESS [02:25 
> min]
> [INFO] Spark Project Kubernetes Integration Tests . FAILURE [ 17.244 
> s]
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time:  03:12 min
> [INFO] Finished at: 2020-08-11T06:26:15-05:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project 
> spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] 

[jira] [Comment Edited] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2020-12-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246092#comment-17246092
 ] 

Attila Zsolt Piros edited comment on SPARK-33711 at 12/8/20, 7:21 PM:
--

Yes, it does. I have updated the affected versions accordingly. Thanks 
[~dongjoon].


was (Author: attilapiros):
Yes, it does. I have updated the affected versions accordingly.

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2020-12-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246092#comment-17246092
 ] 

Attila Zsolt Piros commented on SPARK-33711:


Yes, it does. I have updated the affected versions accordingly.

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2020-12-08 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-33711:
---
Affects Version/s: 2.3.4

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2020-12-08 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-33711:
---
Affects Version/s: (was: 3.0.1)
   3.2.0
   3.1.0

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2020-12-08 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-33711:
---
Summary:  Race condition in Spark k8s Pod lifecycle manager that leads to 
shutdowns  (was:  Race condition in Spark k8s Pod lifecycle management that 
leads to shutdowns)

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle management that leads to shutdowns

2020-12-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245993#comment-17245993
 ] 

Attila Zsolt Piros commented on SPARK-33711:


I am working on this and soon I will open a PR.

>  Race condition in Spark k8s Pod lifecycle management that leads to shutdowns
> -
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33711) Race condition in Spark k8s Pod lifecycle management that leads to shutdowns

2020-12-08 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-33711:
--

 Summary:  Race condition in Spark k8s Pod lifecycle management 
that leads to shutdowns
 Key: SPARK-33711
 URL: https://issues.apache.org/jira/browse/SPARK-33711
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1, 3.0.0, 2.4.7
Reporter: Attila Zsolt Piros


Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
changes which could wrongfully lead to detecting of missing PODs (PODs known by 
scheduler backend but missing from POD snapshots) by the executor POD lifecycle 
manager.

A key indicator of this is seeing this log msg:

"The executor with ID [some_id] was not found in the cluster but we didn't get 
a reason why. Marking the executor as failed. The executor may have been 
deleted but the driver missed the deletion event."

So one of the problem is running the missing POD detection even when a single 
pod is changed without having a full consistent snapshot about all the PODs 
(see ExecutorPodsPollingSnapshotSource). The other could be a race between the 
executor POD lifecycle manager and the scheduler backend.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28210) Shuffle Storage API: Reads

2020-09-15 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196013#comment-17196013
 ] 

Attila Zsolt Piros edited comment on SPARK-28210 at 9/15/20, 9:38 AM:
--

 [~tianczha] [~devaraj] I would like to work on this issue if that's fine for 
you. I intend to progress along the ideas of the linked PR: to pass the 
metadata when the reducer task is constructed. 


was (Author: attilapiros):
 [~tianczha] [~devaraj] I would like to work on this issue if that's fine for 
you. I would like to progress along the ideas of the linked PR: to pass the 
metadata when the reducer task is constructed. 

> Shuffle Storage API: Reads
> --
>
> Key: SPARK-28210
> URL: https://issues.apache.org/jira/browse/SPARK-28210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> As part of the effort to store shuffle data in arbitrary places, this issue 
> tracks implementing an API for reading the shuffle data stored by the write 
> API. Also ensure that the existing shuffle implementation is refactored to 
> use the API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28210) Shuffle Storage API: Reads

2020-09-15 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196013#comment-17196013
 ] 

Attila Zsolt Piros commented on SPARK-28210:


 [~tianczha] [~devaraj] I would like to work on this issue if that's fine for 
you. I would like to progress along the ideas of the linked PR: to pass the 
metadata when the reducer task is constructed. 

> Shuffle Storage API: Reads
> --
>
> Key: SPARK-28210
> URL: https://issues.apache.org/jira/browse/SPARK-28210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> As part of the effort to store shuffle data in arbitrary places, this issue 
> tracks implementing an API for reading the shuffle data stored by the write 
> API. Also ensure that the existing shuffle implementation is refactored to 
> use the API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32663) TransportClient getting closed when there are outstanding requests to the server

2020-08-20 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181138#comment-17181138
 ] 

Attila Zsolt Piros commented on SPARK-32663:


You are right. The clients will be closed via 
[TransportClientFactory#close()|https://github.com/attilapiros/spark/blob/e6795cd3416bbe32505efd4c1fa3202f451bf74d/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java#L324].



> TransportClient getting closed when there are outstanding requests to the 
> server
> 
>
> Key: SPARK-32663
> URL: https://issues.apache.org/jira/browse/SPARK-32663
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
>
> The implementation of {{removeBlocks}} and {{getHostLocalDirs}} in 
> {{ExternalBlockStoreClient}} closes the client after processing a response in 
> the callback. 
> This is a cached client which will be re-used for other responses. There 
> could be other outstanding request to the shuffle service, so it should not 
> be closed after processing a response. 
> Seems like this is a bug introduced with SPARK-27651 and SPARK-27677. 
> The older methods  {{registerWithShuffleServer}} and {{fetchBlocks}} didn't 
> close the client.
> cc [~attilapiros] [~vanzin] [~mridulm80]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option

2020-08-03 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-32293:
---
Target Version/s: 3.1.0

> Inconsistent default unit between Spark memory configs and JVM option
> -
>
> Key: SPARK-32293
> URL: https://issues.apache.org/jira/browse/SPARK-32293
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Spark's maximum memory can be configured in several ways:
> - via Spark config
> - command line argument
> - environment variables 
> Both for executors and for the driver the memory can be configured 
> separately. All of these are following the format of JVM memory 
> configurations in a way they are using the very same size unit suffixes ("k", 
> "m", "g" or "t") but there is an inconsistency regarding the default unit. 
> When no suffix is given then the given amount is passed as it is to the JVM 
> (to the -Xmx and -Xms options) where this memory options are using bytes as a 
> default unit, for this please see the example 
> [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:
> {noformat}
> The following examples show how to set the maximum allowed size of allocated 
> memory to 80 MB using various units:
> -Xmx83886080 
> -Xmx81920k 
> -Xmx80m
> {noformat}
> Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32293) Inconsistent default unit between Spark memory configs and JVM option

2020-07-13 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-32293:
---
Summary: Inconsistent default unit between Spark memory configs and JVM 
option  (was: Inconsistent default units for configuring Spark memory)

> Inconsistent default unit between Spark memory configs and JVM option
> -
>
> Key: SPARK-32293
> URL: https://issues.apache.org/jira/browse/SPARK-32293
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Spark's maximum memory can be configured in several ways:
> - via Spark config
> - command line argument
> - environment variables 
> Both for executors and for the driver the memory can be configured 
> separately. All of these are following the format of JVM memory 
> configurations in a way they are using the very same size unit suffixes ("k", 
> "m", "g" or "t") but there is an inconsistency regarding the default unit. 
> When no suffix is given then the given amount is passed as it is to the JVM 
> (to the -Xmx and -Xms options) where this memory options are using bytes as a 
> default unit, for this please see the example 
> [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:
> {noformat}
> The following examples show how to set the maximum allowed size of allocated 
> memory to 80 MB using various units:
> -Xmx83886080 
> -Xmx81920k 
> -Xmx80m
> {noformat}
> Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32293) Inconsistent default units for configuring Spark memory

2020-07-13 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156884#comment-17156884
 ] 

Attila Zsolt Piros commented on SPARK-32293:


I am working on this.

> Inconsistent default units for configuring Spark memory
> ---
>
> Key: SPARK-32293
> URL: https://issues.apache.org/jira/browse/SPARK-32293
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Spark's maximum memory can be configured in several ways:
> - via Spark config
> - command line argument
> - environment variables 
> Both for executors and for the driver the memory can be configured 
> separately. All of these are following the format of JVM memory 
> configurations in a way they are using the very same size unit suffixes ("k", 
> "m", "g" or "t") but there is an inconsistency regarding the default unit. 
> When no suffix is given then the given amount is passed as it is to the JVM 
> (to the -Xmx and -Xms options) where this memory options are using bytes as a 
> default unit, for this please see the example 
> [here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:
> {noformat}
> The following examples show how to set the maximum allowed size of allocated 
> memory to 80 MB using various units:
> -Xmx83886080 
> -Xmx81920k 
> -Xmx80m
> {noformat}
> Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32293) Inconsistent default units for configuring Spark memory

2020-07-13 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-32293:
--

 Summary: Inconsistent default units for configuring Spark memory
 Key: SPARK-32293
 URL: https://issues.apache.org/jira/browse/SPARK-32293
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 3.0.0, 2.4.6, 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 
2.3.4, 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.3, 2.2.2, 2.2.1, 3.0.1, 3.1.0
Reporter: Attila Zsolt Piros


Spark's maximum memory can be configured in several ways:
- via Spark config
- command line argument
- environment variables 

Both for executors and for the driver the memory can be configured separately. 
All of these are following the format of JVM memory configurations in a way 
they are using the very same size unit suffixes ("k", "m", "g" or "t") but 
there is an inconsistency regarding the default unit. When no suffix is given 
then the given amount is passed as it is to the JVM (to the -Xmx and -Xms 
options) where this memory options are using bytes as a default unit, for this 
please see the example 
[here|https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html]:

{noformat}
The following examples show how to set the maximum allowed size of allocated 
memory to 80 MB using various units:

-Xmx83886080 
-Xmx81920k 
-Xmx80m
{noformat}

Although the Spark memory config default suffix unit is "m".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32149) Improve file path name normalisation at block resolution within the external shuffle service

2020-07-01 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-32149:
---
Affects Version/s: (was: 3.0.1)
   3.1.0

> Improve file path name normalisation at block resolution within the external 
> shuffle service
> 
>
> Key: SPARK-32149
> URL: https://issues.apache.org/jira/browse/SPARK-32149
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> In the external shuffle service during the block resolution the file paths 
> (for disk persisted RDD and for shuffle blocks) are normalized by a custom 
> Spark code which uses an OS dependent regexp. This is a redundant code of the 
> package-private JDK counterpart.
> As the code not a perfect match even it could happen one method results in a 
> bit different (but semantically equal) path. 
> The reason of this redundant transformation is the interning of the 
> normalized path to save some heap here which is only possible if both results 
> in the same string.
> Checking the JDK code I believe there is a better solution which is perfect 
> match for the JDK code as it uses that package private method. Moreover based 
> on some benchmarking even this new method seams to be more performant too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32149) Improve file path name normalisation at block resolution within the external shuffle service

2020-07-01 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149440#comment-17149440
 ] 

Attila Zsolt Piros commented on SPARK-32149:


I am working on this

> Improve file path name normalisation at block resolution within the external 
> shuffle service
> 
>
> Key: SPARK-32149
> URL: https://issues.apache.org/jira/browse/SPARK-32149
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> In the external shuffle service during the block resolution the file paths 
> (for disk persisted RDD and for shuffle blocks) are normalized by a custom 
> Spark code which uses an OS dependent regexp. This is a redundant code of the 
> package-private JDK counterpart.
> As the code not a perfect match even it could happen one method results in a 
> bit different (but semantically equal) path. 
> The reason of this redundant transformation is the interning of the 
> normalized path to save some heap here which is only possible if both results 
> in the same string.
> Checking the JDK code I believe there is a better solution which is perfect 
> match for the JDK code as it uses that package private method. Moreover based 
> on some benchmarking even this new method seams to be more performant too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32149) Improve file path name normalisation at block resolution within the external shuffle service

2020-07-01 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-32149:
--

 Summary: Improve file path name normalisation at block resolution 
within the external shuffle service
 Key: SPARK-32149
 URL: https://issues.apache.org/jira/browse/SPARK-32149
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.0.1
Reporter: Attila Zsolt Piros


In the external shuffle service during the block resolution the file paths (for 
disk persisted RDD and for shuffle blocks) are normalized by a custom Spark 
code which uses an OS dependent regexp. This is a redundant code of the 
package-private JDK counterpart.
As the code not a perfect match even it could happen one method results in a 
bit different (but semantically equal) path. 

The reason of this redundant transformation is the interning of the normalized 
path to save some heap here which is only possible if both results in the same 
string.

Checking the JDK code I believe there is a better solution which is perfect 
match for the JDK code as it uses that package private method. Moreover based 
on some benchmarking even this new method seams to be more performant too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2020-03-05 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052348#comment-17052348
 ] 

Attila Zsolt Piros commented on SPARK-27651:


well in case of dynamic allocation and a recalculation the executors could be 
already gone.

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from the external shuffle service running on the same 
> host. This can be avoided by getting the local directories of the same host 
> executors from the external shuffle service and accessing those blocks from 
> the disk directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2020-03-05 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27651:
---
Description: When a shuffle block (content) is fetched the network is 
always used even when it is fetched from the external shuffle service running 
on the same host. This can be avoided by getting the local directories of the 
same host executors from the external shuffle service and accessing those 
blocks from the disk directly.  (was: When a shuffle block (content) is fetched 
the network is always used even when it is fetched from an executor (or the 
external shuffle service) running on the same host.)

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from the external shuffle service running on the same 
> host. This can be avoided by getting the local directories of the same host 
> executors from the external shuffle service and accessing those blocks from 
> the disk directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2020-03-05 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051995#comment-17051995
 ] 

Attila Zsolt Piros commented on SPARK-27651:


Yes, the final implementation works only when the external shuffle service is 
used as the local directories of the other host local executors are asked from 
the external shuffle service. 
The initial implementation when the PR was opened was using the driver to get 
the host local directories.

The technical reasons of asking the external shuffle service was:
 * decreasing network pressure on the driver (main reason).  
 * getting rid of an unbounded (or bounded but in that case complex fall back 
logic at the fetcher) map which maps the executors to local dirs. In addition 
does that redundantly as this information is already available at the external 
shuffle service just stored in distributed way I mean at a running ext shuffle 
service process only for those executor data are stored which are on the same 
host. 

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from an executor (or the external shuffle service) running 
> on the same host.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-06 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031614#comment-17031614
 ] 

Attila Zsolt Piros edited comment on SPARK-30688 at 2/6/20 2:22 PM:


I have checked on Spark 3.0.0-preview2 and week in year fails there too (but 
for invalid format I would say this is fine):
{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
| null|
+-+
{noformat}
But it fails for a correct pattern too:
{noformat}
scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show();
++
|unix_timestamp(2020-10, -ww)|
++
|null|
++
{noformat}
But for this it is strangely works:
{noformat}
scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show();
++
|unix_timestamp(2020-01, -ww)|
++
|  1577833200|
++
{noformat}


was (Author: attilapiros):
I have checked on Spark 3.0.0-preview2 and week in year fails there too:

{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
| null|
+-+
{noformat}

As you can see it fails for even a simpler pattern too:

{noformat}
scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show();
++
|unix_timestamp(2020-10, -ww)|
++
|null|
++
{noformat}

But this strangely works:

{noformat}
scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show();
++
|unix_timestamp(2020-01, -ww)|
++
|  1577833200|
++
{noformat}


> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-06 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031614#comment-17031614
 ] 

Attila Zsolt Piros commented on SPARK-30688:


I have checked on Spark 3.0.0-preview2 and week in year fails there too:

{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
| null|
+-+
{noformat}

As you can see it fails for even a simpler pattern too:

{noformat}
scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show();
++
|unix_timestamp(2020-10, -ww)|
++
|null|
++
{noformat}

But this strangely works:

{noformat}
scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show();
++
|unix_timestamp(2020-01, -ww)|
++
|  1577833200|
++
{noformat}


> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests

2019-12-14 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16996571#comment-16996571
 ] 

Attila Zsolt Piros commented on SPARK-27021:


[~roncenzhao] it seems to me you bumped into 
https://issues.apache.org/jira/browse/SPARK-26418

> Leaking Netty event loop group for shuffle chunk fetch requests
> ---
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-12-14-23-23-50-384.png
>
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests

2019-12-13 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995538#comment-16995538
 ] 

Attila Zsolt Piros commented on SPARK-27021:



[~roncenzhao]

# yes
# I do not think so. This bug mostly effects the test system as test execution 
is the place where multiple TransportContext, NettyRpcEnv, etc ... are created 
and not closed correctly.

> Leaking Netty event loop group for shuffle chunk fetch requests
> ---
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30235) Keeping compatibility with 2.4 external shuffle service regarding host local shuffle blocks reading

2019-12-12 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-30235:
--

 Summary: Keeping compatibility with 2.4 external shuffle service 
regarding host local shuffle blocks reading
 Key: SPARK-30235
 URL: https://issues.apache.org/jira/browse/SPARK-30235
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


When `spark.shuffle.readHostLocalDisk.enabled` is true then a new message is 
used which is not supported by Spark 2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30235) Keeping compatibility with 2.4 external shuffle service regarding host local shuffle blocks reading

2019-12-12 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994672#comment-16994672
 ] 

Attila Zsolt Piros commented on SPARK-30235:


I am working on this.

> Keeping compatibility with 2.4 external shuffle service regarding host local 
> shuffle blocks reading
> ---
>
> Key: SPARK-30235
> URL: https://issues.apache.org/jira/browse/SPARK-30235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> When `spark.shuffle.readHostLocalDisk.enabled` is true then a new message is 
> used which is not supported by Spark 2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27819) Retry cleanup of disk persisted RDD via external shuffle service when it failed via executor

2019-05-23 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1684#comment-1684
 ] 

Attila Zsolt Piros commented on SPARK-27819:


I am working on this.

> Retry cleanup of disk persisted RDD via external shuffle service when it 
> failed via executor
> 
>
> Key: SPARK-27819
> URL: https://issues.apache.org/jira/browse/SPARK-27819
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This issue is created for not losing the idea came up during SPARK-27677 (at 
> org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd):
> {panel:title=Quote}
> In certain situations (e.g. executor death) it would make sense to retry this 
> through the shuffle service. But I'm not sure how to safely detect those 
> situations (or whether it makes sense to always retry through the shuffle 
> service).
> {panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27819) Retry cleanup of disk persisted RDD via external shuffle service when it failed via executor

2019-05-23 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27819:
---
Description: 
This issue is created for not losing the idea came up during SPARK-27677 (at 
org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd):

{panel:title=Quote}
In certain situations (e.g. executor death) it would make sense to retry this 
through the shuffle service. But I'm not sure how to safely detect those 
situations (or whether it makes sense to always retry through the shuffle 
service).
{panel}

  was:
This issue is created for not losing the idea came up during SPARK-27677 (at 
org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd):

{noformat}
In certain situations (e.g. executor death) it would make sense to retry this 
through the shuffle service. But I'm not sure how to safely detect those 
situations (or whether it makes sense to always retry through the shuffle 
service).
{noformat}



> Retry cleanup of disk persisted RDD via external shuffle service when it 
> failed via executor
> 
>
> Key: SPARK-27819
> URL: https://issues.apache.org/jira/browse/SPARK-27819
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This issue is created for not losing the idea came up during SPARK-27677 (at 
> org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd):
> {panel:title=Quote}
> In certain situations (e.g. executor death) it would make sense to retry this 
> through the shuffle service. But I'm not sure how to safely detect those 
> situations (or whether it makes sense to always retry through the shuffle 
> service).
> {panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27819) Retry cleanup of disk persisted RDD via external shuffle service when it failed via executor

2019-05-23 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-27819:
--

 Summary: Retry cleanup of disk persisted RDD via external shuffle 
service when it failed via executor
 Key: SPARK-27819
 URL: https://issues.apache.org/jira/browse/SPARK-27819
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


This issue is created for not losing the idea came up during SPARK-27677 (at 
org.apache.spark.storage.BlockManagerMasterEndpoint#removeRdd):

{noformat}
In certain situations (e.g. executor death) it would make sense to retry this 
through the shuffle service. But I'm not sure how to safely detect those 
situations (or whether it makes sense to always retry through the shuffle 
service).
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2019-05-07 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834832#comment-16834832
 ] 

Attila Zsolt Piros commented on SPARK-27651:


I am already working on this.

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from an executor (or the external shuffle service) running 
> on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2019-05-07 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-27651:
--

 Summary: Avoid the network when block manager fetches shuffle 
blocks from the same host
 Key: SPARK-27651
 URL: https://issues.apache.org/jira/browse/SPARK-27651
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


When a shuffle block (content) is fetched the network is always used even when 
it is fetched from an executor (or the external shuffle service) running on the 
same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoid network communication when block manager fetches disk persisted RDD blocks from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid network communication when block manager fetches disk 
persisted RDD blocks from the same host  (was: Avoid network communication when 
block manger fetches from the same host)

> Avoid network communication when block manager fetches disk persisted RDD 
> blocks from the same host
> ---
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid the network when block manager fetches disk persisted RDD 
blocks from the same host  (was: Avoid network communication when block manager 
fetches disk persisted RDD blocks from the same host)

> Avoid the network when block manager fetches disk persisted RDD blocks from 
> the same host
> -
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27622) Avoid network communication when block manger fetches from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831553#comment-16831553
 ] 

Attila Zsolt Piros edited comment on SPARK-27622 at 5/6/19 6:22 PM:


I am already working on this. There is already a working prototype for RDD 
blocks.


was (Author: attilapiros):
I am already working on this. A working prototype for RDD blocks are ready and 
working.

> Avoid network communication when block manger fetches from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoid network communication when block manger fetches from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid network communication when block manger fetches from the 
same host  (was: Avoiding network communication when block manger fetching from 
the same host)

> Avoid network communication when block manger fetches from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoiding network communication when block manger fetching from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoiding network communication when block manger fetching from the 
same host  (was: Avoiding network communication when block mangers are running 
on the same host )

> Avoiding network communication when block manger fetching from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoiding network communication when block mangers are running on the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoiding network communication when block mangers are running on 
the same host   (was: Avoiding network communication when block mangers are 
running on the host )

> Avoiding network communication when block mangers are running on the same 
> host 
> ---
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27622) Avoiding network communication when block mangers are running on the host

2019-05-02 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831553#comment-16831553
 ] 

Attila Zsolt Piros commented on SPARK-27622:


I am already working on this. A working prototype for RDD blocks are ready and 
working.

> Avoiding network communication when block mangers are running on the host 
> --
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27622) Avoiding network communication when block mangers are running on the host

2019-05-02 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-27622:
--

 Summary: Avoiding network communication when block mangers are 
running on the host 
 Key: SPARK-27622
 URL: https://issues.apache.org/jira/browse/SPARK-27622
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


Currently fetching blocks always uses the network even when the two block 
managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25888) Service requests for persist() blocks via external service after dynamic deallocation

2019-04-24 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825248#comment-16825248
 ] 

Attila Zsolt Piros commented on SPARK-25888:


I am working on this.

> Service requests for persist() blocks via external service after dynamic 
> deallocation
> -
>
> Key: SPARK-25888
> URL: https://issues.apache.org/jira/browse/SPARK-25888
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle, YARN
>Affects Versions: 2.3.2
>Reporter: Adam Kennedy
>Priority: Major
>
> Large and highly multi-tenant Spark on YARN clusters with diverse job 
> execution often display terrible utilization rates (we have observed as low 
> as 3-7% CPU at max container allocation, but 50% CPU utilization on even a 
> well policed cluster is not uncommon).
> As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250 
> users and 50,000 runs of 1,000 distinct applications per week, with 
> predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark 
> Notebook jobs (no streaming)
> Utilization problems appear to be due in large part to difficulties with 
> persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation.
> In situations where an external shuffle service is present (which is typical 
> on clusters of this type) we already solve this for the shuffle block case by 
> offloading the IO handling of shuffle blocks to the external service, 
> allowing dynamic deallocation to proceed.
> Allowing Executors to transfer persist() blocks to some external "shuffle" 
> service in a similar manner would be an enormous win for Spark multi-tenancy 
> as it would limit deallocation blocking scenarios to only MEMORY-only cache() 
> scenarios.
> I'm not sure if I'm correct, but I seem to recall seeing in the original 
> external shuffle service commits that may have been considered at the time 
> but getting shuffle blocks moved to the external shuffle service was the 
> first priority.
> With support for external persist() DISK blocks in place, we could also then 
> handle deallocation of DISK+MEMORY, as the memory instance could first be 
> dropped, changing the block to DISK only, and then further transferred to the 
> shuffle service.
> We have tried to resolve the persist() issue via extensive user training, but 
> that has typically only allowed us to improve utilization of the worst 
> offenders (10% utilization) up to around 40-60% utilization, as the need for 
> persist() is often legitimate and occurs during the middle stages of a job.
> In a healthy multi-tenant scenario, a large job might spool up to say 10,000 
> cores, persist() data, release executors across a long tail down to 100 
> cores, and then spool back up to 10,000 cores for the following stage without 
> impact on the persist() data.
> In an ideal world, if an new executor started up on a node on which blocks 
> had been transferred to the shuffle service, the new executor might even be 
> able to "recapture" control of those blocks (if that would help with 
> performance in some way).
> And the behavior of gradually expanding up and down several times over the 
> course of a job would not just improve utilization, but would allow resources 
> to more easily be redistributed to other jobs which start on the cluster 
> during the long-tail periods, which would improve multi-tenancy and bring us 
> closer to optimal "envy free" YARN scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-08 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787822#comment-16787822
 ] 

Attila Zsolt Piros commented on SPARK-27090:


I would have waited a little bit to find out what others opinion about this. 
But fine for me.

> Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
> --
>
> Key: SPARK-27090
> URL: https://issues.apache.org/jira/browse/SPARK-27090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places 
> along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver 
> is running or an executor.
> The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So 
> I think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-07 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-27090:
--

 Summary: Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
 Key: SPARK-27090
 URL: https://issues.apache.org/jira/browse/SPARK-27090
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places along 
with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver is 
running or an executor.

The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So I 
think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27021) Netty related thread leaks ()

2019-03-01 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27021:
---
Summary: Netty related thread leaks ()  (was: Netty related thread leaks)

> Netty related thread leaks ()
> -
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests

2019-03-01 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27021:
---
Summary: Leaking Netty event loop group for shuffle chunk fetch requests  
(was: Netty related thread leaks (event loop group for shuffle chunk fetch 
requests))

> Leaking Netty event loop group for shuffle chunk fetch requests
> ---
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27021) Netty related thread leaks (event loop group for shuffle chunk fetch requests)

2019-03-01 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27021:
---
Summary: Netty related thread leaks (event loop group for shuffle chunk 
fetch requests)  (was: Netty related thread leaks ())

> Netty related thread leaks (event loop group for shuffle chunk fetch requests)
> --
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27021) Netty related thread leaks

2019-03-01 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-27021:
--

 Summary: Netty related thread leaks
 Key: SPARK-27021
 URL: https://issues.apache.org/jira/browse/SPARK-27021
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.4.1, 3.0.0
Reporter: Attila Zsolt Piros


The extra event loop group created for handling shuffle chunk fetch requests 
are never closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27021) Netty related thread leaks

2019-03-01 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781707#comment-16781707
 ] 

Attila Zsolt Piros commented on SPARK-27021:


I am already working on it.

> Netty related thread leaks
> --
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26891) Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable"

2019-02-15 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769323#comment-16769323
 ] 

Attila Zsolt Piros commented on SPARK-26891:


I am woking on it.

> Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node 
> blacklist and is serializable"
> ---
>
> Key: SPARK-26891
> URL: https://issues.apache.org/jira/browse/SPARK-26891
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For even an unrelated change sometimes the 
> YarnSchedulerBackendSuite#"RequestExecutors reflects node blacklist and is 
> serializable" test fails with the following error:
> {noformat}
> Error Message
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue:  EmptySet$ cannot be 
> returned by resourceOffers() resourceOffers() should return Seq *** If you're 
> unsure why you're getting above error read on. Due to the nature of the 
> syntax above problem might occur because: 1. This exception *might* occur in 
> wrongly written multi-threaded tests.Please refer to Mockito FAQ on 
> limitations of concurrency testing. 2. A spy is stubbed using 
> when(spy.foo()).then() syntax. It is safer to stub spies - - with 
> doReturn|Throw() family of methods. More in javadocs for Mockito.spy() 
> method. 
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> EmptySet$ cannot be returned by resourceOffers()
> resourceOffers() should return Seq
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26891) Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable"

2019-02-15 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26891:
---
Description: 
For even an unrelated change sometimes the 
YarnSchedulerBackendSuite#"RequestExecutors reflects node blacklist and is 
serializable" test fails with the following error:
{noformat}
Error Message
org.mockito.exceptions.misusing.WrongTypeOfReturnValue:  EmptySet$ cannot be 
returned by resourceOffers() resourceOffers() should return Seq *** If you're 
unsure why you're getting above error read on. Due to the nature of the syntax 
above problem might occur because: 1. This exception *might* occur in wrongly 
written multi-threaded tests.Please refer to Mockito FAQ on limitations of 
concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. 
It is safer to stub spies - - with doReturn|Throw() family of methods. More 
in javadocs for Mockito.spy() method. 
Stacktrace
sbt.ForkMain$ForkError: org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
EmptySet$ cannot be returned by resourceOffers()
resourceOffers() should return Seq
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.
{noformat}
 

  was:
For unrelated changes sometimes YarnSchedulerBackendSuite."RequestExecutors 
reflects node blacklist and is serializable" test fails with the following 
error:
{noformat}
Error Message
org.mockito.exceptions.misusing.WrongTypeOfReturnValue:  EmptySet$ cannot be 
returned by resourceOffers() resourceOffers() should return Seq *** If you're 
unsure why you're getting above error read on. Due to the nature of the syntax 
above problem might occur because: 1. This exception *might* occur in wrongly 
written multi-threaded tests.Please refer to Mockito FAQ on limitations of 
concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. 
It is safer to stub spies - - with doReturn|Throw() family of methods. More 
in javadocs for Mockito.spy() method. 
Stacktrace
sbt.ForkMain$ForkError: org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
EmptySet$ cannot be returned by resourceOffers()
resourceOffers() should return Seq
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.
{noformat}
 


> Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node 
> blacklist and is serializable"
> ---
>
> Key: SPARK-26891
> URL: https://issues.apache.org/jira/browse/SPARK-26891
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For even an unrelated change sometimes the 
> YarnSchedulerBackendSuite#"RequestExecutors reflects node blacklist and is 
> serializable" test fails with the following error:
> {noformat}
> Error Message
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue:  EmptySet$ cannot be 
> returned by resourceOffers() resourceOffers() should return Seq *** If you're 
> unsure why you're getting above error read on. Due to the nature of the 
> syntax above problem might occur because: 1. This exception *might* occur in 
> wrongly written multi-threaded tests.Please refer to Mockito FAQ on 
> limitations of concurrency testing. 2. A spy is stubbed using 
> when(spy.foo()).then() syntax. It is safer to stub spies - - with 
> doReturn|Throw() family of methods. More in javadocs for Mockito.spy() 
> method. 
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> EmptySet$ cannot be returned by resourceOffers()
> resourceOffers() should return Seq
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with 

[jira] [Created] (SPARK-26891) Flaky test:YarnSchedulerBackendSuite."RequestExecutors reflects node blacklist and is serializable"

2019-02-15 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-26891:
--

 Summary: Flaky test:YarnSchedulerBackendSuite."RequestExecutors 
reflects node blacklist and is serializable"
 Key: SPARK-26891
 URL: https://issues.apache.org/jira/browse/SPARK-26891
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.4.0, 2.3.0, 3.0.0
Reporter: Attila Zsolt Piros


For unrelated changes sometimes YarnSchedulerBackendSuite."RequestExecutors 
reflects node blacklist and is serializable" test fails with the following 
error:
{noformat}
Error Message
org.mockito.exceptions.misusing.WrongTypeOfReturnValue:  EmptySet$ cannot be 
returned by resourceOffers() resourceOffers() should return Seq *** If you're 
unsure why you're getting above error read on. Due to the nature of the syntax 
above problem might occur because: 1. This exception *might* occur in wrongly 
written multi-threaded tests.Please refer to Mockito FAQ on limitations of 
concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. 
It is safer to stub spies - - with doReturn|Throw() family of methods. More 
in javadocs for Mockito.spy() method. 
Stacktrace
sbt.ForkMain$ForkError: org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
EmptySet$ cannot be returned by resourceOffers()
resourceOffers() should return Seq
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.
{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string

2019-02-07 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763094#comment-16763094
 ] 

Attila Zsolt Piros commented on SPARK-26845:


This also works:
{code}
test("roundtrip in to_avro and from_avro - string") {
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))

val avroDF = df.select(to_avro('str).as("b"))
val avroTypeStr = s"""
  |{
  |   "type": "record",
  |   "name": "topLevelRecord",
  |   "fields": [
  | {
  |   "name": "str",
  |   "type": ["string", "null"]
  | }
  |   ]
  |}""".stripMargin
checkAnswer(
  avroDF.select(from_avro('b, avroTypeStr).as("rec")).select($"rec.str"),
  df)
  }
{code}
I have introduced a topLevelRecord as at the top level union types is not 
allowed / not working (good question why), I mean this:
{code:javascript}
  {
"name": "str",
"type": ["string", "null"]
  }
{code}
Throws an exception:
{noformat}
org.apache.avro.SchemaParseException: No type: 
{"name":"str","type":["string","null"]} 
{noformat}

> Avro to_avro from_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string

2019-02-07 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763066#comment-16763066
 ] 

Attila Zsolt Piros commented on SPARK-26845:


The test would work if you replace the line
{code:java}
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))
{code}
with
{code:java}
val df = spark.range(3).select('id.cast("string").as("str"))
{code}

*And the difference is caused by the nullable flag of the _StructField_.*

For the _Seq_ you used the schema is:
{code:java}
scala> spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str")).schema 
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(str,StringType,true))
{code}
And for the range:
{code:java}
scala> spark.range(3).select('id.cast("string").as("str")).schema 
res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(str,StringType,false))
{code}
So in your case the _avroTypeStr_ does not match to the data.

> Avro to_avro from_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25035) Replicating disk-stored blocks should avoid memory mapping

2019-01-29 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755322#comment-16755322
 ] 

Attila Zsolt Piros commented on SPARK-25035:


I am working on this.

> Replicating disk-stored blocks should avoid memory mapping
> --
>
> Key: SPARK-25035
> URL: https://issues.apache.org/jira/browse/SPARK-25035
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> This is a follow-up to SPARK-24296.
> When replicating a disk-cached block, even if we fetch-to-disk, we still 
> memory-map the file, just to copy it to another location.
> Ideally we'd just move the tmp file to the right location.  But even without 
> that, we could read the file as an input stream, instead of memory-mapping 
> the whole thing.  Memory-mapping is particularly a problem when running under 
> yarn, as the OS may believe there is plenty of memory available, meanwhile 
> yarn decides to kill the process for exceeding memory limits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748922#comment-16748922
 ] 

Attila Zsolt Piros commented on SPARK-26688:


As I know a node only could have one label.

So if node label is already used for something else (to group somehow your 
nodes) this could be a lightweight way to still exclude some YARN nodes.

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26688:
---
Description: 
Introducing new config for initially blacklisted YARN nodes.

This came up in the apache spark user mailing list: 
[http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]

 

  was:Introducing new config for initially blacklisted YARN nodes.


> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-26688:
--

 Summary: Provide configuration of initially blacklisted YARN nodes
 Key: SPARK-26688
 URL: https://issues.apache.org/jira/browse/SPARK-26688
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


Introducing new config for initially blacklisted YARN nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748632#comment-16748632
 ] 

Attila Zsolt Piros commented on SPARK-26688:


I am working on this.

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26615) Fixing transport server/client resource leaks in the core unittests

2019-01-14 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742243#comment-16742243
 ] 

Attila Zsolt Piros commented on SPARK-26615:


I am working on it (soon a PR will be created)

> Fixing transport server/client resource leaks in the core unittests 
> 
>
> Key: SPARK-26615
> URL: https://issues.apache.org/jira/browse/SPARK-26615
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The testing of SPARK-24938 PR ([https://github.com/apache/spark/pull/22114)] 
> always fails with OOM. Analysing this problem lead to identifying some 
> resource leaks where TransportClient/TransportServer instances are nor closed 
> properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26615) Fixing transport server/client resource leaks in the core unittests

2019-01-14 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26615:
---
Affects Version/s: 3.0.0

> Fixing transport server/client resource leaks in the core unittests 
> 
>
> Key: SPARK-26615
> URL: https://issues.apache.org/jira/browse/SPARK-26615
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The testing of SPARK-24938 PR ([https://github.com/apache/spark/pull/22114)] 
> always fails with OOM. Analysing this problem lead to identifying some 
> resource leaks where TransportClient/TransportServer instances are nor closed 
> properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26615) Fixing transport server/client resource leaks in the core unittests

2019-01-14 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-26615:
--

 Summary: Fixing transport server/client resource leaks in the core 
unittests 
 Key: SPARK-26615
 URL: https://issues.apache.org/jira/browse/SPARK-26615
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Attila Zsolt Piros


The testing of SPARK-24938 PR ([https://github.com/apache/spark/pull/22114)] 
always fails with OOM. Analysing this problem lead to identifying some resource 
leaks where TransportClient/TransportServer instances are nor closed properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24920) Spark should allow sharing netty's memory pools across all uses

2018-12-10 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715040#comment-16715040
 ] 

Attila Zsolt Piros commented on SPARK-24920:


I started to work on it.

I would keep a separate memory pool for transport clients and transport 
servers. 
This way cache allowance for each pool would be like before the change for 
transport servers it would be on and for transport clients it would be off. 

 

> Spark should allow sharing netty's memory pools across all uses
> ---
>
> Key: SPARK-24920
> URL: https://issues.apache.org/jira/browse/SPARK-24920
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> Spark currently creates separate netty memory pools for each of the following 
> "services":
> 1) RPC Client
> 2) RPC Server
> 3) BlockTransfer Client
> 4) BlockTransfer Server
> 5) ExternalShuffle Client
> Depending on configuration and whether its an executor or driver JVM, 
> different of these are active, but its always either 3 or 4.
> Having them independent somewhat defeats the purpose of using pools at all.  
> In my experiments I've found each pool will grow due to a burst of activity 
> in the related service (eg. task start / end msgs), followed another burst in 
> a different service (eg. sending torrent broadcast blocks).  Because of the 
> way these pools work, they allocate memory in large chunks (16 MB by default) 
> for each netty thread, so there is often a surge of 128 MB of allocated 
> memory, even for really tiny messages.  Also a lot of this memory is offheap 
> by default, which makes it even tougher for users to manage.
> I think it would make more sense to combine all of these into a single pool.  
> In some experiments I tried, this noticeably decreased memory usage, both 
> onheap and offheap (no significant performance effect in my small 
> experiments).
> As this is a pretty core change, as I first step I'd propose just exposing 
> this as a conf, to let user experiment more broadly across a wider range of 
> workloads



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24920) Spark should allow sharing netty's memory pools across all uses

2018-12-10 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715040#comment-16715040
 ] 

Attila Zsolt Piros edited comment on SPARK-24920 at 12/10/18 4:39 PM:
--

I started to work on it.

I would keep a separate memory pool for transport clients and transport 
servers. 
 This way cache allowance for each pool would be like before this change: for 
transport servers it would be on and for transport clients it would be off.

 


was (Author: attilapiros):
I started to work on it.

I would keep a separate memory pool for transport clients and transport 
servers. 
This way cache allowance for each pool would be like before the change for 
transport servers it would be on and for transport clients it would be off. 

 

> Spark should allow sharing netty's memory pools across all uses
> ---
>
> Key: SPARK-24920
> URL: https://issues.apache.org/jira/browse/SPARK-24920
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> Spark currently creates separate netty memory pools for each of the following 
> "services":
> 1) RPC Client
> 2) RPC Server
> 3) BlockTransfer Client
> 4) BlockTransfer Server
> 5) ExternalShuffle Client
> Depending on configuration and whether its an executor or driver JVM, 
> different of these are active, but its always either 3 or 4.
> Having them independent somewhat defeats the purpose of using pools at all.  
> In my experiments I've found each pool will grow due to a burst of activity 
> in the related service (eg. task start / end msgs), followed another burst in 
> a different service (eg. sending torrent broadcast blocks).  Because of the 
> way these pools work, they allocate memory in large chunks (16 MB by default) 
> for each netty thread, so there is often a surge of 128 MB of allocated 
> memory, even for really tiny messages.  Also a lot of this memory is offheap 
> by default, which makes it even tougher for users to manage.
> I think it would make more sense to combine all of these into a single pool.  
> In some experiments I tried, this noticeably decreased memory usage, both 
> onheap and offheap (no significant performance effect in my small 
> experiments).
> As this is a pretty core change, as I first step I'd propose just exposing 
> this as a conf, to let user experiment more broadly across a wider range of 
> workloads



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26232) Remove old Netty3 dependency from the build

2018-11-30 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-26232.

Resolution: Won't Fix

One of the third party tool uses Netty3, so as a transitive dependency Netty3 
is really needed.

> Remove old Netty3 dependency from the build 
> 
>
> Key: SPARK-26232
> URL: https://issues.apache.org/jira/browse/SPARK-26232
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Old Netty artifact (3.9.9.Final) is unused.
> The reason it is not collide with Netty4 is they are using different package 
> names:
>  * Netty3: org.jboss.netty.*
>  * Netty4: io.netty.*
> Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26232) Remove old Netty3 dependency from the build

2018-11-30 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26232:
---
Summary: Remove old Netty3 dependency from the build   (was: Remove old 
Netty3 artifact from the build )

> Remove old Netty3 dependency from the build 
> 
>
> Key: SPARK-26232
> URL: https://issues.apache.org/jira/browse/SPARK-26232
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Old Netty artifact (3.9.9.Final) is unused.
> The reason it is not collide with Netty4 is they are using different package 
> names:
>  * Netty3: org.jboss.netty.*
>  * Netty4: io.netty.*
> Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26232) Remove old Netty3 artifact from the build

2018-11-30 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-26232:
--

 Summary: Remove old Netty3 artifact from the build 
 Key: SPARK-26232
 URL: https://issues.apache.org/jira/browse/SPARK-26232
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


Old Netty artifact (3.9.9.Final) is unused.

The reason it is not collide with Netty4 is they are using different package 
names:
 * Netty3: org.jboss.netty.*
 * Netty4: io.netty.*

Still Netty3 is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-20 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26118:
---
Issue Type: New Feature  (was: Bug)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-20 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26118:
---
Issue Type: Bug  (was: Improvement)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692018#comment-16692018
 ] 

Attila Zsolt Piros commented on SPARK-26118:


I am working on a PR.

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26118:
---
Affects Version/s: (was: 2.3.2)
   (was: 2.4.0)
   (was: 2.2.2)
   (was: 2.1.3)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-26118:
--

 Summary: Make Jetty's requestHeaderSize configurable in Spark
 Key: SPARK-26118
 URL: https://issues.apache.org/jira/browse/SPARK-26118
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0, 2.3.2, 2.2.2, 2.1.3, 3.0.0
Reporter: Attila Zsolt Piros


For long authorization fields the request header size could be over the default 
limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request Entity Too 
Large).

This issue may occur if the user is a member of many Active Directory user 
groups.

The HTTP request to the server contains the Kerberos token in the 
WWW-Authenticate header. The header size increases together with the number of 
user groups. 

Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26002) SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01

2018-11-10 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682404#comment-16682404
 ] 

Attila Zsolt Piros commented on SPARK-26002:


I am already working on a fix.

> SQL date operators calculates with incorrect dayOfYears for dates before 
> 1500-03-01
> ---
>
> Key: SPARK-26002
> URL: https://issues.apache.org/jira/browse/SPARK-26002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0, 
> 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Running the following SQL the result is incorrect:
> {noformat}
> scala> sql("select dayOfYear('1500-01-02')").show()
> +---+
> |dayofyear(CAST(1500-01-02 AS DATE))|
> +---+
> |  1|
> +---+
> {noformat}
> This off by one day is more annoying right at the beginning of a year:
> {noformat}
> scala> sql("select year('1500-01-01')").show()
> +--+
> |year(CAST(1500-01-01 AS DATE))|
> +--+
> |  1499|
> +--+
> scala> sql("select month('1500-01-01')").show()
> +---+
> |month(CAST(1500-01-01 AS DATE))|
> +---+
> | 12|
> +---+
> scala> sql("select dayOfYear('1500-01-01')").show()
> +---+
> |dayofyear(CAST(1500-01-01 AS DATE))|
> +---+
> |365|
> +---+
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26002) SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01

2018-11-10 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-26002:
--

 Summary: SQL date operators calculates with incorrect dayOfYears 
for dates before 1500-03-01
 Key: SPARK-26002
 URL: https://issues.apache.org/jira/browse/SPARK-26002
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0, 2.1.3, 
3.0.0
Reporter: Attila Zsolt Piros


Running the following SQL the result is incorrect:
{noformat}
scala> sql("select dayOfYear('1500-01-02')").show()
+---+
|dayofyear(CAST(1500-01-02 AS DATE))|
+---+
|  1|
+---+
{noformat}
This off by one day is more annoying right at the beginning of a year:
{noformat}
scala> sql("select year('1500-01-01')").show()
+--+
|year(CAST(1500-01-01 AS DATE))|
+--+
|  1499|
+--+


scala> sql("select month('1500-01-01')").show()
+---+
|month(CAST(1500-01-01 AS DATE))|
+---+
| 12|
+---+


scala> sql("select dayOfYear('1500-01-01')").show()
+---+
|dayofyear(CAST(1500-01-01 AS DATE))|
+---+
|365|
+---+
{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue

2018-06-19 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517515#comment-16517515
 ] 

Attila Zsolt Piros commented on SPARK-24578:


[~wbzhao] oh sorry I read your comment late, definitely without your help 
(pointing to the right commit caused the problem) it would took much much 
longer. If you like please create another PR, it is fine for me.

> Reading remote cache block behavior changes and causes timeout issue
> 
>
> Key: SPARK-24578
> URL: https://issues.apache.org/jira/browse/SPARK-24578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Wenbo Zhao
>Priority: Major
>
> After Spark 2.3, we observed lots of errors like the following in some of our 
> production job
> {code:java}
> 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, 
> chunkIndex=0}, 
> buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to 
> /172.22.18.7:60865; closing connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
> at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> {code}
>  
> Here is a small reproducible for a small cluster of 2 executors (say host-1 
> and host-2) each with 8 cores. Here, the memory of driver and executors are 
> not an import factor here as long as it is big enough, say 20G. 
> {code:java}
> val n = 1
> val df0 = sc.parallelize(1 to n).toDF
> val df = df0.withColumn("x0", rand()).withColumn("x0", rand()
> ).withColumn("x1", rand()
> ).withColumn("x2", rand()
> ).withColumn("x3", rand()
> ).withColumn("x4", rand()
> ).withColumn("x5", rand()
> ).withColumn("x6", rand()

[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue

2018-06-19 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517452#comment-16517452
 ] 

Attila Zsolt Piros commented on SPARK-24578:


[~wbzhao] Yes, you are right, readerIndex is the first (I just checked 
skipBytes), thanks.

[~irashid] Yes, I can create a PR soon.

[~vanzin] Netty was upgraded between 2.2.1 and 2.3.0: from 4.0.43.Final to 
4.1.17.Final. Could it be?

> Reading remote cache block behavior changes and causes timeout issue
> 
>
> Key: SPARK-24578
> URL: https://issues.apache.org/jira/browse/SPARK-24578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Wenbo Zhao
>Priority: Major
>
> After Spark 2.3, we observed lots of errors like the following in some of our 
> production job
> {code:java}
> 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, 
> chunkIndex=0}, 
> buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to 
> /172.22.18.7:60865; closing connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
> at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> {code}
>  
> Here is a small reproducible for a small cluster of 2 executors (say host-1 
> and host-2) each with 8 cores. Here, the memory of driver and executors are 
> not an import factor here as long as it is big enough, say 20G. 
> {code:java}
> val n = 1
> val df0 = sc.parallelize(1 to n).toDF
> val df = df0.withColumn("x0", rand()).withColumn("x0", rand()
> ).withColumn("x1", rand()
> ).withColumn("x2", rand()
> ).withColumn("x3", rand()
> ).withColumn("x4", rand()
> ).withColumn("x5", rand()
> 

[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue

2018-06-19 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517416#comment-16517416
 ] 

Attila Zsolt Piros commented on SPARK-24578:


I have written a small test I know it is a bit naive but I still thing it shows 
us something:

[https://gist.github.com/attilapiros/730d67b62317d14f5fd0f6779adea245]

And the result is:
{noformat}
n = 500 
duration221 = 2 
duration221.new = 0 
duration230 = 5242 
duration230.new = 7
{noformat}
My guess the receiver timeouts and the sender is writing into a closed socket.

> Reading remote cache block behavior changes and causes timeout issue
> 
>
> Key: SPARK-24578
> URL: https://issues.apache.org/jira/browse/SPARK-24578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Wenbo Zhao
>Priority: Major
>
> After Spark 2.3, we observed lots of errors like the following in some of our 
> production job
> {code:java}
> 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, 
> chunkIndex=0}, 
> buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to 
> /172.22.18.7:60865; closing connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
> at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> {code}
>  
> Here is a small reproducible for a small cluster of 2 executors (say host-1 
> and host-2) each with 8 cores. Here, the memory of driver and executors are 
> not an import factor here as long as it is big enough, say 20G. 
> {code:java}
> val n = 1
> val df0 = sc.parallelize(1 to n).toDF
> val df = df0.withColumn("x0", rand()).withColumn("x0", rand()
> 

[jira] [Commented] (SPARK-24578) Reading remote cache block behavior changes and causes timeout issue

2018-06-19 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517287#comment-16517287
 ] 

Attila Zsolt Piros commented on SPARK-24578:


The copyByteBuf() along with transferTo() is called several times on a huge 
byteBuf. If we have a really huge chunk from small chunks we are re-merging the 
small chunks again and again by 
[https://github.com/apache/spark/blob/v2.3.0/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java#L140|https://github.com/apache/spark/blob/v2.3.0/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java#L140.]
 although we need only a very small part of it.

Is it possible the following line would help?
{code:java}
ByteBuffer buffer = buf.nioBuffer(0, Math.min(buf.readableBytes(), 
NIO_BUFFER_LIMIT));{code}

> Reading remote cache block behavior changes and causes timeout issue
> 
>
> Key: SPARK-24578
> URL: https://issues.apache.org/jira/browse/SPARK-24578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Wenbo Zhao
>Priority: Major
>
> After Spark 2.3, we observed lots of errors like the following in some of our 
> production job
> {code:java}
> 18/06/15 20:59:42 ERROR TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=91672904003, 
> chunkIndex=0}, 
> buffer=org.apache.spark.storage.BlockManagerManagedBuffer@783a9324} to 
> /172.22.18.7:60865; closing connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.writeNioBuffer(MessageWithHeader.java:156)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:142)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:123)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
> at 
> io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
> at 
> io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
> at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> {code}
>  
> Here is a small reproducible for a small cluster of 2 executors 

[jira] [Commented] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-06-19 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517028#comment-16517028
 ] 

Attila Zsolt Piros commented on SPARK-24594:


I am working on this.

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-06-19 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-24594:
---
Description: Within SPARK-16630 it come up to introduce metrics for  YARN 
executor allocation problems.  (was: Within 
https://issues.apache.org/jira/browse/SPARK-16630 it come up to introduce 
metrics for  YARN executor allocation problems.)

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-06-19 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-24594:
--

 Summary: Introduce metrics for YARN executor allocation problems 
 Key: SPARK-24594
 URL: https://issues.apache.org/jira/browse/SPARK-24594
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.4.0
Reporter: Attila Zsolt Piros


Within https://issues.apache.org/jira/browse/SPARK-16630 it come up to 
introduce metrics for  YARN executor allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19526) Spark should raise an exception when it tries to read a Hive view but it doesn't have read access on the corresponding table(s)

2018-06-07 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504827#comment-16504827
 ] 

Attila Zsolt Piros commented on SPARK-19526:


I started working on this issue.

> Spark should raise an exception when it tries to read a Hive view but it 
> doesn't have read access on the corresponding table(s)
> ---
>
> Key: SPARK-19526
> URL: https://issues.apache.org/jira/browse/SPARK-19526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.4, 2.0.3, 2.2.0, 2.3.0
>Reporter: Reza Safi
>Priority: Major
>
> Spark sees a Hive views as a set of hdfs "files". So to read anything from a 
> Hive view, Spark needs access to all of the files that belongs to the 
> table(s) that the view queries them.  In other words a Spark user cannot be 
> granted fine grained permissions at the levels of Hive columns or records.
> Consider that there is a Spark job that contains a SQL query that tries to 
> read a Hive view. Currently the Spark job will finish successfully if the 
> user that runs the Spark job doesn't have proper read access permissions to 
> the tables that the Hive view has been built on top of them. It will just 
> return an empty result set. This can be confusing for the users, since the 
> job will be finishes without any exception or error. 
> Spark should raise an exception like  AccessDenied when it tries to run a 
> Hive view query and its user doesn't have proper permissions to the tables 
> that the Hive view is created on top of them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.

2018-05-08 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467137#comment-16467137
 ] 

Attila Zsolt Piros commented on SPARK-19181:


I am working on this.

> SparkListenerSuite.local metrics fails when average executorDeserializeTime 
> is too short.
> -
>
> Key: SPARK-19181
> URL: https://issues.apache.org/jira/browse/SPARK-19181
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>Priority: Minor
>
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249
> The "local metrics" test asserts that tasks should take more than 1ms on 
> average to complete, even though a code comment notes that this is a small 
> test and tasks may finish faster. I've been seeing some "failures" here on 
> fast systems that finish these tasks quite quickly.
> There are a few ways forward here:
> 1. Disable this test.
> 2. Relax this check.
> 3. Implement sub-millisecond granularity for task times throughout Spark.
> 4. (Imran Rashid's suggestion) Add buffer time by, say, having the task 
> reference a partition that implements a custom Externalizable.readExternal, 
> which always waits 1ms before returning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432495#comment-16432495
 ] 

Attila Zsolt Piros commented on SPARK-16630:


Let me illustrate my problem with an example:
- the limit for blacklisted nodes is configured to 2  
- we have one node blacklisted close to the yarn allocator ("host1" -> 
expiryTime1), this is the new code I am working on
- scheduler requests a new executors along with blacklisted nodes (task-level): 
"host2", "host3" 
(org.apache.spark.deploy.yarn.YarnAllocator#requestTotalExecutorsWithPreferredLocalities)

So I have to choose 2 nodes to communicate towards YARN. My idea to pass 
expiryTime2 and expiryTime3 to the YarnAllocator to choose the most relevant 2 
nodes (the one which expires latter are the more relevant).
For this in the case class RequestExecutors the nodeBlacklist field type is 
changed to Map[String, Long] from Set[String].

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432415#comment-16432415
 ] 

Attila Zsolt Piros commented on SPARK-16630:


I would need the expiry times to choose the most relevant (most fresh) subset 
of nodes to backlist when the limit is less then the union of all 
blacklist-able nodes. So it is only used for sorting.


> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432165#comment-16432165
 ] 

Attila Zsolt Piros edited comment on SPARK-16630 at 4/10/18 12:15 PM:
--

I have question regarding limiting the number of blacklisted nodes according to 
the cluster size. 
With this change there will be two sources of nodes to be backlisted: 
- one list is coming from the scheduler (existing node level backlisting) 
- the other is computed here close to the YARN allocator (stored along with the 
expiry times)

I think it makes sense to have the limit for the complete list (union) of 
blacklisted nodes, am I right? 
If this limit is for the complete list then regarding the subset I think the 
newly blacklisted nodes are more up-to-date to be used then the earlier 
backlisted ones. 
So I would pass the expiry times from the scheduler to the YARN allocator to 
make the subset of backlisted nodes to be communicated to YARN. What is your 
opinion?


was (Author: attilapiros):
I have question regarding limiting the number of blacklisted nodes according to 
the cluster size. 
With this change there will be two sources of nodes to be backlisted: 
- one list is coming from the scheduler (existing node level backlisting) 
- the other is computed here close to the YARN (stored along with the expiry 
times)

I think it makes sense to have the limit for the complete list (union) of 
blacklisted nodes, am I right? 
If this limit is for the complete list then regarding the subset I think the 
newly blacklisted nodes are more up-to-date to be used then the earlier 
backlisted ones. 
So I would pass the expiry times from the scheduler to the YARN allocator to 
make the subset of backlisted nodes to be communicated to YARN. What is your 
opinion?

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-10 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432165#comment-16432165
 ] 

Attila Zsolt Piros commented on SPARK-16630:


I have question regarding limiting the number of blacklisted nodes according to 
the cluster size. 
With this change there will be two sources of nodes to be backlisted: 
- one list is coming from the scheduler (existing node level backlisting) 
- the other is computed here close to the YARN (stored along with the expiry 
times)

I think it makes sense to have the limit for the complete list (union) of 
blacklisted nodes, am I right? 
If this limit is for the complete list then regarding the subset I think the 
newly blacklisted nodes are more up-to-date to be used then the earlier 
backlisted ones. 
So I would pass the expiry times from the scheduler to the YARN allocator to 
make the subset of backlisted nodes to be communicated to YARN. What is your 
opinion?

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-06 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428721#comment-16428721
 ] 

Attila Zsolt Piros commented on SPARK-16630:


Yes you are right more executors can be allocated on the same host. I have 
checked and using getNumClusterNodes is not as hard so I will do that. 

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-05 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427181#comment-16427181
 ] 

Attila Zsolt Piros commented on SPARK-16630:


[~tgraves] what about stopping YARN backlisting when a configured limit with 
the default of "spark.executor.instances" * 
"spark.yarn.backlisting.default.executor.instances.size.weight" (a better name 
for the weight is welcomed) limit is reached (including all the blacklisted 
nodes even stage and task level blacklisted nodes) and in case of dynamic 
allocation the default is Int.MaxValue so there is no limit at all?

This idea comes from the calculation of the default for maxNumExecutorFailures.

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-04-05 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427123#comment-16427123
 ] 

Attila Zsolt Piros commented on SPARK-16630:


[~irashid] I would reuse spark.blacklist.application.maxFailedExecutorsPerNode 
which has already a default = 2. I think it makes sense to use the same limit 
for per node failures before adding the node to the backlist. But in this case 
I cannot consider spark.yarn.max.executor.failures into the default.

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23728) ML test with expected exceptions testing streaming fails on 2.3

2018-03-17 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-23728:
---
Summary: ML test with expected exceptions testing streaming fails on 2.3  
(was: ML test with expected exceptions via streaming fails on 2.3)

> ML test with expected exceptions testing streaming fails on 2.3
> ---
>
> Key: SPARK-23728
> URL: https://issues.apache.org/jira/browse/SPARK-23728
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The testTransformerByInterceptingException fails to catch the expected 
> message as on 2.3 during streaming if an exception happens within an ml 
> feature then feature generated message is not at the direct caused by 
> exception but even one level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23728) ML test with expected exceptions via streaming fails on 2.3

2018-03-17 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-23728:
--

 Summary: ML test with expected exceptions via streaming fails on 
2.3
 Key: SPARK-23728
 URL: https://issues.apache.org/jira/browse/SPARK-23728
 Project: Spark
  Issue Type: Bug
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Attila Zsolt Piros


The testTransformerByInterceptingException fails to catch the expected message 
as on 2.3 during streaming if an exception happens within an ml feature then 
feature generated message is not at the direct caused by exception but even one 
level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23728) ML test with expected exceptions via streaming fails on 2.3

2018-03-17 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403850#comment-16403850
 ] 

Attila Zsolt Piros commented on SPARK-23728:


I am soon creating a PR.

> ML test with expected exceptions via streaming fails on 2.3
> ---
>
> Key: SPARK-23728
> URL: https://issues.apache.org/jira/browse/SPARK-23728
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The testTransformerByInterceptingException fails to catch the expected 
> message as on 2.3 during streaming if an exception happens within an ml 
> feature then feature generated message is not at the direct caused by 
> exception but even one level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-03-08 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391649#comment-16391649
 ] 

Attila Zsolt Piros commented on SPARK-16630:


I am working on this issue.

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-03-05 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387071#comment-16387071
 ] 

Attila Zsolt Piros commented on SPARK-16630:


Of course we can consider "spark.yarn.executor.failuresValidityInterval" for 
"yarn-level" backlisting too.

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-03-05 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386789#comment-16386789
 ] 

Attila Zsolt Piros commented on SPARK-16630:


I have checked the existing sources and I would like to open a discussion about 
the possible solution.


As I have seen YarnAllocator#processCompletedContainers could be extended to 
track the number of failures by host. Also YarnAllocator is responsible to 
update the task-level backlisted nodes with YARN (calling 
AMRMClient#updateBlacklist). So a relatively easy solution would be to have a 
separate counter here (which is independent from task level failures) with its 
own configured limit and updating YARN with the union of task-level backlisted 
nodes and "allocator-level" backlisted nodes. What is your opinion?

 

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379056#comment-16379056
 ] 

Attila Zsolt Piros commented on SPARK-22915:


Thank you as I have already a lot of work in it. I can do a quick PR right now, 
but I would prefer to do one more self review and test executions beforehand. I 
can promise finishing it in the following two days (as I am now on a journey). 

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   >