[ https://issues.apache.org/jira/browse/SPARK-53043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-53043: ---------------------------------- Description: Since Java 9+, we can use Java's built-in `transferTo` directly which is **significantly faster (over 100x)** than `IOUtils.copy`. In addition, Java's `transferTo` returns the correct value of copied bytes while `IOUtils.copy` returns -1 after 2GB which is a well-known limitation. {code} scala> import java.io._ import java.io._ scala> spark.time(new FileInputStream("/tmp/4G.bin").transferTo(new FileOutputStream("/dev/null"))) Time taken: 4 ms val res0: Long = 4294967296 scala> spark.time(org.apache.commons.io.IOUtils.copy(new FileInputStream("/tmp/4G.bin"), new FileOutputStream("/dev/null"))) Time taken: 781 ms val res1: Int = -1 {code} > Use Java `InputStream.transferTo` instead of `IOUtils.copy` > ----------------------------------------------------------- > > Key: SPARK-53043 > URL: https://issues.apache.org/jira/browse/SPARK-53043 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core, SQL > Affects Versions: 4.1.0 > Reporter: Dongjoon Hyun > Assignee: Dongjoon Hyun > Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > > Since Java 9+, we can use Java's built-in `transferTo` directly which is > **significantly faster (over 100x)** than `IOUtils.copy`. In addition, Java's > `transferTo` returns the correct value of copied bytes while `IOUtils.copy` > returns -1 after 2GB which is a well-known limitation. > {code} > scala> import java.io._ > import java.io._ > scala> spark.time(new FileInputStream("/tmp/4G.bin").transferTo(new > FileOutputStream("/dev/null"))) > Time taken: 4 ms > val res0: Long = 4294967296 > scala> spark.time(org.apache.commons.io.IOUtils.copy(new > FileInputStream("/tmp/4G.bin"), new FileOutputStream("/dev/null"))) > Time taken: 781 ms > val res1: Int = -1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org