[ 
https://issues.apache.org/jira/browse/SPARK-53043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-53043:
----------------------------------
    Description: 
Since Java 9+, we can use Java's built-in `transferTo` directly which is 
**significantly faster (over 100x)** than `IOUtils.copy`. In addition, Java's 
`transferTo` returns the correct value of copied bytes while `IOUtils.copy` 
returns -1 after 2GB which is a well-known limitation.

{code}
scala> import java.io._
import java.io._

scala> spark.time(new FileInputStream("/tmp/4G.bin").transferTo(new 
FileOutputStream("/dev/null")))
Time taken: 4 ms
val res0: Long = 4294967296

scala> spark.time(org.apache.commons.io.IOUtils.copy(new 
FileInputStream("/tmp/4G.bin"), new FileOutputStream("/dev/null")))
Time taken: 781 ms
val res1: Int = -1
{code}

> Use Java `InputStream.transferTo` instead of `IOUtils.copy`
> -----------------------------------------------------------
>
>                 Key: SPARK-53043
>                 URL: https://issues.apache.org/jira/browse/SPARK-53043
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Kubernetes, Spark Core, SQL
>    Affects Versions: 4.1.0
>            Reporter: Dongjoon Hyun
>            Assignee: Dongjoon Hyun
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.1.0
>
>
> Since Java 9+, we can use Java's built-in `transferTo` directly which is 
> **significantly faster (over 100x)** than `IOUtils.copy`. In addition, Java's 
> `transferTo` returns the correct value of copied bytes while `IOUtils.copy` 
> returns -1 after 2GB which is a well-known limitation.
> {code}
> scala> import java.io._
> import java.io._
> scala> spark.time(new FileInputStream("/tmp/4G.bin").transferTo(new 
> FileOutputStream("/dev/null")))
> Time taken: 4 ms
> val res0: Long = 4294967296
> scala> spark.time(org.apache.commons.io.IOUtils.copy(new 
> FileInputStream("/tmp/4G.bin"), new FileOutputStream("/dev/null")))
> Time taken: 781 ms
> val res1: Int = -1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to