Attila Zsolt Piros created SPARK-32149:
------------------------------------------

             Summary: Improve file path name normalisation at block resolution 
within the external shuffle service
                 Key: SPARK-32149
                 URL: https://issues.apache.org/jira/browse/SPARK-32149
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 3.0.1
            Reporter: Attila Zsolt Piros


In the external shuffle service during the block resolution the file paths (for 
disk persisted RDD and for shuffle blocks) are normalized by a custom Spark 
code which uses an OS dependent regexp. This is a redundant code of the 
package-private JDK counterpart.
As the code not a perfect match even it could happen one method results in a 
bit different (but semantically equal) path. 

The reason of this redundant transformation is the interning of the normalized 
path to save some heap here which is only possible if both results in the same 
string.

Checking the JDK code I believe there is a better solution which is perfect 
match for the JDK code as it uses that package private method. Moreover based 
on some benchmarking even this new method seams to be more performant too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to