[ 
https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14318960#comment-14318960
 ] 

Kay Ousterhout commented on SPARK-3570:
---------------------------------------

There are a bunch of times when files are opened and written to that are not 
currently logged, so I did some investigation of this to figure out when the 
time may be significant and is therefore worth logging.  I did this on a 
5-machine cluster using ext3 (which exacerbates disk access issues, making it 
easy to see when times may be long) running query 3a of the big data benchmark 
(which struggles with disk access because of the many shuffle files).  Here's 
what I found:

SortShuffleWriter.write, call to shuffleBlockManager.getDataFile: this just 
opens 1 file, and typically takes about 100us, so not worth adding logging

SortShuffleWriter.write, call to shuffleBlockManager.getIndexFile: this writes 
a single index file and typically took about 0.1ms (as high as 1ms). Also 
doesn't seem worth logging.

ExternalSorter.spillToPartitionFiles, creating the disk writers for each 
partition: because this creates one file for each partition, the time to create 
all of the files adds up: this took 75-100ms

ExternalSorter.writePartitionedFile, copying the data from the partitioned 
files to a single file: because this reads and writes all of the shuffle data, 
it can be long; ~13ms on the workload I looked at.

ExternalSorter.writePartitionedFile, time to call blockManager.getDiskWriter on 
line 748: getDiskWriter *CAN* take a long time because of the call to 
file.length(), which may hit disk. However, in this case, each call takes 20us 
or less (and this is likely noisy -- getting small to measure). To totally 
speculate, I'd guess that because this is called many times on the same file, 
as opposed to different files, and the file is actively being written to, the 
length is cached in memory by the OS.

To summarize, this all leads to the intuitive conclusion that we only need to 
long when we're writing lots of data (e.g., when copying all of the shuffle 
data to a single file) or when we're opening a lot of files.

> Shuffle write time does not include time to open shuffle files
> --------------------------------------------------------------
>
>                 Key: SPARK-3570
>                 URL: https://issues.apache.org/jira/browse/SPARK-3570
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.9.2, 1.0.2, 1.1.0
>            Reporter: Kay Ousterhout
>            Assignee: Kay Ousterhout
>         Attachments: 3a_1410854905_0_job_log_waterfall.pdf, 
> 3a_1410957857_0_job_log_waterfall.pdf
>
>
> Currently, the reported shuffle write time does not include time to open the 
> shuffle files.  This time can be very significant when the disk is highly 
> utilized and many shuffle files exist on the machine (I'm not sure how severe 
> this is in 1.0 onward -- since shuffle files are automatically deleted, this 
> may be less of an issue because there are fewer old files sitting around).  
> In experiments I did, in extreme cases, adding the time to open files can 
> increase the shuffle write time from 5ms (of a 2 second task) to 1 second.  
> We should fix this for better performance debugging.
> Thanks [~shivaram] for helping to diagnose this problem.  cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to