Hi Jay,

I spent some more time over this today and went back to the original thread which brought up the issue with file leaks [1]. I think that output of lsof in that logs has a very important hint:

/home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_topic_ypgsearch_yellowpageV2-0/00000000000068818668.log (deleted) java 8446 root 725u REG 253,2 536910838 26087364

/home/work/data/soft/kafka-0.8/data/_oakbay_v2_search_topic_ypgsearch_yellowpageV2-0/00000000000069457098.log (deleted) java 8446 root 726u REG 253,2 536917902 26087368

Notice the "(deleted)" text in that output. The last time I looked at that output, I thought it was the user who had added that "deleted" text to help us understand that problem. But today I read up on the output format of lsof and it turns out that it's lsof which itself adds that hint whenever a file has already been deleted possibly by a different process but some other process is still holding on to open resources of that (deleted) file [2].

So in the context of the issue that we are discussing and the way Kafka deals with async deletes (i.e. first by attempting a rename of the log/index files), I think this all makes sense now. So what I think is happening is, some (other?) process (not sure what/why) has already deleted the log file that Kafka is using for the LogSegment. The LogSegment however still has open FileChannel resource on that deleted file (and that's why the open file descriptor is held on and shows up in that output). Now Kafka, at some point in time, triggers an async delete of the LogSegment, which involves a file rename of that (already deleted) log file. The rename fails (because the original file path isn't there anymore). As a result, we end up throwing that "failed to rename, KafkaStorageException" and thus leave behind the open FileChannel to continue being open forever (till the Kafka program exits).

So I think we should:

1) Find what/why deletes that underlying log file(s). I'll add a reply to that original mail discussion asking the user if he can provide more details. 2) Handle this case and close the FileChannel. The patch that's been uploaded to review board https://reviews.apache.org/r/29755/ does that. The "immediate delete" on failure to rename, involves (safely) closing the open FileChannel and (safely) deleting the (possibly non-existent) file.

By the way, this entire thing can be easily reproduced by running the following program which first creates a file and open a filechannel to that file and then waits for the user to delete that file externally (I used the rm command) and then go and tries to rename that deleted file, which then fails. In between each of these steps, you can run the lsof command externally to see the open file resources (I used 'lsof | grep test.log'):

    public static void main(String[] args) throws Exception {
        // Open a file and file channel for read/write
final File originalLogFile = new File("/home/jaikiran/deleteme/test.log"); // change this path relevantly if you plan to run it final FileChannel fileChannel = new RandomAccessFile(originalLogFile, "rw").getChannel();
        System.out.println("Opened file channel to " + originalLogFile);
        // wait for underlying file to be deleted externally
System.out.println("Waiting for the " + originalLogFile + " to be deleted externally. Press any key after the file is deleted");
        System.in.read();
        // wait for the user to check the lsof output
System.out.println(originalLogFile + " seems to have been deleted externally, check lsof command output to see open file resources."); System.out.println("Press any key to try renaming this already deleted file, from the program");
        System.in.read();
        // try rename
final File fileToRenameTo = new File(originalLogFile.getPath() + ".deleted"); System.out.println("Trying to rename " + originalLogFile + " to " + fileToRenameTo); final boolean renamedSucceeded = originalLogFile.renameTo(fileToRenameTo);
        if (renamedSucceeded) {
System.out.println("Rename SUCCEEDED. Renamed file exists? " + fileToRenameTo.exists());
        } else {
System.out.println("FAILED to rename file " + originalLogFile + " to " + fileToRenameTo);
        }
// wait for the user to check the lsof output, after our rename failed System.out.println("Check the lsof output and press any key to close the open file channel to a deleted file");
        System.in.read();
        // close the file channel
        fileChannel.close();
// let user check the lsof output one final time. This time he won't see open file resources from this program System.out.println("File channel closed. Check the lsof output and press any key to terminate the program");
        System.in.read();
        // all done, exit
        System.out.println("Program will terminate");
    }



[1] http://mail-archives.apache.org/mod_mbox/kafka-users/201501.mbox/%3CCAA4R6b-7gSbPp5_ebGpwYyNibDAwE_%2BwoE%2BKbiMuU27-2j%2BLkg%40mail.gmail.com%3E
[2] http://unixhelp.ed.ac.uk/CGI/man-cgi?lsof+8


-Jaikiran
On Saturday 24 January 2015 11:12 PM, Jay Kreps wrote:
Hey guys,

Jaikiran posted a patch on KAFKA-1853 to improve the handling of failures during delete.
https://issues.apache.org/jira/browse/KAFKA-1853

The core problem here is that we are doing File.rename() as part of the delete sequence which returns false if the rename failed. Or file delete sequence is something like the following:
1. Remove the file from the index so no new reads can begin on it
2. Rename the file to xyz.deleted so that if we crash it will get cleaned up 3. Schedule a task to delete the file in 30 seconds or so when any in-progress reads have likely completed. The goal here is to avoid errors on in progress reads but also avoid locking on all reads.

The question is what to do when rename fails? Previously if this happened we actually didn't pay attention and would fail to delete the file entirely. This patch changes it so that if the rename fails we log an error and force an immediate delete.

I think this is the right thing to do, but I guess the real question is why would rename fail? Some possibilities:
http://stackoverflow.com/questions/2372374/why-would-a-file-rename-fail-in-java

An alternative would be to treat this as a filesystem error and shutdown as we do elsewhere.

Thoughts?

-Jay


Reply via email to