Rajesh Balamohan created SPARK-21971:
----------------------------------------

             Summary: Too many open files in Spark due to concurrent files 
being opened
                 Key: SPARK-21971
                 URL: https://issues.apache.org/jira/browse/SPARK-21971
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 2.1.0
            Reporter: Rajesh Balamohan
            Priority: Minor


When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it 
consistently fails with "too many open files" exception.

{noformat}
O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 
394 ms on machine111.xyz (executor 2) (189/200)
17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 
844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 6) 
(190/200)
17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 844.0 
(TID 243904, machine1.xyz, executor 1): java.nio.file.FileSystemException: 
/grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183:
 Too many open files
        at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at 
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
        at java.nio.channels.FileChannel.open(FileChannel.java:287)
        at java.nio.channels.FileChannel.open(FileChannel.java:335)
        at 
org.apache.spark.io.NioBufferedFileInputStream.<init>(NioBufferedFileInputStream.java:43)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:75)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607)
        at 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169)
        at 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173)
{noformat}

Cluster was configured with multiple cores per executor. 

Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which 
causes large number of spills in larger dataset. With multiple cores per 
executor, this reproduces easily. 

{{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for 
all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file 
in its constructor and closes the file later as a part of its close() call. 
This causes too many open files issue.
Note that this is not a file leak, but more of concurrent files being open at 
any given time depending on the dataset being processed.

One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" 
so that fewer spill files are generated, but it is hard to determine the 
sweetspot for all workload. Another option is to set ulimit to "unlimited" for 
files, but that would not be a good production setting. It would be good to 
consider reducing the number of concurrent "UnsafeExternalSorter::getIterator".










--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to