[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606612#comment-15606612
 ] 

Matt Zhang commented on APEXMALHAR-2274:
----------------------------------------

- FileSystem.listStatusIterator() not available till 2.7

- FileSystem.listLocatedStatus(), uses an array to store the files, which is 
coming from the listStatus() call.

- FileSystem.listFiles() available in 2.6, but using stack, to store the 
RemoteIterator referring to the array from listLocatedStatus().

- FileContext.listStatus(), actually it also refers to the array from 
fs.listStatus()

To resolve this we need to use the reconciler pattern, spawn a worker thread 
during the Setup() of AbstractInputOperator, and read data into a thread safe 
queue. And the input operator will read from this queue asynchronously.

> AbstractFileInputOperator gets killed when there are a large number of files.
> -----------------------------------------------------------------------------
>
>                 Key: APEXMALHAR-2274
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
>             Project: Apache Apex Malhar
>          Issue Type: Bug
>            Reporter: Munagala V. Ramanath
>            Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to