Mike, I am looking into this, and I was able to figure out how we could potentially archive (and eventually age off from the archive) a piece of data for which there is still an open file handle. Specifically, I am able to understand how this could happen only when we have a problem reading a FlowFile (such as ContentNotFoundException) during a call to ProcessSession.write() with a StreamCallback.
So exactly the situation that you laid out here. I will be looking into how to address this. Do you have any idea what may have caused your ContentNotFoundException to occur in the first place? Thanks -Mark > On Apr 27, 2016, at 5:24 PM, Michael Moser <moser...@gmail.com> wrote: > > I found something in the logs on the nodes where I had a problem. A > ContentNotFoundException begins occurring on these nodes and after many > thousands of times we eventually get "too many open files". Once I do > surgery on the content repository so that ContentNotFoundException stops > happening, then lsof tells me that I'm not 'leaking' any more (deleted) > file descriptors. I have no idea why the ContentNotFoundException starts > happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported > back in March. > > Here's a stack trace of the ContentNotFoundException > > 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1] > MyCustomProcessor > org.apache.nifi.processor.exception.MissingFlowFileException: Unable to > find content for FlowFile > at > o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349) > at > o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186) > at MyCustomProcessor.onTrigger() > at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) > Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not > find content for StandardContentClaim > [resourceClaim=StandardResourceClaim[id=123456789-1, container=default, > section=1], offset=351144, length=30720] > at > o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492) > at > o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801) > at > o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743) > at > o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152) > ... common frames omitted > > MyCustomProcessor just calls session.write() with a StreamCallback > implementation that reads from an InputStream and writes to an OutputStream. > > [1] - > http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html > > -- Mike > > > > > On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <joe.w...@gmail.com> wrote: > >> Mike >> >> Ok that is a good data point. In my case they are all in archive but >> I do agree that isn't super meaningful because in reality nothing >> should ever be open for writing in the archive. >> >> If you can and have enough logging on try searching for that first >> part of the filename in your logs. Let's see if we can narrow this to >> a particular type of flow or something. >> >> Thanks >> Joe >> >> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <moser...@gmail.com> >> wrote: >>> Another data point ... we had archiving turned on at first, and then most >>> (but not all) files that lsof reported were >>> >>> /content_repository/0/archive/123456789-123456 (deleted). >>> >>> We turned archiving off, hoping that was related in some way, but it was >>> not. >>> >>> -- Mike >>> >>> >>> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <joe.w...@gmail.com> wrote: >>> >>>> Mike, >>>> >>>> Definitely does not sound familiar. However, just looked up what you >>>> describe and I do see it. In my case there are only three files but >>>> they are sitting there open for writing by the nifi process and yet >>>> have been deleted. So I do believe there is an issue...will dig in a >>>> bit but obviously if you make more discoveries here please share. >>>> >>>> Thanks >>>> Joe >>>> >>>> >>>> >>>> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <moser...@gmail.com> >>>> wrote: >>>>> Devs, >>>>> >>>>> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster. We >> noticed >>>>> half of our cluster nodes getting "too many open files" errors that >>>> require >>>>> a NiFi restart, while the other half works without this problem. >> Using >>>>> 'lsof -p <pid>' to identify the open file descriptors at the time of >> the >>>>> problem, we see most of the file descriptors reference deleted files >> in >>>> the >>>>> content repository like this: >>>>> >>>>> java <pid> <user> <fd> ... /content_repository/81/123456789-123456 >>>> (deleted) >>>>> >>>>> A 'ls /content_repository/81/123456789-123456' confirms that the file >> has >>>>> been deleted. >>>>> >>>>> We are continuing our investigation into why some of our nodes have a >>>>> problem but others don't. Has anyone else seen this? Did anything >>>> change >>>>> between 0.4.1 and 0.5.1 related to deleting files from the content >>>>> repository? >>>>> >>>>> Regards, >>>>> -- Mike >>>> >>