AFAIK it is fine and appropriate to issue multiple provenance events for a single FlowFile. In the case for PutAzureBlobStorage uploading a file to Azure, it is the incoming FlowFile that triggers the upload. Before reporting a provenance event, attributes are added to the FlowFile, so that "version" of the FlowFile can be the one used to report a SEND event. I have done this to said processor as part of a large refactor/improvement of the provenance capability:
session.getProvenanceReporter().send(flowFile, blob.getSnapshotQualifiedUri().toString(), transferMillis, REL_SUCCESS); Having said that, to Mark's point it's probably better to have a separate UPLOAD_FILE event, I can change that in my code. I added a couple like this to similar processors, such as TriggerHiveMetastoreEvent: session.getProvenanceReporter().invokeRemoteProcess(flowFile, hiveMetastoreUrl, REL_SUCCESS); I am still working on this, I need to write up a Jira with a thorough treatment of the material and eventually get a PR up for review. Regards, Matt On Thu, Oct 26, 2023 at 12:02 PM Mark Payne <[email protected]> wrote: > > Lehel, > > I don’t believe we should be trying to create a “Mock FlowFile.” I am ok with > an update to the ProvenanceReporter interface. But I don’t think it should > accept a “size” parameter. Rather, I think this is a completely different > type of event that is occurring. This is not a “send” in that it’s not > sending the contents of the FlowFile to a remote system. Rather, I’d say it's > an UPLOAD_FILE event. So I’d lean more toward an uploadFile() method on > ProvenanceReporter that takes as an argument a `File` (as well as a > FlowFile). The size would come from the File itself, and the event would > convey the information about the local file that was uploaded - probably in > the Event Details. > > Thanks > -Mark > > > > On Oct 26, 2023, at 10:36 AM, Lehel Boér <[email protected]> wrote: > > > > Hi everyone, > > > > I would like to address a particular scenario that has recently come to my > > attention regarding the use of the PutAzureBlobStorage processor with the > > FileResourceService. > > > > When the PutAzureBlobStorage processor is used with the > > FileResourceService, it currently uploads a file from the user's local > > filesystem to Azure, but it does not create a FlowFile. Instead, it > > utilizes the incoming FlowFile solely to send a provenance event. In this > > case the size of the provenance event is the incoming FlowFile's size > > instead of the uploaded one. > > > > There are potential solutions to address this issue and ensure that the > > provenance events are handled effectively. Two main options have been > > proposed: > > > > > > * Create a Mock FlowFile: A mock FlowFile with a size matching that of > > the local file being uploaded could be generated. This mock FlowFile would > > serve as the basis for the provenance event, even though its size might not > > reflect the actual content. > > > > * Modify the ProvenanceReporter Interface: Alternatively, we could > > introduce a new method in the ProvenanceReporter interface that doesn't > > require a FlowFile but instead accepts a "size" parameter as an argument. > > This would eliminate the need for a mock FlowFile. > > > > The lack of a FlowFile operation in this situation creates a distinct > > challenge because provenance events are typically tied to FlowFiles. Still, > > it's important to indicate data transmission for monitoring and tracking. > > > > While the idea of a "size" parameter for the provenance event seems > > preferable, we need to carefully consider its feasibility, potential > > complexities, and community acceptance. The FileResourceService already > > deviates from NiFi's concept of using FlowFiles to hold payload data, and > > we must avoid further complicating the framework unless absolutely > > necessary. > > > > If you have any insights or suggestions, please feel free to reply to this > > email or join the discussion. > > > > Best Regards, > > Lehel >
