AFAIK it is fine and appropriate to issue multiple provenance events
for a single FlowFile. In the case for PutAzureBlobStorage uploading a
file to Azure, it is the incoming FlowFile that triggers the upload.
Before reporting a provenance event, attributes are added to the
FlowFile, so that "version" of the FlowFile can be the one used to
report a SEND event. I have done this to said processor as part of a
large refactor/improvement of the provenance capability:

session.getProvenanceReporter().send(flowFile,
blob.getSnapshotQualifiedUri().toString(), transferMillis,
REL_SUCCESS);

Having said that, to Mark's point it's probably better to have a
separate UPLOAD_FILE event, I can change that in my code.

I added a couple like this to similar processors, such as
TriggerHiveMetastoreEvent:

session.getProvenanceReporter().invokeRemoteProcess(flowFile,
hiveMetastoreUrl, REL_SUCCESS);

I am still working on this, I need to write up a Jira with a thorough
treatment of the material and eventually get a PR up for review.

Regards,
Matt

On Thu, Oct 26, 2023 at 12:02 PM Mark Payne <[email protected]> wrote:
>
> Lehel,
>
> I don’t believe we should be trying to create a “Mock FlowFile.” I am ok with 
> an update to the ProvenanceReporter interface. But I don’t think it should 
> accept a “size” parameter. Rather, I think this is a completely different 
> type of event that is occurring. This is not a “send” in that it’s not 
> sending the contents of the FlowFile to a remote system. Rather, I’d say it's 
> an UPLOAD_FILE event. So I’d lean more toward an uploadFile() method on 
> ProvenanceReporter that takes as an argument a `File` (as well as a 
> FlowFile). The size would come from the File itself, and the event would 
> convey the information about the local file that was uploaded - probably in 
> the Event Details.
>
> Thanks
> -Mark
>
>
> > On Oct 26, 2023, at 10:36 AM, Lehel Boér <[email protected]> wrote:
> >
> > Hi everyone,
> >
> > I would like to address a particular scenario that has recently come to my 
> > attention regarding the use of the PutAzureBlobStorage processor with the 
> > FileResourceService.
> >
> > When the PutAzureBlobStorage processor is used with the 
> > FileResourceService, it currently uploads a file from the user's local 
> > filesystem to Azure, but it does not create a FlowFile. Instead, it 
> > utilizes the incoming FlowFile solely to send a provenance event. In this 
> > case the size of the provenance event is the incoming FlowFile's size 
> > instead of the uploaded one.
> >
> > There are potential solutions to address this issue and ensure that the 
> > provenance events are handled effectively. Two main options have been 
> > proposed:
> >
> >
> >  *   Create a Mock FlowFile: A mock FlowFile with a size matching that of 
> > the local file being uploaded could be generated. This mock FlowFile would 
> > serve as the basis for the provenance event, even though its size might not 
> > reflect the actual content.
> >
> >  *   Modify the ProvenanceReporter Interface: Alternatively, we could 
> > introduce a new method in the ProvenanceReporter interface that doesn't 
> > require a FlowFile but instead accepts a "size" parameter as an argument. 
> > This would eliminate the need for a mock FlowFile.
> >
> > The lack of a FlowFile operation in this situation creates a distinct 
> > challenge because provenance events are typically tied to FlowFiles. Still, 
> > it's important to indicate data transmission for monitoring and tracking.
> >
> > While the idea of a "size" parameter for the provenance event seems 
> > preferable, we need to carefully consider its feasibility, potential 
> > complexities, and community acceptance. The FileResourceService already 
> > deviates from NiFi's concept of using FlowFiles to hold payload data, and 
> > we must avoid further complicating the framework unless absolutely 
> > necessary.
> >
> > If you have any insights or suggestions, please feel free to reply to this 
> > email or join the discussion.
> >
> > Best Regards,
> > Lehel
>

Reply via email to