I think the hard part here is taking a "raw" file like PDF bytes and
creating a record in a certain format. For now I think ScriptedReader
is your best bet, you can read the entire input stream in as a byte
array then return a Record that contains a "bytes" field containing
that data. You can create the schema for the record(s) in the script.
Then whatever writer (AvroRecordSetWriter in this case?) will write it
out in that format. I recommend a binary format like Avro instead of a
text-based one like JSON to ensure the bytes don't get mangled.

If we don't have a "bytes record reader" or something similar, we
should look into adding it, I think most of the readers (even
GrokReader IIRC) are text-based so we may not have something built-in
that can take the content of a Flow File and turn it into a one-field
Record of bytes. I have a stale PR for a text-based line reader [1]
but you can use GrokReader for that so I let it go stale. Maybe I
should do a similar one for binary data.

Regards,
Matt

[1] https://github.com/apache/nifi/pull/3735

On Wed, Jan 3, 2024 at 7:20 PM Richard Beare <richard.be...@gmail.com> wrote:
>
> Any insights on this question post break? I think my problem can be 
> summarised as looking for the right way to place binary data, stored as a 
> on-disk file into a field of an avro record
>
> On Wed, Dec 20, 2023 at 5:06 PM Richard Beare <richard.be...@gmail.com> wrote:
>>
>> I think I've made some progress with this, but I'm now having trouble with 
>> pdf files. The approach that seems to partly solve the problem is to have a 
>> ConvertRecord processor with a scripted reader to place the on disk (as 
>> delivered by the GetFile processor) into a record field. I can then use an 
>> UpdateRecord to add other fields. My current problem, I think, is correctly 
>> dealing with dumping a binary object (e.g. a pdf file) into that field. 
>> Going via strings worked for html files but breaks pdfs. I'm struggling with 
>> how to correctly set up the schema from within the script.
>>
>> On Tue, Dec 19, 2023 at 12:31 PM Richard Beare <richard.be...@gmail.com> 
>> wrote:
>>>
>>> Hi,
>>> I've gotten rusty, not having done much nifi work for a while.
>>>
>>> I want to run some tests of the following scenario. I have a workflow that 
>>> takes documents from a DB and feeds them through tika. I want to test with 
>>> a different document set that is currently living on disk. The tika 
>>> (groovy) processor that is my front end is expecting a record with a number 
>>> of fields, one of which is the document content.
>>>
>>> I can simulate the fields (badly, but that doesn't matter at this stage), 
>>> with generate record, but how do I get document contents from disk into the 
>>> right place. I've been thinking of using updaterecord to modify the random 
>>> records, but can't see how to get the data from GetFile into the right 
>>> place.
>>>
>>> Another thought is that perhaps I need to convert the GetFile output into 
>>> the right record structure with convertrecord, but then how to fill the 
>>> other fields.
>>>
>>> What am I missing here?

Reply via email to