I think the hard part here is taking a "raw" file like PDF bytes and creating a record in a certain format. For now I think ScriptedReader is your best bet, you can read the entire input stream in as a byte array then return a Record that contains a "bytes" field containing that data. You can create the schema for the record(s) in the script. Then whatever writer (AvroRecordSetWriter in this case?) will write it out in that format. I recommend a binary format like Avro instead of a text-based one like JSON to ensure the bytes don't get mangled.
If we don't have a "bytes record reader" or something similar, we should look into adding it, I think most of the readers (even GrokReader IIRC) are text-based so we may not have something built-in that can take the content of a Flow File and turn it into a one-field Record of bytes. I have a stale PR for a text-based line reader [1] but you can use GrokReader for that so I let it go stale. Maybe I should do a similar one for binary data. Regards, Matt [1] https://github.com/apache/nifi/pull/3735 On Wed, Jan 3, 2024 at 7:20 PM Richard Beare <richard.be...@gmail.com> wrote: > > Any insights on this question post break? I think my problem can be > summarised as looking for the right way to place binary data, stored as a > on-disk file into a field of an avro record > > On Wed, Dec 20, 2023 at 5:06 PM Richard Beare <richard.be...@gmail.com> wrote: >> >> I think I've made some progress with this, but I'm now having trouble with >> pdf files. The approach that seems to partly solve the problem is to have a >> ConvertRecord processor with a scripted reader to place the on disk (as >> delivered by the GetFile processor) into a record field. I can then use an >> UpdateRecord to add other fields. My current problem, I think, is correctly >> dealing with dumping a binary object (e.g. a pdf file) into that field. >> Going via strings worked for html files but breaks pdfs. I'm struggling with >> how to correctly set up the schema from within the script. >> >> On Tue, Dec 19, 2023 at 12:31 PM Richard Beare <richard.be...@gmail.com> >> wrote: >>> >>> Hi, >>> I've gotten rusty, not having done much nifi work for a while. >>> >>> I want to run some tests of the following scenario. I have a workflow that >>> takes documents from a DB and feeds them through tika. I want to test with >>> a different document set that is currently living on disk. The tika >>> (groovy) processor that is my front end is expecting a record with a number >>> of fields, one of which is the document content. >>> >>> I can simulate the fields (badly, but that doesn't matter at this stage), >>> with generate record, but how do I get document contents from disk into the >>> right place. I've been thinking of using updaterecord to modify the random >>> records, but can't see how to get the data from GetFile into the right >>> place. >>> >>> Another thought is that perhaps I need to convert the GetFile output into >>> the right record structure with convertrecord, but then how to fill the >>> other fields. >>> >>> What am I missing here?