Hi, Thanks for this. I'm struggling a little with types etc, specifically specifying arrays in schema. I'm missing something simple. Not sure I've located the best examples to work with. Currently my error is: ConvertRecord[id=01861024-3979-1a14-d64d-90e6deefd13b] Failed to process FlowFile[filename=dmr.pdf]; will route to failure: org.apache.nifi.serialization.record.util.IllegalTypeConversionException: Cannot convert value [[B@5b011741] of type class [B to Integer for field content
class BLOBReader implements RecordReader { private final IS //private final String content private final byte[] content private int calls private SimpleRecordSchema schema public BLOBReader(InputStream input) { content = input.readAllBytes() calls = 0 IS = input List<RecordField> recordFields = [] recordFields.add(new RecordField("content", RecordFieldType.BYTE.getDataType())) schema = new SimpleRecordSchema(recordFields) } public Record nextRecord(final boolean coerceTypes, final boolean dropUnknownFields) throws IOException, MalformedRecordException { if (calls > 0) { return null } calls = calls + 1 Map<String, Object> recordValues = [:] recordValues.put("content", content) return new MapRecord(schema, recordValues) } @Override public void close() throws IOException { //bufferedReader.close() IS.close() } @Override public RecordSchema getSchema() { return schema } } On Fri, Jan 5, 2024 at 6:40 PM Matt Burgess <mattyb...@apache.org> wrote: > I think the hard part here is taking a "raw" file like PDF bytes and > creating a record in a certain format. For now I think ScriptedReader > is your best bet, you can read the entire input stream in as a byte > array then return a Record that contains a "bytes" field containing > that data. You can create the schema for the record(s) in the script. > Then whatever writer (AvroRecordSetWriter in this case?) will write it > out in that format. I recommend a binary format like Avro instead of a > text-based one like JSON to ensure the bytes don't get mangled. > > If we don't have a "bytes record reader" or something similar, we > should look into adding it, I think most of the readers (even > GrokReader IIRC) are text-based so we may not have something built-in > that can take the content of a Flow File and turn it into a one-field > Record of bytes. I have a stale PR for a text-based line reader [1] > but you can use GrokReader for that so I let it go stale. Maybe I > should do a similar one for binary data. > > Regards, > Matt > > [1] https://github.com/apache/nifi/pull/3735 > > On Wed, Jan 3, 2024 at 7:20 PM Richard Beare <richard.be...@gmail.com> > wrote: > > > > Any insights on this question post break? I think my problem can be > summarised as looking for the right way to place binary data, stored as a > on-disk file into a field of an avro record > > > > On Wed, Dec 20, 2023 at 5:06 PM Richard Beare <richard.be...@gmail.com> > wrote: > >> > >> I think I've made some progress with this, but I'm now having trouble > with pdf files. The approach that seems to partly solve the problem is to > have a ConvertRecord processor with a scripted reader to place the on disk > (as delivered by the GetFile processor) into a record field. I can then use > an UpdateRecord to add other fields. My current problem, I think, is > correctly dealing with dumping a binary object (e.g. a pdf file) into that > field. Going via strings worked for html files but breaks pdfs. I'm > struggling with how to correctly set up the schema from within the script. > >> > >> On Tue, Dec 19, 2023 at 12:31 PM Richard Beare <richard.be...@gmail.com> > wrote: > >>> > >>> Hi, > >>> I've gotten rusty, not having done much nifi work for a while. > >>> > >>> I want to run some tests of the following scenario. I have a workflow > that takes documents from a DB and feeds them through tika. I want to test > with a different document set that is currently living on disk. The tika > (groovy) processor that is my front end is expecting a record with a number > of fields, one of which is the document content. > >>> > >>> I can simulate the fields (badly, but that doesn't matter at this > stage), with generate record, but how do I get document contents from disk > into the right place. I've been thinking of using updaterecord to modify > the random records, but can't see how to get the data from GetFile into the > right place. > >>> > >>> Another thought is that perhaps I need to convert the GetFile output > into the right record structure with convertrecord, but then how to fill > the other fields. > >>> > >>> What am I missing here? >