Re: Pattern advice - files on disk into a record field

Richard Beare Sun, 07 Jan 2024 15:35:27 -0800

Hi,
Thanks for this.

I'm struggling a little with types etc, specifically specifying arrays in
schema. I'm missing something simple. Not sure I've located the best
examples to work with. Currently my error is:
ConvertRecord[id=01861024-3979-1a14-d64d-90e6deefd13b] Failed to process
FlowFile[filename=dmr.pdf]; will route to failure:
org.apache.nifi.serialization.record.util.IllegalTypeConversionException:
Cannot convert value [[B@5b011741] of type class [B to Integer for field
content


class BLOBReader implements RecordReader {

    private final IS
    //private final String content
    private final byte[] content
    private int calls
    private SimpleRecordSchema schema
    public BLOBReader(InputStream input) {
content = input.readAllBytes()
calls = 0
IS = input
        List<RecordField> recordFields = []
recordFields.add(new RecordField("content",
RecordFieldType.BYTE.getDataType()))
        schema = new SimpleRecordSchema(recordFields)

    }

    public Record nextRecord(final boolean coerceTypes, final boolean
dropUnknownFields) throws IOException, MalformedRecordException {
        if (calls > 0) {
   return null
}
calls = calls + 1
Map<String, Object> recordValues = [:]
recordValues.put("content", content)

        return new MapRecord(schema, recordValues)
    }

    @Override
    public void close() throws IOException {
        //bufferedReader.close()
IS.close()
    }

    @Override
    public RecordSchema getSchema() {
        return schema
    }

}



On Fri, Jan 5, 2024 at 6:40 PM Matt Burgess <mattyb...@apache.org> wrote:

> I think the hard part here is taking a "raw" file like PDF bytes and
> creating a record in a certain format. For now I think ScriptedReader
> is your best bet, you can read the entire input stream in as a byte
> array then return a Record that contains a "bytes" field containing
> that data. You can create the schema for the record(s) in the script.
> Then whatever writer (AvroRecordSetWriter in this case?) will write it
> out in that format. I recommend a binary format like Avro instead of a
> text-based one like JSON to ensure the bytes don't get mangled.
>
> If we don't have a "bytes record reader" or something similar, we
> should look into adding it, I think most of the readers (even
> GrokReader IIRC) are text-based so we may not have something built-in
> that can take the content of a Flow File and turn it into a one-field
> Record of bytes. I have a stale PR for a text-based line reader [1]
> but you can use GrokReader for that so I let it go stale. Maybe I
> should do a similar one for binary data.
>
> Regards,
> Matt
>
> [1] https://github.com/apache/nifi/pull/3735
>
> On Wed, Jan 3, 2024 at 7:20 PM Richard Beare <richard.be...@gmail.com>
> wrote:
> >
> > Any insights on this question post break? I think my problem can be
> summarised as looking for the right way to place binary data, stored as a
> on-disk file into a field of an avro record
> >
> > On Wed, Dec 20, 2023 at 5:06 PM Richard Beare <richard.be...@gmail.com>
> wrote:
> >>
> >> I think I've made some progress with this, but I'm now having trouble
> with pdf files. The approach that seems to partly solve the problem is to
> have a ConvertRecord processor with a scripted reader to place the on disk
> (as delivered by the GetFile processor) into a record field. I can then use
> an UpdateRecord to add other fields. My current problem, I think, is
> correctly dealing with dumping a binary object (e.g. a pdf file) into that
> field. Going via strings worked for html files but breaks pdfs. I'm
> struggling with how to correctly set up the schema from within the script.
> >>
> >> On Tue, Dec 19, 2023 at 12:31 PM Richard Beare <richard.be...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>> I've gotten rusty, not having done much nifi work for a while.
> >>>
> >>> I want to run some tests of the following scenario. I have a workflow
> that takes documents from a DB and feeds them through tika. I want to test
> with a different document set that is currently living on disk. The tika
> (groovy) processor that is my front end is expecting a record with a number
> of fields, one of which is the document content.
> >>>
> >>> I can simulate the fields (badly, but that doesn't matter at this
> stage), with generate record, but how do I get document contents from disk
> into the right place. I've been thinking of using updaterecord to modify
> the random records, but can't see how to get the data from GetFile into the
> right place.
> >>>
> >>> Another thought is that perhaps I need to convert the GetFile output
> into the right record structure with convertrecord, but then how to fill
> the other fields.
> >>>
> >>> What am I missing here?
>

Re: Pattern advice - files on disk into a record field

Reply via email to