The processor breaks down a much larger file into a huge number of small data points. We're talking like turning a 1.1M line file into about 2.5B data points.
My current approach is "read a file with GetFile, save to /tmp, break down into a bunch of large CSV record batches (like a few hundred thousand records per group)" and then commit. It's slow, and with some good debugging statements, I can see the processor tearing into the data just fine. However, I am thinking about adding a variant to this which would be an "iterative" version that would follow this pattern: "read the file, save to /tmp, load the file, keep the current read position intact, every onTrigger call sends out a batch w/ session.commit() until it's done reading. Then grab the next flowfile." Does anyone have any suggestions on good practices to follow here, potential concerns, etc.? (Note: I have to write the file to /tmp because a library I am using which I don't want to fork doesn't have an API that can read from a stream rather than a java.io.File) Also, are there any issues with accepting a contribution that makes use of a LGPL-licensed library, in the event that my client wants to open source it (we think they will)? Thanks, Mike