The processor breaks down a much larger file into a huge number of small
data points. We're talking like turning a 1.1M line file into about 2.5B
data points.

My current approach is "read a file with GetFile, save to /tmp, break down
into a bunch of large CSV record batches (like a few hundred thousand
records per group)" and then commit.

It's slow, and with some good debugging statements, I can see the processor
tearing into the data just fine. However, I am thinking about adding a
variant to this which would be an "iterative" version that would follow
this pattern:

"read the file, save to /tmp, load the file, keep the current read position
intact, every onTrigger call sends out a batch w/ session.commit() until
it's done reading. Then grab the next flowfile."

Does anyone have any suggestions on good practices to follow here,
potential concerns, etc.? (Note: I have to write the file to /tmp because a
library I am using which I don't want to fork doesn't have an API that can
read from a stream rather than a java.io.File)

Also, are there any issues with accepting a contribution that makes use of
a LGPL-licensed library, in the event that my client wants to open source
it (we think they will)?

Thanks,

Mike

Reply via email to