Thanks Arun and Peter. Getting that resolved will be nice. The performance difference of the record reader/writer approach in all this is pretty fantastic so the more we can do to iron out these sorts of edges the better. Thanks!
On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <pwi...@micron.com> wrote: > Arun, > > I'm also using Ctrl+A as a delimiter and had the same problem. I haven't had > time to write up a PR but it looked like a pretty easy fix to me too. > > I can't merge the change if you submit it, but I'd be happy to review it. > > --Peter > > -----Original Message----- > From: Arun Manivannan [mailto:a...@arunma.com] > Sent: Sunday, September 24, 2017 11:17 PM > To: Dev@nifi.apache.org > Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter > > Hi, > > The ConvertCSVToAvro processor have been having performance issues while > processing files which are more than a GB and I was suggested to use the > ConvertRecord that leverages the RecordReader and Writer. Did some tests and > they do perform well. > > Strangely, the CSVReader doesn't accept unicode character as the value > delimiter - Control A (\u0001) character is the delimiter of my CSV. > > Did some analysis and I see that a minor change needs to be made on the > CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also > modify the SingleCharacterValidator. > > Please let me know if you believe this isn't an issue and there's a > workaround for this. Else, I am more than happy to raise an issue and submit > a PR for review. > > Best Regards, > Arun