Thanks Mark for the valuable inputs. SplitRecord is way to handle multiline records and NIFI-3921 helps us to avoid needing schema when we can use the CSV header row itself as schema.
Anyone working on the NIFI-3921 issue if not I can take it up. Regards, Venkat On Tue, Jun 6, 2017 at 10:06 PM, Mark Payne <[email protected]> wrote: > Venkat, > > If you do need to split the data up, there is now a SplitRecord processor > that you can use to accomplish that with the readers and writers. > So that won't have problems with CSV fields that span multiple lines. > > Unfortunately at this time, the writer does require that a schema registry > be used to designate the schema. For most cases, this is fairly > easy to do, but it is a step that we should be able to skip all together. > There already exists a JIRA [1] to update the readers/writers so that > the Record Writer can just inherit the schema that is provided by the > Record Reader. Once this has been done, the CSV Reader should > be able to create the schema based on the CSV Header, and then pass that > along to the record writer. > > Thanks > -Mark > > [1] https://issues.apache.org/jira/browse/NIFI-3921 > > > On Jun 6, 2017, at 12:12 PM, Venkat Williams <[email protected]> > wrote: > > Hi Joe and Mark, > > Thanks a lot for your prompt response. > > I wasn't able to able consider SplitText because CSV Records field values > can fall in to next line with embedded newlines, escaped > double-quotes, etc. So I have rule out any logic related to Split. > > Another question is it possible to convert CSV data json without > specifying any schema just by considering CSV file first row as header and > build schema internally using the header. If I don't specify schema > registry I am getting 'schema access strategy' is invalid. > > Thanks, > Venkat > > On Tue, Jun 6, 2017 at 9:29 PM, Joe Witt <[email protected]> wrote: > >> Venkat, >> >> The only heap issues that could be consider common are if you're doing >> 'SplitText' and trying to go from hundreds of thousands or millions of >> lines files to a single line output in a single processor. You can >> easily overcome that by doing a two phase split where the first >> processor cuts into say 1000 line chunks and the next one does single >> line chunks. That said, with this record approach it doesn't have >> that problem at all so the only cause for memory issues there would be >> if any single record is so large that it takes up all the memory which >> doesn't appear likely for your examples. >> >> Thanks >> >> On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams >> <[email protected]> wrote: >> > Thanks Mark for helping me to build a template and test Covert CSV to >> JSON >> > processing. >> > >> > I want to know is it possible to emit transformed records as it is to >> next >> > processor rather than waiting for full file processing and keep the >> entire >> > result in single flowfile. >> > >> > Input: >> > id,topic,hits >> > Rahul,scala,120 >> > Nikita,spark,80 >> > Mithun,spark,1 >> > myself,cca175,180 >> > >> > Actual Output: >> > [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","t >> opic":"spark","hits":80},{"id":"Mithun","topic":"spark","hit >> s":1},{"id":"myself","topic":"cca175","hits":180}] >> > >> > Expected output:(multiple flow files like split result) >> > {"id":"Rahul","topic":"scala","hits":120} >> > {"id":"Nikita","topic":"spark","hits":80} >> > {"id":"Mithun","topic":"spark","hits":1} >> > {"id":"myself","topic":"cca175","hits":180} >> > >> > By doing this I can overcome Heap/outofmemory issues which are so >> common. >> > (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data) >> > >> > Regards, >> > Venkat >> > >> > On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <[email protected]> >> wrote: >> >> >> >> Hi Venkat, >> >> >> >> I just published a blog post [1] on running SQL in NiFi. The post walks >> >> through creating a CSV Record Reader, >> >> running SQL over the data, and then writing the results in JSON. This >> may >> >> be helpful to you. In your case, >> >> you may want to just use the ConvertRecord processor, rather than >> >> QueryRecord, but the concepts of creating >> >> the Record Reader and Writer are the same. This post references another >> >> post [2] that I wrote a week or two ago >> >> that gives a bit more details on how to actually create the reader and >> >> writer. >> >> >> >> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180, >> >> embedded newlines, escaped >> >> double-quotes, etc. >> >> >> >> I hope this helps give some direction in how to handle this in NiFi. >> >> >> >> Thanks >> >> -Mark >> >> >> >> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event >> >> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi >> >> >> >> >> >> On Jun 6, 2017, at 9:52 AM, Venkat Williams <[email protected] >> > >> >> wrote: >> >> >> >> Hi Joe Witt, >> >> >> >> Thanks for your response. >> >> >> >> I heard and read about about these record readers but not quite got it >> how >> >> to use them using some test data or template. It will be great if you >> can >> >> help me to get some working example or flow. >> >> >> >> I want to know if these implementations support for RFC-4180 formatted >> CSV >> >> files and be sure to handle edge cases like embedded newlines in a >> field >> >> value and escaped double quotes. >> >> >> >> Thanks for your help advance. >> >> >> >> Regards, >> >> Venkat >> >> >> >> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <[email protected]> wrote: >> >>> >> >>> Venkat >> >>> >> >>> I think you'll want to take a closer look at the apache nifi 1.2.0 >> >>> release support for record readers and record writers. It handles >> >>> schema aware parsing/transformation and more for things like csv, >> >>> json, avro, can be easily extended, and supports scripted readers and >> >>> writers written right there through the UI. As it is new examples are >> >>> still emerging but we can certainly help you along. >> >>> >> >>> Thanks >> >>> Joe >> >>> >> >>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams >> >>> <[email protected]> wrote: >> >>> > Hi >> >>> > >> >>> > >> >>> > >> >>> > I want to contribute this processor implementation code to NIFI >> >>> > project. >> >>> > >> >>> > >> >>> > >> >>> > Requirements: >> >>> > >> >>> > >> >>> > >> >>> > 1) Convert CSV files to a standard/canonical JSON format >> >>> > >> >>> > a. One JSON object/document per row in the input CSV >> >>> > >> >>> > b. Format should encode the data as JSON fields and values >> >>> > >> >>> > c. JSON Field names should be the original column header with >> any >> >>> > invalid characters handled properly. >> >>> > >> >>> > d. Values should be kept unaltered >> >>> > >> >>> > 2) Optionally, be able to specify an expected header used to >> >>> > validate/reject input CSVs >> >>> > >> >>> > 3) Support both tab and comma delimited files >> >>> > >> >>> > a. Auto-detect based on header row is easy >> >>> > >> >>> > b. Allow operator to specify the delimiter as a way to override >> the >> >>> > auto-detect logic >> >>> > >> >>> > 4) Handle arbitrarily large files... >> >>> > >> >>> > a. should handle CSV files of any length ( achieve this using >> >>> > batching) >> >>> > >> >>> > 5) Handle errors gracefully >> >>> > >> >>> > a. File failures >> >>> > >> >>> > b. Row failures >> >>> > >> >>> > 6) Support for RFC-4180 formatted CSV files and be sure to >> handle >> >>> > edge >> >>> > cases like embedded newlines in a field value and escaped double >> quotes >> >>> > >> >>> > >> >>> > >> >>> > Example: >> >>> > >> >>> > Input CSV: >> >>> > >> >>> > user,source_ip,source_country,destination_ip,url,timestamp >> >>> > >> >>> > >> >>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,201 >> 7-02-22T14:46:24-05:00 >> >>> > >> >>> > >> >>> > >> >>> > Desired output JSON: >> >>> > >> >>> > >> >>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country": >> "IN","destination_ip":"23.246.97.82","url":"http://www.google.com >> ","timestamp":"2017-02-22T14:46:24-05:00"} >> >>> > >> >>> > >> >>> > >> >>> > Implementation: >> >>> > >> >>> > 1) Reviewed all the existing csv libraries which can be used to >> >>> > transform csv record to json document by supporting RFC-4180 >> standard >> >>> > to >> >>> > handle embedded new lines in field value and escaped quotes. Found >> >>> > OpenCSV, >> >>> > FastCSV, UnivocityCSV Libraries can do this job most effectively. >> >>> > >> >>> > 2) Selected Univocity CSV Library as I can do most of >> validations >> >>> > which >> >>> > are part of my requirements only using this library. When I did the >> >>> > performance testing using 5 GB and 10GB arbitrarily large files this >> >>> > gave >> >>> > better results compared any others. >> >>> > >> >>> > 3) Processed CSV Records are being emitted immediately rather >> than >> >>> > waiting complete file processing. Used some configurable number in >> >>> > processor >> >>> > to wait until that many records to emit. With this approach I could >> >>> > process >> >>> > 5GB CSV data records using 1GB NIFI RAM which is most effective / >> >>> > attractive >> >>> > feature in this whole implementation to handle large files. ( This >> is >> >>> > common >> >>> > limitation in most of processors like SplitText, SplitXML, etc wait >> >>> > until >> >>> > whole file processing and stores the results FlowFile ArrayList >> within >> >>> > the >> >>> > processor this cause heap size/outofmemory issues) >> >>> > >> >>> > 4) Handled File errors and record errors gracefully using user >> defined >> >>> > configurations and processor routes. >> >>> > >> >>> > Can anyone suggest how to proceed further whether I have to open new >> >>> > issue >> >>> > or if I have to use any existing issue. ( I don't find any which >> >>> > matches to >> >>> > this requirement) >> >> >> >> >> >> >> > >> > > >
