[ https://issues.apache.org/jira/browse/NIFI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pierre Villard resolved NIFI-1716. ---------------------------------- Resolution: Duplicate Fix Version/s: 1.2.0 > Implement a SplitCsv processor, possibly also a GetCSV > ------------------------------------------------------ > > Key: NIFI-1716 > URL: https://issues.apache.org/jira/browse/NIFI-1716 > Project: Apache NiFi > Issue Type: New Feature > Components: Core Framework > Reporter: Dmitry Goldenberg > Fix For: 1.2.0 > > > I'm proposing a SplitCSV processor dedicated specifically to splitting CSV > content which is assumed to be in the flowfile-content of its incoming > flowfiles. > It appears that the current mode of splitting a CSV file is by using the > SplitText processor. However, it'd be great to have a CSV splitter to read > CSV records one by one and use the header row's header names to convert each > record into a FlowFile, with attributes set to correspond to the headers. > Whether or not the first row is a header should be a boolean configuration > option. In the absence of a header row, some sensible default column names > should be utilized, for example, one convention could be: column1, column2, > column3, etc. (or a naming strategy could be provided by the user in the > configuration). > Another option on the splitter needs to be the delimiter character (defaulted > to comma). > Empty lines shall be skipped from processing. > Extracted cell values shall be (optionally) whitespace-trimmed. > Jagged rows must have some sensible handling: > 1) For a given row, if there are fewer cells than in the header row, cells > shall be assigned to columns left to right, and any missing cells shall be > considered empty. > 2) For a given row, if there are more cells than in the header row, a > (non-fatal) error shall be generated for the row and the row shall be dropped > from processing. > As typically done with CSV, delimiter characters are ignored within quotes. > Elements may span multiple lines by having embedded carriage returns; such > elements must be quoted. > NIFI-1280 asks for a way to specify which columns are to be kept or skipped. > I'm proposing that instead of a separate processor, this would be implemented > as a configuration option on SplitCSV (a list of 0-based indices of columns > that are to be kept). > It may also make sense to expose a GetCSV ingress component which would share > most of its functionality with SplitCSV. Perhaps it's easiest if users just > follow a GetFile with SplitCSV, however in some cases it makes sense to save > on reading the file into a flowfile-content but rather process all CSV data > in-place, within a GetCSV. -- This message was sent by Atlassian JIRA (v6.4.14#64029)