[ 
https://issues.apache.org/jira/browse/NIFI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Villard resolved NIFI-1716.
----------------------------------
       Resolution: Duplicate
    Fix Version/s: 1.2.0

> Implement a SplitCsv processor, possibly also a GetCSV
> ------------------------------------------------------
>
>                 Key: NIFI-1716
>                 URL: https://issues.apache.org/jira/browse/NIFI-1716
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Dmitry Goldenberg
>             Fix For: 1.2.0
>
>
> I'm proposing a SplitCSV processor dedicated specifically to splitting CSV 
> content which is assumed to be in the flowfile-content of its incoming 
> flowfiles.
> It appears that the current mode of splitting a CSV file is by using the 
> SplitText processor. However, it'd be great to have a CSV splitter to read 
> CSV records one by one and use the header row's header names to convert each 
> record into a FlowFile, with attributes set to correspond to the headers.
> Whether or not the first row is a header should be a boolean configuration 
> option.  In the absence of a header row, some sensible default column names 
> should be utilized, for example, one convention could be: column1, column2, 
> column3, etc. (or a naming strategy could be provided by the user in the 
> configuration).
> Another option on the splitter needs to be the delimiter character (defaulted 
> to comma).
> Empty lines shall be skipped from processing.
> Extracted cell values shall be (optionally) whitespace-trimmed.
> Jagged rows must have some sensible handling:
> 1) For a given row, if there are fewer cells than in the header row, cells 
> shall be assigned to columns left to right, and any missing cells shall be 
> considered empty.
> 2) For a given row, if there are more cells than in the header row, a 
> (non-fatal) error shall be generated for the row and the row shall be dropped 
> from processing.
> As typically done with CSV, delimiter characters are ignored within quotes.
> Elements may span multiple lines by having embedded carriage returns; such 
> elements must be quoted.
> NIFI-1280 asks for a way to specify which columns are to be kept or skipped. 
> I'm proposing that instead of a separate processor, this would be implemented 
> as a configuration option on SplitCSV (a list of 0-based indices of columns 
> that are to be kept).
> It may also make sense to expose a GetCSV ingress component which would share 
> most of its functionality with SplitCSV.  Perhaps it's easiest if users just 
> follow a GetFile with SplitCSV, however in some cases it makes sense to save 
> on reading the file into a flowfile-content but rather process all CSV data 
> in-place, within a GetCSV.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to