[ 
https://issues.apache.org/jira/browse/NIFI-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859726#comment-17859726
 ] 

Brendan Buhr commented on NIFI-12491:
-------------------------------------

[~dstiegli1] I expect that when using the reader like we did pre the 
excelReader Controller service anyone who wants to apply a schema would first 
use a split (excelToCSV) which is now replaced with he splitExcel Processor 
after which it will feed into a record reader with the excelReader defined and 
a schema per sheet.
 # the header set first row as the header was an option on the CSV Reader and 
was the default we are merely looking to get that same functionality on this 
reader, as mentioned by [~iiojj2] there was an option on the excelToCSV to skip 
columns and rows as well. since the Excel split would just split the file I 
think it would be best to bring those 2 features over to the reader and 
possibly make them dynamic (Attributes) which can be dynamically set as flow 
attributes and then passed to the reader so that a single reader can be used on 
multiple sheets to where the sheet and rows/columns to skip are dynamic 
attributes and option with the default being to maintain existing result when 
not defined
!image-2024-06-24-18-01-49-886.png!
!image-2024-06-24-18-02-36-592.png!
 # I have never encountered a case where multiple sheets had the same structure 
and the same header gets applied but I can picture a scenario where data is 
split into chunks for various reasons and would be treated as one dataset, we 
would usually split this and then do a join of some sort before querying it but 
I can see how the benefit of the join of the data in the reader helps, I would 
make this optional and not the default behavior so that you can get an error 
when no sheet name is specified and trigger alerts on that.

 

on a side note, 1 thing I have experienced with data where there are 2 header 
rows and the first one is a merged header across rows, behavior was that the 
value would get written to the first cell and the rest were blank. sometimes 
it's nice to have that value in all the remerged cells so that you can merge 
that value with the second-row headers. (This is nice to have and not relevant 
to this ticket)

> ExcelReader - new Schema Access strategy: Use String Fields From Header
> -----------------------------------------------------------------------
>
>                 Key: NIFI-12491
>                 URL: https://issues.apache.org/jira/browse/NIFI-12491
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 1.23.2
>            Reporter: Philipp Korniets
>            Assignee: Daniel Stieglitz
>            Priority: Major
>         Attachments: image-2024-06-24-18-01-49-886.png, 
> image-2024-06-24-18-02-36-592.png
>
>
> ExcelReader  needs an ability similar to CSVReader to "Use String Fields From 
> Header" as a Schema Access Strategy.
> Current implementation has:
> 1. Use Schema Name/Schema Text - this option relies on the order of the 
> columns. Possible issues - order of the columns change, but types dont. This 
> cause further calculations to be erroneous.
> 2. Infer Schema - replaces real column names with column_1,column_2 etc - 
> this again loses the "context" of the column and forces us to rely on how 
> columns are ordered. 
> Any workarounds make workflow more complicated.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to