New ListHDFS Filter default in 1.23+ doesn't make sense?

Isha Lamboo Tue, 16 Apr 2024 06:46:13 -0700

Hi all,

I'm testing an upgrade from NiFi 1.18/1.19 to 1.25 and ran into an issue where 
most ListHDFS that use a filter stopped picking up new files. A filter like 
.*\\.avro doesn't work anymore with the default setting of "Directories and 
Files". Switching the mode to "Files only" fixes the issue, so we have a way 
forward, but it requires pushing updates for tens of flows through the OTAP 
pipeline.


Looking through Jira and the source code, it seems that the new behavior was 
introduced in NIFI-11178. When the mode is set to the default, the full path is 
split by "/" and then *all* parts must match the pattern. To me, this is a very 
unintuitive default. A typical basic use of this Processor for us is specifying 
a fixed directory like "/raw/sources/xxx/in" and then picking up files of a 
certain type/name pattern. With the new code, this fails because none of 
separate directory names match ".*\\.avro". I can't think of any scenario where 
this filter would work effectively.

The code in question (from 
...bundle/nifi-hdfs-processors/src/main/java/org/apache/nifi/processors/hadoop/ListHDFS.java)

... (other cases)
// FILTER_DIRECTORIES_AND_FILES
default:
    return path -> 
Stream.of(Path.getPathWithoutSchemeAndAuthority(path).toString().split("/"))
    .skip(getPathSegmentsToSkip(recursive))
    .allMatch(v -> fileFilterRegexPattern.matcher(v).matches());

Before I file a Jira ticket, I would like to ask your opinions: 
Would a default of "Files only" make more sense, or is this actually a bug and 
should it be anyMatch rather than allMatch?

Thanks for your time,

Isha

New ListHDFS Filter default in 1.23+ doesn't make sense?

Reply via email to