Hi all,
I'm testing an upgrade from NiFi 1.18/1.19 to 1.25 and ran into an issue where
most ListHDFS that use a filter stopped picking up new files. A filter like
.*\\.avro doesn't work anymore with the default setting of "Directories and
Files". Switching the mode to "Files only" fixes the issue, so we have a way
forward, but it requires pushing updates for tens of flows through the OTAP
pipeline.
Looking through Jira and the source code, it seems that the new behavior was
introduced in NIFI-11178. When the mode is set to the default, the full path is
split by "/" and then *all* parts must match the pattern. To me, this is a very
unintuitive default. A typical basic use of this Processor for us is specifying
a fixed directory like "/raw/sources/xxx/in" and then picking up files of a
certain type/name pattern. With the new code, this fails because none of
separate directory names match ".*\\.avro". I can't think of any scenario where
this filter would work effectively.
The code in question (from
...bundle/nifi-hdfs-processors/src/main/java/org/apache/nifi/processors/hadoop/ListHDFS.java)
... (other cases)
// FILTER_DIRECTORIES_AND_FILES
default:
return path ->
Stream.of(Path.getPathWithoutSchemeAndAuthority(path).toString().split("/"))
.skip(getPathSegmentsToSkip(recursive))
.allMatch(v -> fileFilterRegexPattern.matcher(v).matches());
Before I file a Jira ticket, I would like to ask your opinions:
Would a default of "Files only" make more sense, or is this actually a bug and
should it be anyMatch rather than allMatch?
Thanks for your time,
Isha