RE: Handling CSVs dynamically with NiFi

2023-04-07 Thread Mike Sofen
This is where I felt Nifi wasn’t the right tool for the job and Postgres was.  
After I imported the CSV directly into a staging table in the database (using 
Nifi), I converted the payload part of the columns into jsonb and stored that 
into the final table in a column with additional columns as relational data 
(timestamps, identifiers, etc).  It was an object-relational data model.

 

THEN, using the amazingly powerful Postgres jsonb functions, I was able to 
extract the unique keys in an entire dataset or across multiple datasets (to 
build a data catalog for example), perform a wide range of validations on 
individual keys, etc.  I use the word amazing because they are not just 
powerful functions but they run surprisingly fast given the amount of string 
data they are traversing.

 

Mike Sofen

 

From: James McMahon  
Sent: Thursday, April 06, 2023 2:03 PM
To: users@nifi.apache.org
Subject: Re: Handling CSVs dynamically with NiFi

 

Can I ask you one follow-up? I've gotten my ConvertRecord to work. I created a 
CsvReader service with Schema Access Strategy of Use String Fields From Header. 
I created a JsonRecordSetWriter service with Schema Write Strategy of Do Not 
Write Schema.

When ConvertRecord is finished, my result looks like this sample:

[ {
  "Bank Name�" : "Almena State Bank",
  "City�" : "Almena",
  "State�" : "KS",
  "Cert�" : "15426",
  "Acquiring Institution�" : "Equity Bank",
  "Closing Date�" : "23-Oct-20",
  "Fund" : "10538"
}, {
  "Bank Name�" : "First City Bank of Florida",
  "City�" : "Fort Walton Beach",
  "State�" : "FL",
  "Cert�" : "16748",
  "Acquiring Institution�" : "United Fidelity Bank, fsb",
  "Closing Date�" : "16-Oct-20",
  "Fund" : "10537"
}, {
  "Bank Name�" : "The First State Bank",
  "City�" : "Barboursville",
  "State�" : "WV",
  "Cert�" : "14361",
  "Acquiring Institution�" : "MVB Bank, Inc.",
  "Closing Date�" : "3-Apr-20",
  "Fund" : "10536"
}] 

 

I don't really have a schema. How can I use a combination of SplitJson and 
EvaluateJsonPath to split each json object out to its own nifi flowfile, and to 
pull the json key values out to define the fields in the csv header? I've found 
a few examples through research that allude to this, but they all seem to have 
a fixed schema and they don't offer configurations for the SplitJson. In a case 
where my json keys definition changes depending on the lfowfile, what should 
JsonPathExpression be set to in the SplitJson configuration?

 

On Thu, Apr 6, 2023 at 9:59 AM Mike Sofen mailto:mso...@runbox.com> > wrote:

Jim – that’s exactly what I did on that “pre” step – generate a schema from the 
CSVReader and use that to dynamically create the DDL sql needed to build the 
staging table in Postgres.  In my solution, there are 2 separate pipelines 
running – this pre step and the normal file processing.

 

I used the pre step to ensure that all incoming files were from a known and 
valid source and that they conformed to the schema for that source – a very 
tidy way to ensure data quality.

 

Mike

 

From: James McMahon mailto:jsmcmah...@gmail.com> > 
Sent: Thursday, April 06, 2023 6:39 AM
To: users@nifi.apache.org <mailto:users@nifi.apache.org> 
Subject: Re: Handling CSVs dynamically with NiFi

 

Thank you both very much, Bryan and Mike. Mike, had you considered the approach 
mentioned by Bryan - a Reader processor to infer schema  -  and found it wasn't 
suitable for your use case, for some reason? For instance, perhaps you were 
employing a version of Apache NiFi that did not afford access to a CsvReader or 
InferAvroSchema processor?

Jim

 

On Thu, Apr 6, 2023 at 9:30 AM Mike Sofen mailto:mso...@runbox.com> > wrote:

Hi James,

 

I don’t have time to go into details, but I had nearly the same scenario and 
solved it by using Nifi as the file processing piece only, sending valid CSV 
files (valid as in CSV formatting) and leveraged Postgres to land the CSV data 
into pre-built staging tables and from there did content validations and 
packaging into jsonb for storage into a single target table.  

 

In my case, an external file source had to “register” a single file (to allow 
creating the matching staging table) prior to sending data.  I used Nifi for 
that pre-staging step to derive the schema for the staging table for a file and 
I used a complex stored procedure to handle a massive amount of logic around 
the contents of a file when processing the actual files prior to storing into 
the destination table.

 

Nifi was VERY fast and efficient in this, as was Postg

RE: Handling CSVs dynamically with NiFi

2023-04-06 Thread Mike Sofen
Jim – that’s exactly what I did on that “pre” step – generate a schema from the 
CSVReader and use that to dynamically create the DDL sql needed to build the 
staging table in Postgres.  In my solution, there are 2 separate pipelines 
running – this pre step and the normal file processing.

 

I used the pre step to ensure that all incoming files were from a known and 
valid source and that they conformed to the schema for that source – a very 
tidy way to ensure data quality.

 

Mike

 

From: James McMahon  
Sent: Thursday, April 06, 2023 6:39 AM
To: users@nifi.apache.org
Subject: Re: Handling CSVs dynamically with NiFi

 

Thank you both very much, Bryan and Mike. Mike, had you considered the approach 
mentioned by Bryan - a Reader processor to infer schema  -  and found it wasn't 
suitable for your use case, for some reason? For instance, perhaps you were 
employing a version of Apache NiFi that did not afford access to a CsvReader or 
InferAvroSchema processor?

Jim

 

On Thu, Apr 6, 2023 at 9:30 AM Mike Sofen mailto:mso...@runbox.com> > wrote:

Hi James,

 

I don’t have time to go into details, but I had nearly the same scenario and 
solved it by using Nifi as the file processing piece only, sending valid CSV 
files (valid as in CSV formatting) and leveraged Postgres to land the CSV data 
into pre-built staging tables and from there did content validations and 
packaging into jsonb for storage into a single target table.  

 

In my case, an external file source had to “register” a single file (to allow 
creating the matching staging table) prior to sending data.  I used Nifi for 
that pre-staging step to derive the schema for the staging table for a file and 
I used a complex stored procedure to handle a massive amount of logic around 
the contents of a file when processing the actual files prior to storing into 
the destination table.

 

Nifi was VERY fast and efficient in this, as was Postgres.

 

Mike Sofen

 

From: James McMahon mailto:jsmcmah...@gmail.com> > 
Sent: Thursday, April 06, 2023 4:35 AM
To: users mailto:users@nifi.apache.org> >
Subject: Handling CSVs dynamically with NiFi

 

We have a task requiring that we transform incoming CSV files to JSON. The CSVs 
vary in schema.

 

There are a number of interesting flow examples out there illustrating how one 
can set up a flow to handle the case where the CSV schema is well known and 
fixed, but none for the generalized case.

 

The structure of the incoming CSV files will not be known in advance in our use 
case. Our nifi flow must be generalized because I cannot configure and rely on 
a service that defines a specific fixed Avro schema registry. An Avro schema 
registry seems to presume an awareness of the CSV structure in advance. We 
don't have that luxury in this use case, with CSVs arriving from many different 
providers and so characterized by schemas that are unknown.

 

What is the best way to get around this challenge? Does anyone know of an 
example where NiFi builds the schema on the fly as CSVs arrive for processing, 
dynamically defining the Avro schema for the CSV?

 

Thanks in advance for any thoughts.



RE: Handling CSVs dynamically with NiFi

2023-04-06 Thread Mike Sofen
Hi James,

 

I don’t have time to go into details, but I had nearly the same scenario and 
solved it by using Nifi as the file processing piece only, sending valid CSV 
files (valid as in CSV formatting) and leveraged Postgres to land the CSV data 
into pre-built staging tables and from there did content validations and 
packaging into jsonb for storage into a single target table.  

 

In my case, an external file source had to “register” a single file (to allow 
creating the matching staging table) prior to sending data.  I used Nifi for 
that pre-staging step to derive the schema for the staging table for a file and 
I used a complex stored procedure to handle a massive amount of logic around 
the contents of a file when processing the actual files prior to storing into 
the destination table.

 

Nifi was VERY fast and efficient in this, as was Postgres.

 

Mike Sofen

 

From: James McMahon  
Sent: Thursday, April 06, 2023 4:35 AM
To: users 
Subject: Handling CSVs dynamically with NiFi

 

We have a task requiring that we transform incoming CSV files to JSON. The CSVs 
vary in schema.

 

There are a number of interesting flow examples out there illustrating how one 
can set up a flow to handle the case where the CSV schema is well known and 
fixed, but none for the generalized case.

 

The structure of the incoming CSV files will not be known in advance in our use 
case. Our nifi flow must be generalized because I cannot configure and rely on 
a service that defines a specific fixed Avro schema registry. An Avro schema 
registry seems to presume an awareness of the CSV structure in advance. We 
don't have that luxury in this use case, with CSVs arriving from many different 
providers and so characterized by schemas that are unknown.

 

What is the best way to get around this challenge? Does anyone know of an 
example where NiFi builds the schema on the fly as CSVs arrive for processing, 
dynamically defining the Avro schema for the CSV?

 

Thanks in advance for any thoughts.



RE: Trouble accessing v 1.14.0 on GCP

2021-08-23 Thread Mike Sofen
That was it – setting the nifi.web.proxy.host to the VM’s external IP (and 
leaving the nifi.web.https.host blank) resulted in the Nifi login screen, and I 
was able to log in.

Whew!!  Thank you so much for the information.  Mike


From: David Handermann 
Sent: Monday, August 23, 2021 9:28 AM
To: users@nifi.apache.org
Subject: Re: Trouble accessing v 1.14.0 on GCP

Hi Mike,

Thanks for the reply, it looks like the request is now getting to the NiFi 
server.  The error message indicates that the public IP address is not one of 
the expected values for the HTTP Host header, based on the NiFi configuration. 
The following property should be configured with the public DNS name of the 
NiFi system in order for NiFi to accept requests:
nifi.web.proxy.host

See the Web Properties section of the Administrator's Guide for more details on 
that particular property:

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#web-properties

Running a reverse DNS lookup of the public IP address should return the host 
value to use for that property, and for accessing NiFi through the browser.

Regards,
David Handermann

On Mon, Aug 23, 2021 at 11:16 AM Mike Sofen 
mailto:mso...@ansunbiopharma.com>> wrote:
Hi David,

Thanks for the tip to try a blank https host address – I hadn’t tried that 
since there was a note somewhere saying something like “nifi will pick the 
network, which may not be what you want”.

However, trying it resulted in the same outcome – my on-prem Windows PC browser 
cannot connect to the GCP nifi. but now gets the result shown below.  I never 
get a login screen as the docs mention.  Mike



[cid:image001.png@01D79806.816A6BE0]



From: David Handermann 
mailto:exceptionfact...@apache.org>>
Sent: Monday, August 23, 2021 6:38 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Re: Trouble accessing v 1.14.0 on GCP

Hi Mike,

Small correction, I mistyped the property name the second time, so for 
clarification, I intended to say setting a blank value for the HTTPS host as 
follows:
nifi.web.https.host=

Regards,
David Handermann

On Mon, Aug 23, 2021 at 8:35 AM David Handermann 
mailto:exceptionfact...@apache.org>> wrote:
Hi Mike,

The nifi.web.https.host property must match one of the IP addresses assigned to 
the system on which NiFi is running. The GCP virtual machine has a private IP 
address assigned to a local interface, and uses network address translation to 
send requests from the public address to the local interface address. Setting a 
blank value for nifi.web.http.post<http://nifi.web.http.post> will cause NiFi 
to listen on all available interfaces, which should allow NiFi to receive 
incoming requests.

The purpose of the default 127.0.0.1 address is to avoid public access to NiFi 
without additional security configuration. The default HTTPS and single user 
credentials provide some measure of protection, and I recommend reviewing the 
Security Configuration and User Authentication sections of the NiFi System 
Administrator's Guide for more details on securing the NiFi installation.

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#security_configuration

Regards,
David Handermann

On Mon, Aug 23, 2021 at 8:06 AM Mike Sofen 
mailto:mso...@ansunbiopharma.com>> wrote:
minor correction - the port shown (8543) was from the alternate port test, the 
regular port test 8443 returns a similar error:
" Nifi fails to start, with the log saying:
2021-08-20 18:55:27,715 WARN [main] org.apache.nifi.web.server.JettyServer 
Failed to start web server... shutting down.
java.io.IOException: Failed to bind to /35.xxx.xx.xxx:8543 Caused by: 
java.net.BindException: Cannot assign requested address"

Mike

-Original Message-
From: Mike Sofen
Sent: Monday, August 23, 2021 6:00 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Trouble accessing v 1.14.0 on GCP

At my prior company I've installed earlier versions of nifi on GCP Debian VMs 
and not had a problem pointing a Windows 10 browser at them and going to work.  
I'm aware v1.14.0 requires a user login when not using certs, but I can't even 
get to that step.

I'm pulling my remaining hair out trying to connect to a new Debian VM on GCP 
running v 1.14.0 on Java 8.  Nifi starts and runs properly, with this caveat - 
I cannot reference the static external IP - only the default IP (127.0.0.1), so 
my browser can't connect.  I have a GCP firewall rule that opens the 8443 port 
for the VM, and even added ICMP to it and can ping it from a CMD shell on my 
PC.  I've checked all of the file permissions on that VM, all uniformly correct.

Details of my nifi.properties:

If I use:
nifi.web.https.host=127.0.0.1 (the default)
nifi.web.https.port=8443

Nifi starts properly and runs, but my browser returns " 127.0.0.1 refused to 
connect "

If I use the VM's static IP (which is what I've used on prior VMs):
nifi.web.https.ho

RE: Trouble accessing v 1.14.0 on GCP

2021-08-23 Thread Mike Sofen
Hi David,

Thanks for the tip to try a blank https host address – I hadn’t tried that 
since there was a note somewhere saying something like “nifi will pick the 
network, which may not be what you want”.

However, trying it resulted in the same outcome – my on-prem Windows PC browser 
cannot connect to the GCP nifi. but now gets the result shown below.  I never 
get a login screen as the docs mention.  Mike



[cid:image002.png@01D797FF.3E7E7AD0]



From: David Handermann 
Sent: Monday, August 23, 2021 6:38 AM
To: users@nifi.apache.org
Subject: Re: Trouble accessing v 1.14.0 on GCP

Hi Mike,

Small correction, I mistyped the property name the second time, so for 
clarification, I intended to say setting a blank value for the HTTPS host as 
follows:
nifi.web.https.host=

Regards,
David Handermann

On Mon, Aug 23, 2021 at 8:35 AM David Handermann 
mailto:exceptionfact...@apache.org>> wrote:
Hi Mike,

The nifi.web.https.host property must match one of the IP addresses assigned to 
the system on which NiFi is running. The GCP virtual machine has a private IP 
address assigned to a local interface, and uses network address translation to 
send requests from the public address to the local interface address. Setting a 
blank value for nifi.web.http.post<http://nifi.web.http.post> will cause NiFi 
to listen on all available interfaces, which should allow NiFi to receive 
incoming requests.

The purpose of the default 127.0.0.1 address is to avoid public access to NiFi 
without additional security configuration. The default HTTPS and single user 
credentials provide some measure of protection, and I recommend reviewing the 
Security Configuration and User Authentication sections of the NiFi System 
Administrator's Guide for more details on securing the NiFi installation.

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#security_configuration

Regards,
David Handermann

On Mon, Aug 23, 2021 at 8:06 AM Mike Sofen 
mailto:mso...@ansunbiopharma.com>> wrote:
minor correction - the port shown (8543) was from the alternate port test, the 
regular port test 8443 returns a similar error:
" Nifi fails to start, with the log saying:
2021-08-20 18:55:27,715 WARN [main] org.apache.nifi.web.server.JettyServer 
Failed to start web server... shutting down.
java.io.IOException: Failed to bind to /35.xxx.xx.xxx:8543 Caused by: 
java.net.BindException: Cannot assign requested address"

Mike

-Original Message-
From: Mike Sofen
Sent: Monday, August 23, 2021 6:00 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Trouble accessing v 1.14.0 on GCP

At my prior company I've installed earlier versions of nifi on GCP Debian VMs 
and not had a problem pointing a Windows 10 browser at them and going to work.  
I'm aware v1.14.0 requires a user login when not using certs, but I can't even 
get to that step.

I'm pulling my remaining hair out trying to connect to a new Debian VM on GCP 
running v 1.14.0 on Java 8.  Nifi starts and runs properly, with this caveat - 
I cannot reference the static external IP - only the default IP (127.0.0.1), so 
my browser can't connect.  I have a GCP firewall rule that opens the 8443 port 
for the VM, and even added ICMP to it and can ping it from a CMD shell on my 
PC.  I've checked all of the file permissions on that VM, all uniformly correct.

Details of my nifi.properties:

If I use:
nifi.web.https.host=127.0.0.1 (the default)
nifi.web.https.port=8443

Nifi starts properly and runs, but my browser returns " 127.0.0.1 refused to 
connect "

If I use the VM's static IP (which is what I've used on prior VMs):
nifi.web.https.host=35.xxx.xx.xxx
nifi.web.https.port=8443

Nifi fails to start, with the log saying:
2021-08-20 18:55:27,715 WARN [main] org.apache.nifi.web.server.JettyServer 
Failed to start web server... shutting down.
java.io.IOException: Failed to bind to /35.xxx.xx.xxx:8543 Caused by: 
java.net.BindException: Cannot assign requested address

Endless web searches and tests have resulted in no change of behavior - with 
the default IP, Nifi runs but I can't access it, and with my external IP, it 
won't start.  I've even tried using a different port (8543), no change.  In 
this GCP project, I have just this one VM and it has successfully been running 
Postgres for many months.

Any ideas?

Mike Sofen


RE: Trouble accessing v 1.14.0 on GCP

2021-08-23 Thread Mike Sofen
minor correction - the port shown (8543) was from the alternate port test, the 
regular port test 8443 returns a similar error:  
" Nifi fails to start, with the log saying: 
2021-08-20 18:55:27,715 WARN [main] org.apache.nifi.web.server.JettyServer 
Failed to start web server... shutting down.
java.io.IOException: Failed to bind to /35.xxx.xx.xxx:8543 Caused by: 
java.net.BindException: Cannot assign requested address"

Mike

-Original Message-----
From: Mike Sofen 
Sent: Monday, August 23, 2021 6:00 AM
To: users@nifi.apache.org
Subject: Trouble accessing v 1.14.0 on GCP

At my prior company I've installed earlier versions of nifi on GCP Debian VMs 
and not had a problem pointing a Windows 10 browser at them and going to work.  
I'm aware v1.14.0 requires a user login when not using certs, but I can't even 
get to that step.

I'm pulling my remaining hair out trying to connect to a new Debian VM on GCP 
running v 1.14.0 on Java 8.  Nifi starts and runs properly, with this caveat - 
I cannot reference the static external IP - only the default IP (127.0.0.1), so 
my browser can't connect.  I have a GCP firewall rule that opens the 8443 port 
for the VM, and even added ICMP to it and can ping it from a CMD shell on my 
PC.  I've checked all of the file permissions on that VM, all uniformly correct.

Details of my nifi.properties:

If I use:
nifi.web.https.host=127.0.0.1 (the default)
nifi.web.https.port=8443

Nifi starts properly and runs, but my browser returns " 127.0.0.1 refused to 
connect "

If I use the VM's static IP (which is what I've used on prior VMs): 
nifi.web.https.host=35.xxx.xx.xxx
nifi.web.https.port=8443

Nifi fails to start, with the log saying: 
2021-08-20 18:55:27,715 WARN [main] org.apache.nifi.web.server.JettyServer 
Failed to start web server... shutting down.
java.io.IOException: Failed to bind to /35.xxx.xx.xxx:8543 Caused by: 
java.net.BindException: Cannot assign requested address

Endless web searches and tests have resulted in no change of behavior - with 
the default IP, Nifi runs but I can't access it, and with my external IP, it 
won't start.  I've even tried using a different port (8543), no change.  In 
this GCP project, I have just this one VM and it has successfully been running 
Postgres for many months.

Any ideas?

Mike Sofen



Trouble accessing v 1.14.0 on GCP

2021-08-23 Thread Mike Sofen
At my prior company I've installed earlier versions of nifi on GCP Debian VMs 
and not had a problem pointing a Windows 10 browser at them and going to work.  
I'm aware v1.14.0 requires a user login when not using certs, but I can't even 
get to that step.

I'm pulling my remaining hair out trying to connect to a new Debian VM on GCP 
running v 1.14.0 on Java 8.  Nifi starts and runs properly, with this caveat - 
I cannot reference the static external IP - only the default IP (127.0.0.1), so 
my browser can't connect.  I have a GCP firewall rule that opens the 8443 port 
for the VM, and even added ICMP to it and can ping it from a CMD shell on my 
PC.  I've checked all of the file permissions on that VM, all uniformly correct.

Details of my nifi.properties:

If I use:
nifi.web.https.host=127.0.0.1 (the default)
nifi.web.https.port=8443

Nifi starts properly and runs, but my browser returns " 127.0.0.1 refused to 
connect "

If I use the VM's static IP (which is what I've used on prior VMs): 
nifi.web.https.host=35.xxx.xx.xxx
nifi.web.https.port=8443

Nifi fails to start, with the log saying: 
2021-08-20 18:55:27,715 WARN [main] org.apache.nifi.web.server.JettyServer 
Failed to start web server... shutting down.
java.io.IOException: Failed to bind to /35.236.80.234:8543
Caused by: java.net.BindException: Cannot assign requested address

Endless web searches and tests have resulted in no change of behavior - with 
the default IP, Nifi runs but I can't access it, and with my external IP, it 
won't start.  I've even tried using a different port (8543), no change.  In 
this GCP project, I have just this one VM and it has successfully been running 
Postgres for many months.

Any ideas?

Mike Sofen



RE: Need help to insert complete content of a file into JSONB field in postgres

2021-06-09 Thread Mike Sofen
Hi Jens,

I have a flow that does that exact work – and also has to handle removal of 
special characters, with the only difference that I use a stored proc so that I 
can fix the content before storing it.  The ReplaceText-->PutSQL has worked 
extremely well, even for large (50mb) text chunks.

The syntax to use for calling the postgres stored proc/function is similar to 
this (I pass in many more metadata params), and you can see the $1 text content:

select * from docs."DocTextSet" ('${original_filename_fixed}', 
'${original_file_extension}',  '$1');

looks like this is the ReplaceText processor:
[cid:image001.png@01D75D0B.3CB99E20]

Within the stored proc, I use a variety of Replace and other functions to fix 
the content and push a success/fail back to the flow to disposition the source 
file.

Mike

From: Jens M. Kofoed 
Sent: Wednesday, June 9, 2021 5:20 AM
To: users@nifi.apache.org
Subject: Need help to insert complete content of a file into JSONB field in 
postgres

Dear community

I'm struggling with inserting files into a Postgres DB. It is the whole content 
of a file which has to be inserted into one field in one record.
Not each line/record in the field.
The PutSQL process expect the content of an incoming FlowFile to be the SQL 
command to execute. Not data to add to the database.
I managed to use a ReplaceText Process to alter the content to include the:
INSERT INTO tablename (content)
VALUES ('$1')

where $1 is equal to the whole content. But the content has special characters 
and is failing.

Please any advice how to insert all content of a file into one record/field?

kind regards
Jens


RE: Problem with the GetFile processor deleting my entire installation

2021-06-04 Thread Mike Sofen
Setting aside the user auth/permissions issue (which sounds like, from your
initial email, that you nailed perfectly with a secured cluster), I'll
address this accidental deletion issue using what I consider a general
design pattern for Nifi (or any process that write-touches files):

*   I don't allow physical file deletes (which happen with the default
GetFiles settings)
*   I create an archive folder for files I've processed if I want/need
to move them out of the processing folder
*   I always set the "leave files" flag
*   I use either the timestamp or filename tracker to eliminate repeated
processing if I'm not moving the files
*   OR - I use Move File flag to the archive folder destination.
*   And I advise new users to always spec a small test folder to start,
and to double-check that the remove files is turned off.

 

In some cases, like when I'm processing text files in place in secure repo,
I can't move the files so I use the timestamp eval method.

 

In others, like log files where I definitely want to archive them out of
that processing folder, I use the move model.

 

Don't get discouraged about this twist in the road - you'll find Nifi to be
a truly exceptional product for pipeline automation, wildly more
powerful/flexible/stable than any other product out there (I've used most).
I've used it IoT to DB to ML processing, document processing, metrology data
processing, you name it.  In particular, the built data provenance and
auditability will be valuable for your situation at Optum.  All the best,

 

Mike

 

From: Ruth, Thomas  
Sent: Friday, June 04, 2021 12:26 PM
To: users@nifi.apache.org
Subject: RE: Problem with the GetFile processor deleting my entire
installation

 

Is it recommended that after installing NiFi, that I then proceed to remove
read permissions from all installation files and directories in order to
protect them from removal by users? Will this present a problem running the
software if it's unable to read any files?

 

So for fun, I loaded up a quick nifi container:
docker run -name nifi -p 8080:8080 -d apache/nifi:latest

 

Connected to localhost:8080/nifi

 

Created a GetFile processor, with the directory /opt/nifi. Connected the
success to a LogAttributes. Hit Run..

 

Now I get this:
HTTP ERROR 404 /nifi/canvas.jsp


URI:

/nifi/


STATUS:

404


MESSAGE:

/nifi/canvas.jsp


SERVLET:

jsp

 

This behavior shouldn't be possible, in my opinion. It's putting security in
the hands of my developers. I really am looking for a solution to this
issue. 

 

The help text says this:
Creates FlowFiles from files in a directory. NiFi will ignore files it
doesn't have at least read permissions for.

 

So as you suggested, I removed the read permission recursively from all
files in /opt/nifi. After doing this, nifi no longer starts.

 

find /opt/nifi -print > /tmp/files.txt

for I in `cat /tmp/files.txt`; do chmod a-r $i; done

 

I also went to https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator and
attempted to calculate a CVSS score for this vulnerability. I ended up
calculating a score of 8.0. 

 

Tom

 

From: Russell Bateman mailto:r...@windofkeltia.com>
> 
Sent: Friday, June 4, 2021 11:20 AM
To: users@nifi.apache.org  
Subject: Re: Problem with the GetFile processor deleting my entire
installation

 

Oh, sorry, to finish the answer, yes, you do need to be very careful how you
specify the Input Directory and File Filter properties; the last one is a
regular expression. It's true that the documentation is less than
flag-waving or hair-lighting-on-fire as it presents its help on filling
those in.

Russ

On 6/4/21 11:16 AM, Russell Bateman wrote:

Sorry for this behavior of GetFile which is purposeful. If you configure to
keep the files instead of removing them, you'll keep getting the same files
ingested over and over again as flow files. It's just how it is.

The secret was to read the help blurb when configuring this processor.

Hope this helps,

Russ

On 6/4/21 10:44 AM, Ruth, Thomas wrote:

Hello,

 

I recently built a 3-node NiFi cluster in my organization as a
proof-of-concept for some work we are doing. I used version 1.13.2 and
installed it onto 3 CentOS 7.9 systems. In my organization, I don't have
root access to the system, so I used a different user called "nfadm" to
install and run the product. I don't remember seeing anything in the
documentation that stated that this would be an issue.

 

I am also new to NiFi, and was relying heavily on the Admin documentation on
the web site for instructions to set up the OS and NiFi installations. I
configured certificate-based security and distributed them to my users. I
also configured policies for groups that I thought were OK for them from a
development standpoint.

 

I had an incident occur yesterday in which a user, who is also new to NiFi,
ran a component called "GetFile" for the filesystem "/" with the default
settings (Recurse=true, 

RE: speeding up ListFile

2021-03-22 Thread Mike Sofen
I just retested, to be sure, and there is no impact from setting “include file 
attributes” to False – stopping a flow pointed at a folder tree that had 
already processed the files, adding one file, then restarting it, the flow only 
picked up the new file.  And it still includes the critical attributes of 
filename, path and creation date.  For my use case, this is an appropriate and 
valuable setting.  Mike

From: James McMahon 
Sent: Saturday, March 20, 2021 5:26 PM
To: users@nifi.apache.org
Subject: Re: speeding up ListFile

When we set “include file attributes” to False, does that in any way impact 
ListFile’s ability to track and retrieve new files by state?

On Fri, Mar 19, 2021 at 1:08 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
It’s hard to say without knowing what’s taking so long. Is it simply crawling 
the directory structure that takes forever? If so, there’s not a lot that can 
be done, as accessing tons of files just tends to be slow. One way to verify 
this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command 
would be on Windows.

The “Track Performance” property of the processor can be used to understand 
more about the performance characteristics of the disk access. Set that to true 
and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a 
Record Writer on the processor so that only a single FlowFile gets created. 
That can then be split up using a tiered approach (SplitRecord to split into 
10,000 Record chunks, and then another SplitRecord to split each 10,000 Record 
chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull 
the actual filename into an attribute). I suspect this is not the issue, with 
that mean heap and given that it’s approximately 1 million files. But it may be 
a factor.

Also, setting the “Include File Attributes” to false can significantly improve 
performance, especially on a remote network drive, or some specific types of 
drives/OS’s.

Would recommend you play around with the above options to better understand the 
performance characteristics of your particular environment.

Thanks
-Mark


On Mar 19, 2021, at 12:57 PM, Mike Sofen 
mailto:mso...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile 
model hitting a large document repository on our Windows file server.  It’s 
nearly a million files ranging in size from 100kb to 300mb, and files types of 
pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized 
binary files.  The million files is distributed across tens of thousands of 
folders.

The challenge is, for an example subfolder that has 25k files in 11k folders 
totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate 
a list and send it downstream to the next processor.  It’s running on a PC with 
the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and 
speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these 
for processing by Tika and Tika generates an error when it receives an 
encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen



RE: [EXTERNAL] speeding up ListFile

2021-03-20 Thread Mike Sofen
Mark,

Fantastic info.  Since I’m a fan of minifi, that was a great suggestion.
I’ve built a minifi before to do something quite similar, so all I need now is 
access to that fileserver, will give it a go.  It IS an on-prem server we own.  
And this is a one-time operation to process 20 years of research, regulatory 
and FDA documents, so I can chip away at it over time.

Mike

From: Mark Payne 
Sent: Saturday, March 20, 2021 7:54 AM
To: users@nifi.apache.org
Subject: Re: [EXTERNAL] speeding up ListFile

Mike,

Good to know. In short: yes, absolutely, the Windows file server will slow it 
down that much. You did mention a 100 mb network. Generally, what will be more 
important is the network latency (because performing the listing and gathering 
filenames, sizes, etc. can require many tiny requests) and the performance of 
the server itself (if it’s busy handling tons of other clients, it may be slow 
to respond).

The reason we added the performance metrics in the first place is because we 
had a user who was upset by the poor performance on a network mounted drive (I 
think a Windows file server but I’m not sure). Every time they used ‘ls’ or 
equivalent it was blazing fast. But after instrumenting all of the metrics we 
were able to find that after doing 50,000 disk operations, even though the 
typical request was perhaps < 1 ms, some would block for many seconds, even 
minutes. Not sure if it was a network glitch or the file server itself. That 
then led us to adding the ability to turn off fetching file attributes, as that 
made a massive difference for them.

I don’t know anything about configuring a Windows file server, so I won’t be 
help there. But if you own the Windows file server, perhaps this is a situation 
where it would make sense to run minifi on the file server and have it ship the 
data to NiFi instead of having NiFi polling. That way, minifi would have local 
disk access and could then push the data to nifi more quickly. (If this seems 
like something that would be doable for you, I would recommend you ask for 
details from someone with more experience in the minifi part of the code base 
to ensure that all necessary functionality is there, i haven’t looked at minifi 
in a while. But I think it is).

Thanks
-Mark




On Mar 20, 2021, at 10:41 AM, Mike Sofen 
mailto:mso...@ansunbiopharma.com>> wrote:

It’s NOT ListFile that is slow, or at least for local file systems.

I re-ran a test into a folder tree local to the PC running Nifi (with an SSD).  
It had 667 files in 129 folders, from which it found 117 matching file types to 
list (but it still had to read every folder and file).  Very VERY fast.

248ms ListFile  (.37ms per file)
23 ms UpdateAttribute (add 8 attributes)
12 ms RouteOnAttribute (3 paths)

Is it possible that a Windows file server on a 100mb network can slow it down 
so much? Anyone find a way to speed up remote Windows file access?

Mike

From: Mike Sofen mailto:mso...@runbox.com>>
Sent: Friday, March 19, 2021 6:54 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Someone help me here: the 157 file listing averaged 46ms, so the total duration 
SHOULD have been 7.2 seconds, not nearly 4 minutes (227 seconds).  What could 
be going on for the other 220 seconds?  Something is amiss.

Mike

From: Mike Sofen mailto:mso...@ansunbiopharma.com>>
Sent: Friday, March 19, 2021 3:47 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 
25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), 
after clearing state, with the Include File Attributes set to false and it took 
the same amount of time to produce the listing – about 30 minutes.

For some reason my debug setting isn’t writing to the log file (I set debug 
from within the ListFile processor).  But it did pop up that red error square 
on the processor.  So to save time, I re-ran it again for just a deep child 
folder that had 2 subfolders with a total of 157 files.  Here’s my 
transcription of the debug:

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there 
were 157 operations performed with an average time of 46.229 milliseconds; STD 
Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

To state the obvious, this tiny listing of 157 files averaged more than 1 
second per file.  That mirrors the speed from my 25k trial which averaged a bit 
over 1 second per file – that is really slow.  What might be going on with the 
“significant outliers”?

Mike

From: Olson, Eric mailto:eric.ol...@adm.com>>
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

I’ve observed the same thing. I’m also monitoring directories of large numbers 
of

RE: [EXTERNAL] Re: speeding up ListFile

2021-03-20 Thread Mike Sofen
It’s NOT ListFile that is slow, or at least for local file systems.

I re-ran a test into a folder tree local to the PC running Nifi (with an SSD).  
It had 667 files in 129 folders, from which it found 117 matching file types to 
list (but it still had to read every folder and file).  Very VERY fast.

248ms ListFile  (.37ms per file)
23 ms UpdateAttribute (add 8 attributes)
12 ms RouteOnAttribute (3 paths)

Is it possible that a Windows file server on a 100mb network can slow it down 
so much? Anyone find a way to speed up remote Windows file access?

Mike

From: Mike Sofen 
Sent: Friday, March 19, 2021 6:54 PM
To: users@nifi.apache.org
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Someone help me here: the 157 file listing averaged 46ms, so the total duration 
SHOULD have been 7.2 seconds, not nearly 4 minutes (227 seconds).  What could 
be going on for the other 220 seconds?  Something is amiss.

Mike

From: Mike Sofen mailto:mso...@ansunbiopharma.com>>
Sent: Friday, March 19, 2021 3:47 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 
25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), 
after clearing state, with the Include File Attributes set to false and it took 
the same amount of time to produce the listing – about 30 minutes.

For some reason my debug setting isn’t writing to the log file (I set debug 
from within the ListFile processor).  But it did pop up that red error square 
on the processor.  So to save time, I re-ran it again for just a deep child 
folder that had 2 subfolders with a total of 157 files.  Here’s my 
transcription of the debug:

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there 
were 157 operations performed with an average time of 46.229 milliseconds; STD 
Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

To state the obvious, this tiny listing of 157 files averaged more than 1 
second per file.  That mirrors the speed from my 25k trial which averaged a bit 
over 1 second per file – that is really slow.  What might be going on with the 
“significant outliers”?

Mike

From: Olson, Eric mailto:eric.ol...@adm.com>>
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

I’ve observed the same thing. I’m also monitoring directories of large numbers 
of files and noticed this morning that ListFile took about 30 min to process 
one directory of about 800,000 files. This is under Linux, but the folder in 
question is a shared Windows network folder that has been mounted to the Linux 
machine. (I don’t know how that was done; it’s something my Linux admin set up 
for me.)

I just ran a quick test on a folder with about 75,000 files. ListFile with 
Include File Attributes set to false took about 10 s to emit the 75,000 
FlowFiles. ListFile including file attributes took about 70 s. At the OS level, 
“ls -lR | wc” takes 2 seconds.

Interestingly, in the downstream queue, the two sets of files have the same 
lineage duration. I guess that’s measured starting at when the ListFile 
processor was started.


From: Mark Payne mailto:marka...@hotmail.com>>
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: [EXTERNAL] Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling 
the directory structure that takes forever? If so, there’s not a lot that can 
be done, as accessing tons of files just tends to be slow. One way to verify 
this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command 
would be on Windows.

The “Track Performance” property of the processor can be used to understand 
more about the performance characteristics of the disk access. Set that to true 
and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a 
Record Writer on the processor so that only a single FlowFile gets created. 
That can then be split up using a tiered approach (SplitRecord to split into 
10,000 Record chunks, and then another SplitRecord to split each 10,000 Record 
chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull 
the actual filename into an attribute). I suspect this is not the issue, with 
that mean heap and given that it’s approximately 1 million files. But it may be 
a factor.

Also, setting the “Include File Attributes” to false can significantly improve 
performance, especially on a remote network drive, or some specific types of 
drives/OS’s.

Would recommend you play around with the above options to better understand the 
performance characteristics of your particular environment.

Thanks
-Mark

On M

RE: [EXTERNAL] Re: speeding up ListFile

2021-03-19 Thread Mike Sofen
Someone help me here: the 157 file listing averaged 46ms, so the total duration 
SHOULD have been 7.2 seconds, not nearly 4 minutes (227 seconds).  What could 
be going on for the other 220 seconds?  Something is amiss.

 

Mike

 

From: Mike Sofen  
Sent: Friday, March 19, 2021 3:47 PM
To: users@nifi.apache.org
Subject: RE: [EXTERNAL] Re: speeding up ListFile

 

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 
25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), 
after clearing state, with the Include File Attributes set to false and it took 
the same amount of time to produce the listing – about 30 minutes.

 

For some reason my debug setting isn’t writing to the log file (I set debug 
from within the ListFile processor).  But it did pop up that red error square 
on the processor.  So to save time, I re-ran it again for just a deep child 
folder that had 2 subfolders with a total of 157 files.  Here’s my 
transcription of the debug:

 

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there 
were 157 operations performed with an average time of 46.229 milliseconds; STD 
Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

 

To state the obvious, this tiny listing of 157 files averaged more than 1 
second per file.  That mirrors the speed from my 25k trial which averaged a bit 
over 1 second per file – that is really slow.  What might be going on with the 
“significant outliers”?  

 

Mike

 

From: Olson, Eric mailto:eric.ol...@adm.com> > 
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org <mailto:users@nifi.apache.org> 
Subject: RE: [EXTERNAL] Re: speeding up ListFile

 

I’ve observed the same thing. I’m also monitoring directories of large numbers 
of files and noticed this morning that ListFile took about 30 min to process 
one directory of about 800,000 files. This is under Linux, but the folder in 
question is a shared Windows network folder that has been mounted to the Linux 
machine. (I don’t know how that was done; it’s something my Linux admin set up 
for me.)

 

I just ran a quick test on a folder with about 75,000 files. ListFile with 
Include File Attributes set to false took about 10 s to emit the 75,000 
FlowFiles. ListFile including file attributes took about 70 s. At the OS level, 
“ls -lR | wc” takes 2 seconds.

 

Interestingly, in the downstream queue, the two sets of files have the same 
lineage duration. I guess that’s measured starting at when the ListFile 
processor was started.

 

 

From: Mark Payne mailto:marka...@hotmail.com> > 
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org <mailto:users@nifi.apache.org> 
Subject: [EXTERNAL] Re: speeding up ListFile

 

It’s hard to say without knowing what’s taking so long. Is it simply crawling 
the directory structure that takes forever? If so, there’s not a lot that can 
be done, as accessing tons of files just tends to be slow. One way to verify 
this, on Linux, would be to run: 

 

ls -laR

 

I.e., a recursive listing of all files. Not sure what the analogous command 
would be on Windows.

 

The “Track Performance” property of the processor can be used to understand 
more about the performance characteristics of the disk access. Set that to true 
and enable DEBUG logging for the processor.

 

If there are heap concerns, generating a million FlowFiles, then you can set a 
Record Writer on the processor so that only a single FlowFile gets created. 
That can then be split up using a tiered approach (SplitRecord to split into 
10,000 Record chunks, and then another SplitRecord to split each 10,000 Record 
chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull 
the actual filename into an attribute). I suspect this is not the issue, with 
that mean heap and given that it’s approximately 1 million files. But it may be 
a factor.

 

Also, setting the “Include File Attributes” to false can significantly improve 
performance, especially on a remote network drive, or some specific types of 
drives/OS’s.

 

Would recommend you play around with the above options to better understand the 
performance characteristics of your particular environment.

 

Thanks

-Mark

 

On Mar 19, 2021, at 12:57 PM, Mike Sofen mailto:mso...@ansunbiopharma.com> > wrote:

 

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile 
model hitting a large document repository on our Windows file server.  It’s 
nearly a million files ranging in size from 100kb to 300mb, and files types of 
pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized 
binary files.  The million files is distributed across tens of thousands of 
folders.

 

The challenge is, for an example subfolder that has 25k files in 11k folders 
totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate 
a list and send it downstream to the next pr

RE: [EXTERNAL] Re: speeding up ListFile

2021-03-19 Thread Mike Sofen
Hopes dashed on the rocks of reality...dang.  I just retested my folder with 
25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), 
after clearing state, with the Include File Attributes set to false and it took 
the same amount of time to produce the listing – about 30 minutes.

For some reason my debug setting isn’t writing to the log file (I set debug 
from within the ListFile processor).  But it did pop up that red error square 
on the processor.  So to save time, I re-ran it again for just a deep child 
folder that had 2 subfolders with a total of 157 files.  Here’s my 
transcription of the debug:

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there 
were 157 operations performed with an average time of 46.229 milliseconds; STD 
Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

To state the obvious, this tiny listing of 157 files averaged more than 1 
second per file.  That mirrors the speed from my 25k trial which averaged a bit 
over 1 second per file – that is really slow.  What might be going on with the 
“significant outliers”?

Mike

From: Olson, Eric 
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org
Subject: RE: [EXTERNAL] Re: speeding up ListFile

I’ve observed the same thing. I’m also monitoring directories of large numbers 
of files and noticed this morning that ListFile took about 30 min to process 
one directory of about 800,000 files. This is under Linux, but the folder in 
question is a shared Windows network folder that has been mounted to the Linux 
machine. (I don’t know how that was done; it’s something my Linux admin set up 
for me.)

I just ran a quick test on a folder with about 75,000 files. ListFile with 
Include File Attributes set to false took about 10 s to emit the 75,000 
FlowFiles. ListFile including file attributes took about 70 s. At the OS level, 
“ls -lR | wc” takes 2 seconds.

Interestingly, in the downstream queue, the two sets of files have the same 
lineage duration. I guess that’s measured starting at when the ListFile 
processor was started.


From: Mark Payne mailto:marka...@hotmail.com>>
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: [EXTERNAL] Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling 
the directory structure that takes forever? If so, there’s not a lot that can 
be done, as accessing tons of files just tends to be slow. One way to verify 
this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command 
would be on Windows.

The “Track Performance” property of the processor can be used to understand 
more about the performance characteristics of the disk access. Set that to true 
and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a 
Record Writer on the processor so that only a single FlowFile gets created. 
That can then be split up using a tiered approach (SplitRecord to split into 
10,000 Record chunks, and then another SplitRecord to split each 10,000 Record 
chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull 
the actual filename into an attribute). I suspect this is not the issue, with 
that mean heap and given that it’s approximately 1 million files. But it may be 
a factor.

Also, setting the “Include File Attributes” to false can significantly improve 
performance, especially on a remote network drive, or some specific types of 
drives/OS’s.

Would recommend you play around with the above options to better understand the 
performance characteristics of your particular environment.

Thanks
-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen 
mailto:mso...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile 
model hitting a large document repository on our Windows file server.  It’s 
nearly a million files ranging in size from 100kb to 300mb, and files types of 
pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized 
binary files.  The million files is distributed across tens of thousands of 
folders.

The challenge is, for an example subfolder that has 25k files in 11k folders 
totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate 
a list and send it downstream to the next processor.  It’s running on a PC with 
the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and 
speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these 
for processing by Tika and Tika generates an error when it receives an 
encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen



Confidentiality Notice:
This m

RE: speeding up ListFile

2021-03-19 Thread Mike Sofen
I’m pretty sure it’s the directory crawling that is the issue.  And I’m not 
trying to process the whole thing at once but instead taking small slices like 
the 25k files in 11k folders and testing that.  On Windows, it also takes a 
long time (many minutes) just to generate a file and folder count.

I will use the Track Performance setting as you mentioned and try to get some 
additional data points.

Re Include File Attributes:  I need the filename and path for downstream 
processing, I will have to test that those are still included somewhere if I 
turn off that flag.

Thanks for the tips, will update with more data.

Mike

From: Mark Payne 
Sent: Friday, March 19, 2021 10:08 AM
To: users@nifi.apache.org
Subject: Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling 
the directory structure that takes forever? If so, there’s not a lot that can 
be done, as accessing tons of files just tends to be slow. One way to verify 
this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command 
would be on Windows.

The “Track Performance” property of the processor can be used to understand 
more about the performance characteristics of the disk access. Set that to true 
and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a 
Record Writer on the processor so that only a single FlowFile gets created. 
That can then be split up using a tiered approach (SplitRecord to split into 
10,000 Record chunks, and then another SplitRecord to split each 10,000 Record 
chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull 
the actual filename into an attribute). I suspect this is not the issue, with 
that mean heap and given that it’s approximately 1 million files. But it may be 
a factor.

Also, setting the “Include File Attributes” to false can significantly improve 
performance, especially on a remote network drive, or some specific types of 
drives/OS’s.

Would recommend you play around with the above options to better understand the 
performance characteristics of your particular environment.

Thanks
-Mark


On Mar 19, 2021, at 12:57 PM, Mike Sofen 
mailto:mso...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile 
model hitting a large document repository on our Windows file server.  It’s 
nearly a million files ranging in size from 100kb to 300mb, and files types of 
pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized 
binary files.  The million files is distributed across tens of thousands of 
folders.

The challenge is, for an example subfolder that has 25k files in 11k folders 
totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate 
a list and send it downstream to the next processor.  It’s running on a PC with 
the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and 
speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these 
for processing by Tika and Tika generates an error when it receives an 
encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen



speeding up ListFile

2021-03-19 Thread Mike Sofen
I've built a document processing solution in Nifi, using the ListFile/FetchFile 
model hitting a large document repository on our Windows file server.  It's 
nearly a million files ranging in size from 100kb to 300mb, and files types of 
pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized 
binary files.  The million files is distributed across tens of thousands of 
folders.

The challenge is, for an example subfolder that has 25k files in 11k folders 
totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate 
a list and send it downstream to the next processor.  It's running on a PC with 
the latest gen core i7 with 32gb ram and a 1TB SSD - plenty of horsepower and 
speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I'm sending these 
for processing by Tika and Tika generates an error when it receives an 
encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen


RE: Flow Hotspots

2020-11-20 Thread Mike Sofen
I love this idea.  When developing/debugging new flows, visually seeing where 
the bottlenecks are occurring as the flows are running would be of significant 
value, especially in more complex flows.

 

Mike

 

From: Eric Secules  
Sent: Thursday, November 19, 2020 5:30 PM
To: users@nifi.apache.org
Subject: Flow Hotspots

 

Hello everyone,

 

I was wondering if the nifi summary view could have a summary of the 
connections where flowfiles spend the most time in waiting, that would help 
identify slow points in a complicated flow.

 

Alternatively, does anyone know of some tool which might be able to provide 
this analysis already?

 

Thanks,

Eric



RE: Nested groups for LdapUserGroupProvider

2020-07-24 Thread Mike Sofen
I don’t know how the nifi LDAP provider works specifically, but a commercial 
data virtualization app we use is able to import LDAP groups that contain 
multiple levels of nested groups.  Our LDAP groups have an owner, 1 or more 
supervisors and 1 or more members.  

 

The app can only see LDAP members, so the key for us was to point the config 
settings to the correct spot within our LDAP forest…initially we didn’t point 
it correctly and only saw first-level members, after a bit of trial and error, 
finally got nested groups working, and we’ve tested down 5 levels of nesting.

 

Mike Sofen

 

From: Jens M. Kofoed  
Sent: Friday, July 24, 2020 9:42 AM
To: users@nifi.apache.org
Subject: Re: Nested groups for LdapUserGroupProvider

 

Hi

 

>From my knowledge and playing with ldap and nifi. Nifi “imports” users and 
>groups into nifi and nifi does not support groups in groups.

In my setup it looks like it imports groups first. Next it imports users. If a 
user is memberOf an imported group it will be connected to the group in nifi.

 

Regards 

Jens


Den 24. jul. 2020 kl. 17.41 skrev Bryan Bende mailto:bbe...@gmail.com> >:

>From my limited knowledge of how the LDAP providers work, I'm not aware of 
>anything that would handle transitive group membership, but others may know 
>more.

 

On Fri, Jul 24, 2020 at 11:18 AM Moncef Abboud mailto:moncef.abbou...@gmail.com> > wrote:

Thank you for your reply Bryan. 

 

Yes, I understand that they are related. But I still don't see how to address 
my nested groups problem since the configuration properties only talk about 
direct relationships.

 

 

 

Le ven. 24 juil. 2020 à 17:08, Bryan Bende mailto:bbe...@gmail.com> > a écrit :

There are two different but related things...

 

LdapIdentityProvider for authentication.

 

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#ldap_login_identity_provider

 

LdapUserGroupProvider for authorization.

 

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#ldapusergroupprovider

 

On Fri, Jul 24, 2020 at 11:03 AM Moncef Abboud mailto:moncef.abbou...@gmail.com> > wrote:

Hello Juan, 

 

Thank you for your response. I am not sure that I understand what you mean. I 
believe LdapProvider is used for authentication and doesn't have much to do 
with group memberships and authorization.

 

Moncef

 

Le ven. 24 juil. 2020 à 16:55, Juan Pablo Gardella mailto:gardellajuanpa...@gmail.com> > a écrit :

Maybe that scenario is not supported, but you can start playing with that 
custom scenario. LDAP provider is configurable by XML


ldap-provider
org.apache.nifi.ldap.LdapProvider

Juan

 

On Fri, 24 Jul 2020 at 08:20, Moncef Abboud mailto:moncef.abbou...@gmail.com> > wrote:

Hello fellow NiFi Users, 

 

I am trying to configure authorization using the LdapUserGroupProvider. The 
documentation is clear : specify your "User Search Base" and "Group Search 
Base"  and define membership either using  "User Group Name Attribute" such as 
"memberOf" or the other way around using "Group Member Attribute" such as 
"member". All that is clear and works perfectly but my problems is as follows: 

 

I have two levels of groups in my directory e.g.

 

GroupA contains Group1 and Group2

GroupB contains Group2 and Group3 

GroupC contains Group1 and Group3 

 

Group1 contains User1 and User2

Group2 contains User1 and User3

 

 LDIF looks something like this: 

 

dn: CN=GroupA 
member: CN= Group1 ..
member: CN= Group2 .. 

 

-

dn: CN=Group1 
member: CN=User1 ..
member: CN=User2.. 

.

memberOf: CN=GroupA ...

memberOf: CN=GroupC ... 

 



 

dn: CN=User1

memberOf: CN=Group1 ...

memberOf: CN=Group2 ... 

--

 

No direct link between a user and a level 1 group (GroupA, GroupB..) 

 

I would like to note that groups of level 1 (GroupA, GroupB ..) are not in the 
same branch in the DIT as those of level 2 (Group1, Group2 ..).  

 

The requirement is that the groups used to manage authorization and that should 
show in the NIFI UI are those of level 1 (GroupA, GroupB..) and that users 
should be assigned to the groups containing their direct groups for instance 
User1 (who is a direct member of Group1 and Group2) should be displayed as a 
member of groups (GroupA, GroupB and GroupC). And level 2 groups (Group1, 
Group2..) must not show and must not be used directly in the UI but only as 
link between users and level 1 groups.

 

So to sum up, NIFI should take into account only level1 groups and handle 
transitive memberships through level2 groups.

 

Thank you in advance for your answers.

 

Best Regards,

Moncef  



-- 

Moncef  ABBOUD



-- 

Moncef  ABBOUD



initiating a machine learning script on a remote server

2020-06-25 Thread Mike Sofen
I've been prototyping various functionality on nifi, initially on a Windows
laptop, now on a single GCP Linux instance (for now), using the more basic
processors for files and databases.  It's really a superb platform.

 

What I now need to solve for is firing a python machine learning script that
exists on another CPU/GPU equipped instance, as part of a pipeline that
detects a new file to process, sends the file name/location to the remote
server and receives the results of the processing from the server, for
further actions.  We need maximum performance and robustness from this step
of the processing.

 

I've read a bunch of posts on this and they point to using the
ExecuteStreamCommand processor (vs the ExecuteProcess, since it allows
inputs) but none seem show how to configure the processor to point to a
remote server and execute a script that exists on that server with
arguments/variables I pass in with the call.  These servers will all be GCP
instances. To keep things simple, let's ignore security for the moment and
assume I own both servers.

 

Can someone point me in the right direction? Many thanks!

 

Mike Sofen