Re: IntelliJ, language level and jigaw profile

2020-01-30 Thread Mike Thomsen
While we're on the topic of IntelliJ, highly recommend that people try out
that new font that Jetbrains released for developers:

https://www.jetbrains.com/lp/mono/#how-to-install

So far for me on a MBP, it really does seem to make things even clearer and
easier to read.

On Wed, Jan 29, 2020 at 4:31 PM Jeff  wrote:

> Hi Mark,
>
> IntelliJ IDEA seems to implicitly activate the "jigsaw" profile with a
> "light" checkmark if it thinks you have Java 11 installed versus a "bold"
> checkmark if you've clicked that profile to be enabled explicitly.  Make
> sure that profile is unchecked, and make IDEA rebuild.  That should get you
> back to having your classes compiled with Java 8.
>
> On Wed, Jan 29, 2020 at 4:23 PM Mark Bean  wrote:
>
> > I'm having trouble with IntelliJ setting the proper language level. Has
> > anyone else seen the following behavior?
> >
> > IntelliJ 2019.3.2 (Community Edition)
> > installed from ideaIC-2019.3.2-no-jbr.tar.gz
> >
> > File > Project Structure > Platform Settings > SDKs
> >   - Only 1.8 is loaded
> > File > Project Structure > Project Settings > Project
> >   - Project SDK = 1.8
> >   - Project language level = 8 - Lambdas, type annotations etc.
> > File > Project Structure > Project Settings > Modules
> >   - Every module lists "Language level" as "11 - Local variable syntax
> for
> > lambda paramters"
> > File > Settings > Build, Execution, Deployment > Build Tools > Maven >
> > Importing
> >   - JDK for importer = 1.8
> >
> > I go to File > Settings > Build, Execution, Deployment > Compiler > Java
> > Compiler
> >   - Project bytecode version = 8
> >   - Select all modules and remove them
> >   - Apply
> > From Project view, select root level pom.xml > Maven > Reimport
> > Return to File > Settings > Build, Execution, Deployment > Compiler >
> Java
> > Compiler
> >   - All modules have returned and have a Target bytecode version of 11
> >
> > At this point, I can't run unit tests in IntelliJ. I get an error "Error:
> > java: invalid source release: 11"
> >
> > I could just use Java 11, but I'm curious if this is related to the
> jigsaw
> > profile. When I remove the  from the jigsaw profile, I get
> the
> > above behavior. When I remove both the activation and the properties
> > (maven.compiler.sorce and maven.compiler.target), then I can get the
> > language level to remain at 1.8.
> >
> > It appears the jigaw profile may be activating all the time. Or, maybe
> > IntelliJ is presenting the incorrect JDK version?
> >
> > -Mark
> >
>


Re: IntelliJ, language level and jigaw profile

2020-01-30 Thread Mark Bean
Jeff,

Can you please be a little more specific where to find the light/bold
checked profile option? The only way I've enabled profiles is by explicit
maven option "-P".

Thanks,
Mark

On Wed, Jan 29, 2020 at 4:31 PM Jeff  wrote:

> Hi Mark,
>
> IntelliJ IDEA seems to implicitly activate the "jigsaw" profile with a
> "light" checkmark if it thinks you have Java 11 installed versus a "bold"
> checkmark if you've clicked that profile to be enabled explicitly.  Make
> sure that profile is unchecked, and make IDEA rebuild.  That should get you
> back to having your classes compiled with Java 8.
>
> On Wed, Jan 29, 2020 at 4:23 PM Mark Bean  wrote:
>
> > I'm having trouble with IntelliJ setting the proper language level. Has
> > anyone else seen the following behavior?
> >
> > IntelliJ 2019.3.2 (Community Edition)
> > installed from ideaIC-2019.3.2-no-jbr.tar.gz
> >
> > File > Project Structure > Platform Settings > SDKs
> >   - Only 1.8 is loaded
> > File > Project Structure > Project Settings > Project
> >   - Project SDK = 1.8
> >   - Project language level = 8 - Lambdas, type annotations etc.
> > File > Project Structure > Project Settings > Modules
> >   - Every module lists "Language level" as "11 - Local variable syntax
> for
> > lambda paramters"
> > File > Settings > Build, Execution, Deployment > Build Tools > Maven >
> > Importing
> >   - JDK for importer = 1.8
> >
> > I go to File > Settings > Build, Execution, Deployment > Compiler > Java
> > Compiler
> >   - Project bytecode version = 8
> >   - Select all modules and remove them
> >   - Apply
> > From Project view, select root level pom.xml > Maven > Reimport
> > Return to File > Settings > Build, Execution, Deployment > Compiler >
> Java
> > Compiler
> >   - All modules have returned and have a Target bytecode version of 11
> >
> > At this point, I can't run unit tests in IntelliJ. I get an error "Error:
> > java: invalid source release: 11"
> >
> > I could just use Java 11, but I'm curious if this is related to the
> jigsaw
> > profile. When I remove the  from the jigsaw profile, I get
> the
> > above behavior. When I remove both the activation and the properties
> > (maven.compiler.sorce and maven.compiler.target), then I can get the
> > language level to remain at 1.8.
> >
> > It appears the jigaw profile may be activating all the time. Or, maybe
> > IntelliJ is presenting the incorrect JDK version?
> >
> > -Mark
> >
>


Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
Dear NiFi developer team,

NiFi's Data Provenance and Data Lineage is perfectly adequate in the
environment of NiFi, so there is often no need to use Atlas.

When using NiFi with customer data a problem arises.
The problem is the GDPR requirement that a user has the right to be
forgotten. Unfortunately, I can't find any API call or information on
how to delete individual user data from the NiFi Provenance Repository
based on a user-defined attribute and its defined characteristics.

A delete request like "delete all data and dependencies where the
attribute XYZ has the value 123" is currently not possible to my knowledge.

My questions are:
Is this actually possible and how? And if not, is it planned?

Thanks
Uwe


Re: Provenance Repository and GDPR

2020-01-30 Thread Emanuel Oliveira
Hi, dont think makes sense an api for atomic records:

   1. one configure retention od data provenance (default 24h is "good
   enough" GDPR doesnt need milisecond realtime deletion right ?)
   
https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
   2. even if there would be one api to delete FF's with an attribute =
   , that would normally be useless as well, since inbound FFs have
   normally hundreds, thousands of records that will need to split, aggregate,
   in complex flow file, implementing a clean up an nano atomic level would be
   to hard and extra effort not needed, since your target single record would
   surely be part of multiple FF UUIDs, some only holding your record, but mot
   surefly will have 100s, 100s of other records including your record
   somewhere on the middle.


In my opinion your answer to business/management gate keepers is that data
will be stored on data provenance for 24h (default) which can be
configured, and that


Best Regards,
*Emanuel Oliveira*



On Thu, Jan 30, 2020 at 1:54 PM u...@moosheimer.com 
wrote:

> Dear NiFi developer team,
>
> NiFi's Data Provenance and Data Lineage is perfectly adequate in the
> environment of NiFi, so there is often no need to use Atlas.
>
> When using NiFi with customer data a problem arises.
> The problem is the GDPR requirement that a user has the right to be
> forgotten. Unfortunately, I can't find any API call or information on
> how to delete individual user data from the NiFi Provenance Repository
> based on a user-defined attribute and its defined characteristics.
>
> A delete request like "delete all data and dependencies where the
> attribute XYZ has the value 123" is currently not possible to my knowledge.
>
> My questions are:
> Is this actually possible and how? And if not, is it planned?
>
> Thanks
> Uwe
>


Re: IntelliJ, language level and jigaw profile

2020-01-30 Thread Mark Bean
Found it. Profiles are at the top of the list in the Maven tool window.

Thanks for the tip. Unchecking the jigsaw profile did the trick.

Thanks,
Mark


On Thu, Jan 30, 2020 at 8:51 AM Mark Bean  wrote:

> Jeff,
>
> Can you please be a little more specific where to find the light/bold
> checked profile option? The only way I've enabled profiles is by explicit
> maven option "-P".
>
> Thanks,
> Mark
>
> On Wed, Jan 29, 2020 at 4:31 PM Jeff  wrote:
>
>> Hi Mark,
>>
>> IntelliJ IDEA seems to implicitly activate the "jigsaw" profile with a
>> "light" checkmark if it thinks you have Java 11 installed versus a "bold"
>> checkmark if you've clicked that profile to be enabled explicitly.  Make
>> sure that profile is unchecked, and make IDEA rebuild.  That should get
>> you
>> back to having your classes compiled with Java 8.
>>
>> On Wed, Jan 29, 2020 at 4:23 PM Mark Bean  wrote:
>>
>> > I'm having trouble with IntelliJ setting the proper language level. Has
>> > anyone else seen the following behavior?
>> >
>> > IntelliJ 2019.3.2 (Community Edition)
>> > installed from ideaIC-2019.3.2-no-jbr.tar.gz
>> >
>> > File > Project Structure > Platform Settings > SDKs
>> >   - Only 1.8 is loaded
>> > File > Project Structure > Project Settings > Project
>> >   - Project SDK = 1.8
>> >   - Project language level = 8 - Lambdas, type annotations etc.
>> > File > Project Structure > Project Settings > Modules
>> >   - Every module lists "Language level" as "11 - Local variable syntax
>> for
>> > lambda paramters"
>> > File > Settings > Build, Execution, Deployment > Build Tools > Maven >
>> > Importing
>> >   - JDK for importer = 1.8
>> >
>> > I go to File > Settings > Build, Execution, Deployment > Compiler > Java
>> > Compiler
>> >   - Project bytecode version = 8
>> >   - Select all modules and remove them
>> >   - Apply
>> > From Project view, select root level pom.xml > Maven > Reimport
>> > Return to File > Settings > Build, Execution, Deployment > Compiler >
>> Java
>> > Compiler
>> >   - All modules have returned and have a Target bytecode version of 11
>> >
>> > At this point, I can't run unit tests in IntelliJ. I get an error
>> "Error:
>> > java: invalid source release: 11"
>> >
>> > I could just use Java 11, but I'm curious if this is related to the
>> jigsaw
>> > profile. When I remove the  from the jigsaw profile, I get
>> the
>> > above behavior. When I remove both the activation and the properties
>> > (maven.compiler.sorce and maven.compiler.target), then I can get the
>> > language level to remain at 1.8.
>> >
>> > It appears the jigaw profile may be activating all the time. Or, maybe
>> > IntelliJ is presenting the incorrect JDK version?
>> >
>> > -Mark
>> >
>>
>


Re: Provenance Repository and GDPR

2020-01-30 Thread Mike Thomsen
IANAL, but I would be surprised if NiFi provenance data even legally falls
under the Right to Be Forgotten because it's internal diagnostic data that
is highly ephemeral.

On Thu, Jan 30, 2020 at 9:07 AM Emanuel Oliveira  wrote:

> Hi, dont think makes sense an api for atomic records:
>
>1. one configure retention od data provenance (default 24h is "good
>enough" GDPR doesnt need milisecond realtime deletion right ?)
>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
>2. even if there would be one api to delete FF's with an attribute =
>, that would normally be useless as well, since inbound FFs
> have
>normally hundreds, thousands of records that will need to split,
> aggregate,
>in complex flow file, implementing a clean up an nano atomic level
> would be
>to hard and extra effort not needed, since your target single record
> would
>surely be part of multiple FF UUIDs, some only holding your record, but
> mot
>surefly will have 100s, 100s of other records including your record
>somewhere on the middle.
>
>
> In my opinion your answer to business/management gate keepers is that data
> will be stored on data provenance for 24h (default) which can be
> configured, and that
>
>
> Best Regards,
> *Emanuel Oliveira*
>
>
>
> On Thu, Jan 30, 2020 at 1:54 PM u...@moosheimer.com 
> wrote:
>
> > Dear NiFi developer team,
> >
> > NiFi's Data Provenance and Data Lineage is perfectly adequate in the
> > environment of NiFi, so there is often no need to use Atlas.
> >
> > When using NiFi with customer data a problem arises.
> > The problem is the GDPR requirement that a user has the right to be
> > forgotten. Unfortunately, I can't find any API call or information on
> > how to delete individual user data from the NiFi Provenance Repository
> > based on a user-defined attribute and its defined characteristics.
> >
> > A delete request like "delete all data and dependencies where the
> > attribute XYZ has the value 123" is currently not possible to my
> knowledge.
> >
> > My questions are:
> > Is this actually possible and how? And if not, is it planned?
> >
> > Thanks
> > Uwe
> >
>


Re: Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
Hi,

> GDPR doesnt need milisecond realtime deletion right ?)
right.

> since inbound FFs have
>normally hundreds, thousands of records that will need to split, aggregate,
>in complex flow file, implementing a clean
It depends on your application. Not everyone uses NiFi for IoT and
therefore a single record may be included.

> In my opinion your answer to business/management gate keepers is that data
> will be stored on data provenance for 24h (default) which can be
> configured, and that

This is not necessarily the point of the Data Lineage, that the
information is deleted after 24 hours (or whatever is configured).
If Data Lineage is needed (revision, legal requirements etc.), then
deleting the data after a defined time is not an option.

This is the reason why Atlas supports it.

Best Regards,
Uwe

Am 30.01.2020 um 15:06 schrieb Emanuel Oliveira:
> Hi, dont think makes sense an api for atomic records:
>
>1. one configure retention od data provenance (default 24h is "good
>enough" GDPR doesnt need milisecond realtime deletion right ?)
>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
>2. even if there would be one api to delete FF's with an attribute =
>, that would normally be useless as well, since inbound FFs have
>normally hundreds, thousands of records that will need to split, aggregate,
>in complex flow file, implementing a clean up an nano atomic level would be
>to hard and extra effort not needed, since your target single record would
>surely be part of multiple FF UUIDs, some only holding your record, but mot
>surefly will have 100s, 100s of other records including your record
>somewhere on the middle.
>
>
> In my opinion your answer to business/management gate keepers is that data
> will be stored on data provenance for 24h (default) which can be
> configured, and that
>
>
> Best Regards,
> *Emanuel Oliveira*
>
>
>
> On Thu, Jan 30, 2020 at 1:54 PM u...@moosheimer.com 
> wrote:
>
>> Dear NiFi developer team,
>>
>> NiFi's Data Provenance and Data Lineage is perfectly adequate in the
>> environment of NiFi, so there is often no need to use Atlas.
>>
>> When using NiFi with customer data a problem arises.
>> The problem is the GDPR requirement that a user has the right to be
>> forgotten. Unfortunately, I can't find any API call or information on
>> how to delete individual user data from the NiFi Provenance Repository
>> based on a user-defined attribute and its defined characteristics.
>>
>> A delete request like "delete all data and dependencies where the
>> attribute XYZ has the value 123" is currently not possible to my knowledge.
>>
>> My questions are:
>> Is this actually possible and how? And if not, is it planned?
>>
>> Thanks
>> Uwe
>>



Re: Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
I think you have the wrong picture.

Data lineage systems like Atlas and similar are pushed because GDPR
prescribes it!
Data Lineage is by no means a pure "internal diagnostic" but has a legal
background.

Thus GDPR defines a recording requirement.
It states among other things that
- a description of the categories of personal data
- a description of the categories of recipients of personal data,
including recipients in third countries or international organisations
Transfer of personal data to a third country or an international
organisation
- be recorded in an audit-proof manner.

And if you do all this correctly, then you have to make sure that the
data is erasable again (right to be forgotten).

By the way, this does not only apply to special Data Lineage systems but
also to all log files, backups etc. At least as long as no other legal
regulation prohibits this.
Data Lineage is therefore not a nice feature for internal diagnostics
but a must.

So far, too few companies have thought of this. But more and more are
recognizing the necessity.
This is also the reason why formerly Hortonworks and now Cloudera work
hard on Atlas.

Am 30.01.2020 um 15:25 schrieb Mike Thomsen:
> IANAL, but I would be surprised if NiFi provenance data even legally falls
> under the Right to Be Forgotten because it's internal diagnostic data that
> is highly ephemeral.
>
> On Thu, Jan 30, 2020 at 9:07 AM Emanuel Oliveira  wrote:
>
>> Hi, dont think makes sense an api for atomic records:
>>
>>1. one configure retention od data provenance (default 24h is "good
>>enough" GDPR doesnt need milisecond realtime deletion right ?)
>>
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
>>2. even if there would be one api to delete FF's with an attribute =
>>, that would normally be useless as well, since inbound FFs
>> have
>>normally hundreds, thousands of records that will need to split,
>> aggregate,
>>in complex flow file, implementing a clean up an nano atomic level
>> would be
>>to hard and extra effort not needed, since your target single record
>> would
>>surely be part of multiple FF UUIDs, some only holding your record, but
>> mot
>>surefly will have 100s, 100s of other records including your record
>>somewhere on the middle.
>>
>>
>> In my opinion your answer to business/management gate keepers is that data
>> will be stored on data provenance for 24h (default) which can be
>> configured, and that
>>
>>
>> Best Regards,
>> *Emanuel Oliveira*
>>
>>
>>
>> On Thu, Jan 30, 2020 at 1:54 PM u...@moosheimer.com 
>> wrote:
>>
>>> Dear NiFi developer team,
>>>
>>> NiFi's Data Provenance and Data Lineage is perfectly adequate in the
>>> environment of NiFi, so there is often no need to use Atlas.
>>>
>>> When using NiFi with customer data a problem arises.
>>> The problem is the GDPR requirement that a user has the right to be
>>> forgotten. Unfortunately, I can't find any API call or information on
>>> how to delete individual user data from the NiFi Provenance Repository
>>> based on a user-defined attribute and its defined characteristics.
>>>
>>> A delete request like "delete all data and dependencies where the
>>> attribute XYZ has the value 123" is currently not possible to my
>> knowledge.
>>> My questions are:
>>> Is this actually possible and how? And if not, is it planned?
>>>
>>> Thanks
>>> Uwe
>>>



Re: Provenance Repository and GDPR

2020-01-30 Thread Mike Thomsen
That's actually a pretty fascinating use case. Our experience on this side
of the Atlantic is that few people really care about lineage.

On Thu, Jan 30, 2020 at 9:48 AM u...@moosheimer.com 
wrote:

> I think you have the wrong picture.
>
> Data lineage systems like Atlas and similar are pushed because GDPR
> prescribes it!
> Data Lineage is by no means a pure "internal diagnostic" but has a legal
> background.
>
> Thus GDPR defines a recording requirement.
> It states among other things that
> - a description of the categories of personal data
> - a description of the categories of recipients of personal data,
> including recipients in third countries or international organisations
> Transfer of personal data to a third country or an international
> organisation
> - be recorded in an audit-proof manner.
>
> And if you do all this correctly, then you have to make sure that the
> data is erasable again (right to be forgotten).
>
> By the way, this does not only apply to special Data Lineage systems but
> also to all log files, backups etc. At least as long as no other legal
> regulation prohibits this.
> Data Lineage is therefore not a nice feature for internal diagnostics
> but a must.
>
> So far, too few companies have thought of this. But more and more are
> recognizing the necessity.
> This is also the reason why formerly Hortonworks and now Cloudera work
> hard on Atlas.
>
> Am 30.01.2020 um 15:25 schrieb Mike Thomsen:
> > IANAL, but I would be surprised if NiFi provenance data even legally
> falls
> > under the Right to Be Forgotten because it's internal diagnostic data
> that
> > is highly ephemeral.
> >
> > On Thu, Jan 30, 2020 at 9:07 AM Emanuel Oliveira 
> wrote:
> >
> >> Hi, dont think makes sense an api for atomic records:
> >>
> >>1. one configure retention od data provenance (default 24h is "good
> >>enough" GDPR doesnt need milisecond realtime deletion right ?)
> >>
> >>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> >>2. even if there would be one api to delete FF's with an attribute =
> >>, that would normally be useless as well, since inbound FFs
> >> have
> >>normally hundreds, thousands of records that will need to split,
> >> aggregate,
> >>in complex flow file, implementing a clean up an nano atomic level
> >> would be
> >>to hard and extra effort not needed, since your target single record
> >> would
> >>surely be part of multiple FF UUIDs, some only holding your record,
> but
> >> mot
> >>surefly will have 100s, 100s of other records including your record
> >>somewhere on the middle.
> >>
> >>
> >> In my opinion your answer to business/management gate keepers is that
> data
> >> will be stored on data provenance for 24h (default) which can be
> >> configured, and that
> >>
> >>
> >> Best Regards,
> >> *Emanuel Oliveira*
> >>
> >>
> >>
> >> On Thu, Jan 30, 2020 at 1:54 PM u...@moosheimer.com 
> >> wrote:
> >>
> >>> Dear NiFi developer team,
> >>>
> >>> NiFi's Data Provenance and Data Lineage is perfectly adequate in the
> >>> environment of NiFi, so there is often no need to use Atlas.
> >>>
> >>> When using NiFi with customer data a problem arises.
> >>> The problem is the GDPR requirement that a user has the right to be
> >>> forgotten. Unfortunately, I can't find any API call or information on
> >>> how to delete individual user data from the NiFi Provenance Repository
> >>> based on a user-defined attribute and its defined characteristics.
> >>>
> >>> A delete request like "delete all data and dependencies where the
> >>> attribute XYZ has the value 123" is currently not possible to my
> >> knowledge.
> >>> My questions are:
> >>> Is this actually possible and how? And if not, is it planned?
> >>>
> >>> Thanks
> >>> Uwe
> >>>
>
>


Re: Provenance Repository and GDPR

2020-01-30 Thread Emanuel Oliveira
Hi,

Some recap on NiFi concepts:

   - Content Repository stores FF contents.
   - Data Provenance events -used to check lineage of history of FFs- only
   stores pointers to FFs (not contents).
   - so one can have data deleted and still access lineage/data provenance
   history.

Heres a lof of in-depth on the subject, but above 3 points are the
summary of all:
https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html


*DATA - persistent data only exists in 2 scenarios:*

   - while your flow file running.
   - archived on content repository for 12h (to allow access contents when
   using inspect data provenance/lineage).
   
https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418


*PROVENANCE EVENTS (LINEAGE) OF DATA:*

   - contains only provenance attributes and FF uuid etcbut NO CONTENTS,
   available for 24h unless increasing/changed on config files.
   -
   
https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties



So as you see both context by default expire daily. fast enough that dont
think GDPR is any problem or any action needed.
Now one can always boosts retention of just data provenance events for
months, 1 year or whatever suits. But data is long gone anyway.

Best Regards,
*Emanuel Oliveira*



On Thu, Jan 30, 2020 at 2:26 PM u...@moosheimer.com 
wrote:

> Hi,
>
> > GDPR doesnt need milisecond realtime deletion right ?)
> right.
>
> > since inbound FFs have
> >normally hundreds, thousands of records that will need to split,
> aggregate,
> >in complex flow file, implementing a clean
> It depends on your application. Not everyone uses NiFi for IoT and
> therefore a single record may be included.
>
> > In my opinion your answer to business/management gate keepers is that
> data
> > will be stored on data provenance for 24h (default) which can be
> > configured, and that
>
> This is not necessarily the point of the Data Lineage, that the
> information is deleted after 24 hours (or whatever is configured).
> If Data Lineage is needed (revision, legal requirements etc.), then
> deleting the data after a defined time is not an option.
>
> This is the reason why Atlas supports it.
>
> Best Regards,
> Uwe
>
> Am 30.01.2020 um 15:06 schrieb Emanuel Oliveira:
> > Hi, dont think makes sense an api for atomic records:
> >
> >1. one configure retention od data provenance (default 24h is "good
> >enough" GDPR doesnt need milisecond realtime deletion right ?)
> >
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> >2. even if there would be one api to delete FF's with an attribute =
> >, that would normally be useless as well, since inbound FFs
> have
> >normally hundreds, thousands of records that will need to split,
> aggregate,
> >in complex flow file, implementing a clean up an nano atomic level
> would be
> >to hard and extra effort not needed, since your target single record
> would
> >surely be part of multiple FF UUIDs, some only holding your record,
> but mot
> >surefly will have 100s, 100s of other records including your record
> >somewhere on the middle.
> >
> >
> > In my opinion your answer to business/management gate keepers is that
> data
> > will be stored on data provenance for 24h (default) which can be
> > configured, and that
> >
> >
> > Best Regards,
> > *Emanuel Oliveira*
> >
> >
> >
> > On Thu, Jan 30, 2020 at 1:54 PM u...@moosheimer.com 
> > wrote:
> >
> >> Dear NiFi developer team,
> >>
> >> NiFi's Data Provenance and Data Lineage is perfectly adequate in the
> >> environment of NiFi, so there is often no need to use Atlas.
> >>
> >> When using NiFi with customer data a problem arises.
> >> The problem is the GDPR requirement that a user has the right to be
> >> forgotten. Unfortunately, I can't find any API call or information on
> >> how to delete individual user data from the NiFi Provenance Repository
> >> based on a user-defined attribute and its defined characteristics.
> >>
> >> A delete request like "delete all data and dependencies where the
> >> attribute XYZ has the value 123" is currently not possible to my
> knowledge.
> >>
> >> My questions are:
> >> Is this actually possible and how? And if not, is it planned?
> >>
> >> Thanks
> >> Uwe
> >>
>
>


Re: Provenance Repository and GDPR

2020-01-30 Thread Joe Witt
Mike,

It was created on this side of the Atlantic because when people do care
about such things - they REALLY care.

I anticipate more and more people will care and I hope that day comes
soon.  I'm proud of NiFi's ability to be a leader here because if your flow
management solution between sensors and processing and storage systems
tells you where things came from and went to it is a heck of a good start.

What exists in our provenance data is information about the data but this
can be 'any attribute' put on a flow file throughout its life in the flow.
We simply cannot guarantee this wont be 'content'.  The notion of what is
metadata vs content gets blurry fast.

Uwe,

The data provenance capabilities within NiFi do no support the ability to
'delete records' based on specified parameters.  The only mechanism is
space or time based age off.  For now, whatever the obligation is to
respond to a right to be forgotten request should be what the provenance
within NiFi is configured to hold.  If for instance you have 24 hours then
provenance in NiFi should hold no more than 24 hours.

I doubt this is something we'll be able to spend time on sooner but I agree
the idea of being able to purge out records is a good one based on more
precise parameters.

The intent is not that the built-in nifi provenance store is for long term
but rather the records are there long enough to support flow management use
cases but are always being exported to a long term store such as Atlas or
even just stored in HDFS or other locations for additional use.  One
day...a sweet graph database...

Thanks
Joe

On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira 
wrote:

> Hi,
>
> Some recap on NiFi concepts:
>
>- Content Repository stores FF contents.
>- Data Provenance events -used to check lineage of history of FFs- only
>stores pointers to FFs (not contents).
>- so one can have data deleted and still access lineage/data provenance
>history.
>
> Heres a lof of in-depth on the subject, but above 3 points are the
> summary of all:
> https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
>
>
> *DATA - persistent data only exists in 2 scenarios:*
>
>- while your flow file running.
>- archived on content repository for 12h (to allow access contents when
>using inspect data provenance/lineage).
>
> https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418
>
>
> *PROVENANCE EVENTS (LINEAGE) OF DATA:*
>
>- contains only provenance attributes and FF uuid etcbut NO CONTENTS,
>available for 24h unless increasing/changed on config files.
>-
>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
>
>
>
> So as you see both context by default expire daily. fast enough that dont
> think GDPR is any problem or any action needed.
> Now one can always boosts retention of just data provenance events for
> months, 1 year or whatever suits. But data is long gone anyway.
>
> Best Regards,
> *Emanuel Oliveira*
>
>
>
> On Thu, Jan 30, 2020 at 2:26 PM u...@moosheimer.com 
> wrote:
>
> > Hi,
> >
> > > GDPR doesnt need milisecond realtime deletion right ?)
> > right.
> >
> > > since inbound FFs have
> > >normally hundreds, thousands of records that will need to split,
> > aggregate,
> > >in complex flow file, implementing a clean
> > It depends on your application. Not everyone uses NiFi for IoT and
> > therefore a single record may be included.
> >
> > > In my opinion your answer to business/management gate keepers is that
> > data
> > > will be stored on data provenance for 24h (default) which can be
> > > configured, and that
> >
> > This is not necessarily the point of the Data Lineage, that the
> > information is deleted after 24 hours (or whatever is configured).
> > If Data Lineage is needed (revision, legal requirements etc.), then
> > deleting the data after a defined time is not an option.
> >
> > This is the reason why Atlas supports it.
> >
> > Best Regards,
> > Uwe
> >
> > Am 30.01.2020 um 15:06 schrieb Emanuel Oliveira:
> > > Hi, dont think makes sense an api for atomic records:
> > >
> > >1. one configure retention od data provenance (default 24h is "good
> > >enough" GDPR doesnt need milisecond realtime deletion right ?)
> > >
> >
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> > >2. even if there would be one api to delete FF's with an attribute =
> > >, that would normally be useless as well, since inbound FFs
> > have
> > >normally hundreds, thousands of records that will need to split,
> > aggregate,
> > >in complex flow file, implementing a clean up an nano atomic level
> > would be
> > >to hard and extra effort not needed, since your target single record
> > would
> > >surely be part of multiple FF UUIDs, some only holding your record,
> > but mot
> > >su

Re: Provenance Repository and GDPR

2020-01-30 Thread Emanuel Oliveira
Nifi transforms data.
NiFi is not a persistent database where your data persists.


Best Regards,
*Emanuel Oliveira*



On Thu, Jan 30, 2020 at 2:48 PM u...@moosheimer.com 
wrote:

> I think you have the wrong picture.
>
> Data lineage systems like Atlas and similar are pushed because GDPR
> prescribes it!
> Data Lineage is by no means a pure "internal diagnostic" but has a legal
> background.
>
> Thus GDPR defines a recording requirement.
> It states among other things that
> - a description of the categories of personal data
> - a description of the categories of recipients of personal data,
> including recipients in third countries or international organisations
> Transfer of personal data to a third country or an international
> organisation
> - be recorded in an audit-proof manner.
>
> And if you do all this correctly, then you have to make sure that the
> data is erasable again (right to be forgotten).
>
> By the way, this does not only apply to special Data Lineage systems but
> also to all log files, backups etc. At least as long as no other legal
> regulation prohibits this.
> Data Lineage is therefore not a nice feature for internal diagnostics
> but a must.
>
> So far, too few companies have thought of this. But more and more are
> recognizing the necessity.
> This is also the reason why formerly Hortonworks and now Cloudera work
> hard on Atlas.
>
> Am 30.01.2020 um 15:25 schrieb Mike Thomsen:
> > IANAL, but I would be surprised if NiFi provenance data even legally
> falls
> > under the Right to Be Forgotten because it's internal diagnostic data
> that
> > is highly ephemeral.
> >
> > On Thu, Jan 30, 2020 at 9:07 AM Emanuel Oliveira 
> wrote:
> >
> >> Hi, dont think makes sense an api for atomic records:
> >>
> >>1. one configure retention od data provenance (default 24h is "good
> >>enough" GDPR doesnt need milisecond realtime deletion right ?)
> >>
> >>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> >>2. even if there would be one api to delete FF's with an attribute =
> >>, that would normally be useless as well, since inbound FFs
> >> have
> >>normally hundreds, thousands of records that will need to split,
> >> aggregate,
> >>in complex flow file, implementing a clean up an nano atomic level
> >> would be
> >>to hard and extra effort not needed, since your target single record
> >> would
> >>surely be part of multiple FF UUIDs, some only holding your record,
> but
> >> mot
> >>surefly will have 100s, 100s of other records including your record
> >>somewhere on the middle.
> >>
> >>
> >> In my opinion your answer to business/management gate keepers is that
> data
> >> will be stored on data provenance for 24h (default) which can be
> >> configured, and that
> >>
> >>
> >> Best Regards,
> >> *Emanuel Oliveira*
> >>
> >>
> >>
> >> On Thu, Jan 30, 2020 at 1:54 PM u...@moosheimer.com 
> >> wrote:
> >>
> >>> Dear NiFi developer team,
> >>>
> >>> NiFi's Data Provenance and Data Lineage is perfectly adequate in the
> >>> environment of NiFi, so there is often no need to use Atlas.
> >>>
> >>> When using NiFi with customer data a problem arises.
> >>> The problem is the GDPR requirement that a user has the right to be
> >>> forgotten. Unfortunately, I can't find any API call or information on
> >>> how to delete individual user data from the NiFi Provenance Repository
> >>> based on a user-defined attribute and its defined characteristics.
> >>>
> >>> A delete request like "delete all data and dependencies where the
> >>> attribute XYZ has the value 123" is currently not possible to my
> >> knowledge.
> >>> My questions are:
> >>> Is this actually possible and how? And if not, is it planned?
> >>>
> >>> Thanks
> >>> Uwe
> >>>
>
>


Re: Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
@Mike
However, this is also partly very frustrating, what we have to consider here. 
But also pretty fascinating.

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 30.01.2020 um 16:23 schrieb Mike Thomsen :
> 
> That's actually a pretty fascinating use case. Our experience on this side
> of the Atlantic is that few people really care about lineage.
> 
>> On Thu, Jan 30, 2020 at 9:48 AM u...@moosheimer.com 
>> wrote:
>> 
>> I think you have the wrong picture.
>> 
>> Data lineage systems like Atlas and similar are pushed because GDPR
>> prescribes it!
>> Data Lineage is by no means a pure "internal diagnostic" but has a legal
>> background.
>> 
>> Thus GDPR defines a recording requirement.
>> It states among other things that
>> - a description of the categories of personal data
>> - a description of the categories of recipients of personal data,
>> including recipients in third countries or international organisations
>> Transfer of personal data to a third country or an international
>> organisation
>> - be recorded in an audit-proof manner.
>> 
>> And if you do all this correctly, then you have to make sure that the
>> data is erasable again (right to be forgotten).
>> 
>> By the way, this does not only apply to special Data Lineage systems but
>> also to all log files, backups etc. At least as long as no other legal
>> regulation prohibits this.
>> Data Lineage is therefore not a nice feature for internal diagnostics
>> but a must.
>> 
>> So far, too few companies have thought of this. But more and more are
>> recognizing the necessity.
>> This is also the reason why formerly Hortonworks and now Cloudera work
>> hard on Atlas.
>> 
>>> Am 30.01.2020 um 15:25 schrieb Mike Thomsen:
>>> IANAL, but I would be surprised if NiFi provenance data even legally
>> falls
>>> under the Right to Be Forgotten because it's internal diagnostic data
>> that
>>> is highly ephemeral.
>>> 
>>> On Thu, Jan 30, 2020 at 9:07 AM Emanuel Oliveira 
>> wrote:
>>> 
 Hi, dont think makes sense an api for atomic records:
 
   1. one configure retention od data provenance (default 24h is "good
   enough" GDPR doesnt need milisecond realtime deletion right ?)
 
 
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
   2. even if there would be one api to delete FF's with an attribute =
   , that would normally be useless as well, since inbound FFs
 have
   normally hundreds, thousands of records that will need to split,
 aggregate,
   in complex flow file, implementing a clean up an nano atomic level
 would be
   to hard and extra effort not needed, since your target single record
 would
   surely be part of multiple FF UUIDs, some only holding your record,
>> but
 mot
   surefly will have 100s, 100s of other records including your record
   somewhere on the middle.
 
 
 In my opinion your answer to business/management gate keepers is that
>> data
 will be stored on data provenance for 24h (default) which can be
 configured, and that
 
 
 Best Regards,
 *Emanuel Oliveira*
 
 
 
 On Thu, Jan 30, 2020 at 1:54 PM u...@moosheimer.com 
 wrote:
 
> Dear NiFi developer team,
> 
> NiFi's Data Provenance and Data Lineage is perfectly adequate in the
> environment of NiFi, so there is often no need to use Atlas.
> 
> When using NiFi with customer data a problem arises.
> The problem is the GDPR requirement that a user has the right to be
> forgotten. Unfortunately, I can't find any API call or information on
> how to delete individual user data from the NiFi Provenance Repository
> based on a user-defined attribute and its defined characteristics.
> 
> A delete request like "delete all data and dependencies where the
> attribute XYZ has the value 123" is currently not possible to my
 knowledge.
> My questions are:
> Is this actually possible and how? And if not, is it planned?
> 
> Thanks
> Uwe
> 
>> 
>> 



Re: Provenance Repository and GDPR

2020-01-30 Thread Mike Thomsen
> It was created on this side of the Atlantic because when people do care
about such things - they REALLY care.

Agreed. I was just commenting on our particular experiences with customers
in the federal space. There are unfortunately many who still don't get all
of the accountability traceability advantages provenance and lineage
tracking provides.

On Thu, Jan 30, 2020 at 10:32 AM Joe Witt  wrote:

> Mike,
>
> It was created on this side of the Atlantic because when people do care
> about such things - they REALLY care.
>
> I anticipate more and more people will care and I hope that day comes
> soon.  I'm proud of NiFi's ability to be a leader here because if your flow
> management solution between sensors and processing and storage systems
> tells you where things came from and went to it is a heck of a good start.
>
> What exists in our provenance data is information about the data but this
> can be 'any attribute' put on a flow file throughout its life in the flow.
> We simply cannot guarantee this wont be 'content'.  The notion of what is
> metadata vs content gets blurry fast.
>
> Uwe,
>
> The data provenance capabilities within NiFi do no support the ability to
> 'delete records' based on specified parameters.  The only mechanism is
> space or time based age off.  For now, whatever the obligation is to
> respond to a right to be forgotten request should be what the provenance
> within NiFi is configured to hold.  If for instance you have 24 hours then
> provenance in NiFi should hold no more than 24 hours.
>
> I doubt this is something we'll be able to spend time on sooner but I agree
> the idea of being able to purge out records is a good one based on more
> precise parameters.
>
> The intent is not that the built-in nifi provenance store is for long term
> but rather the records are there long enough to support flow management use
> cases but are always being exported to a long term store such as Atlas or
> even just stored in HDFS or other locations for additional use.  One
> day...a sweet graph database...
>
> Thanks
> Joe
>
> On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira 
> wrote:
>
> > Hi,
> >
> > Some recap on NiFi concepts:
> >
> >- Content Repository stores FF contents.
> >- Data Provenance events -used to check lineage of history of FFs-
> only
> >stores pointers to FFs (not contents).
> >- so one can have data deleted and still access lineage/data
> provenance
> >history.
> >
> > Heres a lof of in-depth on the subject, but above 3 points are the
> > summary of all:
> > https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
> >
> >
> > *DATA - persistent data only exists in 2 scenarios:*
> >
> >- while your flow file running.
> >- archived on content repository for 12h (to allow access contents
> when
> >using inspect data provenance/lineage).
> >
> >
> https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418
> >
> >
> > *PROVENANCE EVENTS (LINEAGE) OF DATA:*
> >
> >- contains only provenance attributes and FF uuid etcbut NO CONTENTS,
> >available for 24h unless increasing/changed on config files.
> >-
> >
> >
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> >
> >
> >
> > So as you see both context by default expire daily. fast enough that dont
> > think GDPR is any problem or any action needed.
> > Now one can always boosts retention of just data provenance events for
> > months, 1 year or whatever suits. But data is long gone anyway.
> >
> > Best Regards,
> > *Emanuel Oliveira*
> >
> >
> >
> > On Thu, Jan 30, 2020 at 2:26 PM u...@moosheimer.com 
> > wrote:
> >
> > > Hi,
> > >
> > > > GDPR doesnt need milisecond realtime deletion right ?)
> > > right.
> > >
> > > > since inbound FFs have
> > > >normally hundreds, thousands of records that will need to split,
> > > aggregate,
> > > >in complex flow file, implementing a clean
> > > It depends on your application. Not everyone uses NiFi for IoT and
> > > therefore a single record may be included.
> > >
> > > > In my opinion your answer to business/management gate keepers is that
> > > data
> > > > will be stored on data provenance for 24h (default) which can be
> > > > configured, and that
> > >
> > > This is not necessarily the point of the Data Lineage, that the
> > > information is deleted after 24 hours (or whatever is configured).
> > > If Data Lineage is needed (revision, legal requirements etc.), then
> > > deleting the data after a defined time is not an option.
> > >
> > > This is the reason why Atlas supports it.
> > >
> > > Best Regards,
> > > Uwe
> > >
> > > Am 30.01.2020 um 15:06 schrieb Emanuel Oliveira:
> > > > Hi, dont think makes sense an api for atomic records:
> > > >
> > > >1. one configure retention od data provenance (default 24h is
> "good
> > > >enough" GDPR doesnt need milisecond realtime deletion right ?)
> > >

Re: Provenance Repository and GDPR

2020-01-30 Thread Emanuel Oliveira
But enlight me please :) isnt GDPR just about cleaning from persistent
storage ?
In what sense does NiFi relates to GDPR compliance ?

   - in terms of data FF contents - they too transient (gone in 12hours /
   default).
   - I guess discussion is on the fact FF attributes are kept on the data
   provenance repo ? (gone in 24h / default)

I wonder wheres the culprit here ? Is it in the situation hwere one wants
to keep a long trace of data provenance like 6 months, but because
attributes are stored on provenance events, then they must be deleted ?
I guess it can only be a problem of deleting attributes from provenance
repo and no FF contents right as they gone fast enough ?

Best Regards,
*Emanuel Oliveira*



On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen  wrote:

> > It was created on this side of the Atlantic because when people do care
> about such things - they REALLY care.
>
> Agreed. I was just commenting on our particular experiences with customers
> in the federal space. There are unfortunately many who still don't get all
> of the accountability traceability advantages provenance and lineage
> tracking provides.
>
> On Thu, Jan 30, 2020 at 10:32 AM Joe Witt  wrote:
>
> > Mike,
> >
> > It was created on this side of the Atlantic because when people do care
> > about such things - they REALLY care.
> >
> > I anticipate more and more people will care and I hope that day comes
> > soon.  I'm proud of NiFi's ability to be a leader here because if your
> flow
> > management solution between sensors and processing and storage systems
> > tells you where things came from and went to it is a heck of a good
> start.
> >
> > What exists in our provenance data is information about the data but this
> > can be 'any attribute' put on a flow file throughout its life in the
> flow.
> > We simply cannot guarantee this wont be 'content'.  The notion of what is
> > metadata vs content gets blurry fast.
> >
> > Uwe,
> >
> > The data provenance capabilities within NiFi do no support the ability to
> > 'delete records' based on specified parameters.  The only mechanism is
> > space or time based age off.  For now, whatever the obligation is to
> > respond to a right to be forgotten request should be what the provenance
> > within NiFi is configured to hold.  If for instance you have 24 hours
> then
> > provenance in NiFi should hold no more than 24 hours.
> >
> > I doubt this is something we'll be able to spend time on sooner but I
> agree
> > the idea of being able to purge out records is a good one based on more
> > precise parameters.
> >
> > The intent is not that the built-in nifi provenance store is for long
> term
> > but rather the records are there long enough to support flow management
> use
> > cases but are always being exported to a long term store such as Atlas or
> > even just stored in HDFS or other locations for additional use.  One
> > day...a sweet graph database...
> >
> > Thanks
> > Joe
> >
> > On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira 
> > wrote:
> >
> > > Hi,
> > >
> > > Some recap on NiFi concepts:
> > >
> > >- Content Repository stores FF contents.
> > >- Data Provenance events -used to check lineage of history of FFs-
> > only
> > >stores pointers to FFs (not contents).
> > >- so one can have data deleted and still access lineage/data
> > provenance
> > >history.
> > >
> > > Heres a lof of in-depth on the subject, but above 3 points are the
> > > summary of all:
> > > https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
> > >
> > >
> > > *DATA - persistent data only exists in 2 scenarios:*
> > >
> > >- while your flow file running.
> > >- archived on content repository for 12h (to allow access contents
> > when
> > >using inspect data provenance/lineage).
> > >
> > >
> >
> https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418
> > >
> > >
> > > *PROVENANCE EVENTS (LINEAGE) OF DATA:*
> > >
> > >- contains only provenance attributes and FF uuid etcbut NO
> CONTENTS,
> > >available for 24h unless increasing/changed on config files.
> > >-
> > >
> > >
> >
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> > >
> > >
> > >
> > > So as you see both context by default expire daily. fast enough that
> dont
> > > think GDPR is any problem or any action needed.
> > > Now one can always boosts retention of just data provenance events for
> > > months, 1 year or whatever suits. But data is long gone anyway.
> > >
> > > Best Regards,
> > > *Emanuel Oliveira*
> > >
> > >
> > >
> > > On Thu, Jan 30, 2020 at 2:26 PM u...@moosheimer.com  >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > > GDPR doesnt need milisecond realtime deletion right ?)
> > > > right.
> > > >
> > > > > since inbound FFs have
> > > > >normally hundreds, thousands of records that will need to split,
> > > > aggregate,
> > > > >in complex 

Re: Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
Joe,

thank you for the detailed and final clarification.
With your statement I know how to argue with my clients.

I would like to share one last idea.

NiFi is being used more and more in Europe. And SMBs/SMEs are starting
to deal with NiFi.
In contrast to the US, the share of SMBs in Europe is extremely high and
therefore a huge market.
And in Europe it is not uncommon to speak of an SMB when the company has
1000 employees.

For SMBs, NiFi is a good (and often the best) entry into the world of
large data such as IoT or process chain events, and often that is enough
for them.
They attach a Postgres or maybe JanusGraph and whatever else they need
and that is enough for their business case.

It would be interesting to expand the NiFi Data Provenance/Data Lineage
for this customer group.
Not everyone starts with HDF or wants to have a Hadoop installation
right away. But they all have the GDPR problem.
And they don't necessarily want to have Atlas, Ranger and
HBase/Cassandra in addition, because they don't have the
personnel/expertise to do so, because it's not their main business.

Even if the flood of data in the Data Lineage becomes extremely high
over time, it would be interesting to expand the possibilities in NiFi
here as well.
Maybe it would be a good idea to extend the ProvenanceReporter to store
in S3 (internally in a Ceph SAN or externally in a S3 Cloud). Then the
hard disk limits would be solved. The whole thing would be routed over
e.g. Kafka or NiFi S2S or whatever to avoid latency problems.

Maybe my idea is unrealistic, but I think it can't hurt to discuss it.

Thanks
Uwe

Am 30.01.2020 um 16:32 schrieb Joe Witt:
> Mike,
>
> It was created on this side of the Atlantic because when people do care
> about such things - they REALLY care.
>
> I anticipate more and more people will care and I hope that day comes
> soon.  I'm proud of NiFi's ability to be a leader here because if your flow
> management solution between sensors and processing and storage systems
> tells you where things came from and went to it is a heck of a good start.
>
> What exists in our provenance data is information about the data but this
> can be 'any attribute' put on a flow file throughout its life in the flow.
> We simply cannot guarantee this wont be 'content'.  The notion of what is
> metadata vs content gets blurry fast.
>
> Uwe,
>
> The data provenance capabilities within NiFi do no support the ability to
> 'delete records' based on specified parameters.  The only mechanism is
> space or time based age off.  For now, whatever the obligation is to
> respond to a right to be forgotten request should be what the provenance
> within NiFi is configured to hold.  If for instance you have 24 hours then
> provenance in NiFi should hold no more than 24 hours.
>
> I doubt this is something we'll be able to spend time on sooner but I agree
> the idea of being able to purge out records is a good one based on more
> precise parameters.
>
> The intent is not that the built-in nifi provenance store is for long term
> but rather the records are there long enough to support flow management use
> cases but are always being exported to a long term store such as Atlas or
> even just stored in HDFS or other locations for additional use.  One
> day...a sweet graph database...
>
> Thanks
> Joe
>
> On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira 
> wrote:
>
>> Hi,
>>
>> Some recap on NiFi concepts:
>>
>>- Content Repository stores FF contents.
>>- Data Provenance events -used to check lineage of history of FFs- only
>>stores pointers to FFs (not contents).
>>- so one can have data deleted and still access lineage/data provenance
>>history.
>>
>> Heres a lof of in-depth on the subject, but above 3 points are the
>> summary of all:
>> https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
>>
>>
>> *DATA - persistent data only exists in 2 scenarios:*
>>
>>- while your flow file running.
>>- archived on content repository for 12h (to allow access contents when
>>using inspect data provenance/lineage).
>>
>> https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418
>>
>>
>> *PROVENANCE EVENTS (LINEAGE) OF DATA:*
>>
>>- contains only provenance attributes and FF uuid etcbut NO CONTENTS,
>>available for 24h unless increasing/changed on config files.
>>-
>>
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
>>
>>
>>
>> So as you see both context by default expire daily. fast enough that dont
>> think GDPR is any problem or any action needed.
>> Now one can always boosts retention of just data provenance events for
>> months, 1 year or whatever suits. But data is long gone anyway.
>>
>> Best Regards,
>> *Emanuel Oliveira*
>>
>>
>>
>> On Thu, Jan 30, 2020 at 2:26 PM u...@moosheimer.com 
>> wrote:
>>
>>> Hi,
>>>
 GDPR doesnt need milisecond realtime deletion right

Re: Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
Emanuel

That was not meant disrespectfully by me. And if that's how you felt,
then I apologize.

>In what sense does NiFi relates to GDPR compliance ?
All person-related data that flows, is read, sent or stored etc.  in a
company is GDPR relevant.

>- in terms of data FF contents - they too transient (gone in 12hours /
default).
It makes no difference how long the data is stored. And it makes no
difference if data is stored on disk or just in memory.

The data can potentially be read, processed by others or sent to other
systems and so on. Or the data can be used during this time to establish
relationships to other data (pseudo anonymized data etc.).

> I guess discussion is on the fact FF attributes are kept on the data
   provenance repo ? (gone in 24h / default)
I'm afraid not. It's generally a matter of NiFi storing data - as
already mentioned, it doesn't make any difference whether it's on the
hard disk or just in memory.

> I wonder where the culprit here ?
There's no culprit here. It's generally a problem with GDPR when
processing person-related data.
It's a problem of person-related data.
It is a problem of person-related data, which would fill a book, what is
person-related, because machine data can also be person-related, for
example if I can relate a person directly to the machine and place/time.
This would allow me to track a person/employee and this is not allowed
(unless a law allows me to do so).

All this goes much further and would be far too much to mention now.
In principle, we have a GDPR issue and must act in accordance with the law.

We do not agree with all the regulation either. But all regulations I
know so far have at least one justification. Even if we as enterprise
architects, developers, administrators etc. have our problems with them.

Regards
Uwe

Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
> But enlight me please :) isnt GDPR just about cleaning from persistent
> storage ?
> In what sense does NiFi relates to GDPR compliance ?
>
>- in terms of data FF contents - they too transient (gone in 12hours /
>default).
>- I guess discussion is on the fact FF attributes are kept on the data
>provenance repo ? (gone in 24h / default)
>
> I wonder wheres the culprit here ? Is it in the situation hwere one wants
> to keep a long trace of data provenance like 6 months, but because
> attributes are stored on provenance events, then they must be deleted ?
> I guess it can only be a problem of deleting attributes from provenance
> repo and no FF contents right as they gone fast enough ?
>
> Best Regards,
> *Emanuel Oliveira*
>
>
>
> On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen  wrote:
>
>>> It was created on this side of the Atlantic because when people do care
>> about such things - they REALLY care.
>>
>> Agreed. I was just commenting on our particular experiences with customers
>> in the federal space. There are unfortunately many who still don't get all
>> of the accountability traceability advantages provenance and lineage
>> tracking provides.
>>
>> On Thu, Jan 30, 2020 at 10:32 AM Joe Witt  wrote:
>>
>>> Mike,
>>>
>>> It was created on this side of the Atlantic because when people do care
>>> about such things - they REALLY care.
>>>
>>> I anticipate more and more people will care and I hope that day comes
>>> soon.  I'm proud of NiFi's ability to be a leader here because if your
>> flow
>>> management solution between sensors and processing and storage systems
>>> tells you where things came from and went to it is a heck of a good
>> start.
>>> What exists in our provenance data is information about the data but this
>>> can be 'any attribute' put on a flow file throughout its life in the
>> flow.
>>> We simply cannot guarantee this wont be 'content'.  The notion of what is
>>> metadata vs content gets blurry fast.
>>>
>>> Uwe,
>>>
>>> The data provenance capabilities within NiFi do no support the ability to
>>> 'delete records' based on specified parameters.  The only mechanism is
>>> space or time based age off.  For now, whatever the obligation is to
>>> respond to a right to be forgotten request should be what the provenance
>>> within NiFi is configured to hold.  If for instance you have 24 hours
>> then
>>> provenance in NiFi should hold no more than 24 hours.
>>>
>>> I doubt this is something we'll be able to spend time on sooner but I
>> agree
>>> the idea of being able to purge out records is a good one based on more
>>> precise parameters.
>>>
>>> The intent is not that the built-in nifi provenance store is for long
>> term
>>> but rather the records are there long enough to support flow management
>> use
>>> cases but are always being exported to a long term store such as Atlas or
>>> even just stored in HDFS or other locations for additional use.  One
>>> day...a sweet graph database...
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira 
>>> wrote:
>>>
 Hi,

 Some recap on NiFi concepts:

- Conten

Re: Provenance Repository and GDPR

2020-01-30 Thread Mike Thomsen
I suppose the elephant in the room here is what sort of personal data is
being stored in your provenance records? Can't you just refactor your flows
to ensure that the provenance data doesn't meaningful contain anything
traceable to a person?

On Thu, Jan 30, 2020 at 12:41 PM u...@moosheimer.com 
wrote:

> Emanuel
>
> That was not meant disrespectfully by me. And if that's how you felt,
> then I apologize.
>
> >In what sense does NiFi relates to GDPR compliance ?
> All person-related data that flows, is read, sent or stored etc.  in a
> company is GDPR relevant.
>
> >- in terms of data FF contents - they too transient (gone in 12hours /
> default).
> It makes no difference how long the data is stored. And it makes no
> difference if data is stored on disk or just in memory.
>
> The data can potentially be read, processed by others or sent to other
> systems and so on. Or the data can be used during this time to establish
> relationships to other data (pseudo anonymized data etc.).
>
> > I guess discussion is on the fact FF attributes are kept on the data
>provenance repo ? (gone in 24h / default)
> I'm afraid not. It's generally a matter of NiFi storing data - as
> already mentioned, it doesn't make any difference whether it's on the
> hard disk or just in memory.
>
> > I wonder where the culprit here ?
> There's no culprit here. It's generally a problem with GDPR when
> processing person-related data.
> It's a problem of person-related data.
> It is a problem of person-related data, which would fill a book, what is
> person-related, because machine data can also be person-related, for
> example if I can relate a person directly to the machine and place/time.
> This would allow me to track a person/employee and this is not allowed
> (unless a law allows me to do so).
>
> All this goes much further and would be far too much to mention now.
> In principle, we have a GDPR issue and must act in accordance with the law.
>
> We do not agree with all the regulation either. But all regulations I
> know so far have at least one justification. Even if we as enterprise
> architects, developers, administrators etc. have our problems with them.
>
> Regards
> Uwe
>
> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
> > But enlight me please :) isnt GDPR just about cleaning from persistent
> > storage ?
> > In what sense does NiFi relates to GDPR compliance ?
> >
> >- in terms of data FF contents - they too transient (gone in 12hours /
> >default).
> >- I guess discussion is on the fact FF attributes are kept on the data
> >provenance repo ? (gone in 24h / default)
> >
> > I wonder wheres the culprit here ? Is it in the situation hwere one wants
> > to keep a long trace of data provenance like 6 months, but because
> > attributes are stored on provenance events, then they must be deleted ?
> > I guess it can only be a problem of deleting attributes from provenance
> > repo and no FF contents right as they gone fast enough ?
> >
> > Best Regards,
> > *Emanuel Oliveira*
> >
> >
> >
> > On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen 
> wrote:
> >
> >>> It was created on this side of the Atlantic because when people do care
> >> about such things - they REALLY care.
> >>
> >> Agreed. I was just commenting on our particular experiences with
> customers
> >> in the federal space. There are unfortunately many who still don't get
> all
> >> of the accountability traceability advantages provenance and lineage
> >> tracking provides.
> >>
> >> On Thu, Jan 30, 2020 at 10:32 AM Joe Witt  wrote:
> >>
> >>> Mike,
> >>>
> >>> It was created on this side of the Atlantic because when people do care
> >>> about such things - they REALLY care.
> >>>
> >>> I anticipate more and more people will care and I hope that day comes
> >>> soon.  I'm proud of NiFi's ability to be a leader here because if your
> >> flow
> >>> management solution between sensors and processing and storage systems
> >>> tells you where things came from and went to it is a heck of a good
> >> start.
> >>> What exists in our provenance data is information about the data but
> this
> >>> can be 'any attribute' put on a flow file throughout its life in the
> >> flow.
> >>> We simply cannot guarantee this wont be 'content'.  The notion of what
> is
> >>> metadata vs content gets blurry fast.
> >>>
> >>> Uwe,
> >>>
> >>> The data provenance capabilities within NiFi do no support the ability
> to
> >>> 'delete records' based on specified parameters.  The only mechanism is
> >>> space or time based age off.  For now, whatever the obligation is to
> >>> respond to a right to be forgotten request should be what the
> provenance
> >>> within NiFi is configured to hold.  If for instance you have 24 hours
> >> then
> >>> provenance in NiFi should hold no more than 24 hours.
> >>>
> >>> I doubt this is something we'll be able to spend time on sooner but I
> >> agree
> >>> the idea of being able to purge out records is a good one based on more
> >>> precise paramete

Re: Provenance Repository and GDPR

2020-01-30 Thread Lars Winderling
Dear Uwe and fellow devs,

sorry if I completely miss the point here, but I'll try. Also working with NiFi 
under GDPR-regulations in online ad business. From my point it would be 
sufficient to ensure that no new data will get stored, if a user requests 
deletion, and delete all personal data from all respective systems. The NiFi 
repos will expire their data, which can be argued to equal a delayed deletion. 
Remember that GDPR is quite strict, but if you have a proper case for this kind 
of process e.g. due to technical limitations, it needs to be documented, and 
then it will likely be ok. We do it similarly, and our legal counsel approved 
this. My response, however, is not legally binding. The regulation says 
something like you should take appropriate measures. If such a tool like NiFi 
just doesn't let you delete temporarily stored data instantly, this may seem 
acceptable.

Best,
Lars

Am 30. Januar 2020 21:36:31 MEZ schrieb Mike Thomsen :
>I suppose the elephant in the room here is what sort of personal data
>is
>being stored in your provenance records? Can't you just refactor your
>flows
>to ensure that the provenance data doesn't meaningful contain anything
>traceable to a person?
>
>On Thu, Jan 30, 2020 at 12:41 PM u...@moosheimer.com
>
>wrote:
>
>> Emanuel
>>
>> That was not meant disrespectfully by me. And if that's how you felt,
>> then I apologize.
>>
>> >In what sense does NiFi relates to GDPR compliance ?
>> All person-related data that flows, is read, sent or stored etc.  in
>a
>> company is GDPR relevant.
>>
>> >- in terms of data FF contents - they too transient (gone in 12hours
>/
>> default).
>> It makes no difference how long the data is stored. And it makes no
>> difference if data is stored on disk or just in memory.
>>
>> The data can potentially be read, processed by others or sent to
>other
>> systems and so on. Or the data can be used during this time to
>establish
>> relationships to other data (pseudo anonymized data etc.).
>>
>> > I guess discussion is on the fact FF attributes are kept on the
>data
>>provenance repo ? (gone in 24h / default)
>> I'm afraid not. It's generally a matter of NiFi storing data - as
>> already mentioned, it doesn't make any difference whether it's on the
>> hard disk or just in memory.
>>
>> > I wonder where the culprit here ?
>> There's no culprit here. It's generally a problem with GDPR when
>> processing person-related data.
>> It's a problem of person-related data.
>> It is a problem of person-related data, which would fill a book, what
>is
>> person-related, because machine data can also be person-related, for
>> example if I can relate a person directly to the machine and
>place/time.
>> This would allow me to track a person/employee and this is not
>allowed
>> (unless a law allows me to do so).
>>
>> All this goes much further and would be far too much to mention now.
>> In principle, we have a GDPR issue and must act in accordance with
>the law.
>>
>> We do not agree with all the regulation either. But all regulations I
>> know so far have at least one justification. Even if we as enterprise
>> architects, developers, administrators etc. have our problems with
>them.
>>
>> Regards
>> Uwe
>>
>> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
>> > But enlight me please :) isnt GDPR just about cleaning from
>persistent
>> > storage ?
>> > In what sense does NiFi relates to GDPR compliance ?
>> >
>> >- in terms of data FF contents - they too transient (gone in
>12hours /
>> >default).
>> >- I guess discussion is on the fact FF attributes are kept on
>the data
>> >provenance repo ? (gone in 24h / default)
>> >
>> > I wonder wheres the culprit here ? Is it in the situation hwere one
>wants
>> > to keep a long trace of data provenance like 6 months, but because
>> > attributes are stored on provenance events, then they must be
>deleted ?
>> > I guess it can only be a problem of deleting attributes from
>provenance
>> > repo and no FF contents right as they gone fast enough ?
>> >
>> > Best Regards,
>> > *Emanuel Oliveira*
>> >
>> >
>> >
>> > On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen
>
>> wrote:
>> >
>> >>> It was created on this side of the Atlantic because when people
>do care
>> >> about such things - they REALLY care.
>> >>
>> >> Agreed. I was just commenting on our particular experiences with
>> customers
>> >> in the federal space. There are unfortunately many who still don't
>get
>> all
>> >> of the accountability traceability advantages provenance and
>lineage
>> >> tracking provides.
>> >>
>> >> On Thu, Jan 30, 2020 at 10:32 AM Joe Witt 
>wrote:
>> >>
>> >>> Mike,
>> >>>
>> >>> It was created on this side of the Atlantic because when people
>do care
>> >>> about such things - they REALLY care.
>> >>>
>> >>> I anticipate more and more people will care and I hope that day
>comes
>> >>> soon.  I'm proud of NiFi's ability to be a leader here because if
>your
>> >> flow
>> >>> management solution between sensors and pr

Parameters, Registry and sensitive values

2020-01-30 Thread Mark Bean
When storing a version controlled process group in the NiFi Registry, the
relevant Parameter Context will get stored as well. Similarly, when a
different NiFi instance instantiates that process group from the Registry,
the instance creates the Parameter Context so it can be used by the
process group.

However, if there are parameters in the context with values marked as
sensitive, then those values are 1) not stored in NiFi Registry and
therefore 2) no value is available on any instance pulling the process
group from the Registry.

Is there work being done to bridge this gap? Are there any recommendations
on how to supply the sensitive values?

Thanks,
Mark


Re: Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
You see this very much from a technical perspective.
Purely technical IoT data does not have this problem.

But the question is what is purely technical?

If you process IoT data of vehicles, then at first glance this is purely
technical data.
But if there is a VIN (Vehicle Identify Number) in the data, then it is
person-related data.
This is already legally defined.

Without VIN, however, the data makes no sense. And even if you make the
VIN pseudo anonymous (because you must have a KV table somewhere or
generate a hash), you can always assign the data to a vehicle and thus
to its owner.
You would have to make the data totally anonymous. But that would make
the VIN completely worthless and thus also your IoT data.

There are many books and articles that describe how to find the person
to whom the data can be assigned from pseudo anonymized data.
And GDPR explicitly says that these are then person-related data.

I could continue this list forever. A farmer has IoT sensors on his
tractor and in the field. This data goes to a provider who then
evaluates how to fertilize and plant.
All, all, all person-related data.

This is not as easy as you might think, because there is hardly any data
that is not personalizable.

And in general it also depends on your business. If you explicitly use
personal data in your Use Case, you cannot simply filter that out.
Unless you stop your business.

Am 30.01.2020 um 21:36 schrieb Mike Thomsen:
> I suppose the elephant in the room here is what sort of personal data is
> being stored in your provenance records? Can't you just refactor your flows
> to ensure that the provenance data doesn't meaningful contain anything
> traceable to a person?
>
> On Thu, Jan 30, 2020 at 12:41 PM u...@moosheimer.com 
> wrote:
>
>> Emanuel
>>
>> That was not meant disrespectfully by me. And if that's how you felt,
>> then I apologize.
>>
>>> In what sense does NiFi relates to GDPR compliance ?
>> All person-related data that flows, is read, sent or stored etc.  in a
>> company is GDPR relevant.
>>
>>> - in terms of data FF contents - they too transient (gone in 12hours /
>> default).
>> It makes no difference how long the data is stored. And it makes no
>> difference if data is stored on disk or just in memory.
>>
>> The data can potentially be read, processed by others or sent to other
>> systems and so on. Or the data can be used during this time to establish
>> relationships to other data (pseudo anonymized data etc.).
>>
>>> I guess discussion is on the fact FF attributes are kept on the data
>>provenance repo ? (gone in 24h / default)
>> I'm afraid not. It's generally a matter of NiFi storing data - as
>> already mentioned, it doesn't make any difference whether it's on the
>> hard disk or just in memory.
>>
>>> I wonder where the culprit here ?
>> There's no culprit here. It's generally a problem with GDPR when
>> processing person-related data.
>> It's a problem of person-related data.
>> It is a problem of person-related data, which would fill a book, what is
>> person-related, because machine data can also be person-related, for
>> example if I can relate a person directly to the machine and place/time.
>> This would allow me to track a person/employee and this is not allowed
>> (unless a law allows me to do so).
>>
>> All this goes much further and would be far too much to mention now.
>> In principle, we have a GDPR issue and must act in accordance with the law.
>>
>> We do not agree with all the regulation either. But all regulations I
>> know so far have at least one justification. Even if we as enterprise
>> architects, developers, administrators etc. have our problems with them.
>>
>> Regards
>> Uwe
>>
>> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
>>> But enlight me please :) isnt GDPR just about cleaning from persistent
>>> storage ?
>>> In what sense does NiFi relates to GDPR compliance ?
>>>
>>>- in terms of data FF contents - they too transient (gone in 12hours /
>>>default).
>>>- I guess discussion is on the fact FF attributes are kept on the data
>>>provenance repo ? (gone in 24h / default)
>>>
>>> I wonder wheres the culprit here ? Is it in the situation hwere one wants
>>> to keep a long trace of data provenance like 6 months, but because
>>> attributes are stored on provenance events, then they must be deleted ?
>>> I guess it can only be a problem of deleting attributes from provenance
>>> repo and no FF contents right as they gone fast enough ?
>>>
>>> Best Regards,
>>> *Emanuel Oliveira*
>>>
>>>
>>>
>>> On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen 
>> wrote:
> It was created on this side of the Atlantic because when people do care
 about such things - they REALLY care.

 Agreed. I was just commenting on our particular experiences with
>> customers
 in the federal space. There are unfortunately many who still don't get
>> all
 of the accountability traceability advantages provenance and lineage
 tracking provides.
>>>

Re: Parameters, Registry and sensitive values

2020-01-30 Thread Joe Witt
The initial import of a versioned flow and associated parameter context
requires setting of sensitive values.  This does however provide for rather
simple configuration of a programmatically pushed flow to an instance and
then all params, sensitive or otherwise set, and the flow run.  As well as
easy subsequent updates.


  There is no work in the apache nifi community I am aware of to provide a
central secrets storage solution.

Thanks

On Thu, Jan 30, 2020 at 4:34 PM Mark Bean  wrote:

> When storing a version controlled process group in the NiFi Registry, the
> relevant Parameter Context will get stored as well. Similarly, when a
> different NiFi instance instantiates that process group from the Registry,
> the instance creates the Parameter Context so it can be used by the
> process group.
>
> However, if there are parameters in the context with values marked as
> sensitive, then those values are 1) not stored in NiFi Registry and
> therefore 2) no value is available on any instance pulling the process
> group from the Registry.
>
> Is there work being done to bridge this gap? Are there any recommendations
> on how to supply the sensitive values?
>
> Thanks,
> Mark
>


Re: Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
Lars

You're absolutely right about what you say.
If the data in the NiFi repositories is only stored temporarily for a
few hours, then documentation is quite sufficient.

The original question was how to delete data from the data lineage.
I assumed to use the NiFi repository as a full Data Lineage System.
If NiFi is your central application, then you could avoid having to
install Atlas as well. And with Atlas, you would have to install Ranger,
Cassandra or even Hadoop and HBase.

Joe has already made it clear to me here that Data Provenance/Data
Lineage of NiFi is not designed for this yet.
Maybe in the future...

Best
Uwe

Am 30.01.2020 um 22:08 schrieb Lars Winderling:
> Dear Uwe and fellow devs,
>
> sorry if I completely miss the point here, but I'll try. Also working with 
> NiFi under GDPR-regulations in online ad business. From my point it would be 
> sufficient to ensure that no new data will get stored, if a user requests 
> deletion, and delete all personal data from all respective systems. The NiFi 
> repos will expire their data, which can be argued to equal a delayed 
> deletion. Remember that GDPR is quite strict, but if you have a proper case 
> for this kind of process e.g. due to technical limitations, it needs to be 
> documented, and then it will likely be ok. We do it similarly, and our legal 
> counsel approved this. My response, however, is not legally binding. The 
> regulation says something like you should take appropriate measures. If such 
> a tool like NiFi just doesn't let you delete temporarily stored data 
> instantly, this may seem acceptable.
>
> Best,
> Lars
>
> Am 30. Januar 2020 21:36:31 MEZ schrieb Mike Thomsen :
>> I suppose the elephant in the room here is what sort of personal data
>> is
>> being stored in your provenance records? Can't you just refactor your
>> flows
>> to ensure that the provenance data doesn't meaningful contain anything
>> traceable to a person?
>>
>> On Thu, Jan 30, 2020 at 12:41 PM u...@moosheimer.com
>> 
>> wrote:
>>
>>> Emanuel
>>>
>>> That was not meant disrespectfully by me. And if that's how you felt,
>>> then I apologize.
>>>
 In what sense does NiFi relates to GDPR compliance ?
>>> All person-related data that flows, is read, sent or stored etc.  in
>> a
>>> company is GDPR relevant.
>>>
 - in terms of data FF contents - they too transient (gone in 12hours
>> /
>>> default).
>>> It makes no difference how long the data is stored. And it makes no
>>> difference if data is stored on disk or just in memory.
>>>
>>> The data can potentially be read, processed by others or sent to
>> other
>>> systems and so on. Or the data can be used during this time to
>> establish
>>> relationships to other data (pseudo anonymized data etc.).
>>>
 I guess discussion is on the fact FF attributes are kept on the
>> data
>>>provenance repo ? (gone in 24h / default)
>>> I'm afraid not. It's generally a matter of NiFi storing data - as
>>> already mentioned, it doesn't make any difference whether it's on the
>>> hard disk or just in memory.
>>>
 I wonder where the culprit here ?
>>> There's no culprit here. It's generally a problem with GDPR when
>>> processing person-related data.
>>> It's a problem of person-related data.
>>> It is a problem of person-related data, which would fill a book, what
>> is
>>> person-related, because machine data can also be person-related, for
>>> example if I can relate a person directly to the machine and
>> place/time.
>>> This would allow me to track a person/employee and this is not
>> allowed
>>> (unless a law allows me to do so).
>>>
>>> All this goes much further and would be far too much to mention now.
>>> In principle, we have a GDPR issue and must act in accordance with
>> the law.
>>> We do not agree with all the regulation either. But all regulations I
>>> know so far have at least one justification. Even if we as enterprise
>>> architects, developers, administrators etc. have our problems with
>> them.
>>> Regards
>>> Uwe
>>>
>>> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
 But enlight me please :) isnt GDPR just about cleaning from
>> persistent
 storage ?
 In what sense does NiFi relates to GDPR compliance ?

- in terms of data FF contents - they too transient (gone in
>> 12hours /
default).
- I guess discussion is on the fact FF attributes are kept on
>> the data
provenance repo ? (gone in 24h / default)

 I wonder wheres the culprit here ? Is it in the situation hwere one
>> wants
 to keep a long trace of data provenance like 6 months, but because
 attributes are stored on provenance events, then they must be
>> deleted ?
 I guess it can only be a problem of deleting attributes from
>> provenance
 repo and no FF contents right as they gone fast enough ?

 Best Regards,
 *Emanuel Oliveira*



 On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen
>> 
>>> wrote:
>> It was created on this side of

Re: Provenance Repository and GDPR

2020-01-30 Thread Joe Witt
Our data provenance is.  Just not our repository :)

On Thu, Jan 30, 2020 at 5:00 PM u...@moosheimer.com 
wrote:

> Lars
>
> You're absolutely right about what you say.
> If the data in the NiFi repositories is only stored temporarily for a
> few hours, then documentation is quite sufficient.
>
> The original question was how to delete data from the data lineage.
> I assumed to use the NiFi repository as a full Data Lineage System.
> If NiFi is your central application, then you could avoid having to
> install Atlas as well. And with Atlas, you would have to install Ranger,
> Cassandra or even Hadoop and HBase.
>
> Joe has already made it clear to me here that Data Provenance/Data
> Lineage of NiFi is not designed for this yet.
> Maybe in the future...
>
> Best
> Uwe
>
> Am 30.01.2020 um 22:08 schrieb Lars Winderling:
> > Dear Uwe and fellow devs,
> >
> > sorry if I completely miss the point here, but I'll try. Also working
> with NiFi under GDPR-regulations in online ad business. From my point it
> would be sufficient to ensure that no new data will get stored, if a user
> requests deletion, and delete all personal data from all respective
> systems. The NiFi repos will expire their data, which can be argued to
> equal a delayed deletion. Remember that GDPR is quite strict, but if you
> have a proper case for this kind of process e.g. due to technical
> limitations, it needs to be documented, and then it will likely be ok. We
> do it similarly, and our legal counsel approved this. My response, however,
> is not legally binding. The regulation says something like you should take
> appropriate measures. If such a tool like NiFi just doesn't let you delete
> temporarily stored data instantly, this may seem acceptable.
> >
> > Best,
> > Lars
> >
> > Am 30. Januar 2020 21:36:31 MEZ schrieb Mike Thomsen <
> mikerthom...@gmail.com>:
> >> I suppose the elephant in the room here is what sort of personal data
> >> is
> >> being stored in your provenance records? Can't you just refactor your
> >> flows
> >> to ensure that the provenance data doesn't meaningful contain anything
> >> traceable to a person?
> >>
> >> On Thu, Jan 30, 2020 at 12:41 PM u...@moosheimer.com
> >> 
> >> wrote:
> >>
> >>> Emanuel
> >>>
> >>> That was not meant disrespectfully by me. And if that's how you felt,
> >>> then I apologize.
> >>>
>  In what sense does NiFi relates to GDPR compliance ?
> >>> All person-related data that flows, is read, sent or stored etc.  in
> >> a
> >>> company is GDPR relevant.
> >>>
>  - in terms of data FF contents - they too transient (gone in 12hours
> >> /
> >>> default).
> >>> It makes no difference how long the data is stored. And it makes no
> >>> difference if data is stored on disk or just in memory.
> >>>
> >>> The data can potentially be read, processed by others or sent to
> >> other
> >>> systems and so on. Or the data can be used during this time to
> >> establish
> >>> relationships to other data (pseudo anonymized data etc.).
> >>>
>  I guess discussion is on the fact FF attributes are kept on the
> >> data
> >>>provenance repo ? (gone in 24h / default)
> >>> I'm afraid not. It's generally a matter of NiFi storing data - as
> >>> already mentioned, it doesn't make any difference whether it's on the
> >>> hard disk or just in memory.
> >>>
>  I wonder where the culprit here ?
> >>> There's no culprit here. It's generally a problem with GDPR when
> >>> processing person-related data.
> >>> It's a problem of person-related data.
> >>> It is a problem of person-related data, which would fill a book, what
> >> is
> >>> person-related, because machine data can also be person-related, for
> >>> example if I can relate a person directly to the machine and
> >> place/time.
> >>> This would allow me to track a person/employee and this is not
> >> allowed
> >>> (unless a law allows me to do so).
> >>>
> >>> All this goes much further and would be far too much to mention now.
> >>> In principle, we have a GDPR issue and must act in accordance with
> >> the law.
> >>> We do not agree with all the regulation either. But all regulations I
> >>> know so far have at least one justification. Even if we as enterprise
> >>> architects, developers, administrators etc. have our problems with
> >> them.
> >>> Regards
> >>> Uwe
> >>>
> >>> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
>  But enlight me please :) isnt GDPR just about cleaning from
> >> persistent
>  storage ?
>  In what sense does NiFi relates to GDPR compliance ?
> 
> - in terms of data FF contents - they too transient (gone in
> >> 12hours /
> default).
> - I guess discussion is on the fact FF attributes are kept on
> >> the data
> provenance repo ? (gone in 24h / default)
> 
>  I wonder wheres the culprit here ? Is it in the situation hwere one
> >> wants
>  to keep a long trace of data provenance like 6 months, but because
>  attributes are stored on provenance event

Re: Provenance Repository and GDPR

2020-01-30 Thread u...@moosheimer.com
Sorry :-)

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 30.01.2020 um 23:08 schrieb Joe Witt :
> 
> Our data provenance is.  Just not our repository :)
> 
>> On Thu, Jan 30, 2020 at 5:00 PM u...@moosheimer.com 
>> wrote:
>> 
>> Lars
>> 
>> You're absolutely right about what you say.
>> If the data in the NiFi repositories is only stored temporarily for a
>> few hours, then documentation is quite sufficient.
>> 
>> The original question was how to delete data from the data lineage.
>> I assumed to use the NiFi repository as a full Data Lineage System.
>> If NiFi is your central application, then you could avoid having to
>> install Atlas as well. And with Atlas, you would have to install Ranger,
>> Cassandra or even Hadoop and HBase.
>> 
>> Joe has already made it clear to me here that Data Provenance/Data
>> Lineage of NiFi is not designed for this yet.
>> Maybe in the future...
>> 
>> Best
>> Uwe
>> 
>>> Am 30.01.2020 um 22:08 schrieb Lars Winderling:
>>> Dear Uwe and fellow devs,
>>> 
>>> sorry if I completely miss the point here, but I'll try. Also working
>> with NiFi under GDPR-regulations in online ad business. From my point it
>> would be sufficient to ensure that no new data will get stored, if a user
>> requests deletion, and delete all personal data from all respective
>> systems. The NiFi repos will expire their data, which can be argued to
>> equal a delayed deletion. Remember that GDPR is quite strict, but if you
>> have a proper case for this kind of process e.g. due to technical
>> limitations, it needs to be documented, and then it will likely be ok. We
>> do it similarly, and our legal counsel approved this. My response, however,
>> is not legally binding. The regulation says something like you should take
>> appropriate measures. If such a tool like NiFi just doesn't let you delete
>> temporarily stored data instantly, this may seem acceptable.
>>> 
>>> Best,
>>> Lars
>>> 
>>> Am 30. Januar 2020 21:36:31 MEZ schrieb Mike Thomsen <
>> mikerthom...@gmail.com>:
 I suppose the elephant in the room here is what sort of personal data
 is
 being stored in your provenance records? Can't you just refactor your
 flows
 to ensure that the provenance data doesn't meaningful contain anything
 traceable to a person?
 
 On Thu, Jan 30, 2020 at 12:41 PM u...@moosheimer.com
 
 wrote:
 
> Emanuel
> 
> That was not meant disrespectfully by me. And if that's how you felt,
> then I apologize.
> 
>> In what sense does NiFi relates to GDPR compliance ?
> All person-related data that flows, is read, sent or stored etc.  in
 a
> company is GDPR relevant.
> 
>> - in terms of data FF contents - they too transient (gone in 12hours
 /
> default).
> It makes no difference how long the data is stored. And it makes no
> difference if data is stored on disk or just in memory.
> 
> The data can potentially be read, processed by others or sent to
 other
> systems and so on. Or the data can be used during this time to
 establish
> relationships to other data (pseudo anonymized data etc.).
> 
>> I guess discussion is on the fact FF attributes are kept on the
 data
>   provenance repo ? (gone in 24h / default)
> I'm afraid not. It's generally a matter of NiFi storing data - as
> already mentioned, it doesn't make any difference whether it's on the
> hard disk or just in memory.
> 
>> I wonder where the culprit here ?
> There's no culprit here. It's generally a problem with GDPR when
> processing person-related data.
> It's a problem of person-related data.
> It is a problem of person-related data, which would fill a book, what
 is
> person-related, because machine data can also be person-related, for
> example if I can relate a person directly to the machine and
 place/time.
> This would allow me to track a person/employee and this is not
 allowed
> (unless a law allows me to do so).
> 
> All this goes much further and would be far too much to mention now.
> In principle, we have a GDPR issue and must act in accordance with
 the law.
> We do not agree with all the regulation either. But all regulations I
> know so far have at least one justification. Even if we as enterprise
> architects, developers, administrators etc. have our problems with
 them.
> Regards
> Uwe
> 
> Am 30.01.2020 um 17:51 schrieb Emanuel Oliveira:
>> But enlight me please :) isnt GDPR just about cleaning from
 persistent
>> storage ?
>> In what sense does NiFi relates to GDPR compliance ?
>> 
>>   - in terms of data FF contents - they too transient (gone in
 12hours /
>>   default).
>>   - I guess discussion is on the fact FF attributes are kept on
 the data
>>   provenance repo ? (gone in 24h / default)
>> 
>> I wonder wheres the cu

Re: Parameters, Registry and sensitive values

2020-01-30 Thread Mark Bean
Joe,

You said "... and then all params, _sensitive or otherwise_ set". This is
not what I observed.

I version controlled a Process Group configured with a Parameter Context
containing one non-sensitive parameter value and one sensitive property
value. Then, I instantiated that version controlled Process Group on a
separate NiFi instance. Only the non-sensitive parameter value was
included. The sensitive parameter value says "No value set".

Further, when I look at what is stored in the Registry, I can confirm the
value for the sensitive parameter is not present. I looked down in the
flow_storage directory at the 2.snapshot file corresponding to the flow in
question. It has:

"parameterContexts" : {
  "sample PC" : {
"name" : "sample PC",
"parameters" : [ {
  "description" : "",
  "name" : "regularParam",
  "sensitive" : false,
  "value" : "test1"
}, {
  "description" : "",
  "name" : "sensitiveParam",
  "sensitive" : true
} ]
  }

Note that there is no "value" for "sensitiveParam"; there is only a "value"
for the non-sensitive parameter.

Both NiFi instances are version 1.10. NiFi registry is version 0.5.0.

-Mark


On Thu, Jan 30, 2020 at 4:51 PM Joe Witt  wrote:

> The initial import of a versioned flow and associated parameter context
> requires setting of sensitive values.  This does however provide for rather
> simple configuration of a programmatically pushed flow to an instance and
> then all params, sensitive or otherwise set, and the flow run.  As well as
> easy subsequent updates.
>
>
>   There is no work in the apache nifi community I am aware of to provide a
> central secrets storage solution.
>
> Thanks
>
> On Thu, Jan 30, 2020 at 4:34 PM Mark Bean  wrote:
>
> > When storing a version controlled process group in the NiFi Registry, the
> > relevant Parameter Context will get stored as well. Similarly, when a
> > different NiFi instance instantiates that process group from the
> Registry,
> > the instance creates the Parameter Context so it can be used by the
> > process group.
> >
> > However, if there are parameters in the context with values marked as
> > sensitive, then those values are 1) not stored in NiFi Registry and
> > therefore 2) no value is available on any instance pulling the process
> > group from the Registry.
> >
> > Is there work being done to bridge this gap? Are there any
> recommendations
> > on how to supply the sensitive values?
> >
> > Thanks,
> > Mark
> >
>


Re: Parameters, Registry and sensitive values

2020-01-30 Thread Joe Witt
i agree nothing is stored in the registry for sensitive params.  i was
talking about in nifi.   this is consistent with behavior we had before
param contexts existed.

On Thu, Jan 30, 2020 at 5:29 PM Mark Bean  wrote:

> Joe,
>
> You said "... and then all params, _sensitive or otherwise_ set". This is
> not what I observed.
>
> I version controlled a Process Group configured with a Parameter Context
> containing one non-sensitive parameter value and one sensitive property
> value. Then, I instantiated that version controlled Process Group on a
> separate NiFi instance. Only the non-sensitive parameter value was
> included. The sensitive parameter value says "No value set".
>
> Further, when I look at what is stored in the Registry, I can confirm the
> value for the sensitive parameter is not present. I looked down in the
> flow_storage directory at the 2.snapshot file corresponding to the flow in
> question. It has:
>
> "parameterContexts" : {
>   "sample PC" : {
> "name" : "sample PC",
> "parameters" : [ {
>   "description" : "",
>   "name" : "regularParam",
>   "sensitive" : false,
>   "value" : "test1"
> }, {
>   "description" : "",
>   "name" : "sensitiveParam",
>   "sensitive" : true
> } ]
>   }
>
> Note that there is no "value" for "sensitiveParam"; there is only a "value"
> for the non-sensitive parameter.
>
> Both NiFi instances are version 1.10. NiFi registry is version 0.5.0.
>
> -Mark
>
>
> On Thu, Jan 30, 2020 at 4:51 PM Joe Witt  wrote:
>
> > The initial import of a versioned flow and associated parameter context
> > requires setting of sensitive values.  This does however provide for
> rather
> > simple configuration of a programmatically pushed flow to an instance and
> > then all params, sensitive or otherwise set, and the flow run.  As well
> as
> > easy subsequent updates.
> >
> >
> >   There is no work in the apache nifi community I am aware of to provide
> a
> > central secrets storage solution.
> >
> > Thanks
> >
> > On Thu, Jan 30, 2020 at 4:34 PM Mark Bean  wrote:
> >
> > > When storing a version controlled process group in the NiFi Registry,
> the
> > > relevant Parameter Context will get stored as well. Similarly, when a
> > > different NiFi instance instantiates that process group from the
> > Registry,
> > > the instance creates the Parameter Context so it can be used by the
> > > process group.
> > >
> > > However, if there are parameters in the context with values marked as
> > > sensitive, then those values are 1) not stored in NiFi Registry and
> > > therefore 2) no value is available on any instance pulling the process
> > > group from the Registry.
> > >
> > > Is there work being done to bridge this gap? Are there any
> > recommendations
> > > on how to supply the sensitive values?
> > >
> > > Thanks,
> > > Mark
> > >
> >
>