Re: Put data to Elastic with static settings or index template

2018-05-24 Thread Bobby
Yep, i used index template and it worked well...If my calculation is right,
the size decreased about 12 - 18%..huge gain!!!

Consider this issue solved..

Thank you everyone 



-

-
Bobby
--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/


Re: Proposal: standard record metadata attributes for data sources

2018-05-24 Thread Otto Fowler
I commented on the PR, but I’ll add this to the thread here.

Wouldn’t something like this lend itself to a ReportingTask?  If not the
current structure, a like structure
for records?

That would allow the destination to do time series analysis etc.
That is not saying there isn’t a case to have it in the Flow as well.



On May 24, 2018 at 08:05:29, Mike Thomsen (mikerthom...@gmail.com) wrote:

I wrote a processor that's inspired by one of the Groovy scripts we use at
that client. PR is here if anyone wants to take a look:

https://github.com/apache/nifi/pull/2737

It's called "RecordStats" and provides both a general record count
attribute and lets you specify record path operations to get stats on
individual field values as well. For example, if you have a field called
called "department" you can do this:

department_count (prop name) => /department

as a dynamic property which will produce the following:

{
"record_count": "100",
"department": "75",
"department.Engineering": "25",
"department.Marketing": "10",
"department.Operations": "25",
"department.Finance": "15"
}

The scenario we have that lead to this involves a lot of big queries and
full collection fetches from MongoDB often as much as 80GB at a time, so
they'd rather see a little slow down from examining those stats and being
able to get "accurate counts" than see things go lightning fast and not
have the insight into exactly what came out of those fetches.



On Tue, May 15, 2018 at 8:40 PM Koji Kawamura 
wrote:

> Hi Mike,
>
> I agree with the approach that enrich provenance events. In order to
> do so, we can use several places to embed meta-data:
>
> - FlowFile attributes: automatically mapped to a provenance event, but
> as Andy mentioned, we need to be careful not to put sensitive data.
> - Transit URI: when I developed NiFi Atlas integration, I used this as
> the primary source of what data a processor interact with. E.g. remote
> address, database, table ... etc.
> - The 'details' string. It might not be ideal solution, but
> ProvenanceReporter accepts additional 'details' string. We can embed
> whatever we want here.
>
> I'd map meta-data you mentioned as follows:
> 1. Source system. => Transit URI
> 2. Database/table/index/collection/etc. => Transit URI or FlowFile
> attribute. I think it's fine to put these into attribute.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have). => 'details' string
>
> What I learned from Atlas integration, it's really hard to design a
> complete standard set of attributes. I'd suggest use what NiFi
> framework provides currently.
>
> Thanks,
>
> Koji
>
> On Tue, May 15, 2018 at 8:15 AM, Andy LoPresto 
> wrote:
> > Maybe an ADDINFO event or FORK event could be used and a new flowfile
> with
> > the relevant attributes/content could be created. The flowfiles would
be
> > linked, but the “sensitive” information wouldn’t travel with the
> original.
> >
> > Andy LoPresto
> > alopre...@apache.org
> > alopresto.apa...@gmail.com
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> >
> > On May 14, 2018, at 3:32 PM, Mike Thomsen 
> wrote:
> >
> > Does the provenance system have the ability to add user-defined
key/value
> > pairs to a flowfile's provenance record at a particular processor?
> >
> > On Mon, May 14, 2018 at 6:11 PM Andy LoPresto 
> wrote:
> >
> > I would actually propose that this is added to the provenance but not
> > always put into the flowfile attributes. There are many scenarios in
> which
> > the data retrieval should be separated from the analysis/follow-on,
both
> > for visibility, responsibility, and security concerns. While I
> understand a
> > separate UpdateAttribute processor could be put in the downstream flow
to
> > remove these attributes, I would push for not adding them by default as
a
> > more secure approach. Perhaps this could be configurable on the Get*
> > processor via a boolean property, but I think doing it automatically by
> > default introduces some serious concerns.
> >
> >
> > Andy LoPresto
> > alopre...@apache.org
> > *alopresto.apa...@gmail.com *
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> >
> > On May 13, 2018, at 11:48 AM, Mike Thomsen 
> wrote:
> >
> > @Joe @Matt
> >
> > This is kinda related to the point that Joe made in the graph DB thread
> > about provenance. My thought here was that we need some standards on
> > enriching the metadata about what was fetched so that no matter how you
> > store the provenance, you can find some way to query it for questions
> like
> > when a data set was loaded into NiFi, how many records went through a
> > terminating processor, etc. IMO this could help batch-oriented
> > organizations feel more at ease with something stream-oriented like
NiFi.
> >
> > On Fri, Apr 13, 2018 at 4:01 PM Mike 

Re: Proposal: standard record metadata attributes for data sources

2018-05-24 Thread Mike Thomsen
I wrote a processor that's inspired by one of the Groovy scripts we use at
that client. PR is here if anyone wants to take a look:

https://github.com/apache/nifi/pull/2737

It's called "RecordStats" and provides both a general record count
attribute and lets you specify record path operations to get stats on
individual field values as well. For example, if you have a field called
called "department" you can do this:

department_count (prop name) => /department

as a dynamic property which will produce the following:

{
"record_count": "100",
"department": "75",
"department.Engineering": "25",
"department.Marketing": "10",
"department.Operations": "25",
"department.Finance": "15"
}

The scenario we have that lead to this involves a lot of big queries and
full collection fetches from MongoDB often as much as 80GB at a time, so
they'd rather see a little slow down from examining those stats and being
able to get "accurate counts" than see things go lightning fast and not
have the insight into exactly what came out of those fetches.



On Tue, May 15, 2018 at 8:40 PM Koji Kawamura 
wrote:

> Hi Mike,
>
> I agree with the approach that enrich provenance events. In order to
> do so, we can use several places to embed meta-data:
>
> - FlowFile attributes: automatically mapped to a provenance event, but
> as Andy mentioned, we need to be careful not to put sensitive data.
> - Transit URI: when I developed NiFi Atlas integration, I used this as
> the primary source of what data a processor interact with. E.g. remote
> address, database, table ... etc.
> - The 'details' string. It might not be ideal solution, but
> ProvenanceReporter accepts additional 'details' string. We can embed
> whatever we want here.
>
> I'd map meta-data you mentioned as follows:
> 1. Source system. => Transit URI
> 2. Database/table/index/collection/etc. => Transit URI or FlowFile
> attribute. I think it's fine to put these into attribute.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have). => 'details' string
>
> What I learned from Atlas integration, it's really hard to design a
> complete standard set of attributes. I'd suggest use what NiFi
> framework provides currently.
>
> Thanks,
>
> Koji
>
> On Tue, May 15, 2018 at 8:15 AM, Andy LoPresto 
> wrote:
> > Maybe an ADDINFO event or FORK event could be used and a new flowfile
> with
> > the relevant attributes/content could be created. The flowfiles would be
> > linked, but the “sensitive” information wouldn’t travel with the
> original.
> >
> > Andy LoPresto
> > alopre...@apache.org
> > alopresto.apa...@gmail.com
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On May 14, 2018, at 3:32 PM, Mike Thomsen 
> wrote:
> >
> > Does the provenance system have the ability to add user-defined key/value
> > pairs to a flowfile's provenance record at a particular processor?
> >
> > On Mon, May 14, 2018 at 6:11 PM Andy LoPresto 
> wrote:
> >
> > I would actually propose that this is added to the provenance but not
> > always put into the flowfile attributes. There are many scenarios in
> which
> > the data retrieval should be separated from the analysis/follow-on, both
> > for visibility, responsibility, and security concerns. While I
> understand a
> > separate UpdateAttribute processor could be put in the downstream flow to
> > remove these attributes, I would push for not adding them by default as a
> > more secure approach. Perhaps this could be configurable on the Get*
> > processor via a boolean property, but I think doing it automatically by
> > default introduces some serious concerns.
> >
> >
> > Andy LoPresto
> > alopre...@apache.org
> > *alopresto.apa...@gmail.com *
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On May 13, 2018, at 11:48 AM, Mike Thomsen 
> wrote:
> >
> > @Joe @Matt
> >
> > This is kinda related to the point that Joe made in the graph DB thread
> > about provenance. My thought here was that we need some standards on
> > enriching the metadata about what was fetched so that no matter how you
> > store the provenance, you can find some way to query it for questions
> like
> > when a data set was loaded into NiFi, how many records went through a
> > terminating processor, etc. IMO this could help batch-oriented
> > organizations feel more at ease with something stream-oriented like NiFi.
> >
> > On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen 
> > wrote:
> >
> > I'd like to propose that all non-deprecated (or likely to be deprecated)
> > Get/Fetch/Query processors get a standard convention for attributes that
> > describe things like:
> >
> > 1. Source system.
> > 2. Database/table/index/collection/etc.
> > 3. The lookup criteria that was used (similar to the "query attribute"
> >