Re: RouteText / Replcace Text to remove first line of file

2017-03-06 Thread Mark Payne
Joe,

In terms of updating a processor to better handle this, we could update 
RouteText to do so. The idea being that if a large percentage of the time (say 
> 80% of the time) the FlowFile routed to one of the relationships is a single, 
contiguous subset of the data in the original FlowFile, then we could perform a 
2-pass algorithm. In the first pass, we check to see if this is the case. If 
so, we can use session.clone(flowFile, offset, length) and we are done, without 
writing any content. If this is not the case, then we would have to perform a 
second pass over the data and write out the results as we do now.

It is certainly not a trivial update to the processor but it is something that 
can be hidden from the user and in cases like this can provide significantly 
better performance. Something to think about. 

-Mark

Sent from my iPhone

> On Mar 6, 2017, at 5:25 PM, Joe Witt  wrote:
> 
> Juan,
> 
> I think RouteText is the right answer.  It would indeed need to check
> all lines to determine whether the condition is satisfied and to
> remove the first line it will need to write out all the remaining
> lines.  If a majority of the input files do not have this problematic
> header I'd use RouteOnContent with a small buffer (however many bytes
> would be in the header line of an erroneous file) and check for the
> presence of "ERROR".  If it did hit then I'd route to RouteText to do
> this more expensive piece.  If it didn't hit then you can move it on
> without paying the RouteText cost.
> 
> It is useful to consider we'd have a processor, or update an existing
> one to handle removal of lines from the beginning or end based on some
> conditional.  Not sure what that would look like as the requirements
> can get pretty specific.  I do think generally for such cases that
> ExecuteScript processors offer an excellent tradeoff such that one can
> build a very small focused and fast script to do precisely what they
> need.
> 
> Thanks
> Joe
> 
>> On Mon, Mar 6, 2017 at 4:50 PM, Lee Laim  wrote:
>> Juan,
>> 
>> If you're in the linux environment, you can use the Execute Stream command
>> (ESC) processor to run "head -n 1"  the contents of the incoming large
>> flowfile to  quickly extract the first line.  ESC has an option to put the
>> output of the command directly into a new attribute, and pass the "original
>> contents" to the next processor.  The value of the new attribute contains
>> the first line, while the entire file remains in the flowfile contents.  You
>> can use the new attribute for quick(er) routing decision.
>> 
>> Thanks,
>> Lee
>> 
>> 
>> 
>>> On Mon, Mar 6, 2017 at 1:46 PM, Juan Sequeiros  wrote:
>>> 
>>> Good afternoon all,,
>>> 
>>> I am trying to remove the first line of a file if it has a certain word in
>>> it "ERROR"
>>> I know it will exist only in the first line ( I can not fix the reason why
>>> it gets put there )
>>> 
>>> These files are big and lots of them.
>>> 
>>> and I can not find a "fast" fix to pop the first line of a file,
>>> everything I can think of within NIFI ends up at least running through the
>>> whole file.
>>> 
>>> I am using RouteText suggested at one time on separate thread.
>>> 
>>> Routing Strategy: Route to "matched" if the line matches any condition.
>>> Matching Strategy: Satisfies Expression
>>> My expression: ${lineNo:lt(2):and($line:find('ERROR')})}
>>> 
>>> I then route "matched" to auto-terminate and unmatched as my "new" file
>>> without the first line.
>>> 
>>> This seems to be working but it is slow since I believe it still at least
>>> runs through the whole file line by line.
>>> 
>>> Is there any other suggestions?  I've read the "ExecuteGroovy" solutions
>>> but they seem excessive if all I want is to remove first line of file.
>>> 
>>> I've also looked at ReplaceText and thought that would give me a clean
>>> solution since I thought I could control input stream with the "Maximum
>>> Buffer Size" but that is a conditional setting and if "evaluation Mode" is
>>> Line-by-Line then I later learned "Maximum Buffer" is only for the buffer
>>> size of the line.
>>> 
>>> Thanks
>> 
>> 


[ANNOUNCE] CVE-2017-5635 and CVE-2017-5636

2017-03-06 Thread Andy LoPresto
Apache NiFi PMC would like to announce the discovery and resolution of 
CVE-2017-5635 and CVE-2017-5636. These issues have been resolved and new 
versions of the Apache NiFi project were released in accordance with the Apache 
Release Process.

Fixed in Apache NiFi 0.7.2 and 1.1.2

CVE-2107-5635: Apache NiFi Unauthorized Data Access In Cluster Environment

Severity: Important

Versions Affected:

Apache NiFi 0.7.0
Apache NiFi 0.7.1
Apache NiFi 1.1.0
Apache NiFi 1.1.1
Description: In a cluster environment, if an anonymous user request is 
replicated to another node, the originating node identity is used rather than 
the “anonymous” user.

Mitigation: A fix has been provided (removing the negative check for anonymous 
user before building the proxy chain and throwing an exception, and evaluating 
each user in the proxy chain iteration and comparing against a static constant 
anonymous user). This fix was applied in NIFI-3487 and released in Apache NiFi 
0.7.2 and 1.1.2. 1.x users running a clustered environment should upgrade to 
1.1.2. 0.x users running a clustered environment should upgrade to 0.7.2.

Credit: This issue was discovered by Leonardo Dias in conjunction with Matt 
Gilman.

CVE-2107-5636: Apache NiFi User Impersonation In Cluster Environment

Severity: Moderate

Versions Affected:

Apache NiFi 0.7.0
Apache NiFi 0.7.1
Apache NiFi 1.1.0
Apache NiFi 1.1.1
Description: In a cluster environment, the proxy chain 
serialization/deserialization is vulnerable to an injection attack where a 
carefully crafted username could impersonate another user and gain their 
permissions on a replicated request to another node.

Mitigation: A fix has been provided (modification of the tokenization code and 
sanitization of user-provided input). This fix was applied in NIFI-3487 and 
released in Apache NiFi 0.7.2 and 1.1.2. 1.x users running a clustered 
environment should upgrade to 1.1.2. 0.x users running a clustered environment 
should upgrade to 0.7.2.

Credit: This issue was discovered by Andy LoPresto.

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: RouteText / Replcace Text to remove first line of file

2017-03-06 Thread Joe Witt
Juan,

I think RouteText is the right answer.  It would indeed need to check
all lines to determine whether the condition is satisfied and to
remove the first line it will need to write out all the remaining
lines.  If a majority of the input files do not have this problematic
header I'd use RouteOnContent with a small buffer (however many bytes
would be in the header line of an erroneous file) and check for the
presence of "ERROR".  If it did hit then I'd route to RouteText to do
this more expensive piece.  If it didn't hit then you can move it on
without paying the RouteText cost.

It is useful to consider we'd have a processor, or update an existing
one to handle removal of lines from the beginning or end based on some
conditional.  Not sure what that would look like as the requirements
can get pretty specific.  I do think generally for such cases that
ExecuteScript processors offer an excellent tradeoff such that one can
build a very small focused and fast script to do precisely what they
need.

Thanks
Joe

On Mon, Mar 6, 2017 at 4:50 PM, Lee Laim  wrote:
> Juan,
>
> If you're in the linux environment, you can use the Execute Stream command
> (ESC) processor to run "head -n 1"  the contents of the incoming large
> flowfile to  quickly extract the first line.  ESC has an option to put the
> output of the command directly into a new attribute, and pass the "original
> contents" to the next processor.  The value of the new attribute contains
> the first line, while the entire file remains in the flowfile contents.  You
> can use the new attribute for quick(er) routing decision.
>
> Thanks,
> Lee
>
>
>
> On Mon, Mar 6, 2017 at 1:46 PM, Juan Sequeiros  wrote:
>>
>> Good afternoon all,,
>>
>> I am trying to remove the first line of a file if it has a certain word in
>> it "ERROR"
>> I know it will exist only in the first line ( I can not fix the reason why
>> it gets put there )
>>
>> These files are big and lots of them.
>>
>> and I can not find a "fast" fix to pop the first line of a file,
>> everything I can think of within NIFI ends up at least running through the
>> whole file.
>>
>> I am using RouteText suggested at one time on separate thread.
>>
>> Routing Strategy: Route to "matched" if the line matches any condition.
>> Matching Strategy: Satisfies Expression
>> My expression: ${lineNo:lt(2):and($line:find('ERROR')})}
>>
>> I then route "matched" to auto-terminate and unmatched as my "new" file
>> without the first line.
>>
>> This seems to be working but it is slow since I believe it still at least
>> runs through the whole file line by line.
>>
>> Is there any other suggestions?  I've read the "ExecuteGroovy" solutions
>> but they seem excessive if all I want is to remove first line of file.
>>
>> I've also looked at ReplaceText and thought that would give me a clean
>> solution since I thought I could control input stream with the "Maximum
>> Buffer Size" but that is a conditional setting and if "evaluation Mode" is
>> Line-by-Line then I later learned "Maximum Buffer" is only for the buffer
>> size of the line.
>>
>> Thanks
>
>


Re: RouteText / Replcace Text to remove first line of file

2017-03-06 Thread Lee Laim
Juan,

If you're in the linux environment, you can use the Execute Stream command
(ESC) processor to run "head -n 1"  the contents of the incoming large
flowfile to  quickly extract the first line.  ESC has an option to put the
output of the command directly into a new attribute, and pass the "original
contents" to the next processor.  The value of the new attribute contains
the first line, while the entire file remains in the flowfile contents.
You can use the new attribute for quick(er) routing decision.

Thanks,
Lee



On Mon, Mar 6, 2017 at 1:46 PM, Juan Sequeiros  wrote:

> Good afternoon all,,
>
> I am trying to remove the first line of a file if it has a certain word in
> it "ERROR"
> I know it will exist only in the first line ( I can not fix the reason why
> it gets put there )
>
> These files are big and lots of them.
>
> and I can not find a "fast" fix to pop the first line of a file,
> everything I can think of within NIFI ends up at least running through the
> whole file.
>
> I am using RouteText suggested at one time on separate thread.
>
> Routing Strategy: Route to "matched" if the line matches any condition.
> Matching Strategy: Satisfies Expression
> My expression: ${lineNo:lt(2):and($line:find('ERROR')})}
>
> I then route "matched" to auto-terminate and unmatched as my "new" file
> without the first line.
>
> This seems to be working but it is slow since I believe it still at least
> runs through the whole file line by line.
>
> Is there any other suggestions?  I've read the "ExecuteGroovy" solutions
> but they seem excessive if all I want is to remove first line of file.
>
> I've also looked at ReplaceText and thought that would give me a clean
> solution since I thought I could control input stream with the "Maximum
> Buffer Size" but that is a conditional setting and if "evaluation Mode" is
> Line-by-Line then I later learned "Maximum Buffer" is only for the buffer
> size of the line.
>
> Thanks
>


RouteText / Replcace Text to remove first line of file

2017-03-06 Thread Juan Sequeiros
Good afternoon all,,

I am trying to remove the first line of a file if it has a certain word in
it "ERROR"
I know it will exist only in the first line ( I can not fix the reason why
it gets put there )

These files are big and lots of them.

and I can not find a "fast" fix to pop the first line of a file, everything
I can think of within NIFI ends up at least running through the whole file.

I am using RouteText suggested at one time on separate thread.

Routing Strategy: Route to "matched" if the line matches any condition.
Matching Strategy: Satisfies Expression
My expression: ${lineNo:lt(2):and($line:find('ERROR')})}

I then route "matched" to auto-terminate and unmatched as my "new" file
without the first line.

This seems to be working but it is slow since I believe it still at least
runs through the whole file line by line.

Is there any other suggestions?  I've read the "ExecuteGroovy" solutions
but they seem excessive if all I want is to remove first line of file.

I've also looked at ReplaceText and thought that would give me a clean
solution since I thought I could control input stream with the "Maximum
Buffer Size" but that is a conditional setting and if "evaluation Mode" is
Line-by-Line then I later learned "Maximum Buffer" is only for the buffer
size of the line.

Thanks


Re: Merge join processor for nifi

2017-03-06 Thread Joe Witt
Purbon

The MergeContent processor supports merging data using a correlation
key.  The processor does not support SQL style inner/outer joins as it
is not joining data based on per record/tuple type logic but rather at
a per flowfile level.  We can correlated based on any flowfile
attribute or combination of attributes though.

Thanks
Joe

On Mon, Mar 6, 2017 at 12:56 PM, Pere Urbón Bayes  wrote:
> HI,
>   my name is Pere and I am kinda new to nifi, working now on integrating
> this awesome project as a data pipeline tool for a project I am working on.
> In this project I aim to get a collection of FlowFiles what share some kind
> of unique key, then might be a join operation involved.
>
> I am wondering, well I have not been able to find it for myself for now,
> what are the current recommendations to handle this kind of situation from
> the community?
>
> I see from tools like Pentaho something like
> http://wiki.pentaho.com/display/EAI/Merge+Join .
>
> /cheers
>
> - purbon


Merge join processor for nifi

2017-03-06 Thread Pere Urbón Bayes
HI,
  my name is Pere and I am kinda new to nifi, working now on integrating
this awesome project as a data pipeline tool for a project I am working on.
In this project I aim to get a collection of FlowFiles what share some kind
of unique key, then might be a join operation involved.

I am wondering, well I have not been able to find it for myself for now,
what are the current recommendations to handle this kind of situation from
the community?

I see from tools like Pentaho something like
http://wiki.pentaho.com/display/EAI/Merge+Join .

/cheers

- purbon


Re: Visual Indicator for "Can't run because there are no threads"?

2017-03-06 Thread Andy LoPresto
I created NIFI-3558 [1] to capture these scenarios. I added this specific 
example, but if anyone has more, please contribute them on the ticket.

[1] https://issues.apache.org/jira/browse/NIFI-3558

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 6, 2017, at 10:52 AM, Andy LoPresto  wrote:
> 
> Peter,
> 
> There have been intermittent discussions around a “system 
> status/configuration traffic light tool” which would be a visual indicator in 
> the UI that addresses common problems that are easily attributed to a 
> specific configuration value or environment scenario not matching best 
> practices. It would aggregate the collective institutional knowledge of the 
> mailing lists when we’ve encountered the same problem multiple times and try 
> to provide that diagnosis and recommended solutions to the user at a much 
> earlier stage, rather than relying on these conversations. This sounds like 
> another great piece of information to collect and display there.
> 
> There is a vague reference to this “better tooling” in [1] but I can’t find 
> an explicit ticket for it right now. I’ll open one and we can start listing 
> the desired functionality for the first pass.
> 
> [1] https://issues.apache.org/jira/browse/NIFI-3496 
> 
> 
> 
> Andy LoPresto
> alopre...@apache.org 
> alopresto.apa...@gmail.com 
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> 
>> On Mar 6, 2017, at 10:18 AM, Peter Wicks (pwicks) > > wrote:
>> 
>> Joe,
>> 
>> In my case I had not seen the issue until I added 7 new 
>> QueryDatabaseProcessor's. All seven of them kicked off against the same SQL 
>> database on restart and took 10 to 15 minutes to come back.  During that 
>> time my default 10 threads was running with only 3 to spare, which were 
>> being shared across a lot of other jobs.  I bumped it up considerably and 
>> have not had issues since then.
>> 
>> --Peter
>> 
>> -Original Message-
>> From: Joe Witt [mailto:joe.w...@gmail.com ]
>> Sent: Friday, March 03, 2017 3:02 PM
>> To: users@nifi.apache.org 
>> Subject: Re: Visual Indicator for "Can't run because there are no threads"?
>> 
>> Peter,
>> 
>> That is a good idea and I don't believe there is any existing JIRAs to do 
>> so.  But the idea makes a lot of sense.  Being so thread starved that 
>> processors do not get to run for extended periods of time is pretty unique.  
>> Makes me think that the flow has processors which are not honoring the model 
>> but are rather more acting like greedy thread daemons.  That should also be 
>> considered.  But even with that said I could certainly see how it would be 
>> helpful to know that a processor is running less often than it would like 
>> due to lack of available threads rather than just backpressure.
>> 
>> Thanks
>> Joe
>> 
>> On Fri, Mar 3, 2017 at 4:57 PM, Peter Wicks (pwicks) > > wrote:
>>> I think everyone was really happy when backpressure finally got super
>>> great indicators.  Backpressure used to be my #1, “Why isn’t stuff moving?”
>>> problem.  My latest issue is there are no free threads, sometimes for
>>> hours, and I don’t notice and start wondering what’s going on.
>>> 
>>> 
>>> 
>>> Is there anything under consideration for an indicator to show how
>>> many processors can’t run because there aren’t enough threads
>>> available? I can create a ticket, wasn’t sure if there was one floating 
>>> around.
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Visual Indicator for "Can't run because there are no threads"?

2017-03-06 Thread Andy LoPresto
Peter,

There have been intermittent discussions around a “system status/configuration 
traffic light tool” which would be a visual indicator in the UI that addresses 
common problems that are easily attributed to a specific configuration value or 
environment scenario not matching best practices. It would aggregate the 
collective institutional knowledge of the mailing lists when we’ve encountered 
the same problem multiple times and try to provide that diagnosis and 
recommended solutions to the user at a much earlier stage, rather than relying 
on these conversations. This sounds like another great piece of information to 
collect and display there.

There is a vague reference to this “better tooling” in [1] but I can’t find an 
explicit ticket for it right now. I’ll open one and we can start listing the 
desired functionality for the first pass.

[1] https://issues.apache.org/jira/browse/NIFI-3496 



Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 6, 2017, at 10:18 AM, Peter Wicks (pwicks)  wrote:
> 
> Joe,
> 
> In my case I had not seen the issue until I added 7 new 
> QueryDatabaseProcessor's. All seven of them kicked off against the same SQL 
> database on restart and took 10 to 15 minutes to come back.  During that time 
> my default 10 threads was running with only 3 to spare, which were being 
> shared across a lot of other jobs.  I bumped it up considerably and have not 
> had issues since then.
> 
> --Peter
> 
> -Original Message-
> From: Joe Witt [mailto:joe.w...@gmail.com]
> Sent: Friday, March 03, 2017 3:02 PM
> To: users@nifi.apache.org
> Subject: Re: Visual Indicator for "Can't run because there are no threads"?
> 
> Peter,
> 
> That is a good idea and I don't believe there is any existing JIRAs to do so. 
>  But the idea makes a lot of sense.  Being so thread starved that processors 
> do not get to run for extended periods of time is pretty unique.  Makes me 
> think that the flow has processors which are not honoring the model but are 
> rather more acting like greedy thread daemons.  That should also be 
> considered.  But even with that said I could certainly see how it would be 
> helpful to know that a processor is running less often than it would like due 
> to lack of available threads rather than just backpressure.
> 
> Thanks
> Joe
> 
> On Fri, Mar 3, 2017 at 4:57 PM, Peter Wicks (pwicks)  
> wrote:
>> I think everyone was really happy when backpressure finally got super
>> great indicators.  Backpressure used to be my #1, “Why isn’t stuff moving?”
>> problem.  My latest issue is there are no free threads, sometimes for
>> hours, and I don’t notice and start wondering what’s going on.
>> 
>> 
>> 
>> Is there anything under consideration for an indicator to show how
>> many processors can’t run because there aren’t enough threads
>> available? I can create a ticket, wasn’t sure if there was one floating 
>> around.



signature.asc
Description: Message signed with OpenPGP using GPGMail


RE: Visual Indicator for "Can't run because there are no threads"?

2017-03-06 Thread Peter Wicks (pwicks)
Joe,

In my case I had not seen the issue until I added 7 new 
QueryDatabaseProcessor's. All seven of them kicked off against the same SQL 
database on restart and took 10 to 15 minutes to come back.  During that time 
my default 10 threads was running with only 3 to spare, which were being shared 
across a lot of other jobs.  I bumped it up considerably and have not had 
issues since then.

--Peter

-Original Message-
From: Joe Witt [mailto:joe.w...@gmail.com] 
Sent: Friday, March 03, 2017 3:02 PM
To: users@nifi.apache.org
Subject: Re: Visual Indicator for "Can't run because there are no threads"?

Peter,

That is a good idea and I don't believe there is any existing JIRAs to do so.  
But the idea makes a lot of sense.  Being so thread starved that processors do 
not get to run for extended periods of time is pretty unique.  Makes me think 
that the flow has processors which are not honoring the model but are rather 
more acting like greedy thread daemons.  That should also be considered.  But 
even with that said I could certainly see how it would be helpful to know that 
a processor is running less often than it would like due to lack of available 
threads rather than just backpressure.

Thanks
Joe

On Fri, Mar 3, 2017 at 4:57 PM, Peter Wicks (pwicks)  wrote:
> I think everyone was really happy when backpressure finally got super 
> great indicators.  Backpressure used to be my #1, “Why isn’t stuff moving?”
> problem.  My latest issue is there are no free threads, sometimes for 
> hours, and I don’t notice and start wondering what’s going on.
>
>
>
> Is there anything under consideration for an indicator to show how 
> many processors can’t run because there aren’t enough threads 
> available? I can create a ticket, wasn’t sure if there was one floating 
> around.


NiFi 1.1.1 AWS EC2 Secure Cluster Zookeeper Connection Loss Error

2017-03-06 Thread Ryan H
Hi All,

I am running into another issue setting up a secure NiFi cluster across 2
EC2 instances in AWS. Shortly after starting up the two nodes, the
nifi-app.log is completely spammed with the following error message(s):

2017-03-06 13:48:06,029 ERROR [Curator-Framework-0]
o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
[curator-framework-2.11.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[na:1.8.0_121]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_121]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[na:1.8.0_121]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_121]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
2017-03-06 13:48:06,029 ERROR [Curator-Framework-0]
o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
ConnectionLoss
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
[curator-framework-2.11.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[na:1.8.0_121]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_121]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[na:1.8.0_121]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_121]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]


The error looks very similar to the following post:
http://apache-nifi-developer-list.39713.n7.nabble.com/Zookeeper-error-td13915.html

However, the resolution employed with the solution provided did not resolve
the issue (clearing state and flow from the NiFi nodes). As an aside and
FWIW, when stopping the nodes using ./bin/nifi.sh stop I will get the
following shut-down message:

ERROR [main] org.apache.nifi.bootstrap.Command Failed to send shutdown
command to port 32993 due to java.net.SocketTimeoutException: Read timed
out. Will kill the NiFi Process with PID 5984.

This issue looks extremely close to the following Bug as documented on
Apache:
https://issues.apache.org/jira/browse/CURATOR-209

Here is what I have previously successfully done during my development
efforts:

   1. Setup single standalone Unsecured NiFi.
   2. Setup multiple nodes Unsecured (clustered) on single EC2 instance.
   3. Setup multiple nodes Unsecured across multiple EC2 instances.
   4. Setup single standalone Secured NiFi.
   5. Setup multiple nodes Secured (clustered) on single EC2 instance.

Below are the relevant config files for my 2 nodes. Any help is greatly
appreciated!


Cheers,

Ryan H.


-

EC2 Instance 1

-

nifi.properties

nifi.state.management.embedded.zookeeper.start=true

# Site to Site
properties

nifi.remote.input.host=my-host-name-1
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10443
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec

# web properties
#

nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=
nifi.web.https.host=my-host-name-1
nifi.web.

Re: How to specify priority attributes for seperate flowfiles?

2017-03-06 Thread prabhu Mahendran
Bryan & Andre,

PriorityAttributePrioritizer ,FIFO strategy worked if flow has no loop
processing.

But i have configured loop processing in my workflow by using this
reference.

https://gist.github.com/ijokarumawak/01c4fd2d9291d3e74ec424a581659ca8#file-loop_sample-xml

For example  if i process two files using getFile processor then content of
those flow files get shuffled after looping of file.

That's why i can't prioritize my flow files.

Now i need to combine flow files if it having same file name.

*use case: *consider two files named file1 and file2. i have get those two
files and processing it after some changes in contents then perform loop
flow to iterate same modifications for all files.Hereafter queue having
some shuffled flowfiles.

i try to combine flow files based on name of file.Check filename and then
combine all flowfiles having same name into one flowfile.

Can you suggest any way to perform my requirement?

Thanks in Advance,

On Fri, Mar 3, 2017 at 8:08 PM, Bryan Bende  wrote:

> What Andre described is what I had in mind as well...
>
> One thing to keep in mind is that I think you can only guarantee the
> ordering if all the files you want to process are picked up in one
> execution of GetFile.
>
> For example, imagine there are 100 files in the directory, and
> GetFile's Batch Size is set to 10 (the default). The first time
> GetFile executes it is going to get 10 out of the 100 flow files, and
> then using Andre's example with the epoch as the priority, you can get
> those 10 flow files processed in order.
>
> If you were trying to get total order across all 100 files, you would
> either need the batch size to be greater than the total number of
> files, or you would need some kind of custom processor that waited for
> N flow files, and then if the queue before that processor used the
> PriorityAttributePrioritizer, then you would be waiting until all 100
> flow files were in the queue in priority order before letting any of
> them process.
>
>
>
>
> On Fri, Mar 3, 2017 at 2:59 AM, Andre  wrote:
> > Prabhu,
> >
> > I suspect you need to rethink your use of concurrency on your workflow. I
> > give you an example:
> >
> > You spoke about 10 concurrent GetFile threads reading a repository and
> their
> > consequent ordering:
> >
> > Suppose you have 2 threads consuming:
> >
> > file1 - 10 MB
> > file2 - 20 MB
> > file3 - 50 MB
> > file4 - 10 MB
> > file5 - 10 MB
> > file6 - 10 MB
> >
> > All things equal, consider each of the 2 threads consume and dispatch the
> > files at the same speed. How can you guarantee that thread 1 will consume
> > file5 (i.e. as in t1-f1, t2-f2, t1-f3, t2-f4, t1-f5, t2-f6)?
> >
> > Or as Brandon DeVries clearly put a lojng while ago[1]:
> >
> > "Just because a FlowFile begins processing first doesn't mean it will
> > complete first (assuming the processor has multiple concurrent tasks)"
> >
> > Brandon goes further and provides some suggestions that may help you
> binning
> > your flowfiles and records together, but in any case...
> >
> >
> > Assuming the filename is named based on a date (e.g.
> > file_2017-03-03T010101.csv), have you considered using UpdateAttributes
> to
> > parse the filename into a date, that date into Epoch (which happens to
> be an
> > increasing number) as a first level index / prioritizer?
> >
> > This way you could have:
> >
> > GetFile (single thread) -- Connector with FIFO --> UpdateAttribute
> (adding
> > Epoch from filename date) -- Connector with PriorityAttributePrioritizer
> -->
> > rest of your flow
> >
> >
> > Once again, assuming the file name is file_2017-03-03T010101.csv, the
> > expression language would be something like:
> >
> > ${filename:toDate("'file_'-MM-dd'T'HHmmss'.csv'", "UTC"):toNumber()}
> >
> >
> > Would that help?
> >
> >
> > [1]
> > https://lists.apache.org/thread.html/203ddc0423ac7f877817ad5e2b389f
> 079c2a27d8d4b4ef998ad91a32@1449844053@%3Cdev.nifi.apache.org%3E
> >
> >
> > On 3 Mar 2017 5:27 PM, "prabhu Mahendran" 
> wrote:
> >>
> >> This task(NIFI-470) suits to some of the workflow. If I set concurrent
> >> task to 10, records runs in parallel so that each file gets shuffled as
> I
> >> can see in the List Queue.
> >>
> >>
> >>
> >> If we get order of files from the Getfile, How I can ensure the data
> from
> >> each file is properly moved to destination(consider SQL) in same order
> with
> >> respect to concurrent task also?
> >>
> >>
> >>
> >> I need flow like this: Consider file1 has 10 records and it should be
> >> priortized from the value 1 to 10, then next file2 records should start
> with
> >> the priority value 11 to so on.. Filename can be in the order of the
> date
> >> from the getfile processor. Here I can ensure each ordered files are
> moved
> >> in the same order into SQL.
> >>
> >>
> >>
> >> Will this be achieved in the ticket or any suggestion for this?
> >>
> >>
> >> On Fri, Mar 3, 2017 at 11:37 AM, Andre  wrote:
> >>>
> >>> Hi,
> >>>
> >>> There's an existing JIRA ticket(NIFI-47