Re: SPOOFED: Re: ExecuteSQL question

Conrad Crampton Mon, 08 Aug 2016 06:07:47 -0700

And final update…
It turns out that ExecuteSql doesn’t distribute processing across the nodes in 
the cluster – not unsurprising really when I think about it, so refinement to 
the below was to just have the text file on one node so that it effectively 
only runs on that node. Of course I could have set the ExecuteSql processor 
just to run on Primary Node but I fear that that node would become overloaded 
as I have a bunch of other stuff that runs on there exclusively.


I don’t know if on roadmap, but would be useful to be able to indicate for a 
processor to run on any specific node in cluster other than just primary (so a 
manual load balancing could be under control of user). I think this has been 
discussed before though).

Conrad

On 05/08/2016, 15:35, "Conrad Crampton" <conrad.cramp...@secdata.com> wrote:

    Thanks for the input on this.
    As a follow-up this is what I have done…
    
    Created a text file which contains single value of the lowest ID of the 
data I am retrieving, save on each node of cluster.
    Then
    GetFile (to read file)
    ExtractText (to get the value – and set to attribute)
    UpdateAttribute (to create another attribute which is the upper bound of 
the ID to use in sql i.e. lower >= ID < upper)
    ExecuteSql (using the attribute as above as parameters) – using DBPool for 
connection
    Branch1
        MergeContent
        PutHDFS
    Branch2
        ReplaceText (replace flowfile content with upper attribute from above)
        PutFile (same filename as came in)
    
    Rinse…Repeat
    
    Appears to be working ok. Issue is that if ExecuteSQL fails, then as I have 
it now, the file is original deleted – probably better to get this moved so 
always have a backup of the last id used.
    
    Conrad
    
    
    On 03/08/2016, 15:34, "Yohann Lepage" <yoh...@lepage.info> wrote:
    
        Hi,
        
        I have exactly the same use case to periodically get rows from some
        security appliances with just a read only access.
        
        Currently (without NiFi), we use an SQL query  to track the maximum
        value, depending on the DB/appliance/vendor, it could be a simple
        "SELECT getdate()" or  "select max(SW_TIME) from log_table" or  a more
        complex query with INNER JOIN.
        
        So, it would be great to have an option to customize the tracking query.
        
        Regards
        
        2016-08-03 16:02 GMT+02:00 Conrad Crampton 
<conrad.cramp...@secdata.com>:
        > Hi,
        > Thanks for this.
        > I did think about a MV but unfortunately I haven’t access to create 
views – just read access. That would have been my simplest option ;-) Life’s 
never that easy though is it?
        > The only part of the sql I need to be dynamic is the date parameter 
(I could even use the id column). Instead of using the MapCache (if that isn’t 
a good idea), could I use the GetFile to just pull a single txt file with the 
parameter (i.e. last run or max id value from the last run), which creates 
flowfile, read that value and pass that into ExecuteSql (using the 
aforementioned value as the parameter in the sql) as the select query can be 
dynamically constructed from attributes/ flowfile content (as per docs)? And 
then finally write text file back to file system to be picked up next time?
        > Thanks
        > Conrad
        >
        > On 03/08/2016, 14:02, "Matt Burgess" <mattyb...@gmail.com> wrote:
        >
        >     Conrad,
        >
        >     Is it possible to add a view (materialized or not) to the RDBMS? 
That
        >     view could take care of the denormalization and then
        >     QueryDatabaseTable could point at the view. The DB would take 
care of
        >     the push-down filters, which functionally is like if you had a
        >     QueryDatabaseTable for each table then did the joins.
        >
        >     In NiFi 1.0 there is a GenerateTableFetch processor which is like
        >     QueryDatabaseTable except it generates SQL instead of executing 
SQL.
        >     That might be used in your a) option above but you'd have to 
reconcile
        >     the SQL statements into a JOIN. A possible improvement to 
either/both
        >     processors would be to add attributes for the maximum value 
columns
        >     whose values are the maximum observed values. Then you wouldn't 
have
        >     to parse or manipulate the SQL if you really just want the max 
values.
        >
        >     I have been thinking about how QueryDatabaseTable and
        >     GenerateTableFetch would work if they accepted incoming flow 
files (to
        >     allow dynamic table names for example). It's a bit more tricky 
because
        >     those processors run without input to get max values, so their
        >     behavior would change when a flow file is present but would 
return to
        >     the original behavior if no flow file is present. Since the
        >     ListDatabaseTables processor is also in 1.0, it would be nice to 
use
        >     that as input to the other two processors.
        >
        >     I'm definitely interested in any thoughts or discussion around 
these things :)
        >
        >     Regards,
        >     Matt
        >
        >     On Wed, Aug 3, 2016 at 8:37 AM, Conrad Crampton
        >     <conrad.cramp...@secdata.com> wrote:
        >     > Hi,
        >     >
        >     > My use case is that I want to ship a load of rows from an RDMS 
periodically
        >     > and put in HDFS as Avro.
        >     >
        >     > QueryTable processor has functionality that would be great i.e. 
maxcolumn
        >     > value (there are couple of columns I could use for this from 
the data) and
        >     > it is this functionality I am looking for, BUT the data is not 
from one
        >     > single table. The nature of the RDBMS is that the business view 
on the data
        >     > requires a bunch of joins from other tables/schemas to get the 
correct Avro
        >     > file so the options I appear to have are
        >     >
        >     > a)       Use QueryTable for each table that make up the 
business view and do
        >     > the joins etc. in HDFS (Spark or something) – or potentially do 
the
        >     > reconciliation within NiFi???
        >     >
        >     > b)       Use ExecuteSQL to run the complete SQL to get the rows 
which can
        >     > easily be put into HDFS as Avro given that the line will be the 
business
        >     > (denormalised) data that is required.
        >     >
        >     > The problem with a) is the reconciliation (denormalisation) of 
the data and
        >     > the problem with b) is how to maintain the maxcolumn value so I 
only get the
        >     > data since the last run.
        >     >
        >     >
        >     >
        >     > In order to address b) can I use the DistrubutedMapCacheServer 
& Client to
        >     > hold a key/value pair of last run date and extract from this 
date as a
        >     > parameter?
        >     >
        >     >
        >     >
        >     > Thanks for any suggestions.
        >     >
        >     >
        >     >
        >     > Conrad
        >     >
        >     >
        >     >
        >     > SecureData, combating cyber threats
        >     >
        >     > ________________________________
        >     >
        >     > The information contained in this message or any of its 
attachments may be
        >     > privileged and confidential and intended for the exclusive use 
of the
        >     > intended recipient. If you are not the intended recipient any 
disclosure,
        >     > reproduction, distribution or other dissemination or use of this
        >     > communications is strictly prohibited. The views expressed in 
this email are
        >     > those of the individual and not necessarily of SecureData 
Europe Ltd. Any
        >     > prices quoted are only valid if followed up by a formal written 
quote.
        >     >
        >     > SecureData Europe Limited. Registered in England & Wales 
04365896.
        >     > Registered Address: SecureData House, Hermitage Court, 
Hermitage Lane,
        >     > Maidstone, Kent, ME16 9NT
        >
        >
        >      ***This email originated outside SecureData***
        >
        >     Click 
https://www.mailcontrol.com/sr/N2EHwWuz2JbGX2PQPOmvUiQSa3+T5MHvUCPvvNtzgx8r0bDDMR!duWMwCi4iTZgUfaZqve+Bl7WqQH8FWVwkvA==
  to report this email as spam.
        >
        >
        
        
        
        -- 
        Yohann Lepage

Re: SPOOFED: Re: ExecuteSQL question

Reply via email to