Re:Re: Re: Multiple dataflow jobs management(lots of jobs)

2016-03-13 Thread
Hi  ThadThank you very much for your advice. Kettle can do the job for sure , 
but the metadata i was talking about is the metadata of the job descriptions 
used for kettle itself. The only option left for kettle is multiple instances , 
but that also means that we need to develop a master application to gather all 
the instances metadata. Moreover , Kettle does not have a Web Based GUI for 
designing and testing the job , that39s why we want NIFI , but again , multiple 
instances of nifi also leads to a HA problem for master node, so we turn to 
ambari metrics for that issue.Talend has a cloud server doing the similar 
thing, but it39s running on public cloud which is not accepted by our 
client.Kettle is a great ETL tool, but Web Based designer is really the master 
point for future.Thank you very muchYan LiuYan Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

14/03/2016邮件原文发件人:Thad Guidry  <thadgui...@gmail.com>收件人:users 
<us...@nifi.apache.org>抄 送: dev  <dev@nifi.apache.org>发送时间:2016-03-13 
23:04:39主题:Re: Re: Multiple dataflow jobs management(lots of jobs)
Yan,



Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K jobs 
to accomplish this is not the proper way to setup Pentaho.  Also, using MySQL 
to store the metadata is where you made a wrong choice.  PostgreSQL with data 
silos on SSD drives would be a better choice, while properly doing Async config 
[1] and other necessary steps for high writes.  Don39t keep Pentaho39s Table 
output commit levels at their default of 10k rows when your processing millions 
of rows!) For Oracle 11g or PostgreSQL, where I need 30 sec time slice windows 
for the metadata logging and where I typically have less than 1k of data on 
average per row, I typically will choose 200k rows or more in Pentaho39s table 
output commit option.



I would suggest you contact Pentaho for some adhoc support or hire some 
consultants to help you learn more, or setup properly for your use case.  For 
free, you can also just do a web search on "Pentaho best practices".  There39s 
a lot to learn from industry experts who already have used these tools and know 
their quirks.



[1] 
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR





Thad+ThadGuidry






On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 <liu...@richinfo.cn> wrote:Hi Aldrinsome 
additional information.it39s a typical ETL offloading user case each extraction 
job should foucs on 1 table and 1 table only.  data will be written on HDFS , 
this is similar to Database Staging. The reason why we need to foucs on 1 table 
for each job is because there might be database error or disconnection occur 
during the extraction , if it39s running as  a script like extraction job with 
expression langurage, then it39s hard to do the re-running or excape thing on 
that table or tables.once the extraction is done, a triger like action will do 
the data cleansing.  this is similar to ODS layer of Datawarehousingif the data 
quality has passed the quality check , then it will be marked as cleaned. 
otherwise , it will return to previous step and redo the data extraction, or 
send alert/email to the  system administrator.if certain numbers of tables were 
all cleaned and checked , then it will  call some Transforming  processor to do 
the transforming , then push the data into a datawarehouse (Hive in our 
case)Thank you very much Yan Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

13/03/2016邮件原文发件人:"刘岩" <liu...@richinfo.cn>收件人:users  
<us...@nifi.apache.org>抄 送: dev  <dev@nifi.apache.org>发送时间:2016-03-13 
00:12:27主题:Re:Re: Multiple dataflow jobs management(lots of jobs)
Hi AldrinCurrently  we need to extract 60K tables per day , and the time window 
is limited to 8 Hours.  Which means that we need to run jobs concurrently , and 
we need a general description of what39s going on with all those 60K job flows 
and take further actions.  We have tried Kettle and Talend ,  Talend is a 
IDE-Based so not what we are looking for,  and Kettle was crashed due to the 
Mysql cannot handle the Kettle39s metadata with 10K jobs.So we want to use Nifi 
,  this is really the product that we are looking for , but  the missing piece 
here is a DataFlow jobs Admin Page.  so we can have multiple Nifi instances 
running on different nodes, but monitoring the jobs in one page.  If it can 
intergrate with Ambari metrics API,  then we can develop an Ambari View for 
Nifi Jobs Monitoring just like HDFS View and Hive View. Thank you very much Yan 
Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

06/03/2016邮件原文发件人:Aldrin Piri  <aldrinp...@gmail.com>收件人:users 
<us...@nifi.apache.org>抄 送: dev  <dev@nifi.apache.org>发送时间:2016-03-11 
02:27:11主题:Re: Mutiple dataflow jobs management(lots of jobs)Hi Yan,
We can get more into details an

Re:Re: Multiple dataflow jobs management(lots of jobs)

2016-03-12 Thread
Hi Aldrinsome additional information.it39s a typical ETL offloading user case 
each extraction job should foucs on 1 table and 1 table only.  data will be 
written on HDFS , this is similar to Database Staging. The reason why we need 
to foucs on 1 table for each job is because there might be database error or 
disconnection occur during the extraction , if it39s running as  a script like 
extraction job with expression langurage, then it39s hard to do the re-running 
or excape thing on that table or tables.once the extraction is done, a triger 
like action will do the data cleansing.  this is similar to ODS layer of 
Datawarehousingif the data quality has passed the quality check , then it will 
be marked as cleaned. otherwise , it will return to previous step and redo the 
data extraction, or send alert/email to the  system administrator.if certain 
numbers of tables were all cleaned and checked , then it will  call some 
Transforming  processor to do the transforming , then push the data into a 
datawarehouse (Hive in our case)Thank you very much Yan Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

13/03/2016邮件原文发件人:"刘岩" <liu...@richinfo.cn>收件人:users  
<us...@nifi.apache.org>抄 送: dev  <dev@nifi.apache.org>发送时间:2016-03-13 
00:12:27主题:Re:Re: Multiple dataflow jobs management(lots of jobs)Hi 
AldrinCurrently  we need to extract 60K tables per day , and the time window is 
limited to 8 Hours.  Which means that we need to run jobs concurrently , and we 
need a general description of what39s going on with all those 60K job flows and 
take further actions.  We have tried Kettle and Talend ,  Talend is a IDE-Based 
so not what we are looking for,  and Kettle was crashed due to the Mysql cannot 
handle the Kettle39s metadata with 10K jobs.So we want to use Nifi ,  this is 
really the product that we are looking for , but  the missing piece here is a 
DataFlow jobs Admin Page.  so we can have multiple Nifi instances running on 
different nodes, but monitoring the jobs in one page.  If it can intergrate 
with Ambari metrics API,  then we can develop an Ambari View for Nifi Jobs 
Monitoring just like HDFS View and Hive View. Thank you very much Yan Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

06/03/2016邮件原文发件人:Aldrin Piri  <aldrinp...@gmail.com>收件人:users 
<us...@nifi.apache.org>抄 送: dev  <dev@nifi.apache.org>发送时间:2016-03-11 
02:27:11主题:Re: Mutiple dataflow jobs management(lots of jobs)Hi Yan,
We can get more into details and particulars if needed, but have you 
experimented with expression language?  I could see a Cron driven approach 
which covers your periodic efforts that feeds some number of ExecuteSQL 
processors (perhaps one for each database you are communicating with) each 
having a table.  This would certainly cut down on the need for the 30k 
processors on a one-to-one basis with a given processor.

In terms of monitoring the dataflows, could you describe what else you are 
searching for beyond the graph view?  NiFi tries to provide context for the 
flow of data but is not trying to be a sole monitoring, we can give information 
on a processor basis, but do not delve into specifics.  There is a summary view 
for the overall flow where you can monitor stats about the components and 
connections in the system. We support interoperation with monitoring systems 
via push (ReportingTask) and pull (REST API [2]) semantics. 

Any other details beyond your list of how this all interoperates might shed 
some more light on what you are trying to accomplish.  It seems like NiFi 
should be able to help with this.  With some additional information we may be 
able to provide further guidance or at least get some insights on use cases we 
could look to improve upon and extend NiFi to support.

Thanks!


[1] http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
[2] 
http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
[3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html



On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <liu...@richinfo.cn> wrote:Hi All 



i39m trying to adapt Nifi to production but can not find an admin console 
which monitoring the dataflows



   The scenarios is simple,  



   1.  we gather data from oracle database to hdfs and then to hive.

   2.  residules/incrementals are updated daily or monthly via Nifi.

   3.  full dump on some table are excuted daily or monthly via Nifi.



is it really simple ,  however , we have  7 oracle databases with over 30K  
tables needs to implement the above scenario.



which means that i will drag that ExcuteSQL  elements for like 30K time or so 
and also need to place them with a nice looking way on my little 21 inch screen 
. 



Just wondering if there is a table list like  ,groupable and searchable task 
control and monitoring feature for Nifi 





Thank you very much  in advance





Yan Liu

Re:Re: Multiple dataflow jobs management(lots of jobs)

2016-03-12 Thread
Hi AldrinCurrently  we need to extract 60K tables per day , and the time window 
is limited to 8 Hours.  Which means that we need to run jobs concurrently , and 
we need a general description of what39s going on with all those 60K job flows 
and take further actions.  We have tried Kettle and Talend ,  Talend is a 
IDE-Based so not what we are looking for,  and Kettle was crashed due to the 
Mysql cannot handle the Kettle39s metadata with 10K jobs.So we want to use Nifi 
,  this is really the product that we are looking for , but  the missing piece 
here is a DataFlow jobs Admin Page.  so we can have multiple Nifi instances 
running on different nodes, but monitoring the jobs in one page.  If it can 
intergrate with Ambari metrics API,  then we can develop an Ambari View for 
Nifi Jobs Monitoring just like HDFS View and Hive View. Thank you very much Yan 
Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

06/03/2016邮件原文发件人:Aldrin Piri  <aldrinp...@gmail.com>收件人:users 
<us...@nifi.apache.org>抄 送: dev  <dev@nifi.apache.org>发送时间:2016-03-11 
02:27:11主题:Re: Mutiple dataflow jobs management(lots of jobs)Hi Yan,
We can get more into details and particulars if needed, but have you 
experimented with expression language?  I could see a Cron driven approach 
which covers your periodic efforts that feeds some number of ExecuteSQL 
processors (perhaps one for each database you are communicating with) each 
having a table.  This would certainly cut down on the need for the 30k 
processors on a one-to-one basis with a given processor.

In terms of monitoring the dataflows, could you describe what else you are 
searching for beyond the graph view?  NiFi tries to provide context for the 
flow of data but is not trying to be a sole monitoring, we can give information 
on a processor basis, but do not delve into specifics.  There is a summary view 
for the overall flow where you can monitor stats about the components and 
connections in the system. We support interoperation with monitoring systems 
via push (ReportingTask) and pull (REST API [2]) semantics. 

Any other details beyond your list of how this all interoperates might shed 
some more light on what you are trying to accomplish.  It seems like NiFi 
should be able to help with this.  With some additional information we may be 
able to provide further guidance or at least get some insights on use cases we 
could look to improve upon and extend NiFi to support.

Thanks!


[1] http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
[2] 
http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
[3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html



On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <liu...@richinfo.cn> wrote:Hi All 



i39m trying to adapt Nifi to production but can not find an admin console 
which monitoring the dataflows



   The scenarios is simple,  



   1.  we gather data from oracle database to hdfs and then to hive.

   2.  residules/incrementals are updated daily or monthly via Nifi.

   3.  full dump on some table are excuted daily or monthly via Nifi.



is it really simple ,  however , we have  7 oracle databases with over 30K  
tables needs to implement the above scenario.



which means that i will drag that ExcuteSQL  elements for like 30K time or so 
and also need to place them with a nice looking way on my little 21 inch screen 
. 



Just wondering if there is a table list like  ,groupable and searchable task 
control and monitoring feature for Nifi 





Thank you very much  in advance





Yan Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

06/03/2016