Hi All! Thanks a lot for many great and prompt responses!
hi Adam, please look at some screenshots i have uploaded to Falcon-790 that > should address #1. > Yes, the landing page is nice https://issues.apache.org/jira/secure/attachment/12686141/landing.png because it provides the search feature. It's super useful! I wonder if this patch can be split into two smaller patches (1 for landing page with search features and 1 for write-access to entities), if this helps to commit a new landing page to trunk sooner. Being able to create/edit entities in the Web UI is really great, but in our case, it's not a priority yet. We currently use Falcon to schedule and manage our daily/hourly ETL and KPI processes and their feeds. To do so, we have dedicated engineers (e.g. myself), who can create these XML files and submit them to the Falcon server with small effort. Other people just look at Falcon Web UI. I guess that many many people (including Hadoop admins, data engineers and data analysts) could benefit from the features like: > lineage (already looks nice, thanks!), > triage (https://issues.apache.org/jira/browse/FALCON-796 as Ajay Yadav mentioned), > being able to quickly find any feed and process on the list > being able to have a quickly look at Web UI to check if some KPIs failed during a night or are still being processed.... Being able to re-run an instance of the process from the Web UI, I would find also very useful (the CLI for that is not that friendly, because you must provide a timestamp and so). I created a JIRA for that in case it's useful for other people as well https://issues.apache.org/jira/browse/FALCON-986. > - (5) Can you please explain it in more detail? You want only size, hdfs location etc. or something more? I can find a couple of use-cases: 1. the list of recently created partitions of the feed (initially, it can be just a link to HUE so one can easily navigate back and forth). 2. the hint for the retention period for a dataset In this talk http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam (slides 53 and 57), I talked about the "empirical retention policy" for a dataset. This could be (more or less) calculated using the last registered access time kept by NameNode for each file (with the precision of dfs.namenode.accesstime.precision). Let's say, you have a dataset where a new directory is created each day. If you aggregate the information about the last modification time and last registered access time for the each file in the each directory, then you can get some insight about what retention for a dataset should be. So you know, how many days after the creation, an instance of the dataset is not accessed anymore... I can provide an exemplary Python code for that as I have it implemented somewhere... :) This hint could be helpful to set a better retention period for a feed. In practice, many owner of the dataset don't really know what the retention should be and they specify too long periods. 3. automatic detection for the hot datasets In the same talk (slides 57-61), I talked about hot datasets. Initially, it could be as simple as counting at the number of processes that depend on that feed, and use some colours (e.g. red, yellow, green) or numbers, to distinguish popular feeds in some way If the dataset is really hot, you might want to temporarily increase its replication factor for the most recent instances (to benefit from the higher data locality). Falcon might also take care adjusting HDFS replication factor since it's data management... :) > *P.S. *To see how we are thinking about monitoring dashboard you can take a look at the POC that I did at https://github.com/ajayyadav/falcon-dashboard Currently this works on mocked API calls and we are working to provide the backend APIs in falcon as of now. Thanks for sharing, I will have a look! Cheers! Adam
