Hi, Stefano, Stuart and the Debian community!
I don’t know how things are going with “clamouring hordes wanting to pick Debian Metrics Portal project”, but this project definitely worth it :) A couple of days I’m thinking about it and now I have some ideas about details of it, as well as I have some unclear parts. Prelude: Most of the modern data visualization software gathering data with agent-server architecture (Munin, Newrelic) or with some entry-point for data (Graphite, RRDTool) local or via web. The choice of approach is due to the nature of the data. For example, the first option is great for gathering information about the the server environment (load state, performance) since it provides a plugin-based agent, but the second is better to get information from third-parties, because actually represents API. In our case, we are faced with a very different data. As I understand it, we need to be able to show "normal" data from the database in which there is an time-value point and not regular points from some file with the number of the GSoC proposals for period or a pie chart of using different versions of Debian. Besides this data is represented in the different formats - it's not homogeneous. For example RRDTool clearly defines what is to be received (in some cases with specified intervals), that allows it to easily handle incoming data and later apply filters and customize displaying according to the configuration. The task is complicated by the fact that not everyone who already place stats on https://wiki.debian.org/Statistics will be willing to provide data in Debian Metrics Portal (DMP) format or convert it. In other cases it is simply not realistic, such as the disk I/O metric, when in addition to that user is responsible to send one to DMP, he is also has to get it from somewhere (in the sense that if we are replacing Munin with DMP). In other hand, such data sources as a RDB is a little easier to handle. They already have some scheme so we can just let user specify which column is a timestamp and which is value or even do it ourselves. The main types of data sources for charts (from https://wiki.debian.org/Statistics): - RRDTool storage Example: https://ftp-master.debian.org/stat.html - flexible (which means that the owner of the graphs is not difficult to go to our specific format) Example: https://bugs.debian.org/release-critical/ https://buildd.debian.org/stats/ http://davesteele.github.io/debian-rfs-stats/ - RDBMS Example: UDD and I guess this is also from DB: http://asdfasdf.debian.net/~tar/bugstats/ - Plain text Example: http://qa.debian.org/watch/uscan-status-stats.txt - Undefined. Sources that may be available in hard-to-parse format, like HTML page. Example: http://ircbots.debian.net/factoids/stats.php?q=recently-created I would suggest: 1. Architecture. The agent-server is not optimal, since we will not always have the opportunity and the need to run the agent on the server, in most cases, this approach will be redundant (eg UDD). Data-entry-point also has some trouble spots, but to solve them will be much easier. So, we will provide local and network entry points (DMP API). Example use cases (actually DMP API wrappers): Console: Create simple shell script. $ dmp-client add <metric_id> <value> <timestamp> Web: Run lightway web server on non-80 port for web interection. $.ajax({ type: "POST", url: “http://192.168.0.100:4242/dmp-rpc/”, data: { “metric_id”: “website_visit”, “timestamp”: “auto”, }, }); Remote: Trivial way in most software is a open socket. Available local and remote: echo "users_online 42 <timestamp>" | nc -q0 192.168.0.100 4241 2. Allow multiple display formats (adjusted for certain metrics, depending on the data type): charts, diagrams, etc. 3. In addition to DMP API, collect data as follows: In the admin web interface user set a “path” to data: it could be a plain text file (via ftp, http or local), RRD database, mysql database, etc. With the given path we are trying to get that data and ask user in our super-web-gui-import-tool to specify how we should treat “columns” (let’s say separated by “,”) e.g. as a timestamp or value. Or do it ourselves if we know that format (like RRD). It’s like how import products from any .xls table works in my software. Next, user set the frequency of data synchronization and enjoying graphs in DMP and in his system simultaneously. This also allow us to get data from things like Munin without any problems and no need to change anything in the existing system. Then user may move to the DMP API method, which is better, without loosing data. This raises the question about the duplication of data. For example, this method is applicable to UDD source. However, copying all the data from the remote database to DMP is excessively. But every time directly receive data from UDD is also not flexible. 4. Provide support for real-time graphs for metrics such as "Online right now", which is not required to store a lot of data, since it is necessary to analyze only the current state at the current moment. 5. DMP API - not only the entry point to send data. It also a way to querying it. 6. Maybe provide an event-collector API for event-based data, so DMP would be also a open-source alternative of Google Analytics-like software. So, all this should satisfy all data source types with more or less things to code. But in some cases, the owner will still have to go to the new format. - In short, I have a misunderstanding with how to receive/gather data. How to add metrics in admin-panel, store them and display - I’ve done it before and I have some ideas about it. Some feedback will help me to begin work on prototype (not just a script that uses matplotlib to graph some metrics, but a web app as a proof of concept). I think this project has the potential to become a great open-source solution to help Debian and other projects make their software better by analysis their stats. -- Regards, Nikolay.
_______________________________________________ Soc-coordination mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/soc-coordination
