Hi Auke, I have added your statements in requirements. See refreshed version 0.12. Couple of extra comments below.
Thanks for your inputs, Leonid -----Original Message----- From: Kok, Auke-jan H [mailto:[email protected]] Sent: 19 November 2013 00:07 ... > - any form of input device data is privacy-sensitive. Touches on the screen could reveal unlock patterns, keys typed on the virtual keyboard, etc.. Yep, this part already removed. > - proc is loaded with privacy-sensitive data, and even security-sensitive data, so it should (1) be specified and (2) restricted to only specific proc files that do not contain privacy-sensitive data. Example: /proc/mounts > may contain the label of a SD card that was inserted. The information about mounts available for any process, so even Chromium can upload it somewhere. From practical point of view df -k contains much interesting information. I think we should be sane with level of security to prevent it affecting system fixing. Anyhow we always can have files blacklist to prevent uploading them. > - most of the system logs contain way too much privacy sensitive information to be passed around. This problem is exaggerated by the sheer volume of debug information printed by some of the apps. Someone pointed that dlog makes filtering. We can re-use this part of code. But if some application share PINs/passwords in syslog - that is clear bug in application. We can upload just logs for crashed PID, but for analysis it is not so useful in all cases and usually it is done on server side. > - any data sent to a server should be SSL encrypted and do proper certificate verification. Yep. > The design is very inclusive - you're trying to capture everything, that also means you'll have to assure that all of that is properly filtered and selected before sending anything out. If you reduce the amount of > things you collect, you will have an easier time doing that. Correct. I tried to cover "an ideal crash reporter", not all features will be implemented immediately, it depends from later plans we should do based on features prioritizing. Cheers, Auke On Mon, Nov 18, 2013 at 7:24 AM, Leonid Moiseichuk <[email protected]> wrote: > Hello again, > > One week passed for Crash reporting proposal. > The new version contains implements "security hardening" changes: > - no user input collected > - no application-specific shell executed > - all application-specific files must be readable from application > UID/GID to be added into report > > See attached files and you are welcome with more comments . > Let's introduce deadline 25-Nov-2013 and if no changes introduced the > version will be a community reviewed "working proposal". > > Best Wishes, > Leonid > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of Leonid Moiseichuk > Sent: 14 November 2013 10:25 > To: [email protected] > Subject: [Dev] Crash Reporting proposal for Tizen > > Hello, > > I am happy to present Crash Reporter idea based on number available > versions in publicity. > It might be part of Tizen 3.0 if we agreed on approach. > > I recommend to start from architecture document and look into > requirements if you need technical details. > Please don't hesitate to share your opinion here or by email to me. > Any constructive critics are welcome. > > --- > Leonid Moiseichuk > Tizen Open Source Software engineer > Finland Research Institute - Branch of Samsung Research UK Falcon > Business Park, Vaisalantie 4, 02130 Espoo, Finland > [email protected] | > Mobile: +358 50 4872719 > > > > _______________________________________________ > Dev mailing list > [email protected] > https://lists.tizen.org/listinfo/dev >
Crash Reporter High Level requirements ====================================== version 0.12 19-Nov-2013 author Leonid Moiseichuk [[email protected]] reviewers Auke-Jan Kok [[email protected]] Juho Son [[email protected]] Karol Lewandowski [[email protected]] Kyungmin Park [[email protected]] Lukasz" Stelmach [[email protected]] Introduction ------------ Crash reporting for embedded system is not so easy work as for desktops/servers (corewatch/crashdb, apport+whoopsie/daisy, abrt) due to you have limitations by system availability, security, energy, memory and performance. Thus, some design decisions which are suitable for desktops must be strongly prohibited for mobile devices, for example you cannot install debugger or symbols on device, utilize full connection IO, expect that connectivity always available or use expensive cellular connection as easy as WiFi. On another hand having just a crash report with backtrace is not sufficient to have problem fixed - you need to understand use-case, conditions on device, running apps and memory, logs and data. Practice demonstrated as extremely helpful opportunity to analyze backtraces across many crashes, time, and synchronize changes in code and result from stability point of view. In this document I tried to collect all important high level requirements with some technical details and rationale which might be useful for design/deployment of such system based on experience in design and usage Maemo/Meego (n800-n9, Meltemi) crash reporter tool and analysis server. Partially source code still available and could be used as starting point: * rich-core - on-device crash data collector https://gitorious.org/meego-quality-assurance/rich-core * public settings - crash reporter configuration to be used in open environment https://gitorious.org/meego-quality-assurance/crash-reporter-settings-public * crash-reporter - on device UI and daemon to upload crashes https://gitorious.org/meego-quality-assurance/crash-reporter/ * corelysis - on server crash files unpacker/analyzer/backtracer https://gitorious.org/meego-quality-assurance/corelysis * Crash Reports Web UI https://gitorious.org/meego-quality-assurance/crash-reports The extrending of crash reporting could be done taking into account ideas which could be picked from Abrt - most advanced but non-embedded solution: * home page https://github.com/abrt/abrt/wiki/ABRT-Project * repository git://github.com/abrt/abrt.git Components ---------- There are 4 top-level components crash reporting system should have: * on-device Collector which takes oops/crash information for kernel or application * Analyzer on server (or cluster) for gathering all crashes from devices population, unpacking and processing delivered information, generate most often happened crash, crash statistics per release, device runtime statistics as much as we can fetch. * Connector which allows to collect the rest information about device and to deliver it Collector (device) to Analyzer (server) in most appropriate time and cost, without impacting other use-cases. * on-device Viewer to support cases when crash information cannot be uploaded to server and need to be processed on device like Abrt does. This part is mostly for [3rd-party] developers (and corewatcher substitution) Each of these components by fact might be very nontrivial e.g. Analyzer might be deployed as: - Uploader - server which gather uploads (crashes) from the public internet - Dispatcher(s) - which takes upload (crash), unpack, validate, check existence in database, push processing further if crash is new - Processor - which takes a crash, if necessary install software for backtracing taking into account hardware and release, analyze and push all possible info to database - ReleaseManager(s) - server which keeps pool most often used releases and debug symbols and provide them on demand for Processors - Database - database server which collect all information, raw files, provide data by requests for all other servers - WebUI - front-end which knows how to deliver information for Developers, Managers, Testers, link crashes to JIRA/Bugzilla Initially (2007, n800) we had Analyzer deployed on one mid-power PC, but later with project development the deployment enlarged to about 20 servers, mostly to keep and prepare releases (link Oulu -> Tampere was quite slow) on demand and processing crashes. Database and Web stays on the same node. It allows to handle up to 20K+ crashes per day keeping processing time below 1 hour. Let's walk through all components. Collector --------- That is most memory and performance critical application in whole chain because should serve crashes. There are two types of crashes: kernel oopses and application crashes. Both types handled in two stages: critical information collected in the moment of crash/oops happeend and some extra data about device must be added into crash file later but before uploading to crash analysis server. The crash information file should be stored into crash folder e.g. /var/upload which is ideally should be mounted on separate partition to avoid situations when amount of uploads impacts usual device operations and allow to sort and remove uploads if connection throughput is not sufficient. Thus, the Collector functionality could be divided for 3 parts: 1. Collector:base - base permanent device information which is not changed in runtime e.g. build version, MAC and IMEI codes, serial number etc. Such information should be collected during device boot and prepared in form most suitable for further processing from scripts. Also it should make settings to enable oops/crash dumps, specify folders and partitions etc. Most critical to dumping cores settings are: kernel.core_pattern = |<PATH_TO_Collector:crash %p %u %g %s %t %e kernel.core_pipe_limit = 0 kernel.core_uses_pid = 0 fs.suid_dumpable = 1 ulimit -c unlimited 2. Collector:kernel - processing kernel oopses. To indicate that oops happened and avoid sequential oopses due to in-kernel memory corruption the device must have kernel.panic_on_oops = 1 So, when oops happened device will be automatic rebooted. Having mtdoops or Android apanic will lead to save oops information before reboot to oneNAND or eMMC partition, if device supports static scratchpad memory we might save information here as well to cover cases when interrupts were disabled in the moment of oops and saving data to eMMC/oneNAND will fail. That is also possible to save an information about user-space status as much as we can collect it in the moment of oops. Thus, Collector:kernel activated on device boot, make required for kernel oopsing settings, checks scratch memory AND oops partition to investigate does device [re]booted with new oops or without. If oops detected the related information must be collected as kernel crash and stored to crashes folder for uploading to server. 3. Collector:crash - this part is activated in the moment of application crash through kernel interface available by /proc/sys/kernel/core_pattern. The core dump in elf format from crashed application comes at stdin (fd 0), so the following information should be pushed to uploading: - crashed application information = pid and everything from /proc/self = uid and gid of process = signal which was reason to die = name of applicaion as it set by prctl() (in addition to /proc/self/cmdline) = timestamp of crash = core file, which could be * Google Breakpad minidump ptraced from Collector:crash * reduced form core (below 200KB), which is enough for gdb to dump backtrace for all threads, registers, and variables on stack * in full format as it generated by kernel if application is in special exceptional list due to sometimes reduced core is not enough, so we have to use full version which will have size as VM (up to 2-3 GB) = maps = smaps = application-specific files which could be pointed through Settings Note: that is necessary to cover cases like java/python/lua crashes e.g. save_backtrace("/tmp/python_backtrace.txt") raise(SIGBUS) The process will be self-killed but python_backtrace.txt collected to report. - system runtime information like = /proc contents (better if everything) = running apps information e.g. smaps files = logs (dmesg, syslog, any other important logs) = uptime = interfaces statistics (ifconfig --all) = battery level = file system usage = etc. due to this part will be easy expendable in future Ideally the Collector:crash should be static linked application located in /bin. But for first versions shell scripting will be also suitable. In both Collector cases the information should be send through pipe to Connector interface to be packed properly and delivered to Analyzer later, so we must prevent temporary files or any other side modification of file system contents to avoid sequential crashes or file system corruption. In addition/replacement of Breakpad's minidump core reduction becomes important part of picture due to allow to recover crash point, registers, data in function parameters by about 100KB instead of 100MB-3GB VM process has initially. The core file should be chopped on-fly just to speedup processing and reduce memory requirements. As it mentioned above the application must be checked through exceptions list controlled from device settings due to in some rare cases reduced core might not work/well produced. The installation of Collector on usual release will lead to alarm User about side-backs and enable default settings which User might change. It will allow us to turn any device to crashing mode and that is very useful if we have oddity with particular device in User's hand. Until almost latest releases the Crash Reporter must be installed on device and turned on by default. Warning: Both kernel panic and crashes folders MUST be not re-flashable if they already exists. It will allow us to make post-mortem analysis in case device bricked - just re-flash old/new build and boot. Connector --------- Connector should perform the following important activities: - keep available space in upload folder large enough by killing too often produced files like crashes from the same application - keep scheme of naming for all types of files like type.MAC.DATE_TIME.application.dump examples crash.1867b036f310.20131003_132301123.kernel.dump crash.1867b036f310.20131003_132301204.systemd.dump stats.1867b036f310.20131003_132404215.logs.dump stats.1867b036f310.20131003_132404215.power.dump Note: the extension .dump is selected due to file may contains crash or kernel oops information, log dump, power management information and should be autocompressed. - adds system permanent information collected by Collector:base like = build ID = list of installed packages = probably has sense to support binaries collection - crashed executable and dependencies which are known from maps file - fills the file upload information using pre-filled information and UI dialog OR file with pre-defined name (/tmp/crash.info). That is necessary to cover as User cases (crash during browsing) as well as script-based testing into type.MAC.DATE_TIME.application.dump.info file which contains information which is also important for uploading e.g. server IP which is used for uploading when it is started. - after .info file filled the actual upload may start using one file at a time, taking into account file upload priority, if appropriate moment happened e.g. has Wi-Fi or Cellular, charge level is OK, device idling or night time, etc. The interface for Connector should be provided by libuploader to usage in applications (e.g. Collector or connector daemon) or uploader utility which accepts contents and create compressed file to upload on-fly through pipe cat /var/log/syslog|uploader -t stats -c "Example of logging" logs -f syslog After file created the upload daemon should finalize corresponding .info file and schedule uploading to appropriate Analyzer server based on file type. If device has lack of space the oldest file [of the same type] should be removed. The Settings to uploader should control the following options: - allowed interface (e.g. wlan0 by default) and level of utilization (1-100%) - email of User for notification of uploads and bugs assigning, otherwise User have to track uploads manually from Analyzer:WebUI page by device MAC and upload time - auto-upload which prohibit to show any dialogues to User due to it may break test flow or just annoying User - policy to upload i.e. idling, immediate, night - default text for description (could be overwritten by contents of /tmp/crash.info) - list of servers to be used, selection policy (primary -> backup or random selection) - blacklist of applications to be not reported - list of applications (not from blacklist) which should be reported without core reduction - application-specific files, e.g. in case Xorg crash we should pack Xorg.log.* Information priorities to filling crash info file could be the following: 1. what User enters in upload dialog if it activated 2. what device has in /tmp/crash.info if it exists 3. what device has in Settings See Formats section for details about files structure and Protocol for communication protocol information. Analyzer -------- In comparison to other available crash reporting facilities the Analyzer is a key differentiator which allows to significantly improve product quality in a short time. First, the collected crashes per applications and kernel could be grouped by top 5 function names as it was discovered by Meego practice and instead of hundreds and thousands crashes the application support team will have tens unique patterns to be fixed. The same applicable for kernel as well. Second, developer will have a lot of extra useful information collected in the moment of crash: use-case, logs, memory conditions, files. That simplifies developer's work a lot and often allow to have fix without necessity to reproduce rare crash. Such crashes could be verified after tracking on Analyzer for 2-4 weeks: if crash gone or becomes very rare it might be accepted as closed. Third, keeping historical information for several products also allows to find similarities and fixes without real coding, just by re-applying patches from other components. And finally, a lot of useful information could be produced from database on-fly, for example time between charging session, minimal/average/maximal memory consumption, lack of storage cases, time between reboots or average amount of crashes per day. These indirect numbers are useful to understand release stability, size of devices population and have proven product readiness numbers. As it was mentioned above the following functional pieces could be pointed at Analyzer part: - Analyzer:Uploader - server which gather uploads (crashes) from the public internet. That is most security-sensitive part of chain and probably has sense to have number of such servers which will be selected randomly on device for load-balancing purpose. - Analyzer:Dispatcher(s) - which takes upload, unpack and make an initial handling, like create folder and database entry, core file sha256 check-summing, find similar cores and if not exist - push further to available Analyzer:Process instance e.g. through Analyzer:Database. - Analyzer:Database - database server connects all important pieces together, keeps all information, folders with raw files according to sha256 sums, provides data by requests for all other servers. - Analyzer:Processor - fetches a crash from database according to queue, installs software release for backtracing e.g. using QEMU or cross-gdb, produces a backtrace, makes stripped version of function calls without parameters (crash snapshot), analyzes a delivered files to provide used memory figures, free memory on partitions, logs for crashed pid etc. It also sends email when processing completed, updates statistics about processing time, creates bugs in JIRA if necessary and crash was not known before etc. - Analyzer:ReleaseManager(s) - server which keeps pool most often used releases and debug symbols and provides them on demand for Analyzer:Processor(s) - Analyzer:WebUI - front-end which knows how to deliver information for Developers, Managers, Testers, link crashes to JIRA/Bugzilla. It allows to upload crashes manually and puts such uploads into top of queue for processing. Analyzer:WebUI should provide 3 types of information: = the server statistics like number of cores in queue to processing, average load for last hour/day/month, number of allocated processors, pre-cached releases in pool, incoming queue of uploads to be processed/already processed. = the device population statistics for specified product and release: memory usage, amount of devices in population, time between charging session, average uptime, number of oopses, number of crashes, etc. = the crash statistics, for each product Customers of the system must see the following essential data: = page with application/week/crashes statistics, selection particular application or week or crash value will open page with details filtered to selection e.g. calculator crashes per week or just all crashes of calculator for pointed week = page with unique crashes, similar to previous one but numbers produced based on top 5 function names in backtrace. = page with applications/bugs, sorted by amount of crashes taking into account unicity of backtrace => i.e. which problem should be fixed first = any particular crash page, addressable by sha256 checksum: * application, signal * backtrace (for all threads), parameters from stack, direct links from function names/files to source code in git (we used Mozilla MXR) * logs, /proc contents, files, memory consumption * moments of crash, upload, processing * similar crashes and crashes generated on the same device nearby in time (e.g. 15 seconds before/after this particular crash) * linked report in Bugzilla/JIRA if bug specified = any application, week or unique crash should be addressable through WebUI, that is very useful to point crashes directly from JIRA or any other reports. All/part of these servers could be located in cloud, so we will have an opportunity to scale up quickly when we have a lot of crashes. Viewer ------ This component we had not in Meego crash reporting but seems have it very useful. The idea of dump file viewer is pretty simple - allow to see .dump file contents without access to Analyzer. It is very useful to debugging whole crash system and for developers which often like to use own libraries and symbols. Viewing .dump file contents could be done using command line tools like lzop, gdb etc. but having UI is useful because allows us to debug analyser part on device. Viewer should make a report with the following information: - type of file - device generic information - installed packages - backtrace (if gdb installed and symbols available) - allow to unpack files into some folder for manual checks - produce comprehensive text report to be added e.g. in bugzilla/jira Formats ------- Files for exchange should have pre-defined namings and format. The naming expected to be following: type.MAC.DATE_TIME.application.dump where type is file type which is problem-specific e.g. crash, stats, logs etc. MAC - mac address of device due to that is not guaranteed device has IMEI DATE_TIME - timestamp when file is created in format YYYYMMDD_HHMMSSmmm application - application name which resulted to producing this file e.g. logs, systemd, systemui .dump - file name extension (could be something else like .data if you like) The file itself is a archive of compressed files using lzo where new files added one by one. The archive must be unpacked after uploading to folder with about the following structure type.MAC.DATE_TIME.application/ core - real or reduced core file collected on device system.log - the last X MB syslog collected on device before moment of crash procfs.json- contents of /proc hanlded like proc2csv does name - application name as it reported by kernel signal - signal which killed application pid, uid, gid etc. - see description above info - the text of description if it exists in .info file email - email if it exists in .info file bug - title of bug if it exists in .info file packages.ls- list of installed packages SubDir/ - subfolder if you need for extra files The correstponding .info file named as type.MAC.DATE_TIME.application.dump.info And initially contains number of optional fields which cannot be added in the moment of .dump file creation: - uploader email - bug title if it allowed to create new bug in JIRA and assign to User - text with description (should be bug summary if it needs to be created) - etc. Example of /tmp/crash.info file which is created ahead of time before crash happened and should be used to producing .info file for uploading: ENTRY: email [email protected] ENTRY: bug auto-bug for suite BrowserReliability test case YouTube_01 ENTRY: summary Please fill me with details When .info file detected by Connector, the each entry ENTRY: entry_name line1 .. lineN will added to .dump file as file with name entry_name and contents from line1..lineN and original content of .info file will be wiped. Then the server IP selected according to file type, and if server connected, the .info file turned to 2 line contents: ENTRY: server 12.122.14.11 After uploading completed the .dump and .info files should be erased. Protocol -------- There are following important things we should support in protocol: - it should be stateless to support interruptable downloads - the channel utilization should be adjusted in runtime and not more then pointed in settings (50% by default) - if file checksum is not much server may request re-upload - encryption support and certificates verification is mandatory to hide sensitive information (symbols, passwords, crash points) from monitoring, so https-based protocol (port 443) must be used. It also fit nice into usual network infrastructure like firewalls/proxies. When .info file discovered we may have 3 situations: 1. the info file contains sections except "ENTRY: server" => need to be processed locally as described above with adding extra data to .dump file 2. the info file is empty - server must be selected and connected according to .dump file type 3. the info file contains "ENTRY: server" - connect to server or remove contents of .info file if connection failed After connection to server established the following steps could be performed for negotiation in between device (D) and server (S): D: type.MAC.DATE_TIME.application.dump <FILE_SIZE> <FILE_HASH> // We have file for upload with pointed size and check sum S: <FILE_POSITION> <MAX_BLOCK_SIZE> // Yes, please start from that position, FILE_POSITION > 0 if upload was // interrupted earlier D: {data_block_sent} * N // Sending blocks of data 4K..MAX_BLOCK_SIZE size until whole file loaded S: <FILE_HASH> // Hash of file (md5, sha1, sha256) counted on server side, if file transfered badly // it must be NOT the same as in the beginning of transfer => stream closed and file // deleted only on server side, re-transmission started So on device side the bad file hash or stream closing are signals for re-connection and re-transmission according to server requests. Security -------- The number of security hardening requirements are presented in this document: - all data uploading works over https, should be SSL encrypted and do proper certificate verification. - all configuration files expected to be integrity-protected - any form of input device data is privacy-sensitive and prohibit to be uploaded. Touches on the screen could reveal unlock patterns, keys typed on the virtual keyboard, etc. - logs collection should be performed taking into account they may contain sensitive information (e.g. PINs), such information should be filtered out. logs size needs to be restricted by some limit to prevent information irrelevant to crash to leave device. - for application-specific files collection expected to handle files only if they accessible from application UID/GID - the static linking should be used and no shells executed from Collector:crash - files blacklisting should be used to prevent any application access for security sensitive data ===[ end of Crash Reporter High Level requirements ]===
_______________________________________________ Dev mailing list [email protected] https://lists.tizen.org/listinfo/dev
