Re: [Dev] Crash Reporting proposal for Tizen

Leonid Moiseichuk Tue, 19 Nov 2013 00:51:18 -0800

Hi Auke,

I have added your statements in requirements. See refreshed version 0.12.
Couple of extra comments below.


Thanks for your inputs,
Leonid

-----Original Message-----
From: Kok, Auke-jan H [mailto:[email protected]] 
Sent: 19 November 2013 00:07
...
> - any form of input device data is privacy-sensitive. Touches on the
screen could reveal unlock patterns, keys typed on the virtual keyboard,
etc..
Yep, this part already removed.

> - proc is loaded with privacy-sensitive data, and even security-sensitive
data, so it should (1) be specified and (2) restricted to only specific proc
files that do not contain privacy-sensitive data. Example: /proc/mounts >
may contain the label of a SD card that was inserted.
The information about mounts available for any process, so even Chromium can
upload it somewhere. From practical point of view df -k contains much
interesting information.
I think we should be sane with level of security to prevent it affecting
system fixing. Anyhow we always can have files blacklist to prevent
uploading them.

> - most of the system logs contain way too much privacy sensitive
information to be passed around. This problem is exaggerated by the sheer
volume of debug information printed by some of the apps.
Someone pointed that dlog makes filtering. We can re-use this part of code.
But if some application share PINs/passwords in syslog - that is clear bug
in application.
We can upload just logs for crashed PID, but for analysis it is not so
useful in all cases and usually it is done on server side.

> - any data sent to a server should be SSL encrypted and do proper
certificate verification.
Yep.

> The design is very inclusive - you're trying to capture everything, that
also means you'll have to assure that all of that is properly filtered and
selected before sending anything out. If you reduce the amount of 
> things you collect, you will have an easier time doing that.
Correct. I tried to cover "an ideal crash reporter", not all features will
be implemented immediately, it depends from later plans we should do based
on features prioritizing. 

Cheers,

Auke

On Mon, Nov 18, 2013 at 7:24 AM, Leonid Moiseichuk
<[email protected]> wrote:
> Hello again,
>
> One week passed for Crash reporting proposal.
> The new version contains implements "security hardening" changes:
> - no user input collected
> - no application-specific shell executed
> - all application-specific files must be readable from application 
> UID/GID to be added into report
>
> See attached files and you are welcome with more comments .
> Let's introduce deadline 25-Nov-2013 and if no changes introduced the 
> version will be a community reviewed "working proposal".
>
> Best Wishes,
> Leonid
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] 
> On Behalf Of Leonid Moiseichuk
> Sent: 14 November 2013 10:25
> To: [email protected]
> Subject: [Dev] Crash Reporting proposal for Tizen
>
> Hello,
>
> I am happy to present Crash Reporter idea based on number available 
> versions in publicity.
> It might be part of Tizen 3.0 if we agreed on approach.
>
> I recommend to start from architecture document and look into 
> requirements if you need technical details.
> Please don't hesitate to share your opinion here or by email to me.
> Any constructive critics are welcome.
>
> ---
> Leonid Moiseichuk
> Tizen Open Source Software engineer
> Finland Research Institute - Branch of Samsung Research UK Falcon 
> Business Park, Vaisalantie 4, 02130 Espoo, Finland 
> [email protected] |
> Mobile:  +358 50 4872719
>
>
>
> _______________________________________________
> Dev mailing list
> [email protected]
> https://lists.tizen.org/listinfo/dev
>

Crash Reporter High Level requirements
======================================
version 0.12 19-Nov-2013
author Leonid Moiseichuk [[email protected]]
reviewers
  Auke-Jan Kok [[email protected]]
  Juho Son [[email protected]]
  Karol Lewandowski [[email protected]]
  Kyungmin Park [[email protected]]
  Lukasz" Stelmach [[email protected]]

Introduction
------------
Crash reporting for embedded system is not so easy work as for desktops/servers
(corewatch/crashdb, apport+whoopsie/daisy, abrt) due to you have limitations by
system availability, security, energy, memory and performance. Thus, some design
decisions which are suitable for desktops must be strongly prohibited for mobile
devices, for example you cannot install debugger or symbols on device, utilize
full connection IO, expect that connectivity always available or use expensive
cellular connection as easy as WiFi. On another hand having just a crash report
with backtrace is not sufficient to have problem fixed - you need to understand
use-case, conditions on device, running apps and memory, logs and data. Practice
demonstrated as extremely helpful opportunity to analyze backtraces across many
crashes, time, and synchronize changes in code and result from stability point
of view.

In this document I tried to collect all important high level requirements with 
some
technical details and rationale which might be useful for design/deployment of 
such
system based on experience in design and usage Maemo/Meego (n800-n9, Meltemi)
crash reporter tool and analysis server.

Partially source code still available and could be used as starting point:
* rich-core - on-device crash data collector
  https://gitorious.org/meego-quality-assurance/rich-core
* public settings - crash reporter configuration to be used in open environment
  https://gitorious.org/meego-quality-assurance/crash-reporter-settings-public
* crash-reporter - on device UI and daemon to upload crashes
  https://gitorious.org/meego-quality-assurance/crash-reporter/
* corelysis - on server crash files unpacker/analyzer/backtracer
  https://gitorious.org/meego-quality-assurance/corelysis
* Crash Reports Web UI
  https://gitorious.org/meego-quality-assurance/crash-reports

The extrending of crash reporting could be done taking into account ideas which
could be picked from Abrt - most advanced but non-embedded solution:
* home page  https://github.com/abrt/abrt/wiki/ABRT-Project
* repository git://github.com/abrt/abrt.git

Components
----------
There are 4 top-level components crash reporting system should have:
* on-device Collector which takes oops/crash information for kernel or 
application
* Analyzer on server (or cluster) for gathering all crashes from devices 
population,
  unpacking and processing delivered information, generate most often happened 
crash,
  crash statistics per release, device runtime statistics as much as we can 
fetch.
* Connector which allows to collect the rest information about device and to 
deliver
  it Collector (device) to Analyzer (server) in most appropriate time and cost,
  without impacting other use-cases.
* on-device Viewer to support cases when crash information cannot be uploaded to
  server and need to be processed on device like Abrt does. This part is mostly
  for [3rd-party] developers (and corewatcher substitution)

Each of these components by fact might be very nontrivial e.g. Analyzer might be
deployed as:
- Uploader - server which gather uploads (crashes) from the public internet
- Dispatcher(s) - which takes upload (crash), unpack, validate, check existence 
in
  database, push processing further if crash is new
- Processor - which takes a crash, if necessary install software for backtracing
  taking into account hardware and release, analyze and push all possible info
  to database
- ReleaseManager(s) - server which keeps pool most often used releases and debug
  symbols and provide them on demand for Processors
- Database - database server which collect all information, raw files, provide 
data
  by requests for all other servers
- WebUI - front-end which knows how to deliver information for Developers, 
Managers,
  Testers, link crashes to JIRA/Bugzilla

Initially (2007, n800) we had Analyzer deployed on one mid-power PC, but later 
with
project development the deployment enlarged to about 20 servers, mostly to keep 
and
prepare releases (link Oulu -> Tampere was quite slow) on demand and processing
crashes. Database and Web stays on the same node. It allows to handle up to 20K+
crashes per day keeping processing time below 1 hour.

Let's walk through all components.

Collector
---------
That is most memory and performance critical application in whole chain because 
should
serve crashes. There are two types of crashes: kernel oopses and application
crashes. Both types handled in two stages: critical information collected in 
the moment
of crash/oops happeend and some extra data about device must be added into 
crash file
later but before uploading to crash analysis server. The crash information file 
should
be stored into crash folder e.g. /var/upload which is ideally should be mounted 
on
separate partition to avoid situations when amount of uploads impacts usual 
device
operations and allow to sort and remove uploads if connection throughput is not 
sufficient.

Thus, the Collector functionality could be divided for 3 parts:

1. Collector:base - base permanent device information  which is not changed in
   runtime e.g. build version, MAC and IMEI codes, serial number etc. Such
   information should be collected during device boot and prepared in form
   most suitable for further processing from scripts. Also it should make
   settings to enable oops/crash dumps, specify folders and partitions etc.
   Most critical to dumping cores settings are:
    kernel.core_pattern = |<PATH_TO_Collector:crash %p %u %g %s %t %e
    kernel.core_pipe_limit = 0
    kernel.core_uses_pid = 0
    fs.suid_dumpable = 1
    ulimit -c unlimited

2. Collector:kernel - processing kernel oopses. To indicate that oops happened
   and avoid sequential oopses due to in-kernel memory corruption the device
   must have
     kernel.panic_on_oops = 1
   So, when oops happened device will be automatic rebooted. Having mtdoops or
   Android apanic will lead to save oops information before reboot to oneNAND
   or eMMC partition, if device supports static scratchpad memory we might
   save information here as well to cover cases when interrupts were disabled
   in the moment of oops and saving data to eMMC/oneNAND will fail. That is
   also possible to save an information about user-space status as much as we
   can collect it in the moment of oops.

   Thus, Collector:kernel activated on device boot, make required for kernel
   oopsing settings, checks scratch memory AND oops partition to investigate
   does device [re]booted with new oops or without. If oops detected the related
   information must be collected as kernel crash and stored to crashes
   folder for uploading to server.

3. Collector:crash - this part is activated in the moment of application crash
   through kernel interface available by /proc/sys/kernel/core_pattern. The core
   dump in elf format from crashed application comes at stdin (fd 0), so the
   following information should be pushed to uploading:
   - crashed application information
      = pid and everything from /proc/self
      = uid and gid of process
      = signal which was reason to die
      = name of applicaion as it set by prctl() (in addition to 
/proc/self/cmdline)
      = timestamp of crash
      = core file, which could be
        * Google Breakpad minidump ptraced from Collector:crash
        * reduced form core (below 200KB), which is enough for gdb to
          dump backtrace for all threads, registers, and variables on stack
        * in full format as it generated by kernel if application is in special
          exceptional list due to sometimes reduced core is not enough, so
          we have to use full version which will have size as VM (up to 2-3 GB)
      = maps
      = smaps
      = application-specific files which could be pointed through Settings
        Note: that is necessary to cover cases like java/python/lua crashes e.g.
            save_backtrace("/tmp/python_backtrace.txt")
            raise(SIGBUS)
          The process will be self-killed but python_backtrace.txt collected to
          report.
   - system runtime information like
      = /proc contents (better if everything)
      = running apps information e.g. smaps files
      = logs (dmesg, syslog, any other important logs)
      = uptime
      = interfaces statistics (ifconfig --all)
      = battery level
      = file system usage
      = etc. due to this part will be easy expendable in future
   Ideally the Collector:crash should be static linked application located
   in /bin. But for first versions shell scripting will be also suitable.

In both Collector cases the information should be send through pipe to Connector
interface to be packed properly and delivered to Analyzer later, so we must 
prevent
temporary files or any other side modification of file system contents to avoid
sequential crashes or file system corruption.

In addition/replacement of Breakpad's minidump core reduction becomes important
part of picture due to allow to recover crash point, registers, data in function
parameters by about 100KB instead of 100MB-3GB VM process has initially.
The core file should be chopped on-fly just to speedup processing and reduce
memory requirements. As it mentioned above the application must be checked
through exceptions list controlled from device settings due to in some rare 
cases
reduced core might not work/well produced.

The installation of Collector on usual release will lead to alarm User about
side-backs and enable default settings which User might change. It will allow
us to turn any device to crashing mode and that is very useful if we have oddity
with particular device in User's hand. Until almost latest releases the Crash
Reporter must be installed on device and turned on by default.

Warning: Both kernel panic and crashes folders MUST be not re-flashable if they
  already exists. It will allow us to make post-mortem analysis in case device
  bricked - just re-flash old/new build and boot.

Connector
---------
Connector should perform the following important activities:
- keep available space in upload folder large enough by killing too often
  produced files like crashes from the same application
- keep scheme of naming for all types of files like
    type.MAC.DATE_TIME.application.dump
  examples
    crash.1867b036f310.20131003_132301123.kernel.dump
    crash.1867b036f310.20131003_132301204.systemd.dump
    stats.1867b036f310.20131003_132404215.logs.dump
    stats.1867b036f310.20131003_132404215.power.dump
  Note: the extension .dump is selected due to file may contains crash or
    kernel oops information, log dump, power management information and
    should be autocompressed.
- adds system permanent information collected by Collector:base like
  = build ID
  = list of installed packages
  = probably has sense to support binaries collection - crashed executable and
    dependencies which are known from maps file
- fills the file upload information using pre-filled information and UI dialog
  OR file with pre-defined name (/tmp/crash.info). That is necessary to cover
  as User cases (crash during browsing) as well as script-based testing into
    type.MAC.DATE_TIME.application.dump.info
  file which contains information which is also important for uploading e.g.
  server IP which is used for uploading when it is started.
- after .info file filled the actual upload may start using one file at a time,
  taking into account file upload priority, if appropriate moment happened e.g.
  has Wi-Fi or Cellular, charge level is OK, device idling or night time, etc.

The interface for Connector should be provided by libuploader to usage in
applications (e.g. Collector or connector daemon) or uploader utility which
accepts contents and create compressed file to upload on-fly through pipe
  cat /var/log/syslog|uploader -t stats -c "Example of logging" logs -f syslog

After file created the upload daemon should finalize corresponding .info file
and schedule uploading to appropriate Analyzer server based on file type. If
device has lack of space the oldest file [of the same type] should be removed.

The Settings to uploader should control the following options:
- allowed interface (e.g. wlan0 by default) and level of utilization (1-100%)
- email of User for notification of uploads and bugs assigning, otherwise User
  have to track uploads manually from Analyzer:WebUI page by device MAC and
  upload time
- auto-upload which prohibit to show any dialogues to User due to it may
  break test flow or just annoying User
- policy to upload i.e. idling, immediate, night
- default text for description (could be overwritten by contents of 
/tmp/crash.info)
- list of servers to be used, selection policy (primary -> backup or random 
selection)
- blacklist of applications to be not reported
- list of applications (not from blacklist) which should be reported without
  core reduction
- application-specific files, e.g. in case Xorg crash we should pack Xorg.log.*

Information priorities to filling crash info file could be the following:
1. what User enters in upload dialog if it activated
2. what device has in /tmp/crash.info if it exists
3. what device has in Settings

See Formats section for details about files structure and Protocol for 
communication
protocol information.

Analyzer
--------
In comparison to other available crash reporting facilities the Analyzer is
a key differentiator which allows to significantly improve product quality in
a short time.

First, the collected crashes per applications and kernel could be grouped
by top 5 function names as it was discovered by Meego practice and instead
of hundreds and thousands crashes the application support team will have
tens unique patterns to be fixed. The same applicable for kernel as well.

Second, developer will have a lot of extra useful information collected in
the moment of crash: use-case, logs, memory conditions, files. That simplifies
developer's  work a lot and often allow to have fix without necessity to
reproduce rare crash. Such crashes could be verified after tracking on
Analyzer for 2-4 weeks: if crash gone or becomes very rare it might be
accepted as closed.

Third, keeping historical information for several products also allows to find
similarities and fixes without real coding, just by re-applying patches from
other components.

And finally, a lot of useful information could be produced from database on-fly,
for example time between charging session, minimal/average/maximal memory
consumption, lack of storage cases, time between reboots or average amount
of crashes per day. These indirect numbers are useful to understand release
stability, size of devices population and have proven product readiness numbers.

As it was mentioned above the following functional pieces could be pointed
at Analyzer part:
- Analyzer:Uploader - server which gather uploads (crashes) from the public
    internet. That is most security-sensitive part of chain and probably
    has sense to have number of such servers which will be selected randomly
    on device for load-balancing purpose.
- Analyzer:Dispatcher(s) - which takes upload, unpack and make an initial
    handling, like create folder and database entry, core file sha256
    check-summing, find similar cores and if not exist - push further to
    available Analyzer:Process instance e.g. through Analyzer:Database.
- Analyzer:Database - database server connects all important pieces together,
    keeps all information, folders with raw files according to sha256 sums,
    provides data by requests for all other servers.
- Analyzer:Processor - fetches a crash from database according to queue,
    installs software release for backtracing e.g. using QEMU or cross-gdb,
    produces a backtrace, makes stripped version of function calls without
    parameters (crash snapshot),  analyzes a delivered files to provide used
    memory figures, free memory on partitions, logs for crashed pid etc.
    It also sends email when processing completed, updates statistics about
    processing time, creates bugs in JIRA if necessary and crash was not
    known before etc.
- Analyzer:ReleaseManager(s) - server which keeps pool most often used releases
    and debug symbols and provides them on demand for Analyzer:Processor(s)
- Analyzer:WebUI - front-end which knows how to deliver information for
    Developers, Managers, Testers, link crashes to JIRA/Bugzilla. It allows
    to upload crashes manually and puts such uploads into top of queue for
    processing.
    Analyzer:WebUI should provide 3 types of information:
    = the server statistics like number of cores in queue to processing,
      average load for last hour/day/month, number of allocated processors,
      pre-cached releases in pool, incoming queue of uploads to be
      processed/already processed.
    = the device population statistics for specified product and release:
      memory usage, amount of devices in population, time between charging
      session, average uptime, number of oopses, number of crashes, etc.
    = the crash statistics, for each product Customers of the system must
      see the following essential data:
      = page with application/week/crashes statistics, selection particular
        application or week or crash value will open page with details
        filtered to selection e.g. calculator crashes per week or just
        all crashes of calculator for pointed week
      = page with unique crashes, similar to previous one but numbers
        produced based on top 5 function names in backtrace.
      = page with applications/bugs, sorted by amount of crashes taking into
        account unicity of backtrace => i.e. which problem should be fixed first
      = any particular crash page, addressable by sha256 checksum:
        * application, signal
        * backtrace (for all threads), parameters from stack, direct links
          from function names/files to source code in git (we used Mozilla MXR)
        * logs, /proc contents, files, memory consumption
        * moments of crash, upload, processing
        * similar crashes and crashes generated on the same device nearby
          in time (e.g. 15 seconds before/after this particular crash)
        * linked report in Bugzilla/JIRA if bug specified
    = any application, week or unique crash should be addressable through
      WebUI, that is very useful to point crashes directly from JIRA
      or any other reports.

All/part of these servers could be located in cloud, so we will have an
opportunity to scale up quickly when we have a lot of crashes.

Viewer
------
This component we had not in Meego crash reporting but seems have it very 
useful.
The idea of dump file viewer is pretty simple - allow to see .dump file contents
without access to Analyzer. It is very useful to debugging whole crash system 
and
for developers which often like to use own libraries and symbols. Viewing .dump
file contents could be done using command line tools like lzop, gdb etc. but 
having
UI is useful because allows us to debug analyser part on device.

Viewer should make a report with the following information:
- type of file
- device generic information
- installed packages
- backtrace (if gdb installed and symbols available)
- allow to unpack files into some folder for manual checks
- produce comprehensive text report to be added e.g. in bugzilla/jira

Formats
-------
Files for exchange should have pre-defined namings and format. The naming
expected to be following:
  type.MAC.DATE_TIME.application.dump
where
  type is file type which is problem-specific e.g. crash, stats, logs etc.
  MAC - mac address of device due to that is not guaranteed device has IMEI
  DATE_TIME - timestamp when file is created in format
    YYYYMMDD_HHMMSSmmm
  application - application name which resulted to producing this file
    e.g. logs, systemd, systemui
  .dump - file name extension (could be something else like .data if you like)

The file itself is a archive of compressed files using lzo where new files added
one by one. The archive must be unpacked after uploading to folder with about
the following structure
  type.MAC.DATE_TIME.application/
      core - real or reduced core file collected on device
      system.log - the last X MB syslog collected on device before moment of 
crash
      procfs.json- contents of /proc hanlded like proc2csv does
      name       - application name as it reported by kernel
      signal     - signal which killed application
      pid, uid, gid etc. - see description above
      info       - the text of description if it exists in .info file
      email      - email if it exists in .info file
      bug        - title of bug if it exists in .info file
      packages.ls- list of installed packages
      SubDir/    - subfolder if you need for extra files

The correstponding .info file named as
  type.MAC.DATE_TIME.application.dump.info

And initially contains number of optional fields which cannot be added in the
moment of .dump file creation:
  - uploader email
  - bug title if it allowed to create new bug in JIRA and assign to User
  - text with description (should be bug summary if it needs to be created)
  - etc.

Example of /tmp/crash.info file which is created ahead of time before crash 
happened
and should be used to producing .info file for uploading:
  ENTRY: email
  [email protected]
  ENTRY: bug
  auto-bug for suite BrowserReliability test case YouTube_01
  ENTRY: summary
  Please fill me with details

When .info file detected by Connector, the each entry
  ENTRY: entry_name
  line1
  ..
  lineN
will added to .dump file as file with name entry_name and contents from 
line1..lineN
and original content of .info file will be wiped. Then the server IP selected
according to file type, and if server connected, the .info file turned to 2 line
contents:
  ENTRY: server
  12.122.14.11

After uploading completed the .dump and .info files should be erased.

Protocol
--------
There are following important things we should support in protocol:
- it should be stateless to support interruptable downloads
- the channel utilization should be adjusted in runtime and not more then
  pointed in settings (50% by default)
- if file checksum is not much server may request re-upload
- encryption support and certificates verification is mandatory to hide 
sensitive
  information (symbols, passwords, crash points) from monitoring, so https-based
  protocol (port 443) must be used. It also fit nice into usual network
  infrastructure like firewalls/proxies.

When .info file discovered we may have 3 situations:
1. the info file contains sections except "ENTRY: server"
   => need to be processed locally as described above with adding extra data
      to .dump file
2. the info file is empty - server must be selected and connected according
   to .dump file type
3. the info file contains "ENTRY: server" - connect to server or remove contents
   of .info file if connection failed

After connection to server established the following steps could be performed
for negotiation in between device (D) and server (S):
D: type.MAC.DATE_TIME.application.dump  <FILE_SIZE>  <FILE_HASH>
  // We have file for upload with pointed size and check sum
S: <FILE_POSITION> <MAX_BLOCK_SIZE>
  // Yes, please start from that position, FILE_POSITION > 0 if upload was
  // interrupted earlier
D: {data_block_sent} * N
  // Sending blocks of data 4K..MAX_BLOCK_SIZE size until whole file loaded
S: <FILE_HASH>
  // Hash of file (md5, sha1, sha256) counted on server side, if file 
transfered badly
  // it must be NOT the same as in the beginning of transfer => stream closed 
and file
  // deleted only on server side, re-transmission started

So on device side the bad file hash or stream closing are signals for 
re-connection
and re-transmission according to server requests.

Security
--------
The number of security hardening requirements are presented in this document:
- all data uploading works over https, should be SSL encrypted and do proper
  certificate verification.
- all configuration files expected to be integrity-protected
- any form of input device data is privacy-sensitive and prohibit to be 
uploaded.
  Touches on the screen could reveal unlock patterns, keys typed on the virtual
  keyboard, etc.
- logs collection should be performed taking into account they may contain
  sensitive information (e.g. PINs), such information should be filtered out.
  logs size needs to be restricted by some limit to prevent information
  irrelevant to crash to leave device.
- for application-specific files collection expected to handle files only
  if they accessible from application UID/GID
- the static linking should be used and no shells executed from Collector:crash
- files blacklisting should be used to prevent any application access for
  security sensitive data

===[ end of Crash Reporter High Level requirements ]===

_______________________________________________
Dev mailing list
[email protected]
https://lists.tizen.org/listinfo/dev

Re: [Dev] Crash Reporting proposal for Tizen

Reply via email to