Friends,

Please read this, your involvement will make Neo4j more efficient!

A week ago I spiked an idea for making storing short string properties more
efficient.

Since then me and my colleagues at Neo Technology have done some testings,
and the initial measurements are looking very promising.
We have measured string store sizes decreasing by up to 80%. With the added
benefit of those 80% of the string properties being accessible much faster.

In order to make these improvements applicable to as many stores as possible
I would like to get some statistics on what the strings stored by actual
applications look like.
For these statistics to be as accurate as possible I need your help.

I have written a small utility that analyzes the string properties stored by
Neo4j and computes some statistics about them.
If I could get as many of you to run this tool on your stores and send those
statistics to me as possible, that would be great.

This tool is available for download here:
https://github.com/downloads/thobe/neo4j-admin-store/stringstat.jar

To run it, all you need to do is:
java -jar stringstat.jar /path/to/your/neo4j/store/dir

Then send the output (printed to stdout at the end of execution) in
the STRING STORE STATISTICS block to me via e-mail.

Please keep in mind that even if my tests show that running this tool is
safe and does not modify your datastore, running it will be on your own
risk.

Since this tool only collects statistics about your string data, and not the
actual data, none of your sensitive information will be leaked by sharing
the statistics from this tool.

Once again, please consider running this tool and submitting the statistics,
since doing so will make Neo4j better at handling YOUR data in the future.

The rest of this email is a technical description of what the tool does and
the statistics it gathers.

This tool will first iterate through all string property records (filtering
out any record that occupies more than one block, since those are too long
to be applicable anyway) and compute the character frequencies in these
strings (under the assumption of three different compression levels).
It will then use this to design three custom compression formats that are
optimized for the short strings in your database.
These compression formats will be used together with six formats that would
be useful based on most databases I've seen to compute how large the gain of
compression on your store would be.
The resulting statistics on how efficient each compression format would be
is then written to standard out together with the character frequencies from
step one.

Since this process can take some time on a large store, I've added a
progress indicator that prints out while the tool is doing its thing. This
will print a dot for every half percent of progress and print how far
through the store it has gotten (in percent) every ten percent. This
progress indication will be printed twice since the tool iterates through
all string records twice, first to compute the frequencies, and then to
compute the gain of compression.


Here is an example of what the output from running this tool could look
like:

$ java -jar target/stringstat.jar /Users/tobias/tmp/neo4j-db
Computing character frequencies for 12395202 string records
................... 10%
................... 20%
................... 30%
................... 40%
................... 50%
................... 60%
................... 70%
................... 80%
................... 90%
...................100%
Matching potential encodings for 12395202 string records
................... 10%
................... 20%
................... 30%
................... 40%
................... 50%
................... 60%
................... 70%
................... 80%
................... 90%
...................100%
<STRING STORE STATISTICS>
= 4 bit frequencies =
  1: 1327696  2: 1248529  5: 1248296  8: 1247808
  4: 1247263  0: 1246971  3: 1246731  6: 1246669
  7: 1246401  9:  911489  a:  172562  n:  124640
  e:  123376   :   65844  m:   51293  g:   50802
  t:   50388  i:   49993  r:   44902  s:   33587
  k:   27985  b:   27031  .:   22444  o:   19244
  l:   15134  u:   10314  d:    9930  j:    9083
  S:    8593  L:    7706  M:    7564  A:    7258
= 5 bit frequencies =
  a: 268274  n: 267067  e: 256386   : 210979
  i: 196894  r: 154346  l: 134354  o: 131955
  s: 115836  d:  93797  t:  76991  k:  63191
  u:  60566  m:  57359  g:  53697  S:  51287
  .:  50007  @:  49330  v:  46931  h:  42891
  M:  42399  H:  37493  A:  34785  L:  31556
   j:  31315  b:  29391  E:  28335  y:  26427
  T:  26248  B:  24491  c:  23347  K:  22046
  f:  20345  p:  20287  J:  20173  N:  19144
  ?:  16926  W:  16001  G:  14017  R:  13871
  O:  13630  F:  13054  D:  12195  V:  11483
  z:  10023  ?:   8814  ?:   8702  ?:   7903
  x:   6439  U:   5837  C:   5371  I:   4547
  ?:   4491  2:   2703  P:   2637  4:   1218
  w:   1187  Z:   1183  Y:   1178  ,:   1141
   -:    857  ?:    634  1:    632  0:    316
= 6 bit frequencies =
  n: 874452  a: 802639  e: 790998  i: 597945
   : 524600  r: 496695  o: 462202  s: 457249
  l: 429857  d: 327564  t: 310925  .: 217419
  m: 207459  @: 203896  k: 190190  g: 180166
  u: 169502  v: 158119  S: 148109  h: 144968
  M: 118409  H: 101946  b:  93958  c:  85548
  A:  84580  E:  77119  L:  75604  y:  67204
   j:  66711  K:  63279  T:  61919  p:  60935
  f:  60589  B:  51351  J:  48471  ?:  44154
  N:  39113  O:  35251  W:  32999  F:  32734
  G:  30821  R:  30341  z:  27002  V:  25837
  ?:  25434  D:  23706  C:  22801  I:  21277
  ?:  19167  ?:  17891  ?:  14752  P:  13294
  x:  11184  2:  11011  w:   9972  U:   9556
  Y:   7486  -:   4537  Z:   4524  ?:   4496
  1:   4490  4:   3119  q:   1756  0:   1379
   6:   1379  9:   1379  8:   1379  ,:    609
  ?:    342  ?:    294
   9107508 strings with category bitmask              0b0
   2947442 strings with category bitmask        0b1111111
     80361 strings with category bitmask       0b10000000
      8251 strings with category bitmask      0b100000000
     16971 strings with category bitmask      0b100010000
     26235 strings with category bitmask      0b100010001
     20632 strings with category bitmask      0b100010011
     41383 strings with category bitmask      0b101010011
     86358 strings with category bitmask      0b101111111
       508 strings with category bitmask      0b110000000
     32973 strings with category bitmask      0b110010000
     26579 strings with category bitmask      0b110010001
= Category index =
 0. NineSevenBitAscii
 1. LowerCaseHexadecimal
 2. UpperCaseHexadecimal
 3. PunctuatedNumerical
 4. AlphaNumericalName
 5. Numerical
 6. FrequencyBased:4bit
 7. FrequencyBased:5bit
 8. FrequencyBased:6bit
</STRING STORE STATISTICS>


The information I want you to send back to me is the data between
the <STRING STORE STATISTICS> and </STRING STORE STATISTICS> lines.

May the source be with you,
-- 
Tobias Ivarsson <tobias.ivars...@neotechnology.com>
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to