Friends, Please read this, your involvement will make Neo4j more efficient!
A week ago I spiked an idea for making storing short string properties more efficient. Since then me and my colleagues at Neo Technology have done some testings, and the initial measurements are looking very promising. We have measured string store sizes decreasing by up to 80%. With the added benefit of those 80% of the string properties being accessible much faster. In order to make these improvements applicable to as many stores as possible I would like to get some statistics on what the strings stored by actual applications look like. For these statistics to be as accurate as possible I need your help. I have written a small utility that analyzes the string properties stored by Neo4j and computes some statistics about them. If I could get as many of you to run this tool on your stores and send those statistics to me as possible, that would be great. This tool is available for download here: https://github.com/downloads/thobe/neo4j-admin-store/stringstat.jar To run it, all you need to do is: java -jar stringstat.jar /path/to/your/neo4j/store/dir Then send the output (printed to stdout at the end of execution) in the STRING STORE STATISTICS block to me via e-mail. Please keep in mind that even if my tests show that running this tool is safe and does not modify your datastore, running it will be on your own risk. Since this tool only collects statistics about your string data, and not the actual data, none of your sensitive information will be leaked by sharing the statistics from this tool. Once again, please consider running this tool and submitting the statistics, since doing so will make Neo4j better at handling YOUR data in the future. The rest of this email is a technical description of what the tool does and the statistics it gathers. This tool will first iterate through all string property records (filtering out any record that occupies more than one block, since those are too long to be applicable anyway) and compute the character frequencies in these strings (under the assumption of three different compression levels). It will then use this to design three custom compression formats that are optimized for the short strings in your database. These compression formats will be used together with six formats that would be useful based on most databases I've seen to compute how large the gain of compression on your store would be. The resulting statistics on how efficient each compression format would be is then written to standard out together with the character frequencies from step one. Since this process can take some time on a large store, I've added a progress indicator that prints out while the tool is doing its thing. This will print a dot for every half percent of progress and print how far through the store it has gotten (in percent) every ten percent. This progress indication will be printed twice since the tool iterates through all string records twice, first to compute the frequencies, and then to compute the gain of compression. Here is an example of what the output from running this tool could look like: $ java -jar target/stringstat.jar /Users/tobias/tmp/neo4j-db Computing character frequencies for 12395202 string records ................... 10% ................... 20% ................... 30% ................... 40% ................... 50% ................... 60% ................... 70% ................... 80% ................... 90% ...................100% Matching potential encodings for 12395202 string records ................... 10% ................... 20% ................... 30% ................... 40% ................... 50% ................... 60% ................... 70% ................... 80% ................... 90% ...................100% <STRING STORE STATISTICS> = 4 bit frequencies = 1: 1327696 2: 1248529 5: 1248296 8: 1247808 4: 1247263 0: 1246971 3: 1246731 6: 1246669 7: 1246401 9: 911489 a: 172562 n: 124640 e: 123376 : 65844 m: 51293 g: 50802 t: 50388 i: 49993 r: 44902 s: 33587 k: 27985 b: 27031 .: 22444 o: 19244 l: 15134 u: 10314 d: 9930 j: 9083 S: 8593 L: 7706 M: 7564 A: 7258 = 5 bit frequencies = a: 268274 n: 267067 e: 256386 : 210979 i: 196894 r: 154346 l: 134354 o: 131955 s: 115836 d: 93797 t: 76991 k: 63191 u: 60566 m: 57359 g: 53697 S: 51287 .: 50007 @: 49330 v: 46931 h: 42891 M: 42399 H: 37493 A: 34785 L: 31556 j: 31315 b: 29391 E: 28335 y: 26427 T: 26248 B: 24491 c: 23347 K: 22046 f: 20345 p: 20287 J: 20173 N: 19144 ?: 16926 W: 16001 G: 14017 R: 13871 O: 13630 F: 13054 D: 12195 V: 11483 z: 10023 ?: 8814 ?: 8702 ?: 7903 x: 6439 U: 5837 C: 5371 I: 4547 ?: 4491 2: 2703 P: 2637 4: 1218 w: 1187 Z: 1183 Y: 1178 ,: 1141 -: 857 ?: 634 1: 632 0: 316 = 6 bit frequencies = n: 874452 a: 802639 e: 790998 i: 597945 : 524600 r: 496695 o: 462202 s: 457249 l: 429857 d: 327564 t: 310925 .: 217419 m: 207459 @: 203896 k: 190190 g: 180166 u: 169502 v: 158119 S: 148109 h: 144968 M: 118409 H: 101946 b: 93958 c: 85548 A: 84580 E: 77119 L: 75604 y: 67204 j: 66711 K: 63279 T: 61919 p: 60935 f: 60589 B: 51351 J: 48471 ?: 44154 N: 39113 O: 35251 W: 32999 F: 32734 G: 30821 R: 30341 z: 27002 V: 25837 ?: 25434 D: 23706 C: 22801 I: 21277 ?: 19167 ?: 17891 ?: 14752 P: 13294 x: 11184 2: 11011 w: 9972 U: 9556 Y: 7486 -: 4537 Z: 4524 ?: 4496 1: 4490 4: 3119 q: 1756 0: 1379 6: 1379 9: 1379 8: 1379 ,: 609 ?: 342 ?: 294 9107508 strings with category bitmask 0b0 2947442 strings with category bitmask 0b1111111 80361 strings with category bitmask 0b10000000 8251 strings with category bitmask 0b100000000 16971 strings with category bitmask 0b100010000 26235 strings with category bitmask 0b100010001 20632 strings with category bitmask 0b100010011 41383 strings with category bitmask 0b101010011 86358 strings with category bitmask 0b101111111 508 strings with category bitmask 0b110000000 32973 strings with category bitmask 0b110010000 26579 strings with category bitmask 0b110010001 = Category index = 0. NineSevenBitAscii 1. LowerCaseHexadecimal 2. UpperCaseHexadecimal 3. PunctuatedNumerical 4. AlphaNumericalName 5. Numerical 6. FrequencyBased:4bit 7. FrequencyBased:5bit 8. FrequencyBased:6bit </STRING STORE STATISTICS> The information I want you to send back to me is the data between the <STRING STORE STATISTICS> and </STRING STORE STATISTICS> lines. May the source be with you, -- Tobias Ivarsson <tobias.ivars...@neotechnology.com> Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user