MonetDB: string_imprints - Update comment

2021-09-06 Thread Panagiotis Koutsourakis
Changeset: 63aecf69eb6a for MonetDB
URL: https://dev.monetdb.org/hg/MonetDB/rev/63aecf69eb6a
Modified Files:
gdk/gdk_strimps.c
Branch: string_imprints
Log Message:

Update comment


diffs (173 lines):

diff --git a/gdk/gdk_strimps.c b/gdk/gdk_strimps.c
--- a/gdk/gdk_strimps.c
+++ b/gdk/gdk_strimps.c
@@ -12,11 +12,10 @@
  * A string imprint is an index that can be used as a prefilter in LIKE
  * queries. It has 2 components:
  *
- * - a header of 32 or 64 string element pairs.
+ * - a header of 64 string element pairs.
  *
- * - a 32 or 64 bit mask for each string in the BAT that encodes the
- *   presence or absence of each element of the header in the specific
- *   item.
+ * - a 64 bit mask for each string in the BAT that encodes the presence
+ *   or absence of each element of the header in the specific item.
  *
  * A string imprint is stored in a new Heap in the BAT, aligned in 8
  * byte (64 bit) words.
@@ -24,40 +23,45 @@
  * The first 64 bit word, the header descriptor, describes how the
  * header of the strimp is encoded. The least significant byte (v in the
  * schematic below) is the version number. The second (np) is the number
- * of pairs in the header. The next 2 bytes (hs) is the size of the
- * header in bytes. Finally the fifth byte is the persistence byte. The
- * last 3 bytes needed to align to the 8 byte boundary should be zero,
- * and are reserved for future use.
+ * of pairs in the header. In the current implementation this is always
+ * 64. The next 2 bytes (hs) is the total size of the header in
+ * bytes. Finally the fifth byte is the persistence byte. The last 3
+ * bytes needed to align to the 8 byte boundary should be zero, and are
+ * reserved for future use.
  *
  * The following np bytes are the sizes of the pairs. These can have
  * values from 2 to 8 and are the number of bytes that the corresponding
  * pair takes up. Following that there are the bytes encoding the actual
  * pairs.
  *
- * |   v   |  np   |  hs  |   p   |  reserved  |  8bytes
- * |   | 
---
- * Strimp Header  |
- * | psz_0 | psz_1 | ...   |  |
- * |   |  --- |
- * |   |np bytes  |
- * |   ... | psz_n |  ---  hs 
bytes
- * | pair_0  |   pair_1|  |
- * |...|  |
- * | pair_k-1   |   pair_k |  |
- * |  pair_n   |  |
- * |   | 
---
+ * | 1byte | 1byte | 1byte | 1byte | 1byte | 1byte | 1byte | 1byte |
+ * |---|
+ * |   v   |  np   |  hs   |   p   |  reserved |  8bytes   
  ---
+ * |---|  ___  
   |
+ * | psz_0 | psz_1 | ...   |   |   
   |
+ * |   |   |   
   |
+ * |   |np bytes   
   |
+ * |   |   |   
   |
+ * |   ... | psz_n |   |   
hs bytes
+ * |---|  ___  
   |
+ * | pair_0| pair_1|   
   |
+ * |  ...  |   
   |
+ * | pair_k-1   |   pair_k |   
   |
+ * |  pair_n   |   
   |
+ * |---|   
  ---
  *
  *
- * The bitmasks for each string in the BAT follow after this.
+ * The bitmasks for each string in the BAT follow after this, aligned to
+ * the string BAT.
  *
  * Strimp creation goes as follows:
  *
  * - Construct a histogram of the element (byte or character) pairs for
  *   all the strings in the BAT.
  *
- * - Take the 32/64 most frequent pairs as the Strimp Header.
+ * - Take the 64 most frequent pairs as the Strimp Header.
  *
- * - For each string in the bat construct a 32/64 bit mask that encodes
+ * - For each string in the bat construct a 64 bit mask that encodes
  *   the presence or absence of each member of the header in the string.
  */
 
@@ -80,8 +84,8 @@
 #define NPAIRS(d) (((d) >> 8) & 0xff)
 #define HSIZE(d) (((d) >> 16) & 0x)
 
-#undef UTF8STRINGS 

MonetDB: string_imprints - Update comment

2021-04-06 Thread Panagiotis Koutsourakis
Changeset: 0cc344ae7097 for MonetDB
URL: https://dev.monetdb.org/hg/MonetDB/rev/0cc344ae7097
Modified Files:
gdk/gdk_strimps.c
Branch: string_imprints
Log Message:

Update comment


diffs (45 lines):

diff --git a/gdk/gdk_strimps.c b/gdk/gdk_strimps.c
--- a/gdk/gdk_strimps.c
+++ b/gdk/gdk_strimps.c
@@ -16,17 +16,33 @@
  * - a 64 bit mask for each item in the BAT that encodes the presence or
  *   absence of each element of the header in the specific item.
  *
- * A string imprint is stored in a new Heap in the BAT.
+ * A string imprint is stored in a new Heap in the BAT, aligned in 8
+ * byte (64 bit) words.
  *
- * In the current (byte pair) implementation the first 136 bytes
- * (i.e. the first 17 64 bit quantities) in the Heap are as follows:
+ * The first 64 bit word describes how the header of the strimp is
+ * encoded. The most significant byte (v in the schematic below) is the
+ * version number. The second (np) is the number of pairs in the
+ * header. The third (b/p) is the number of bytes per pair if each pair
+ * is encoded using a constant number of bytes or 0 if it is utf-8. The
+ * next 2 bytes (hs) is the size of the header in bytes. The last 3
+ * bytes needed to align to the 8 byte boundary should be zero, and are
+ * reserved for future use.
+ *
+ * In the current implementation we use 64 byte pairs for the header, so
  *
- * |   Version Number  |   -
- * | byte pair 01 | byte pair 02 | byte pair 03 | byte pair 04 | |
- * | byte pair 05 | byte pair 06 | byte pair 07 | byte pair 08 | |  17 64 
bit quantities
- * [...] |
- * | byte pair 61 | byte pair 62 | byte pair 63 | byte pair 64 |   -
+ * np  == 64
+ * b/p == 2
+ * hs  == 128
+ *
+ * The actual header follows. If it ends before an 8 byte boundary it
+ * is padded with zeros.
  *
+ * |  v   |  np   |  b/p |  hs  | reserved |  8bytes
+ * |   |---
+ * Strimp Header |
+ * |   |  hs bytes + 
padding
+ * |   | |
+ * |   |---
  * The bitmasks for each string in the BAT follow after this.
  *
  * Strimp creation goes as follows:
___
checkin-list mailing list
checkin-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/checkin-list