Changeset: 63aecf69eb6a for MonetDB
URL: https://dev.monetdb.org/hg/MonetDB/rev/63aecf69eb6a
Modified Files:
gdk/gdk_strimps.c
Branch: string_imprints
Log Message:
Update comment
diffs (173 lines):
diff --git a/gdk/gdk_strimps.c b/gdk/gdk_strimps.c
--- a/gdk/gdk_strimps.c
+++ b/gdk/gdk_strimps.c
@@ -12,11 +12,10 @@
* A string imprint is an index that can be used as a prefilter in LIKE
* queries. It has 2 components:
*
- * - a header of 32 or 64 string element pairs.
+ * - a header of 64 string element pairs.
*
- * - a 32 or 64 bit mask for each string in the BAT that encodes the
- * presence or absence of each element of the header in the specific
- * item.
+ * - a 64 bit mask for each string in the BAT that encodes the presence
+ * or absence of each element of the header in the specific item.
*
* A string imprint is stored in a new Heap in the BAT, aligned in 8
* byte (64 bit) words.
@@ -24,40 +23,45 @@
* The first 64 bit word, the header descriptor, describes how the
* header of the strimp is encoded. The least significant byte (v in the
* schematic below) is the version number. The second (np) is the number
- * of pairs in the header. The next 2 bytes (hs) is the size of the
- * header in bytes. Finally the fifth byte is the persistence byte. The
- * last 3 bytes needed to align to the 8 byte boundary should be zero,
- * and are reserved for future use.
+ * of pairs in the header. In the current implementation this is always
+ * 64. The next 2 bytes (hs) is the total size of the header in
+ * bytes. Finally the fifth byte is the persistence byte. The last 3
+ * bytes needed to align to the 8 byte boundary should be zero, and are
+ * reserved for future use.
*
* The following np bytes are the sizes of the pairs. These can have
* values from 2 to 8 and are the number of bytes that the corresponding
* pair takes up. Following that there are the bytes encoding the actual
* pairs.
*
- * | v | np | hs | p | reserved | 8bytes
- * | |
---
- * Strimp Header |
- * | psz_0 | psz_1 | ... | |
- * | | --- |
- * | |np bytes |
- * | ... | psz_n | --- hs
bytes
- * | pair_0 | pair_1| |
- * |...| |
- * | pair_k-1 | pair_k | |
- * | pair_n | |
- * | |
---
+ * | 1byte | 1byte | 1byte | 1byte | 1byte | 1byte | 1byte | 1byte |
+ * |---|
+ * | v | np | hs | p | reserved | 8bytes
---
+ * |---| ___
|
+ * | psz_0 | psz_1 | ... | |
|
+ * | | |
|
+ * | |np bytes
|
+ * | | |
|
+ * | ... | psz_n | |
hs bytes
+ * |---| ___
|
+ * | pair_0| pair_1|
|
+ * | ... |
|
+ * | pair_k-1 | pair_k |
|
+ * | pair_n |
|
+ * |---|
---
*
*
- * The bitmasks for each string in the BAT follow after this.
+ * The bitmasks for each string in the BAT follow after this, aligned to
+ * the string BAT.
*
* Strimp creation goes as follows:
*
* - Construct a histogram of the element (byte or character) pairs for
* all the strings in the BAT.
*
- * - Take the 32/64 most frequent pairs as the Strimp Header.
+ * - Take the 64 most frequent pairs as the Strimp Header.
*
- * - For each string in the bat construct a 32/64 bit mask that encodes
+ * - For each string in the bat construct a 64 bit mask that encodes
* the presence or absence of each member of the header in the string.
*/
@@ -80,8 +84,8 @@
#define NPAIRS(d) (((d) >> 8) & 0xff)
#define HSIZE(d) (((d) >> 16) & 0x)
-#undef UTF8STRINGS