John,

On Tue, Feb 14, 2012 at 3:33 PM, K. John Wu <[email protected]> wrote:

> Hi, Dominique,
>
> It is not a problem that the min/max are not updated.  FastBit will
> figure out that the corresponding index actually have the min and max
> information during query answering time.


But doesn't it take time to figure out the min/max or is this a funciton of
the indexing step? like a full pass across the data ?


> If you are metadata files do not have min/max, don't worry about it.
>

As such I assume having a min/max in the meta data doesn't speed anything
up at all.
*
*
*QUESTION: So is it true in the current implementation of fastbit I do not *
*need **to **set "minimum" or "maximum" as I get no benefit from setting *
*them in **the **part.txt (or metata data) file at all ?  no indexing
speedup *
*and no **query speedup ?*

Thanks in Advance

Jon Strabala


> Regards,
>
> John
>
>

Below is just some food for thought, I am just thinking here, without
really
understanding the internals of fastbit.

But I would imaging something like storing dates as seconds since epoch
might benefit (perhaps not under the current design) assume I store
1,000,000
samples

from

Tue Feb 14 13:58:35 HST 2012
1329263915

to

Tue Feb 14 13:59:33 HST 2012
1329263973


Obviously the span in seconds is only 58 via the min (1329263915) and the
max (1329263973), knowing or trusting this wouldn't it be possible and
faster
to build a minimal index, dropping a data scan, storing less data in the
"build"
process by using bytes or unsigned bytes knowing that an offset is applied.

And even storing the columnar data as a set of bytes instead of integers or
longs.

Normally in fast bit we might have the following part.txt

Begin Column
name = "TIMESTAMP"
data_type = "LONG"
minimum = 1329263915
maximum = 1329263973
index = <binning precision=4/><encoding interval-equality/>
End Column


But we could enhance fastbit and have the following part.txt, where there
are
two new directives "data_storage_type" (to store a value in something
smaller
the type it is representing in the columnar data) and "data_storage_offset"
(the
value to add to the "minimum" to get the true data.

Begin Column
name = "TIMESTAMP"
data_type = "LONG"
data_storage_type = "BYTE"
data_storage_offset = 1329263915
minimum = 0
maximum = 58
index = <binning precision=4/><encoding interval-equality/>
End Column


Of course there are lots of ways to represent the part.txt, we might even
have a
directive "pack = TRUE" which uses min/max and figures out the smallest unit
for the  columnar data based on the "minimum" and "maximum" and the
"data_type"
this is just food for thought.

Begin Column
name = "TIMESTAMP"
data_type = "LONG"
pack_columnar_data = "TRUE"
minimum = 1329263915
maximum = 1329263973
index = <binning precision=4/><encoding interval-equality/>
End Column


This might save a lot of space if we are storing our types for example
epoch
timestamps (seconds since epcoch or millis since epoch) in small ordered (or
semi-ordered) time series directories.

Of course we would have to reconstruct the values and or indices with the
"data_storage_offset" or the implied data_storage_offset which is the value
of
"minimum" this would cost some CPU.
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to