John, On Tue, Feb 14, 2012 at 3:33 PM, K. John Wu <[email protected]> wrote:
> Hi, Dominique, > > It is not a problem that the min/max are not updated. FastBit will > figure out that the corresponding index actually have the min and max > information during query answering time. But doesn't it take time to figure out the min/max or is this a funciton of the indexing step? like a full pass across the data ? > If you are metadata files do not have min/max, don't worry about it. > As such I assume having a min/max in the meta data doesn't speed anything up at all. * * *QUESTION: So is it true in the current implementation of fastbit I do not * *need **to **set "minimum" or "maximum" as I get no benefit from setting * *them in **the **part.txt (or metata data) file at all ? no indexing speedup * *and no **query speedup ?* Thanks in Advance Jon Strabala > Regards, > > John > > Below is just some food for thought, I am just thinking here, without really understanding the internals of fastbit. But I would imaging something like storing dates as seconds since epoch might benefit (perhaps not under the current design) assume I store 1,000,000 samples from Tue Feb 14 13:58:35 HST 2012 1329263915 to Tue Feb 14 13:59:33 HST 2012 1329263973 Obviously the span in seconds is only 58 via the min (1329263915) and the max (1329263973), knowing or trusting this wouldn't it be possible and faster to build a minimal index, dropping a data scan, storing less data in the "build" process by using bytes or unsigned bytes knowing that an offset is applied. And even storing the columnar data as a set of bytes instead of integers or longs. Normally in fast bit we might have the following part.txt Begin Column name = "TIMESTAMP" data_type = "LONG" minimum = 1329263915 maximum = 1329263973 index = <binning precision=4/><encoding interval-equality/> End Column But we could enhance fastbit and have the following part.txt, where there are two new directives "data_storage_type" (to store a value in something smaller the type it is representing in the columnar data) and "data_storage_offset" (the value to add to the "minimum" to get the true data. Begin Column name = "TIMESTAMP" data_type = "LONG" data_storage_type = "BYTE" data_storage_offset = 1329263915 minimum = 0 maximum = 58 index = <binning precision=4/><encoding interval-equality/> End Column Of course there are lots of ways to represent the part.txt, we might even have a directive "pack = TRUE" which uses min/max and figures out the smallest unit for the columnar data based on the "minimum" and "maximum" and the "data_type" this is just food for thought. Begin Column name = "TIMESTAMP" data_type = "LONG" pack_columnar_data = "TRUE" minimum = 1329263915 maximum = 1329263973 index = <binning precision=4/><encoding interval-equality/> End Column This might save a lot of space if we are storing our types for example epoch timestamps (seconds since epcoch or millis since epoch) in small ordered (or semi-ordered) time series directories. Of course we would have to reconstruct the values and or indices with the "data_storage_offset" or the implied data_storage_offset which is the value of "minimum" this would cost some CPU.
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
