Hi, Jon, The min and max in the metadata file is only used for a small number of things. For example, if you ask FastBit to compute a histogram on column, but has asked FastBit to automatically decide how to bin the data, then it will have to figure out what are the min and max. If you ask FastBit to build certain binned indexes, then it would have figure out what are the min and max.
The presence of min and max does not improve query answering operations if an index is present. If there isn't an index, there is a possibility that the query can be resolved by knowing the min and max. However, as you might imagine, this case would be rare. John On 2/14/12 4:21 PM, Jon Strabala wrote: > John, > > On Tue, Feb 14, 2012 at 3:33 PM, K. John Wu <[email protected] > <mailto:[email protected]>> wrote: > > Hi, Dominique, > > It is not a problem that the min/max are not updated. FastBit will > figure out that the corresponding index actually have the min and max > information during query answering time. > > > But doesn't it take time to figure out the min/max or is this a > funciton of > the indexing step? like a full pass across the data ? > > > If you are metadata files do not have min/max, don't worry about it. > > > As such I assume having a min/max in the meta data doesn't speed anything > up at all. > * > * > *QUESTION: So is it true in the current implementation of fastbit I do > not * > *need **to **set "minimum" or "maximum" as I get no benefit from setting * > *them in **the **part.txt (or metata data) file at all ? no indexing > speedup * > *and no **query speedup ?* > > Thanks in Advance > > Jon Strabala > > > Regards, > > John > > > > Below is just some food for thought, I am just thinking here, without > really > understanding the internals of fastbit. > > But I would imaging something like storing dates as seconds since epoch > might benefit (perhaps not under the current design) assume I store > 1,000,000 > samples > > from > > Tue Feb 14 13:58:35 HST 2012 > 1329263915 > > to > > Tue Feb 14 13:59:33 HST 2012 > 1329263973 > > > Obviously the span in seconds is only 58 via the min (1329263915) and the > max (1329263973), knowing or trusting this wouldn't it be possible and > faster > to build a minimal index, dropping a data scan, storing less data in > the "build" > process by using bytes or unsigned bytes knowing that an offset is > applied. > And even storing the columnar data as a set of bytes instead of > integers or > longs. > > Normally in fast bit we might have the following part.txt > > Begin Column > name = "TIMESTAMP" > data_type = "LONG" > minimum = 1329263915 > maximum = 1329263973 > index = <binning precision=4/><encoding interval-equality/> > End Column > > > But we could enhance fastbit and have the following part.txt, where > there are > two new directives "data_storage_type" (to store a value in something > smaller > the type it is representing in the columnar data) and > "data_storage_offset" (the > value to add to the "minimum" to get the true data. > > Begin Column > name = "TIMESTAMP" > data_type = "LONG" > data_storage_type = "BYTE" > data_storage_offset = 1329263915 > minimum = 0 > maximum = 58 > index = <binning precision=4/><encoding interval-equality/> > End Column > > > Of course there are lots of ways to represent the part.txt, we might > even have a > directive "pack = TRUE" which uses min/max and figures out the > smallest unit > for the columnar data based on the "minimum" and "maximum" and the > "data_type" > this is just food for thought. > > Begin Column > name = "TIMESTAMP" > data_type = "LONG" > pack_columnar_data = "TRUE" > minimum = 1329263915 > maximum = 1329263973 > index = <binning precision=4/><encoding interval-equality/> > End Column > > > This might save a lot of space if we are storing our types for example > epoch > timestamps (seconds since epcoch or millis since epoch) in small > ordered (or > semi-ordered) time series directories. > > Of course we would have to reconstruct the values and or indices with the > "data_storage_offset" or the implied data_storage_offset which is the > value of > "minimum" this would cost some CPU. > > > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
