Hi John, I described the way we are using the FastBit library to store data in my first post to this thread. What information do you need? I could copy paste the relevant portions of our code if it helps, although it should boil down to what I described earlier.
Petr On 28 April 2012 02:18, K. John Wu <[email protected]> wrote: > HI, Petr, > > Would you mind tell us a bit more about how you use FastBit to store > data? I am interested in spending sometime on this to remove the > excess operations in FastBit, but will need a concrete example to > investigate the issue further. > > Thanks. > > John > > > On 4/22/12 9:50 AM, Thorgrin wrote: >> Hi John, >> >> I finally got to test the difference between the two approaches for >> storing data. The difference between using FastBit library and storing >> the data directly is quite significant. >> >> I'm sending the data to storage program over network using UDP, so >> when it cannot cope with the speed, the network layer drops some >> packets. I tried three speeds, 8k, 10k and 12k packets per second, >> each packet contains number of rows, the size of packets are similar. >> It takes about 154, 122 and 103 seconds to send the data at the >> respective speeds. >> >> The number of stored rows are roughly summed up in following table: >> #speed using library direct storage >> 8000 7387000 31740000 >> 10000 5925000 31737000 >> 12000 5000000 31715000 >> >> The way we are storing the data is simple. We use buffer of size 70k >> values for each column, that is just a piece of allocated memory. When >> the buffer is full, we flush the memory to the file. >> >> I do not know whether the results are a result of misusing the library >> somehow, but maybe someone else stumbled upon this issue. >> >> Regards, >> Petr >> >>>>> On 4/13/12 2:05 AM, Thorgrin wrote: >>>>>> Hi John, >>>>>> >>>>>> my colleague is currently working on the data storage and he decided >>>>>> to incorporate the creation of fastbit files directly into our code, >>>>>> thus we have no generic storage code to give back. There are no >>>>>> problems with our approach so far. >>>>>> >>>>>> I hope that I'll be able to test the performance soon to see if the >>>>>> difference is really significant. >>>>>> >>>>>> Meanwhile, I've another question. How does fastbit work with byte >>>>>> arrays? The table API seems to miss this feature, but there are some >>>>>> internal types for this. We need to support storing byte arrays along >>>>>> with strings and basic types in the end, so I would like to know >>>>>> whether this is possible using fastbit, or if we have to come up with >>>>>> another solution. >>>>>> >>>>>> Thank you, >>>>>> Petr >>>>>> >>>>>> On 6 April 2012 19:36, K. John Wu <[email protected]> wrote: >>>>>>> Hi, Petr, >>>>>>> >>>>>>> Have you got a chance to try this? Any comments or questions? >>>>>>> >>>>>>> John >>>>>>> >>>>>>> >>>>>>> On 3/22/12 8:14 AM, Thorgrin wrote: >>>>>>>> Thanks, >>>>>>>> >>>>>>>> we will definitely give it a try. Maybe this simple approach could be >>>>>>>> made into some C++ class or C code and provided along with the >>>>>>>> library, assuming the results are significantly better. >>>>>>>> >>>>>>>> Petr >>>>>>>> >>>>>>>> On 22 March 2012 15:58, Dominique Prunier >>>>>>>> <[email protected]> wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I was doing pretty much the same thing and here how i did it: >>>>>>>>> * I handle import myself (the column file format and the -part.txt >>>>>>>>> are pretty straightforward, i was only using category and long, >>>>>>>>> beware of the endianness though) >>>>>>>>> * I generate indexes column by column, using the code from the C API >>>>>>>>> (which, to answer your question, builds the index for a specific >>>>>>>>> column using part::getColumn and then column::loadIndex and >>>>>>>>> column::unloadIndex on the selected column) >>>>>>>>> * From time to time, i merge smallers partitions in larger one and >>>>>>>>> reindex them >>>>>>>>> >>>>>>>>> Hope this helps, >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: [email protected] >>>>>>>>> [mailto:[email protected]] On Behalf Of Thorgrin >>>>>>>>> Sent: Thursday, March 22, 2012 10:12 AM >>>>>>>>> To: K. John Wu >>>>>>>>> Cc: FastBit Users >>>>>>>>> Subject: Re: [FastBit-users] Howto store data with fatbit library >>>>>>>>> >>>>>>>>> Hi John, >>>>>>>>> >>>>>>>>> Thank you for pointing me in the right direction. What we are >>>>>>>>> currently using seems quite similar, but we are experiencing some >>>>>>>>> performance issues. >>>>>>>>> >>>>>>>>> We are storing quite lot of rows into multiple tables at a same time. >>>>>>>>> Currently we are using one thread to write into about 7 different >>>>>>>>> tables, some of which are more heavily used than others. We are >>>>>>>>> experiencing high CPU load while the throughput is not as big as >>>>>>>>> expected. >>>>>>>>> On the test machine we have hit the ceiling at about 35k rows per >>>>>>>>> second. This of course includes processing the incoming data as well >>>>>>>>> as storing it, but we believe that the storing is what is the most >>>>>>>>> limiting factor. The harddrive performance should not be an issue >>>>>>>>> here, its busy at about 7-15%. >>>>>>>>> The data is not written to harddrive immediately, but always where >>>>>>>>> there are about 200k rows (this applies for each table). >>>>>>>>> >>>>>>>>> Since the fastbit data partition format is quite simple, it might be >>>>>>>>> best to store the data directly. This would allow us to create a >>>>>>>>> buffer for each data type and partition, which could be written >>>>>>>>> directly to harddrive. The only drawback is that we need to generate >>>>>>>>> the -part.txt file for ourselves, but that is not too hard. Then we >>>>>>>>> can use fastbit library to generate indexes on existing data. >>>>>>>>> What is your recommendation? >>>>>>>>> >>>>>>>>> >>>>>>>>> Regarding the buildIndexes() functions, they are indeed present both >>>>>>>>> at parts and tables. But the table have also a buildIndex() function, >>>>>>>>> which can be used to generate indexes on specific columns. I cannot >>>>>>>>> seem to find an equivalent in parts API. Is there any way to build >>>>>>>>> index on one column only using parts? >>>>>>>>> >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Petr >>>>>>>>> >>>>>>>>> On 8 March 2012 18:46, K. John Wu <[email protected]> wrote: >>>>>>>>>> Hi, Petr, >>>>>>>>>> >>>>>>>>>> You are on the right track. The file tests/setqgen.cpp is >>>>>>>>>> essentially >>>>>>>>>> doing what you are talking about. You can take a look at the file >>>>>>>>>> either in the source code directory or online at >>>>>>>>>> <http://goo.gl/D1XgX>. >>>>>>>>>> >>>>>>>>>> The class ibis::table (for a data table) is a container of ibis::part >>>>>>>>>> (for a data partition). A table can have multiple partitions. All >>>>>>>>>> data records written by setqgen.cpp can be regarded as one table, but >>>>>>>>>> it might have a number of partitions. The function >>>>>>>>>> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data >>>>>>>>>> partition it has. The actual work is done belong >>>>>>>>>> ibis::part::buildIndexes. If you are not using >>>>>>>>>> ibis::table::buildIndexes, you will be doing the looping yourself. >>>>>>>>>> Either way is fine. Take the option that is convenient for you. >>>>>>>>>> >>>>>>>>>> Good luck. >>>>>>>>>> >>>>>>>>>> John >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 3/8/12 8:10 AM, Thorgrin wrote: >>>>>>>>>>> Hi John, >>>>>>>>>>> >>>>>>>>>>> thank you for your ongoing work on fastbit library, the >>>>>>>>>>> improvements are great. >>>>>>>>>>> >>>>>>>>>>> I have a several question regarding correct usage. We are currently >>>>>>>>>>> creating fastbit data partitions using tablex object in following >>>>>>>>>>> manner: >>>>>>>>>>> >>>>>>>>>>> # initialise partition with columns >>>>>>>>>>> tablex->addColumns(); >>>>>>>>>>> >>>>>>>>>>> tablex->reserveSpace(); >>>>>>>>>>> # multiple calls to append data. We are storing the data on the fly >>>>>>>>>>> as >>>>>>>>>>> is comes, so there are lots of calls to append. >>>>>>>>>>> tablex->append(); >>>>>>>>>>> >>>>>>>>>>> # When we fill the reserved space, we write the data to disk >>>>>>>>>>> tablex->write(); >>>>>>>>>>> tablex->clearData(); >>>>>>>>>>> >>>>>>>>>>> # And continue with >>>>>>>>>>> tablex->append() >>>>>>>>>>> . >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>>> Is this the right and efficient way? Or could you recommend a better >>>>>>>>>>> approach? We really just need to receive data and store it into the >>>>>>>>>>> partitions fast. Currently it seems that this consumes quite a lot >>>>>>>>>>> of >>>>>>>>>>> CPU resources, just for writing thins down. >>>>>>>>>>> >>>>>>>>>>> Additionally, we want to created indexes on the newly created >>>>>>>>>>> partitions. Currently we load it as a table using >>>>>>>>>>> table = ibis::table::create(); >>>>>>>>>>> # and then >>>>>>>>>>> table->buildIndexes(); >>>>>>>>>>> delete table; >>>>>>>>>>> >>>>>>>>>>> I've noticed that there is a buildIndexes() function on part class. >>>>>>>>>>> What is the difference? Should we use the other one? Additionally, >>>>>>>>>>> table allows to build an index on specific columns, part only on all >>>>>>>>>>> columns. Is there a reason for this? >>>>>>>>>>> >>>>>>>>>>> Thank you, >>>>>>>>>>> Petr >>>>>>>>> _______________________________________________ >>>>>>>>> FastBit-users mailing list >>>>>>>>> [email protected] >>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>>>>>> _______________________________________________ >>>>>>>>> FastBit-users mailing list >>>>>>>>> [email protected] >>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
