Hi, I was doing pretty much the same thing and here how i did it: * I handle import myself (the column file format and the -part.txt are pretty straightforward, i was only using category and long, beware of the endianness though) * I generate indexes column by column, using the code from the C API (which, to answer your question, builds the index for a specific column using part::getColumn and then column::loadIndex and column::unloadIndex on the selected column) * From time to time, i merge smallers partitions in larger one and reindex them
Hope this helps, -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Thorgrin Sent: Thursday, March 22, 2012 10:12 AM To: K. John Wu Cc: FastBit Users Subject: Re: [FastBit-users] Howto store data with fatbit library Hi John, Thank you for pointing me in the right direction. What we are currently using seems quite similar, but we are experiencing some performance issues. We are storing quite lot of rows into multiple tables at a same time. Currently we are using one thread to write into about 7 different tables, some of which are more heavily used than others. We are experiencing high CPU load while the throughput is not as big as expected. On the test machine we have hit the ceiling at about 35k rows per second. This of course includes processing the incoming data as well as storing it, but we believe that the storing is what is the most limiting factor. The harddrive performance should not be an issue here, its busy at about 7-15%. The data is not written to harddrive immediately, but always where there are about 200k rows (this applies for each table). Since the fastbit data partition format is quite simple, it might be best to store the data directly. This would allow us to create a buffer for each data type and partition, which could be written directly to harddrive. The only drawback is that we need to generate the -part.txt file for ourselves, but that is not too hard. Then we can use fastbit library to generate indexes on existing data. What is your recommendation? Regarding the buildIndexes() functions, they are indeed present both at parts and tables. But the table have also a buildIndex() function, which can be used to generate indexes on specific columns. I cannot seem to find an equivalent in parts API. Is there any way to build index on one column only using parts? Thank you, Petr On 8 March 2012 18:46, K. John Wu <[email protected]> wrote: > Hi, Petr, > > You are on the right track. The file tests/setqgen.cpp is essentially > doing what you are talking about. You can take a look at the file > either in the source code directory or online at <http://goo.gl/D1XgX>. > > The class ibis::table (for a data table) is a container of ibis::part > (for a data partition). A table can have multiple partitions. All > data records written by setqgen.cpp can be regarded as one table, but > it might have a number of partitions. The function > ibis::table::buildIndexes calls ibis::part::buildIndexes on each data > partition it has. The actual work is done belong > ibis::part::buildIndexes. If you are not using > ibis::table::buildIndexes, you will be doing the looping yourself. > Either way is fine. Take the option that is convenient for you. > > Good luck. > > John > > > > On 3/8/12 8:10 AM, Thorgrin wrote: >> Hi John, >> >> thank you for your ongoing work on fastbit library, the improvements are >> great. >> >> I have a several question regarding correct usage. We are currently >> creating fastbit data partitions using tablex object in following >> manner: >> >> # initialise partition with columns >> tablex->addColumns(); >> >> tablex->reserveSpace(); >> # multiple calls to append data. We are storing the data on the fly as >> is comes, so there are lots of calls to append. >> tablex->append(); >> >> # When we fill the reserved space, we write the data to disk >> tablex->write(); >> tablex->clearData(); >> >> # And continue with >> tablex->append() >> . >> . >> >> Is this the right and efficient way? Or could you recommend a better >> approach? We really just need to receive data and store it into the >> partitions fast. Currently it seems that this consumes quite a lot of >> CPU resources, just for writing thins down. >> >> Additionally, we want to created indexes on the newly created >> partitions. Currently we load it as a table using >> table = ibis::table::create(); >> # and then >> table->buildIndexes(); >> delete table; >> >> I've noticed that there is a buildIndexes() function on part class. >> What is the difference? Should we use the other one? Additionally, >> table allows to build an index on specific columns, part only on all >> columns. Is there a reason for this? >> >> Thank you, >> Petr _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
