Hi, Petr, Have you got a chance to try this? Any comments or questions?
John On 3/22/12 8:14 AM, Thorgrin wrote: > Thanks, > > we will definitely give it a try. Maybe this simple approach could be > made into some C++ class or C code and provided along with the > library, assuming the results are significantly better. > > Petr > > On 22 March 2012 15:58, Dominique Prunier > <[email protected]> wrote: >> Hi, >> >> I was doing pretty much the same thing and here how i did it: >> * I handle import myself (the column file format and the -part.txt are >> pretty straightforward, i was only using category and long, beware of the >> endianness though) >> * I generate indexes column by column, using the code from the C API >> (which, to answer your question, builds the index for a specific column >> using part::getColumn and then column::loadIndex and column::unloadIndex on >> the selected column) >> * From time to time, i merge smallers partitions in larger one and reindex >> them >> >> Hope this helps, >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Thorgrin >> Sent: Thursday, March 22, 2012 10:12 AM >> To: K. John Wu >> Cc: FastBit Users >> Subject: Re: [FastBit-users] Howto store data with fatbit library >> >> Hi John, >> >> Thank you for pointing me in the right direction. What we are >> currently using seems quite similar, but we are experiencing some >> performance issues. >> >> We are storing quite lot of rows into multiple tables at a same time. >> Currently we are using one thread to write into about 7 different >> tables, some of which are more heavily used than others. We are >> experiencing high CPU load while the throughput is not as big as >> expected. >> On the test machine we have hit the ceiling at about 35k rows per >> second. This of course includes processing the incoming data as well >> as storing it, but we believe that the storing is what is the most >> limiting factor. The harddrive performance should not be an issue >> here, its busy at about 7-15%. >> The data is not written to harddrive immediately, but always where >> there are about 200k rows (this applies for each table). >> >> Since the fastbit data partition format is quite simple, it might be >> best to store the data directly. This would allow us to create a >> buffer for each data type and partition, which could be written >> directly to harddrive. The only drawback is that we need to generate >> the -part.txt file for ourselves, but that is not too hard. Then we >> can use fastbit library to generate indexes on existing data. >> What is your recommendation? >> >> >> Regarding the buildIndexes() functions, they are indeed present both >> at parts and tables. But the table have also a buildIndex() function, >> which can be used to generate indexes on specific columns. I cannot >> seem to find an equivalent in parts API. Is there any way to build >> index on one column only using parts? >> >> >> Thank you, >> Petr >> >> On 8 March 2012 18:46, K. John Wu <[email protected]> wrote: >>> Hi, Petr, >>> >>> You are on the right track. The file tests/setqgen.cpp is essentially >>> doing what you are talking about. You can take a look at the file >>> either in the source code directory or online at <http://goo.gl/D1XgX>. >>> >>> The class ibis::table (for a data table) is a container of ibis::part >>> (for a data partition). A table can have multiple partitions. All >>> data records written by setqgen.cpp can be regarded as one table, but >>> it might have a number of partitions. The function >>> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data >>> partition it has. The actual work is done belong >>> ibis::part::buildIndexes. If you are not using >>> ibis::table::buildIndexes, you will be doing the looping yourself. >>> Either way is fine. Take the option that is convenient for you. >>> >>> Good luck. >>> >>> John >>> >>> >>> >>> On 3/8/12 8:10 AM, Thorgrin wrote: >>>> Hi John, >>>> >>>> thank you for your ongoing work on fastbit library, the improvements are >>>> great. >>>> >>>> I have a several question regarding correct usage. We are currently >>>> creating fastbit data partitions using tablex object in following >>>> manner: >>>> >>>> # initialise partition with columns >>>> tablex->addColumns(); >>>> >>>> tablex->reserveSpace(); >>>> # multiple calls to append data. We are storing the data on the fly as >>>> is comes, so there are lots of calls to append. >>>> tablex->append(); >>>> >>>> # When we fill the reserved space, we write the data to disk >>>> tablex->write(); >>>> tablex->clearData(); >>>> >>>> # And continue with >>>> tablex->append() >>>> . >>>> . >>>> >>>> Is this the right and efficient way? Or could you recommend a better >>>> approach? We really just need to receive data and store it into the >>>> partitions fast. Currently it seems that this consumes quite a lot of >>>> CPU resources, just for writing thins down. >>>> >>>> Additionally, we want to created indexes on the newly created >>>> partitions. Currently we load it as a table using >>>> table = ibis::table::create(); >>>> # and then >>>> table->buildIndexes(); >>>> delete table; >>>> >>>> I've noticed that there is a buildIndexes() function on part class. >>>> What is the difference? Should we use the other one? Additionally, >>>> table allows to build an index on specific columns, part only on all >>>> columns. Is there a reason for this? >>>> >>>> Thank you, >>>> Petr >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
