HI, Petr, Would you mind tell us a bit more about how you use FastBit to store data? I am interested in spending sometime on this to remove the excess operations in FastBit, but will need a concrete example to investigate the issue further.
Thanks. John On 4/22/12 9:50 AM, Thorgrin wrote: > Hi John, > > I finally got to test the difference between the two approaches for > storing data. The difference between using FastBit library and storing > the data directly is quite significant. > > I'm sending the data to storage program over network using UDP, so > when it cannot cope with the speed, the network layer drops some > packets. I tried three speeds, 8k, 10k and 12k packets per second, > each packet contains number of rows, the size of packets are similar. > It takes about 154, 122 and 103 seconds to send the data at the > respective speeds. > > The number of stored rows are roughly summed up in following table: > #speed using library direct storage > 8000 7387000 31740000 > 10000 5925000 31737000 > 12000 5000000 31715000 > > The way we are storing the data is simple. We use buffer of size 70k > values for each column, that is just a piece of allocated memory. When > the buffer is full, we flush the memory to the file. > > I do not know whether the results are a result of misusing the library > somehow, but maybe someone else stumbled upon this issue. > > Regards, > Petr > >>>> On 4/13/12 2:05 AM, Thorgrin wrote: >>>>> Hi John, >>>>> >>>>> my colleague is currently working on the data storage and he decided >>>>> to incorporate the creation of fastbit files directly into our code, >>>>> thus we have no generic storage code to give back. There are no >>>>> problems with our approach so far. >>>>> >>>>> I hope that I'll be able to test the performance soon to see if the >>>>> difference is really significant. >>>>> >>>>> Meanwhile, I've another question. How does fastbit work with byte >>>>> arrays? The table API seems to miss this feature, but there are some >>>>> internal types for this. We need to support storing byte arrays along >>>>> with strings and basic types in the end, so I would like to know >>>>> whether this is possible using fastbit, or if we have to come up with >>>>> another solution. >>>>> >>>>> Thank you, >>>>> Petr >>>>> >>>>> On 6 April 2012 19:36, K. John Wu <[email protected]> wrote: >>>>>> Hi, Petr, >>>>>> >>>>>> Have you got a chance to try this? Any comments or questions? >>>>>> >>>>>> John >>>>>> >>>>>> >>>>>> On 3/22/12 8:14 AM, Thorgrin wrote: >>>>>>> Thanks, >>>>>>> >>>>>>> we will definitely give it a try. Maybe this simple approach could be >>>>>>> made into some C++ class or C code and provided along with the >>>>>>> library, assuming the results are significantly better. >>>>>>> >>>>>>> Petr >>>>>>> >>>>>>> On 22 March 2012 15:58, Dominique Prunier >>>>>>> <[email protected]> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I was doing pretty much the same thing and here how i did it: >>>>>>>> * I handle import myself (the column file format and the -part.txt >>>>>>>> are pretty straightforward, i was only using category and long, beware >>>>>>>> of the endianness though) >>>>>>>> * I generate indexes column by column, using the code from the C API >>>>>>>> (which, to answer your question, builds the index for a specific >>>>>>>> column using part::getColumn and then column::loadIndex and >>>>>>>> column::unloadIndex on the selected column) >>>>>>>> * From time to time, i merge smallers partitions in larger one and >>>>>>>> reindex them >>>>>>>> >>>>>>>> Hope this helps, >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: [email protected] >>>>>>>> [mailto:[email protected]] On Behalf Of Thorgrin >>>>>>>> Sent: Thursday, March 22, 2012 10:12 AM >>>>>>>> To: K. John Wu >>>>>>>> Cc: FastBit Users >>>>>>>> Subject: Re: [FastBit-users] Howto store data with fatbit library >>>>>>>> >>>>>>>> Hi John, >>>>>>>> >>>>>>>> Thank you for pointing me in the right direction. What we are >>>>>>>> currently using seems quite similar, but we are experiencing some >>>>>>>> performance issues. >>>>>>>> >>>>>>>> We are storing quite lot of rows into multiple tables at a same time. >>>>>>>> Currently we are using one thread to write into about 7 different >>>>>>>> tables, some of which are more heavily used than others. We are >>>>>>>> experiencing high CPU load while the throughput is not as big as >>>>>>>> expected. >>>>>>>> On the test machine we have hit the ceiling at about 35k rows per >>>>>>>> second. This of course includes processing the incoming data as well >>>>>>>> as storing it, but we believe that the storing is what is the most >>>>>>>> limiting factor. The harddrive performance should not be an issue >>>>>>>> here, its busy at about 7-15%. >>>>>>>> The data is not written to harddrive immediately, but always where >>>>>>>> there are about 200k rows (this applies for each table). >>>>>>>> >>>>>>>> Since the fastbit data partition format is quite simple, it might be >>>>>>>> best to store the data directly. This would allow us to create a >>>>>>>> buffer for each data type and partition, which could be written >>>>>>>> directly to harddrive. The only drawback is that we need to generate >>>>>>>> the -part.txt file for ourselves, but that is not too hard. Then we >>>>>>>> can use fastbit library to generate indexes on existing data. >>>>>>>> What is your recommendation? >>>>>>>> >>>>>>>> >>>>>>>> Regarding the buildIndexes() functions, they are indeed present both >>>>>>>> at parts and tables. But the table have also a buildIndex() function, >>>>>>>> which can be used to generate indexes on specific columns. I cannot >>>>>>>> seem to find an equivalent in parts API. Is there any way to build >>>>>>>> index on one column only using parts? >>>>>>>> >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Petr >>>>>>>> >>>>>>>> On 8 March 2012 18:46, K. John Wu <[email protected]> wrote: >>>>>>>>> Hi, Petr, >>>>>>>>> >>>>>>>>> You are on the right track. The file tests/setqgen.cpp is essentially >>>>>>>>> doing what you are talking about. You can take a look at the file >>>>>>>>> either in the source code directory or online at >>>>>>>>> <http://goo.gl/D1XgX>. >>>>>>>>> >>>>>>>>> The class ibis::table (for a data table) is a container of ibis::part >>>>>>>>> (for a data partition). A table can have multiple partitions. All >>>>>>>>> data records written by setqgen.cpp can be regarded as one table, but >>>>>>>>> it might have a number of partitions. The function >>>>>>>>> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data >>>>>>>>> partition it has. The actual work is done belong >>>>>>>>> ibis::part::buildIndexes. If you are not using >>>>>>>>> ibis::table::buildIndexes, you will be doing the looping yourself. >>>>>>>>> Either way is fine. Take the option that is convenient for you. >>>>>>>>> >>>>>>>>> Good luck. >>>>>>>>> >>>>>>>>> John >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 3/8/12 8:10 AM, Thorgrin wrote: >>>>>>>>>> Hi John, >>>>>>>>>> >>>>>>>>>> thank you for your ongoing work on fastbit library, the improvements >>>>>>>>>> are great. >>>>>>>>>> >>>>>>>>>> I have a several question regarding correct usage. We are currently >>>>>>>>>> creating fastbit data partitions using tablex object in following >>>>>>>>>> manner: >>>>>>>>>> >>>>>>>>>> # initialise partition with columns >>>>>>>>>> tablex->addColumns(); >>>>>>>>>> >>>>>>>>>> tablex->reserveSpace(); >>>>>>>>>> # multiple calls to append data. We are storing the data on the fly >>>>>>>>>> as >>>>>>>>>> is comes, so there are lots of calls to append. >>>>>>>>>> tablex->append(); >>>>>>>>>> >>>>>>>>>> # When we fill the reserved space, we write the data to disk >>>>>>>>>> tablex->write(); >>>>>>>>>> tablex->clearData(); >>>>>>>>>> >>>>>>>>>> # And continue with >>>>>>>>>> tablex->append() >>>>>>>>>> . >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> Is this the right and efficient way? Or could you recommend a better >>>>>>>>>> approach? We really just need to receive data and store it into the >>>>>>>>>> partitions fast. Currently it seems that this consumes quite a lot of >>>>>>>>>> CPU resources, just for writing thins down. >>>>>>>>>> >>>>>>>>>> Additionally, we want to created indexes on the newly created >>>>>>>>>> partitions. Currently we load it as a table using >>>>>>>>>> table = ibis::table::create(); >>>>>>>>>> # and then >>>>>>>>>> table->buildIndexes(); >>>>>>>>>> delete table; >>>>>>>>>> >>>>>>>>>> I've noticed that there is a buildIndexes() function on part class. >>>>>>>>>> What is the difference? Should we use the other one? Additionally, >>>>>>>>>> table allows to build an index on specific columns, part only on all >>>>>>>>>> columns. Is there a reason for this? >>>>>>>>>> >>>>>>>>>> Thank you, >>>>>>>>>> Petr >>>>>>>> _______________________________________________ >>>>>>>> FastBit-users mailing list >>>>>>>> [email protected] >>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>>>>> _______________________________________________ >>>>>>>> FastBit-users mailing list >>>>>>>> [email protected] >>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
