Re: [FastBit-users] Howto store data with fatbit library

K. John Wu Fri, 27 Apr 2012 17:18:39 -0700

HI, Petr,

Would you mind tell us a bit more about how you use FastBit to store
data?  I am interested in spending sometime on this to remove the
excess operations in FastBit, but will need a concrete example to
investigate the issue further.


Thanks.

John


On 4/22/12 9:50 AM, Thorgrin wrote:
> Hi John,
> 
> I finally got to test the difference between the two approaches for
> storing data. The difference between using FastBit library and storing
> the data directly is quite significant.
> 
> I'm sending the data to storage program over network using UDP, so
> when it cannot cope with the speed, the network layer drops some
> packets. I tried three speeds, 8k, 10k and 12k packets per second,
> each packet contains number of rows, the size of packets are similar.
> It takes about 154, 122 and 103 seconds to send the data at the
> respective speeds.
> 
> The number of stored rows are roughly summed up in following table:
> #speed        using library   direct storage
> 8000  7387000         31740000
> 10000 5925000         31737000
> 12000 5000000         31715000
> 
> The way we are storing the data is simple. We use buffer of size 70k
> values for each column, that is just a piece of allocated memory. When
> the buffer is full, we flush the memory to the file.
> 
> I do not know whether the results are a result of misusing the library
> somehow, but maybe someone else stumbled upon this issue.
> 
> Regards,
> Petr
> 
>>>> On 4/13/12 2:05 AM, Thorgrin wrote:
>>>>> Hi John,
>>>>>
>>>>> my colleague is currently working on the data storage and he decided
>>>>> to incorporate the creation of fastbit files directly into our code,
>>>>> thus we have no generic storage code to give back. There are no
>>>>> problems with our approach so far.
>>>>>
>>>>> I hope that I'll be able to test the performance soon to see if the
>>>>> difference is really significant.
>>>>>
>>>>> Meanwhile, I've another question. How does fastbit work with byte
>>>>> arrays? The table API seems to miss this feature, but there are some
>>>>> internal types for this. We need to support storing byte arrays along
>>>>> with strings and basic types in the end, so I would like to know
>>>>> whether this is possible using fastbit, or if we have to come up with
>>>>> another solution.
>>>>>
>>>>> Thank you,
>>>>> Petr
>>>>>
>>>>> On 6 April 2012 19:36, K. John Wu <[email protected]> wrote:
>>>>>> Hi, Petr,
>>>>>>
>>>>>> Have you got a chance to try this?  Any comments or questions?
>>>>>>
>>>>>> John
>>>>>>
>>>>>>
>>>>>> On 3/22/12 8:14 AM, Thorgrin wrote:
>>>>>>> Thanks,
>>>>>>>
>>>>>>> we will definitely give it a try. Maybe this simple approach could be
>>>>>>> made into some C++ class or C code and provided along with the
>>>>>>> library, assuming the results are significantly better.
>>>>>>>
>>>>>>> Petr
>>>>>>>
>>>>>>> On 22 March 2012 15:58, Dominique Prunier
>>>>>>> <[email protected]> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I was doing pretty much the same thing and here how i did it:
>>>>>>>>  * I handle import myself (the column file format and the -part.txt 
>>>>>>>> are pretty straightforward, i was only using category and long, beware 
>>>>>>>> of the endianness though)
>>>>>>>>  * I generate indexes column by column, using the code from the C API 
>>>>>>>> (which, to answer your question, builds the index for a specific 
>>>>>>>> column using part::getColumn and then column::loadIndex and 
>>>>>>>> column::unloadIndex on the selected column)
>>>>>>>>  * From time to time, i merge smallers partitions in larger one and 
>>>>>>>> reindex them
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: [email protected] 
>>>>>>>> [mailto:[email protected]] On Behalf Of Thorgrin
>>>>>>>> Sent: Thursday, March 22, 2012 10:12 AM
>>>>>>>> To: K. John Wu
>>>>>>>> Cc: FastBit Users
>>>>>>>> Subject: Re: [FastBit-users] Howto store data with fatbit library
>>>>>>>>
>>>>>>>> Hi John,
>>>>>>>>
>>>>>>>> Thank you for pointing me in the right direction. What we are
>>>>>>>> currently using seems quite similar, but we are experiencing some
>>>>>>>> performance issues.
>>>>>>>>
>>>>>>>> We are storing quite lot of rows into multiple tables at a same time.
>>>>>>>> Currently we are using one thread to write into about 7 different
>>>>>>>> tables, some of which are more heavily used than others. We are
>>>>>>>> experiencing high CPU load while the throughput is not as big as
>>>>>>>> expected.
>>>>>>>> On the test machine we have hit the ceiling at about 35k rows per
>>>>>>>> second. This of course includes processing the incoming data as well
>>>>>>>> as storing it, but we believe that the storing is what is the most
>>>>>>>> limiting factor. The harddrive performance should not be an issue
>>>>>>>> here, its busy at about 7-15%.
>>>>>>>> The data is not written to harddrive immediately, but always where
>>>>>>>> there are about 200k rows (this applies for each table).
>>>>>>>>
>>>>>>>> Since the fastbit data partition format is quite simple, it might be
>>>>>>>> best to store the data directly. This would allow us to create a
>>>>>>>> buffer for each data type and partition, which could be written
>>>>>>>> directly to harddrive. The only drawback is that we need to generate
>>>>>>>> the -part.txt file for ourselves, but that is not  too hard. Then we
>>>>>>>> can use fastbit library to generate indexes on existing data.
>>>>>>>> What is your recommendation?
>>>>>>>>
>>>>>>>>
>>>>>>>> Regarding the buildIndexes() functions, they are indeed present both
>>>>>>>> at parts and tables. But the table have also a buildIndex() function,
>>>>>>>> which can be used to generate indexes on specific columns. I cannot
>>>>>>>> seem to find an equivalent in parts API. Is there any way to build
>>>>>>>> index on one column only using parts?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Petr
>>>>>>>>
>>>>>>>> On 8 March 2012 18:46, K. John Wu <[email protected]> wrote:
>>>>>>>>> Hi, Petr,
>>>>>>>>>
>>>>>>>>> You are on the right track.  The file tests/setqgen.cpp is essentially
>>>>>>>>> doing what you are talking about.  You can take a look at the file
>>>>>>>>> either in the source code directory or online at 
>>>>>>>>> <http://goo.gl/D1XgX>.
>>>>>>>>>
>>>>>>>>> The class ibis::table (for a data table) is a container of ibis::part
>>>>>>>>> (for a data partition).  A table can have multiple partitions.  All
>>>>>>>>> data records written by setqgen.cpp can be regarded as one table, but
>>>>>>>>> it might have a number of partitions.  The function
>>>>>>>>> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data
>>>>>>>>> partition it has.  The actual work is done belong
>>>>>>>>> ibis::part::buildIndexes.  If you are not using
>>>>>>>>> ibis::table::buildIndexes, you will be doing the looping yourself.
>>>>>>>>> Either way is fine.  Take the option that is convenient for you.
>>>>>>>>>
>>>>>>>>> Good luck.
>>>>>>>>>
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/8/12 8:10 AM, Thorgrin wrote:
>>>>>>>>>> Hi John,
>>>>>>>>>>
>>>>>>>>>> thank you for your ongoing work on fastbit library, the improvements 
>>>>>>>>>> are great.
>>>>>>>>>>
>>>>>>>>>> I have a several question regarding correct usage. We are currently
>>>>>>>>>> creating fastbit data partitions using tablex object in following
>>>>>>>>>> manner:
>>>>>>>>>>
>>>>>>>>>> # initialise partition with columns
>>>>>>>>>> tablex->addColumns();
>>>>>>>>>>
>>>>>>>>>> tablex->reserveSpace();
>>>>>>>>>> # multiple calls to append data. We are storing the data on the fly 
>>>>>>>>>> as
>>>>>>>>>> is comes, so there are lots of calls to append.
>>>>>>>>>> tablex->append();
>>>>>>>>>>
>>>>>>>>>> # When we fill the reserved space, we write the data to disk
>>>>>>>>>> tablex->write();
>>>>>>>>>> tablex->clearData();
>>>>>>>>>>
>>>>>>>>>> # And continue with
>>>>>>>>>> tablex->append()
>>>>>>>>>> .
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Is this the right and efficient way? Or could you recommend a better
>>>>>>>>>> approach? We really just need to receive data and store it into the
>>>>>>>>>> partitions fast. Currently it seems that this consumes quite a lot of
>>>>>>>>>> CPU resources, just for writing thins down.
>>>>>>>>>>
>>>>>>>>>> Additionally, we want to created indexes on the newly created
>>>>>>>>>> partitions. Currently we load it as a table using
>>>>>>>>>> table = ibis::table::create();
>>>>>>>>>> # and then
>>>>>>>>>> table->buildIndexes();
>>>>>>>>>> delete table;
>>>>>>>>>>
>>>>>>>>>> I've noticed that there is a buildIndexes() function on part class.
>>>>>>>>>> What is the difference? Should we use the other one? Additionally,
>>>>>>>>>> table allows to build an index on specific columns, part only on all
>>>>>>>>>> columns. Is there a reason for this?
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Petr
>>>>>>>> _______________________________________________
>>>>>>>> FastBit-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>>>> _______________________________________________
>>>>>>>> FastBit-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Howto store data with fatbit library

Reply via email to