Re: [FastBit-users] Howto store data with fatbit library

Thorgrin Sat, 28 Apr 2012 12:02:57 -0700

Hi John,

I described the way we are using the FastBit library to store data in
my first post to this thread. What information do you need? I could
copy paste the relevant portions of our code if it helps, although it
should boil down to what I described earlier.


Petr

On 28 April 2012 02:18, K. John Wu <[email protected]> wrote:
> HI, Petr,
>
> Would you mind tell us a bit more about how you use FastBit to store
> data?  I am interested in spending sometime on this to remove the
> excess operations in FastBit, but will need a concrete example to
> investigate the issue further.
>
> Thanks.
>
> John
>
>
> On 4/22/12 9:50 AM, Thorgrin wrote:
>> Hi John,
>>
>> I finally got to test the difference between the two approaches for
>> storing data. The difference between using FastBit library and storing
>> the data directly is quite significant.
>>
>> I'm sending the data to storage program over network using UDP, so
>> when it cannot cope with the speed, the network layer drops some
>> packets. I tried three speeds, 8k, 10k and 12k packets per second,
>> each packet contains number of rows, the size of packets are similar.
>> It takes about 154, 122 and 103 seconds to send the data at the
>> respective speeds.
>>
>> The number of stored rows are roughly summed up in following table:
>> #speed        using library   direct storage
>> 8000  7387000         31740000
>> 10000 5925000         31737000
>> 12000 5000000         31715000
>>
>> The way we are storing the data is simple. We use buffer of size 70k
>> values for each column, that is just a piece of allocated memory. When
>> the buffer is full, we flush the memory to the file.
>>
>> I do not know whether the results are a result of misusing the library
>> somehow, but maybe someone else stumbled upon this issue.
>>
>> Regards,
>> Petr
>>
>>>>> On 4/13/12 2:05 AM, Thorgrin wrote:
>>>>>> Hi John,
>>>>>>
>>>>>> my colleague is currently working on the data storage and he decided
>>>>>> to incorporate the creation of fastbit files directly into our code,
>>>>>> thus we have no generic storage code to give back. There are no
>>>>>> problems with our approach so far.
>>>>>>
>>>>>> I hope that I'll be able to test the performance soon to see if the
>>>>>> difference is really significant.
>>>>>>
>>>>>> Meanwhile, I've another question. How does fastbit work with byte
>>>>>> arrays? The table API seems to miss this feature, but there are some
>>>>>> internal types for this. We need to support storing byte arrays along
>>>>>> with strings and basic types in the end, so I would like to know
>>>>>> whether this is possible using fastbit, or if we have to come up with
>>>>>> another solution.
>>>>>>
>>>>>> Thank you,
>>>>>> Petr
>>>>>>
>>>>>> On 6 April 2012 19:36, K. John Wu <[email protected]> wrote:
>>>>>>> Hi, Petr,
>>>>>>>
>>>>>>> Have you got a chance to try this?  Any comments or questions?
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>> On 3/22/12 8:14 AM, Thorgrin wrote:
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> we will definitely give it a try. Maybe this simple approach could be
>>>>>>>> made into some C++ class or C code and provided along with the
>>>>>>>> library, assuming the results are significantly better.
>>>>>>>>
>>>>>>>> Petr
>>>>>>>>
>>>>>>>> On 22 March 2012 15:58, Dominique Prunier
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I was doing pretty much the same thing and here how i did it:
>>>>>>>>>  * I handle import myself (the column file format and the -part.txt 
>>>>>>>>> are pretty straightforward, i was only using category and long, 
>>>>>>>>> beware of the endianness though)
>>>>>>>>>  * I generate indexes column by column, using the code from the C API 
>>>>>>>>> (which, to answer your question, builds the index for a specific 
>>>>>>>>> column using part::getColumn and then column::loadIndex and 
>>>>>>>>> column::unloadIndex on the selected column)
>>>>>>>>>  * From time to time, i merge smallers partitions in larger one and 
>>>>>>>>> reindex them
>>>>>>>>>
>>>>>>>>> Hope this helps,
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: [email protected] 
>>>>>>>>> [mailto:[email protected]] On Behalf Of Thorgrin
>>>>>>>>> Sent: Thursday, March 22, 2012 10:12 AM
>>>>>>>>> To: K. John Wu
>>>>>>>>> Cc: FastBit Users
>>>>>>>>> Subject: Re: [FastBit-users] Howto store data with fatbit library
>>>>>>>>>
>>>>>>>>> Hi John,
>>>>>>>>>
>>>>>>>>> Thank you for pointing me in the right direction. What we are
>>>>>>>>> currently using seems quite similar, but we are experiencing some
>>>>>>>>> performance issues.
>>>>>>>>>
>>>>>>>>> We are storing quite lot of rows into multiple tables at a same time.
>>>>>>>>> Currently we are using one thread to write into about 7 different
>>>>>>>>> tables, some of which are more heavily used than others. We are
>>>>>>>>> experiencing high CPU load while the throughput is not as big as
>>>>>>>>> expected.
>>>>>>>>> On the test machine we have hit the ceiling at about 35k rows per
>>>>>>>>> second. This of course includes processing the incoming data as well
>>>>>>>>> as storing it, but we believe that the storing is what is the most
>>>>>>>>> limiting factor. The harddrive performance should not be an issue
>>>>>>>>> here, its busy at about 7-15%.
>>>>>>>>> The data is not written to harddrive immediately, but always where
>>>>>>>>> there are about 200k rows (this applies for each table).
>>>>>>>>>
>>>>>>>>> Since the fastbit data partition format is quite simple, it might be
>>>>>>>>> best to store the data directly. This would allow us to create a
>>>>>>>>> buffer for each data type and partition, which could be written
>>>>>>>>> directly to harddrive. The only drawback is that we need to generate
>>>>>>>>> the -part.txt file for ourselves, but that is not  too hard. Then we
>>>>>>>>> can use fastbit library to generate indexes on existing data.
>>>>>>>>> What is your recommendation?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regarding the buildIndexes() functions, they are indeed present both
>>>>>>>>> at parts and tables. But the table have also a buildIndex() function,
>>>>>>>>> which can be used to generate indexes on specific columns. I cannot
>>>>>>>>> seem to find an equivalent in parts API. Is there any way to build
>>>>>>>>> index on one column only using parts?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Petr
>>>>>>>>>
>>>>>>>>> On 8 March 2012 18:46, K. John Wu <[email protected]> wrote:
>>>>>>>>>> Hi, Petr,
>>>>>>>>>>
>>>>>>>>>> You are on the right track.  The file tests/setqgen.cpp is 
>>>>>>>>>> essentially
>>>>>>>>>> doing what you are talking about.  You can take a look at the file
>>>>>>>>>> either in the source code directory or online at 
>>>>>>>>>> <http://goo.gl/D1XgX>.
>>>>>>>>>>
>>>>>>>>>> The class ibis::table (for a data table) is a container of ibis::part
>>>>>>>>>> (for a data partition).  A table can have multiple partitions.  All
>>>>>>>>>> data records written by setqgen.cpp can be regarded as one table, but
>>>>>>>>>> it might have a number of partitions.  The function
>>>>>>>>>> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data
>>>>>>>>>> partition it has.  The actual work is done belong
>>>>>>>>>> ibis::part::buildIndexes.  If you are not using
>>>>>>>>>> ibis::table::buildIndexes, you will be doing the looping yourself.
>>>>>>>>>> Either way is fine.  Take the option that is convenient for you.
>>>>>>>>>>
>>>>>>>>>> Good luck.
>>>>>>>>>>
>>>>>>>>>> John
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3/8/12 8:10 AM, Thorgrin wrote:
>>>>>>>>>>> Hi John,
>>>>>>>>>>>
>>>>>>>>>>> thank you for your ongoing work on fastbit library, the 
>>>>>>>>>>> improvements are great.
>>>>>>>>>>>
>>>>>>>>>>> I have a several question regarding correct usage. We are currently
>>>>>>>>>>> creating fastbit data partitions using tablex object in following
>>>>>>>>>>> manner:
>>>>>>>>>>>
>>>>>>>>>>> # initialise partition with columns
>>>>>>>>>>> tablex->addColumns();
>>>>>>>>>>>
>>>>>>>>>>> tablex->reserveSpace();
>>>>>>>>>>> # multiple calls to append data. We are storing the data on the fly 
>>>>>>>>>>> as
>>>>>>>>>>> is comes, so there are lots of calls to append.
>>>>>>>>>>> tablex->append();
>>>>>>>>>>>
>>>>>>>>>>> # When we fill the reserved space, we write the data to disk
>>>>>>>>>>> tablex->write();
>>>>>>>>>>> tablex->clearData();
>>>>>>>>>>>
>>>>>>>>>>> # And continue with
>>>>>>>>>>> tablex->append()
>>>>>>>>>>> .
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> Is this the right and efficient way? Or could you recommend a better
>>>>>>>>>>> approach? We really just need to receive data and store it into the
>>>>>>>>>>> partitions fast. Currently it seems that this consumes quite a lot 
>>>>>>>>>>> of
>>>>>>>>>>> CPU resources, just for writing thins down.
>>>>>>>>>>>
>>>>>>>>>>> Additionally, we want to created indexes on the newly created
>>>>>>>>>>> partitions. Currently we load it as a table using
>>>>>>>>>>> table = ibis::table::create();
>>>>>>>>>>> # and then
>>>>>>>>>>> table->buildIndexes();
>>>>>>>>>>> delete table;
>>>>>>>>>>>
>>>>>>>>>>> I've noticed that there is a buildIndexes() function on part class.
>>>>>>>>>>> What is the difference? Should we use the other one? Additionally,
>>>>>>>>>>> table allows to build an index on specific columns, part only on all
>>>>>>>>>>> columns. Is there a reason for this?
>>>>>>>>>>>
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Petr
>>>>>>>>> _______________________________________________
>>>>>>>>> FastBit-users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>>>>> _______________________________________________
>>>>>>>>> FastBit-users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Howto store data with fatbit library

Reply via email to