Re: [FastBit-users] Howto store data with fatbit library

K. John Wu Fri, 06 Apr 2012 10:36:32 -0700

Hi, Petr,

Have you got a chance to try this?  Any comments or questions?


John


On 3/22/12 8:14 AM, Thorgrin wrote:
> Thanks,
> 
> we will definitely give it a try. Maybe this simple approach could be
> made into some C++ class or C code and provided along with the
> library, assuming the results are significantly better.
> 
> Petr
> 
> On 22 March 2012 15:58, Dominique Prunier
> <[email protected]> wrote:
>> Hi,
>>
>> I was doing pretty much the same thing and here how i did it:
>>  * I handle import myself (the column file format and the -part.txt are 
>> pretty straightforward, i was only using category and long, beware of the 
>> endianness though)
>>  * I generate indexes column by column, using the code from the C API 
>> (which, to answer your question, builds the index for a specific column 
>> using part::getColumn and then column::loadIndex and column::unloadIndex on 
>> the selected column)
>>  * From time to time, i merge smallers partitions in larger one and reindex 
>> them
>>
>> Hope this helps,
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Thorgrin
>> Sent: Thursday, March 22, 2012 10:12 AM
>> To: K. John Wu
>> Cc: FastBit Users
>> Subject: Re: [FastBit-users] Howto store data with fatbit library
>>
>> Hi John,
>>
>> Thank you for pointing me in the right direction. What we are
>> currently using seems quite similar, but we are experiencing some
>> performance issues.
>>
>> We are storing quite lot of rows into multiple tables at a same time.
>> Currently we are using one thread to write into about 7 different
>> tables, some of which are more heavily used than others. We are
>> experiencing high CPU load while the throughput is not as big as
>> expected.
>> On the test machine we have hit the ceiling at about 35k rows per
>> second. This of course includes processing the incoming data as well
>> as storing it, but we believe that the storing is what is the most
>> limiting factor. The harddrive performance should not be an issue
>> here, its busy at about 7-15%.
>> The data is not written to harddrive immediately, but always where
>> there are about 200k rows (this applies for each table).
>>
>> Since the fastbit data partition format is quite simple, it might be
>> best to store the data directly. This would allow us to create a
>> buffer for each data type and partition, which could be written
>> directly to harddrive. The only drawback is that we need to generate
>> the -part.txt file for ourselves, but that is not  too hard. Then we
>> can use fastbit library to generate indexes on existing data.
>> What is your recommendation?
>>
>>
>> Regarding the buildIndexes() functions, they are indeed present both
>> at parts and tables. But the table have also a buildIndex() function,
>> which can be used to generate indexes on specific columns. I cannot
>> seem to find an equivalent in parts API. Is there any way to build
>> index on one column only using parts?
>>
>>
>> Thank you,
>> Petr
>>
>> On 8 March 2012 18:46, K. John Wu <[email protected]> wrote:
>>> Hi, Petr,
>>>
>>> You are on the right track.  The file tests/setqgen.cpp is essentially
>>> doing what you are talking about.  You can take a look at the file
>>> either in the source code directory or online at <http://goo.gl/D1XgX>.
>>>
>>> The class ibis::table (for a data table) is a container of ibis::part
>>> (for a data partition).  A table can have multiple partitions.  All
>>> data records written by setqgen.cpp can be regarded as one table, but
>>> it might have a number of partitions.  The function
>>> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data
>>> partition it has.  The actual work is done belong
>>> ibis::part::buildIndexes.  If you are not using
>>> ibis::table::buildIndexes, you will be doing the looping yourself.
>>> Either way is fine.  Take the option that is convenient for you.
>>>
>>> Good luck.
>>>
>>> John
>>>
>>>
>>>
>>> On 3/8/12 8:10 AM, Thorgrin wrote:
>>>> Hi John,
>>>>
>>>> thank you for your ongoing work on fastbit library, the improvements are 
>>>> great.
>>>>
>>>> I have a several question regarding correct usage. We are currently
>>>> creating fastbit data partitions using tablex object in following
>>>> manner:
>>>>
>>>> # initialise partition with columns
>>>> tablex->addColumns();
>>>>
>>>> tablex->reserveSpace();
>>>> # multiple calls to append data. We are storing the data on the fly as
>>>> is comes, so there are lots of calls to append.
>>>> tablex->append();
>>>>
>>>> # When we fill the reserved space, we write the data to disk
>>>> tablex->write();
>>>> tablex->clearData();
>>>>
>>>> # And continue with
>>>> tablex->append()
>>>> .
>>>> .
>>>>
>>>> Is this the right and efficient way? Or could you recommend a better
>>>> approach? We really just need to receive data and store it into the
>>>> partitions fast. Currently it seems that this consumes quite a lot of
>>>> CPU resources, just for writing thins down.
>>>>
>>>> Additionally, we want to created indexes on the newly created
>>>> partitions. Currently we load it as a table using
>>>> table = ibis::table::create();
>>>> # and then
>>>> table->buildIndexes();
>>>> delete table;
>>>>
>>>> I've noticed that there is a buildIndexes() function on part class.
>>>> What is the difference? Should we use the other one? Additionally,
>>>> table allows to build an index on specific columns, part only on all
>>>> columns. Is there a reason for this?
>>>>
>>>> Thank you,
>>>> Petr
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Howto store data with fatbit library

Reply via email to