Re: [FastBit-users] Howto store data with fatbit library

Dominique Prunier Thu, 22 Mar 2012 07:59:20 -0700

Hi,

I was doing pretty much the same thing and here how i did it:
 * I handle import myself (the column file format and the -part.txt are pretty 
straightforward, i was only using category and long, beware of the endianness 
though)
 * I generate indexes column by column, using the code from the C API (which, 
to answer your question, builds the index for a specific column using 
part::getColumn and then column::loadIndex and column::unloadIndex on the 
selected column)
 * From time to time, i merge smallers partitions in larger one and reindex them

Hope this helps,

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Thorgrin
Sent: Thursday, March 22, 2012 10:12 AM
To: K. John Wu
Cc: FastBit Users
Subject: Re: [FastBit-users] Howto store data with fatbit library

Hi John,

Thank you for pointing me in the right direction. What we are
currently using seems quite similar, but we are experiencing some
performance issues.

We are storing quite lot of rows into multiple tables at a same time.
Currently we are using one thread to write into about 7 different
tables, some of which are more heavily used than others. We are
experiencing high CPU load while the throughput is not as big as
expected.
On the test machine we have hit the ceiling at about 35k rows per
second. This of course includes processing the incoming data as well
as storing it, but we believe that the storing is what is the most
limiting factor. The harddrive performance should not be an issue
here, its busy at about 7-15%.
The data is not written to harddrive immediately, but always where
there are about 200k rows (this applies for each table).

Since the fastbit data partition format is quite simple, it might be
best to store the data directly. This would allow us to create a
buffer for each data type and partition, which could be written
directly to harddrive. The only drawback is that we need to generate
the -part.txt file for ourselves, but that is not  too hard. Then we
can use fastbit library to generate indexes on existing data.
What is your recommendation?

Regarding the buildIndexes() functions, they are indeed present both
at parts and tables. But the table have also a buildIndex() function,
which can be used to generate indexes on specific columns. I cannot
seem to find an equivalent in parts API. Is there any way to build
index on one column only using parts?

Thank you,
Petr

On 8 March 2012 18:46, K. John Wu <[email protected]> wrote:
> Hi, Petr,
>
> You are on the right track.  The file tests/setqgen.cpp is essentially
> doing what you are talking about.  You can take a look at the file
> either in the source code directory or online at <http://goo.gl/D1XgX>.
>
> The class ibis::table (for a data table) is a container of ibis::part
> (for a data partition).  A table can have multiple partitions.  All
> data records written by setqgen.cpp can be regarded as one table, but
> it might have a number of partitions.  The function
> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data
> partition it has.  The actual work is done belong
> ibis::part::buildIndexes.  If you are not using
> ibis::table::buildIndexes, you will be doing the looping yourself.
> Either way is fine.  Take the option that is convenient for you.
>
> Good luck.
>
> John
>
>
>
> On 3/8/12 8:10 AM, Thorgrin wrote:
>> Hi John,
>>
>> thank you for your ongoing work on fastbit library, the improvements are 
>> great.
>>
>> I have a several question regarding correct usage. We are currently
>> creating fastbit data partitions using tablex object in following
>> manner:
>>
>> # initialise partition with columns
>> tablex->addColumns();
>>
>> tablex->reserveSpace();
>> # multiple calls to append data. We are storing the data on the fly as
>> is comes, so there are lots of calls to append.
>> tablex->append();
>>
>> # When we fill the reserved space, we write the data to disk
>> tablex->write();
>> tablex->clearData();
>>
>> # And continue with
>> tablex->append()
>> .
>> .
>>
>> Is this the right and efficient way? Or could you recommend a better
>> approach? We really just need to receive data and store it into the
>> partitions fast. Currently it seems that this consumes quite a lot of
>> CPU resources, just for writing thins down.
>>
>> Additionally, we want to created indexes on the newly created
>> partitions. Currently we load it as a table using
>> table = ibis::table::create();
>> # and then
>> table->buildIndexes();
>> delete table;
>>
>> I've noticed that there is a buildIndexes() function on part class.
>> What is the difference? Should we use the other one? Additionally,
>> table allows to build an index on specific columns, part only on all
>> columns. Is there a reason for this?
>>
>> Thank you,
>> Petr
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Howto store data with fatbit library

Reply via email to