Re: [FastBit-users] Howto store data with fatbit library

Thorgrin Fri, 13 Apr 2012 02:07:44 -0700

Hi John,

my colleague is currently working on the data storage and he decided
to incorporate the creation of fastbit files directly into our code,
thus we have no generic storage code to give back. There are no
problems with our approach so far.


I hope that I'll be able to test the performance soon to see if the
difference is really significant.

Meanwhile, I've another question. How does fastbit work with byte
arrays? The table API seems to miss this feature, but there are some
internal types for this. We need to support storing byte arrays along
with strings and basic types in the end, so I would like to know
whether this is possible using fastbit, or if we have to come up with
another solution.

Thank you,
Petr

On 6 April 2012 19:36, K. John Wu <[email protected]> wrote:
> Hi, Petr,
>
> Have you got a chance to try this?  Any comments or questions?
>
> John
>
>
> On 3/22/12 8:14 AM, Thorgrin wrote:
>> Thanks,
>>
>> we will definitely give it a try. Maybe this simple approach could be
>> made into some C++ class or C code and provided along with the
>> library, assuming the results are significantly better.
>>
>> Petr
>>
>> On 22 March 2012 15:58, Dominique Prunier
>> <[email protected]> wrote:
>>> Hi,
>>>
>>> I was doing pretty much the same thing and here how i did it:
>>>  * I handle import myself (the column file format and the -part.txt are 
>>> pretty straightforward, i was only using category and long, beware of the 
>>> endianness though)
>>>  * I generate indexes column by column, using the code from the C API 
>>> (which, to answer your question, builds the index for a specific column 
>>> using part::getColumn and then column::loadIndex and column::unloadIndex on 
>>> the selected column)
>>>  * From time to time, i merge smallers partitions in larger one and reindex 
>>> them
>>>
>>> Hope this helps,
>>>
>>> -----Original Message-----
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Thorgrin
>>> Sent: Thursday, March 22, 2012 10:12 AM
>>> To: K. John Wu
>>> Cc: FastBit Users
>>> Subject: Re: [FastBit-users] Howto store data with fatbit library
>>>
>>> Hi John,
>>>
>>> Thank you for pointing me in the right direction. What we are
>>> currently using seems quite similar, but we are experiencing some
>>> performance issues.
>>>
>>> We are storing quite lot of rows into multiple tables at a same time.
>>> Currently we are using one thread to write into about 7 different
>>> tables, some of which are more heavily used than others. We are
>>> experiencing high CPU load while the throughput is not as big as
>>> expected.
>>> On the test machine we have hit the ceiling at about 35k rows per
>>> second. This of course includes processing the incoming data as well
>>> as storing it, but we believe that the storing is what is the most
>>> limiting factor. The harddrive performance should not be an issue
>>> here, its busy at about 7-15%.
>>> The data is not written to harddrive immediately, but always where
>>> there are about 200k rows (this applies for each table).
>>>
>>> Since the fastbit data partition format is quite simple, it might be
>>> best to store the data directly. This would allow us to create a
>>> buffer for each data type and partition, which could be written
>>> directly to harddrive. The only drawback is that we need to generate
>>> the -part.txt file for ourselves, but that is not  too hard. Then we
>>> can use fastbit library to generate indexes on existing data.
>>> What is your recommendation?
>>>
>>>
>>> Regarding the buildIndexes() functions, they are indeed present both
>>> at parts and tables. But the table have also a buildIndex() function,
>>> which can be used to generate indexes on specific columns. I cannot
>>> seem to find an equivalent in parts API. Is there any way to build
>>> index on one column only using parts?
>>>
>>>
>>> Thank you,
>>> Petr
>>>
>>> On 8 March 2012 18:46, K. John Wu <[email protected]> wrote:
>>>> Hi, Petr,
>>>>
>>>> You are on the right track.  The file tests/setqgen.cpp is essentially
>>>> doing what you are talking about.  You can take a look at the file
>>>> either in the source code directory or online at <http://goo.gl/D1XgX>.
>>>>
>>>> The class ibis::table (for a data table) is a container of ibis::part
>>>> (for a data partition).  A table can have multiple partitions.  All
>>>> data records written by setqgen.cpp can be regarded as one table, but
>>>> it might have a number of partitions.  The function
>>>> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data
>>>> partition it has.  The actual work is done belong
>>>> ibis::part::buildIndexes.  If you are not using
>>>> ibis::table::buildIndexes, you will be doing the looping yourself.
>>>> Either way is fine.  Take the option that is convenient for you.
>>>>
>>>> Good luck.
>>>>
>>>> John
>>>>
>>>>
>>>>
>>>> On 3/8/12 8:10 AM, Thorgrin wrote:
>>>>> Hi John,
>>>>>
>>>>> thank you for your ongoing work on fastbit library, the improvements are 
>>>>> great.
>>>>>
>>>>> I have a several question regarding correct usage. We are currently
>>>>> creating fastbit data partitions using tablex object in following
>>>>> manner:
>>>>>
>>>>> # initialise partition with columns
>>>>> tablex->addColumns();
>>>>>
>>>>> tablex->reserveSpace();
>>>>> # multiple calls to append data. We are storing the data on the fly as
>>>>> is comes, so there are lots of calls to append.
>>>>> tablex->append();
>>>>>
>>>>> # When we fill the reserved space, we write the data to disk
>>>>> tablex->write();
>>>>> tablex->clearData();
>>>>>
>>>>> # And continue with
>>>>> tablex->append()
>>>>> .
>>>>> .
>>>>>
>>>>> Is this the right and efficient way? Or could you recommend a better
>>>>> approach? We really just need to receive data and store it into the
>>>>> partitions fast. Currently it seems that this consumes quite a lot of
>>>>> CPU resources, just for writing thins down.
>>>>>
>>>>> Additionally, we want to created indexes on the newly created
>>>>> partitions. Currently we load it as a table using
>>>>> table = ibis::table::create();
>>>>> # and then
>>>>> table->buildIndexes();
>>>>> delete table;
>>>>>
>>>>> I've noticed that there is a buildIndexes() function on part class.
>>>>> What is the difference? Should we use the other one? Additionally,
>>>>> table allows to build an index on specific columns, part only on all
>>>>> columns. Is there a reason for this?
>>>>>
>>>>> Thank you,
>>>>> Petr
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Howto store data with fatbit library

Reply via email to