Re: [FastBit-users] Howto store data with fatbit library

Thorgrin Sun, 22 Apr 2012 09:50:24 -0700

Hi John,

I finally got to test the difference between the two approaches for
storing data. The difference between using FastBit library and storing
the data directly is quite significant.


I'm sending the data to storage program over network using UDP, so
when it cannot cope with the speed, the network layer drops some
packets. I tried three speeds, 8k, 10k and 12k packets per second,
each packet contains number of rows, the size of packets are similar.
It takes about 154, 122 and 103 seconds to send the data at the
respective speeds.

The number of stored rows are roughly summed up in following table:
#speed  using library   direct storage
8000    7387000         31740000
10000   5925000         31737000
12000   5000000         31715000

The way we are storing the data is simple. We use buffer of size 70k
values for each column, that is just a piece of allocated memory. When
the buffer is full, we flush the memory to the file.

I do not know whether the results are a result of misusing the library
somehow, but maybe someone else stumbled upon this issue.

Regards,
Petr

>>> On 4/13/12 2:05 AM, Thorgrin wrote:
>>>> Hi John,
>>>>
>>>> my colleague is currently working on the data storage and he decided
>>>> to incorporate the creation of fastbit files directly into our code,
>>>> thus we have no generic storage code to give back. There are no
>>>> problems with our approach so far.
>>>>
>>>> I hope that I'll be able to test the performance soon to see if the
>>>> difference is really significant.
>>>>
>>>> Meanwhile, I've another question. How does fastbit work with byte
>>>> arrays? The table API seems to miss this feature, but there are some
>>>> internal types for this. We need to support storing byte arrays along
>>>> with strings and basic types in the end, so I would like to know
>>>> whether this is possible using fastbit, or if we have to come up with
>>>> another solution.
>>>>
>>>> Thank you,
>>>> Petr
>>>>
>>>> On 6 April 2012 19:36, K. John Wu <[email protected]> wrote:
>>>>> Hi, Petr,
>>>>>
>>>>> Have you got a chance to try this?  Any comments or questions?
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>> On 3/22/12 8:14 AM, Thorgrin wrote:
>>>>>> Thanks,
>>>>>>
>>>>>> we will definitely give it a try. Maybe this simple approach could be
>>>>>> made into some C++ class or C code and provided along with the
>>>>>> library, assuming the results are significantly better.
>>>>>>
>>>>>> Petr
>>>>>>
>>>>>> On 22 March 2012 15:58, Dominique Prunier
>>>>>> <[email protected]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was doing pretty much the same thing and here how i did it:
>>>>>>>  * I handle import myself (the column file format and the -part.txt are 
>>>>>>> pretty straightforward, i was only using category and long, beware of 
>>>>>>> the endianness though)
>>>>>>>  * I generate indexes column by column, using the code from the C API 
>>>>>>> (which, to answer your question, builds the index for a specific column 
>>>>>>> using part::getColumn and then column::loadIndex and 
>>>>>>> column::unloadIndex on the selected column)
>>>>>>>  * From time to time, i merge smallers partitions in larger one and 
>>>>>>> reindex them
>>>>>>>
>>>>>>> Hope this helps,
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: [email protected] 
>>>>>>> [mailto:[email protected]] On Behalf Of Thorgrin
>>>>>>> Sent: Thursday, March 22, 2012 10:12 AM
>>>>>>> To: K. John Wu
>>>>>>> Cc: FastBit Users
>>>>>>> Subject: Re: [FastBit-users] Howto store data with fatbit library
>>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> Thank you for pointing me in the right direction. What we are
>>>>>>> currently using seems quite similar, but we are experiencing some
>>>>>>> performance issues.
>>>>>>>
>>>>>>> We are storing quite lot of rows into multiple tables at a same time.
>>>>>>> Currently we are using one thread to write into about 7 different
>>>>>>> tables, some of which are more heavily used than others. We are
>>>>>>> experiencing high CPU load while the throughput is not as big as
>>>>>>> expected.
>>>>>>> On the test machine we have hit the ceiling at about 35k rows per
>>>>>>> second. This of course includes processing the incoming data as well
>>>>>>> as storing it, but we believe that the storing is what is the most
>>>>>>> limiting factor. The harddrive performance should not be an issue
>>>>>>> here, its busy at about 7-15%.
>>>>>>> The data is not written to harddrive immediately, but always where
>>>>>>> there are about 200k rows (this applies for each table).
>>>>>>>
>>>>>>> Since the fastbit data partition format is quite simple, it might be
>>>>>>> best to store the data directly. This would allow us to create a
>>>>>>> buffer for each data type and partition, which could be written
>>>>>>> directly to harddrive. The only drawback is that we need to generate
>>>>>>> the -part.txt file for ourselves, but that is not  too hard. Then we
>>>>>>> can use fastbit library to generate indexes on existing data.
>>>>>>> What is your recommendation?
>>>>>>>
>>>>>>>
>>>>>>> Regarding the buildIndexes() functions, they are indeed present both
>>>>>>> at parts and tables. But the table have also a buildIndex() function,
>>>>>>> which can be used to generate indexes on specific columns. I cannot
>>>>>>> seem to find an equivalent in parts API. Is there any way to build
>>>>>>> index on one column only using parts?
>>>>>>>
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Petr
>>>>>>>
>>>>>>> On 8 March 2012 18:46, K. John Wu <[email protected]> wrote:
>>>>>>>> Hi, Petr,
>>>>>>>>
>>>>>>>> You are on the right track.  The file tests/setqgen.cpp is essentially
>>>>>>>> doing what you are talking about.  You can take a look at the file
>>>>>>>> either in the source code directory or online at <http://goo.gl/D1XgX>.
>>>>>>>>
>>>>>>>> The class ibis::table (for a data table) is a container of ibis::part
>>>>>>>> (for a data partition).  A table can have multiple partitions.  All
>>>>>>>> data records written by setqgen.cpp can be regarded as one table, but
>>>>>>>> it might have a number of partitions.  The function
>>>>>>>> ibis::table::buildIndexes calls ibis::part::buildIndexes on each data
>>>>>>>> partition it has.  The actual work is done belong
>>>>>>>> ibis::part::buildIndexes.  If you are not using
>>>>>>>> ibis::table::buildIndexes, you will be doing the looping yourself.
>>>>>>>> Either way is fine.  Take the option that is convenient for you.
>>>>>>>>
>>>>>>>> Good luck.
>>>>>>>>
>>>>>>>> John
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/8/12 8:10 AM, Thorgrin wrote:
>>>>>>>>> Hi John,
>>>>>>>>>
>>>>>>>>> thank you for your ongoing work on fastbit library, the improvements 
>>>>>>>>> are great.
>>>>>>>>>
>>>>>>>>> I have a several question regarding correct usage. We are currently
>>>>>>>>> creating fastbit data partitions using tablex object in following
>>>>>>>>> manner:
>>>>>>>>>
>>>>>>>>> # initialise partition with columns
>>>>>>>>> tablex->addColumns();
>>>>>>>>>
>>>>>>>>> tablex->reserveSpace();
>>>>>>>>> # multiple calls to append data. We are storing the data on the fly as
>>>>>>>>> is comes, so there are lots of calls to append.
>>>>>>>>> tablex->append();
>>>>>>>>>
>>>>>>>>> # When we fill the reserved space, we write the data to disk
>>>>>>>>> tablex->write();
>>>>>>>>> tablex->clearData();
>>>>>>>>>
>>>>>>>>> # And continue with
>>>>>>>>> tablex->append()
>>>>>>>>> .
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Is this the right and efficient way? Or could you recommend a better
>>>>>>>>> approach? We really just need to receive data and store it into the
>>>>>>>>> partitions fast. Currently it seems that this consumes quite a lot of
>>>>>>>>> CPU resources, just for writing thins down.
>>>>>>>>>
>>>>>>>>> Additionally, we want to created indexes on the newly created
>>>>>>>>> partitions. Currently we load it as a table using
>>>>>>>>> table = ibis::table::create();
>>>>>>>>> # and then
>>>>>>>>> table->buildIndexes();
>>>>>>>>> delete table;
>>>>>>>>>
>>>>>>>>> I've noticed that there is a buildIndexes() function on part class.
>>>>>>>>> What is the difference? Should we use the other one? Additionally,
>>>>>>>>> table allows to build an index on specific columns, part only on all
>>>>>>>>> columns. Is there a reason for this?
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Petr
>>>>>>> _______________________________________________
>>>>>>> FastBit-users mailing list
>>>>>>> [email protected]
>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>>> _______________________________________________
>>>>>>> FastBit-users mailing list
>>>>>>> [email protected]
>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Howto store data with fatbit library

Reply via email to