Hi Matthew,
although I really like Riak, I would seriously consider using Hadoop or
something similar (perhaps Disco http://discoproject.org/, if you don't
like Java), since you only need to read/write your data sequentially.
Assuming your example is representative of your typical log entry
(smaller than 300 bytes each), your 50.000 writes/s is really a piece of
cake for a Hadoop like system. This amounts to only 300*50000 bytes/s or
about 15Mb/s.
Also analyzing your data will be way faster. I am just testing a small
Hadoop cluster for almost exactly the same thing that you want to do,
and I can analyze (to be precise I did a similar query as
your example) 100Gb of logs in about 3 minutes. This is a cluster of 8
desktop machines with 4 cores, 8 Gb of RAM and 4 standard SATA 7200rpm
drives each. So even with somewhat smaller hardware you should be able
to analyze your full dataset of 1 billion (about 300Gb) in under an hour.
What you should also consider is the amount of tools to analyze exactly
this type of data you get for free with Hadoop. If you use Hive to
analyze your data, you don't have to write any code at all. It's really
just like using a relational DB, while the speed is nearly on par with
handwritten code (My 3 minute result above was also done with Hive). So
the additional time to setup Hadoop will amortize quickly. Your example
analysis would be something like this with Hive:
SELECT * FROM log WHERE timestamp>=1289589000 AND
timestamp<=1289590000 AND method='GET' AND url='http://www.test.de'
The next thing where Hadoop is easy and extremely efficient compared to
Riak is deleting/archiving old data (assuming you want to delete log
after a certain time). In Hadoop you just delete old files, which is
only a metadata operation and therefore takes seconds. I don't even want
to think about how long this will take with Riak, especially using the
InnoDB backend.
If you use the Cloudera distribution of Hadoop, it is not that hard to
setup any more. So I would at least give it a try. It was more or less
made for your use case, namely log analysis.
Sorry Basho ;-), but this is exactly the kind of problem where Hadoop
shines. Only if you need super high write availability out of the box
above everything else, especially above performance and ease of
analyzing your data, Riak could be a good choice for you.
Cheers,
Nico
Am 28.06.2011 17:17, schrieb Evans, Matthew:
Hi,
I've been looking at a number of technologies for a simple application.
We are saving large amounts of data to disc; this data is event-log/sensor data
which may look something like:
Version, Account, RequestID, Timestamp, Duration, IPAddr, Method, URL, HTTP
Version, Response_Code, Size, Hit_Rate, Range_From, Range_To, Referrer, Agent,
Content_Type, Accept_Encoding, Redirect_Code, Progress
For Example:
1 agora 27050938271286652285000000000368375 1289589216.893 1989.938 79.7.41.29
GET http://bi.sciagnij.pl/0/4/TWEE_Upgrade.exe HTTP/1.1 200 953772216 725098308
713834308 -1 -1 - Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1)
application/octet-stream gzip - 0 progress
The data has no specific key to index off (we will be doing some parsing of the
data on ingest to get basic information allowing for fast queries, but this is
outside of Riak).
Really the issue is that we need to be able to apply "analytic" (map-reduce)
type queries on the data. These queries do not need to be real-time, but should not take
days to run.
For example: All GET requests for a specific URL within a specific time range.
The amount of data saved could be quite large (forcing us to use InnoDB instead
of BitCask). One estimate is ~1 billion records. Architecturally this data
could be split over multiple nodes.
The choice of client-side language is still open, with Erlang as the current
favorite. As I see it the advantages of Riak are:
1) HTTP based API as well as Erlang and other client APIs (the system has a mix
of programming languages including Python and C/C++).
2) More flexible/extensible data model (Cassandra requires you to predefine the
key spaces, columns etc ahead of time)
3) Easier to install/setup without the apparent bloat and complexity of
Cassandra (which also includes Java setup)
4) Map-reduce queries
The disadvantages of Riak are:
1) Write performance. We need to handle ~50,000 writes per second.
I would recommend running our client app from within the same Erlang VM as Riak
so hopefully we can gain something here. Alternatively use innostore Erlang API
directly for writes.
Questions:
1) Is Riak a good database for this application?
2) Can we write to InnoDB directly and still leverage the map-reduce queries on
the data?
Regards
Matt
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com