Hi Matthew,

although I really like Riak, I would seriously consider using Hadoop or something similar (perhaps Disco http://discoproject.org/, if you don't like Java), since you only need to read/write your data sequentially. Assuming your example is representative of your typical log entry (smaller than 300 bytes each), your 50.000 writes/s is really a piece of cake for a Hadoop like system. This amounts to only 300*50000 bytes/s or about 15Mb/s.

Also analyzing your data will be way faster. I am just testing a small Hadoop cluster for almost exactly the same thing that you want to do, and I can analyze (to be precise I did a similar query as your example) 100Gb of logs in about 3 minutes. This is a cluster of 8 desktop machines with 4 cores, 8 Gb of RAM and 4 standard SATA 7200rpm drives each. So even with somewhat smaller hardware you should be able to analyze your full dataset of 1 billion (about 300Gb) in under an hour. What you should also consider is the amount of tools to analyze exactly this type of data you get for free with Hadoop. If you use Hive to analyze your data, you don't have to write any code at all. It's really just like using a relational DB, while the speed is nearly on par with handwritten code (My 3 minute result above was also done with Hive). So the additional time to setup Hadoop will amortize quickly. Your example analysis would be something like this with Hive: SELECT * FROM log WHERE timestamp>=1289589000 AND timestamp<=1289590000 AND method='GET' AND url='http://www.test.de'

The next thing where Hadoop is easy and extremely efficient compared to Riak is deleting/archiving old data (assuming you want to delete log after a certain time). In Hadoop you just delete old files, which is only a metadata operation and therefore takes seconds. I don't even want to think about how long this will take with Riak, especially using the InnoDB backend.

If you use the Cloudera distribution of Hadoop, it is not that hard to setup any more. So I would at least give it a try. It was more or less made for your use case, namely log analysis.

Sorry Basho ;-), but this is exactly the kind of problem where Hadoop shines. Only if you need super high write availability out of the box above everything else, especially above performance and ease of analyzing your data, Riak could be a good choice for you.

Cheers,
Nico

Am 28.06.2011 17:17, schrieb Evans, Matthew:
Hi,

I've been looking at a number of technologies for a simple application.

We are saving large amounts of data to disc; this data is event-log/sensor data 
which may look something like:

Version, Account, RequestID, Timestamp, Duration, IPAddr, Method, URL, HTTP 
Version, Response_Code, Size, Hit_Rate, Range_From, Range_To, Referrer, Agent, 
Content_Type, Accept_Encoding, Redirect_Code, Progress


For Example:

1 agora 27050938271286652285000000000368375 1289589216.893 1989.938 79.7.41.29 
GET http://bi.sciagnij.pl/0/4/TWEE_Upgrade.exe HTTP/1.1 200 953772216 725098308 
713834308 -1 -1 - Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1) 
application/octet-stream gzip - 0 progress

The data has no specific key to index off (we will be doing some parsing of the 
data on ingest to get basic information allowing for fast queries, but this is 
outside of Riak).

Really the issue is that we need to be able to apply "analytic" (map-reduce) 
type queries on the data. These queries do not need to be real-time, but should not take 
days to run.

For example: All GET requests for a specific URL within a specific time range.

The amount of data saved could be quite large (forcing us to use InnoDB instead 
of BitCask). One estimate is ~1 billion records. Architecturally this data 
could be split over multiple nodes.

The choice of client-side language is still open, with Erlang as the current 
favorite. As I see it the advantages of Riak are:

1) HTTP based API as well as Erlang and other client APIs (the system has a mix 
of programming languages including Python and C/C++).

2) More flexible/extensible data model (Cassandra requires you to predefine the 
key spaces, columns etc ahead of time)

3) Easier to install/setup without the apparent bloat and complexity of 
Cassandra (which also includes Java setup)

4) Map-reduce queries

The disadvantages of Riak are:

1) Write performance. We need to handle ~50,000 writes per second.

I would recommend running our client app from within the same Erlang VM as Riak 
so hopefully we can gain something here. Alternatively use innostore Erlang API 
directly for writes.

Questions:

1) Is Riak a good database for this application?

2) Can we write to InnoDB directly and still leverage the map-reduce queries on 
the data?

Regards

Matt



_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to