Re: Riak or Cassandra for this...

Nico Meyer Wed, 29 Jun 2011 04:17:18 -0700

Hi Matthew,

although I really like Riak, I would seriously consider using Hadoop orsomething similar (perhaps Disco http://discoproject.org/, if you don'tlike Java), since you only need to read/write your data sequentially.Assuming your example is representative of your typical log entry(smaller than 300 bytes each), your 50.000 writes/s is really a piece ofcake for a Hadoop like system. This amounts to only 300*50000 bytes/s orabout 15Mb/s.

Also analyzing your data will be way faster. I am just testing a smallHadoop cluster for almost exactly the same thing that you want to do,and I can analyze (to be precise I did a similar query asyour example) 100Gb of logs in about 3 minutes. This is a cluster of 8desktop machines with 4 cores, 8 Gb of RAM and 4 standard SATA 7200rpmdrives each. So even with somewhat smaller hardware you should be ableto analyze your full dataset of 1 billion (about 300Gb) in under an hour.What you should also consider is the amount of tools to analyze exactlythis type of data you get for free with Hadoop. If you use Hive toanalyze your data, you don't have to write any code at all. It's reallyjust like using a relational DB, while the speed is nearly on par withhandwritten code (My 3 minute result above was also done with Hive). Sothe additional time to setup Hadoop will amortize quickly. Your exampleanalysis would be something like this with Hive:SELECT * FROM log WHERE timestamp>=1289589000 ANDtimestamp<=1289590000 AND method='GET' AND url='http://www.test.de'

The next thing where Hadoop is easy and extremely efficient compared toRiak is deleting/archiving old data (assuming you want to delete logafter a certain time). In Hadoop you just delete old files, which isonly a metadata operation and therefore takes seconds. I don't even wantto think about how long this will take with Riak, especially using theInnoDB backend.

If you use the Cloudera distribution of Hadoop, it is not that hard tosetup any more. So I would at least give it a try. It was more or lessmade for your use case, namely log analysis.

Sorry Basho ;-), but this is exactly the kind of problem where Hadoopshines. Only if you need super high write availability out of the boxabove everything else, especially above performance and ease ofanalyzing your data, Riak could be a good choice for you.


Cheers,
Nico

Am 28.06.2011 17:17, schrieb Evans, Matthew:

Hi,

I've been looking at a number of technologies for a simple application.

We are saving large amounts of data to disc; this data is event-log/sensor data 
which may look something like:

Version, Account, RequestID, Timestamp, Duration, IPAddr, Method, URL, HTTP 
Version, Response_Code, Size, Hit_Rate, Range_From, Range_To, Referrer, Agent, 
Content_Type, Accept_Encoding, Redirect_Code, Progress


For Example:

1 agora 27050938271286652285000000000368375 1289589216.893 1989.938 79.7.41.29 
GET http://bi.sciagnij.pl/0/4/TWEE_Upgrade.exe HTTP/1.1 200 953772216 725098308 
713834308 -1 -1 - Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1) 
application/octet-stream gzip - 0 progress

The data has no specific key to index off (we will be doing some parsing of the 
data on ingest to get basic information allowing for fast queries, but this is 
outside of Riak).

Really the issue is that we need to be able to apply "analytic" (map-reduce) 
type queries on the data. These queries do not need to be real-time, but should not take 
days to run.

For example: All GET requests for a specific URL within a specific time range.

The amount of data saved could be quite large (forcing us to use InnoDB instead 
of BitCask). One estimate is ~1 billion records. Architecturally this data 
could be split over multiple nodes.

The choice of client-side language is still open, with Erlang as the current 
favorite. As I see it the advantages of Riak are:

1) HTTP based API as well as Erlang and other client APIs (the system has a mix 
of programming languages including Python and C/C++).

2) More flexible/extensible data model (Cassandra requires you to predefine the 
key spaces, columns etc ahead of time)

3) Easier to install/setup without the apparent bloat and complexity of 
Cassandra (which also includes Java setup)

4) Map-reduce queries

The disadvantages of Riak are:

1) Write performance. We need to handle ~50,000 writes per second.

I would recommend running our client app from within the same Erlang VM as Riak 
so hopefully we can gain something here. Alternatively use innostore Erlang API 
directly for writes.

Questions:

1) Is Riak a good database for this application?

2) Can we write to InnoDB directly and still leverage the map-reduce queries on 
the data?

Regards

Matt



_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak or Cassandra for this...

Reply via email to