Hi Matthew,

You really don't need to write any Java code to use Hadoop. A lot of high level tools exists for the relatively common use case that you have. As I mentioned, we are evaluating Hadoop to store and analyze our own logs, and we have not written a single line of Java so far. Also, if the need should arise, it is possible to write MapReduce code in any language using Hadoop Streaming, you only pay a small price in performance.

It's true, Hadoop is quite a complex system, but I wouldn't say it's bloated. The learning curve is bit steep at first, but there is also a lot of information out there. Get a good book, like O'Reillys 'Hadoop - the definitive Guide', which helps a lot over the first bump.

I don't understand you point about the configuration at all. Is there something special abouts Riak configs that make them fit your management infrastructure better? The configuration of the Hadoop worker nodes is actually quite uniform. I use the same config files on all nodes. The only thing you should have in place are hostnames that can be properly resolved on each node.

The advantage of Hadoop over Disco is certainly the large ecosystem of tools and information/support around Hadoop. But if you give Disco a try, I would be interested in the results :-). I also think Implementing the capability to run Erlang jobs should be quite doable, especially if you have inhouse Erlang experience.

Cheer,
Nico

Am 29.06.2011 13:41, schrieb Evans, Matthew:
Hi Nico,

You bring up an awesome point. I was actually considering Casandra and Hadoop, 
as far as performance (write) is concerned Cassandra seems more than fast 
enough. The issue is we have minimal Java expertise, and from what I have read 
they both seem somewhat bloaty and a bit of a support headache. The 
configuration would also be very hard to fit into our management infrastructure.

Disco is a great idea, a good combination of Erlang and Python. It would be 
nice if the jobs could be defined in Erlang as well as Python since the company 
has equal competence of both languages, but I can't imagine it's too hard to 
implement that.

Matt
________________________________________
From: Nico Meyer [[email protected]]
Sent: Wednesday, June 29, 2011 7:17 AM
To: [email protected]
Cc: Evans, Matthew
Subject: Re: Riak or Cassandra for this...

Hi Matthew,

although I really like Riak, I would seriously consider using Hadoop or
something similar (perhaps Disco http://discoproject.org/, if you don't
like Java), since you only need to read/write your data sequentially.
Assuming your example is representative of your typical log entry
(smaller than 300 bytes each), your 50.000 writes/s is really a piece of
cake for a Hadoop like system. This amounts to only 300*50000 bytes/s or
about 15Mb/s.

Also analyzing your data will be way faster. I am just testing a small
Hadoop cluster for almost exactly the same thing that you want to do,
and I can analyze (to be precise I did a similar query as
your example) 100Gb of logs in about 3 minutes. This is a cluster of 8
desktop machines with 4 cores, 8 Gb of RAM and 4 standard SATA 7200rpm
drives each. So even with somewhat smaller hardware you should be able
to analyze your full dataset of 1 billion (about 300Gb) in under an hour.
What you should also consider is the amount of tools to analyze exactly
this type of data you get for free with Hadoop. If you use Hive to
analyze your data, you don't have to write any code at all. It's really
just like using a relational DB, while the speed is nearly on par with
handwritten code (My 3 minute result above was also done with Hive). So
the additional time to setup Hadoop will amortize quickly. Your example
analysis would be something like this with Hive:
      SELECT * FROM log WHERE timestamp>=1289589000 AND
timestamp<=1289590000 AND method='GET' AND url='http://www.test.de'

The next thing where Hadoop is easy and extremely efficient compared to
Riak is deleting/archiving old data (assuming you want to delete log
after a certain time). In Hadoop you just delete old files, which is
only a metadata operation and therefore takes seconds. I don't even want
to think about how long this will take with Riak, especially using the
InnoDB backend.

If you use the Cloudera distribution of Hadoop, it is not that hard to
setup any more. So I would at least give it a try. It was more or less
made for your use case, namely log analysis.

Sorry Basho ;-), but this is exactly the kind of problem where Hadoop
shines. Only if you need super high write availability out of the box
above everything else, especially above performance and ease of
analyzing your data, Riak could be a good choice for you.

Cheers,
Nico

Am 28.06.2011 17:17, schrieb Evans, Matthew:
Hi,

I've been looking at a number of technologies for a simple application.

We are saving large amounts of data to disc; this data is event-log/sensor data 
which may look something like:

Version, Account, RequestID, Timestamp, Duration, IPAddr, Method, URL, HTTP 
Version, Response_Code, Size, Hit_Rate, Range_From, Range_To, Referrer, Agent, 
Content_Type, Accept_Encoding, Redirect_Code, Progress


For Example:

1 agora 27050938271286652285000000000368375 1289589216.893 1989.938 79.7.41.29 
GET http://bi.sciagnij.pl/0/4/TWEE_Upgrade.exe HTTP/1.1 200 953772216 725098308 
713834308 -1 -1 - Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1) 
application/octet-stream gzip - 0 progress

The data has no specific key to index off (we will be doing some parsing of the 
data on ingest to get basic information allowing for fast queries, but this is 
outside of Riak).

Really the issue is that we need to be able to apply "analytic" (map-reduce) 
type queries on the data. These queries do not need to be real-time, but should not take 
days to run.

For example: All GET requests for a specific URL within a specific time range.

The amount of data saved could be quite large (forcing us to use InnoDB instead 
of BitCask). One estimate is ~1 billion records. Architecturally this data 
could be split over multiple nodes.

The choice of client-side language is still open, with Erlang as the current 
favorite. As I see it the advantages of Riak are:

1) HTTP based API as well as Erlang and other client APIs (the system has a mix 
of programming languages including Python and C/C++).

2) More flexible/extensible data model (Cassandra requires you to predefine the 
key spaces, columns etc ahead of time)

3) Easier to install/setup without the apparent bloat and complexity of 
Cassandra (which also includes Java setup)

4) Map-reduce queries

The disadvantages of Riak are:

1) Write performance. We need to handle ~50,000 writes per second.

I would recommend running our client app from within the same Erlang VM as Riak 
so hopefully we can gain something here. Alternatively use innostore Erlang API 
directly for writes.

Questions:

1) Is Riak a good database for this application?

2) Can we write to InnoDB directly and still leverage the map-reduce queries on 
the data?

Regards

Matt



_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to