Hi,

We have an HDFS set up of a namenode and three datanodes all on EC2
larges.  One of our data partitions basically has files that are fed from a
few Flume instances rolling *hourly*.  This equates to around 3 4-8mb files
per hour right now

Our Mesos cluster consists of a Master and the three slave nodes colocated
on these EC2 larges as well (slaves -> datanodes, mesos master ->
namenode).  Spark scheduled jobs are launched from spark shell ad-hoc today.

The data is serialized protobuf messages in sequence files.  Our operations
typically consist of deserializing the data, grabbing a few primitive
fields out of the message and doing some maps/reduces.

For grabbing on the order of 2 days of data this size, what would the
expected Spark performance be?  We are seeing simple maps and 'takes' on
this data taking on the order of 15 minutes.

Thanks,

Gary

Reply via email to