MapReduce response time and speed

2013-07-24 Thread Jan Algermissen
Hi,

I am Jan Algermissen (REST-head, freelance programmer/consultant) and 
Cassandra-newbie.

I am looking at Cassandra for an application I am working on. There will be a 
max. of 10 Million items (Texts and attributes of a retailer's products) in the 
database. There will occasional writes (e.g. price updates).

The use case for the application is to work on the whole data set, item by item 
to produce 'exports'. It will be neccessary to access the full set every time. 
There is no relationship between the items. Processing is done iteratively.

My question: I am thinking that this is an ideal scenario for map-reduce but I 
am unsure about two things:

Can a user of the system define new jobs in an ad-hoc fashion (like a query) or 
do map reduce jobs need to be prepared by a developer (e.g. in RIAK you do a 
developer to compile-in the job when you need the perormance of Erlang-based 
jobs).

Suppose a user indeed can specify a job and send it off to Cassandra for 
processing, what is the expected response time?

Is it possible to reduce the response time (by tuning, adding more nodes) to 
make a result available within a couple of minutes? Or will there most 
certainly be a gap of 10 minutes or so and more?

I understand that map-reduce is not for ad-hoc 'querying', but my users expect 
the system to feel quasi-ineractive, because they intend to refine the 
processing job based on the results they get. A short gap would be ok, but a 
definite gap in the order of 10+ minutes not.

(For example, as far as I learned with RIAK you would most certainly have such 
a gap. How about Cassandra? Throwing more nodes at the problem would be ok, I 
just need to understand whether there is a definite 'response time penalty' I 
have to expect no matter what)

Jan




 

Re: MapReduce response time and speed

2013-07-24 Thread Shahab Yunus
You have lot of questions there so I can't answer all but for the following:
*"Can a user of the system define new jobs in an ad-hoc fashion (like a
query) or do map reduce jobs need to be prepared by a developer (e.g. in
RIAK you do a developer to compile-in the job when you need the perormance
of Erlang-based jobs).

Suppose a user indeed can specify a job and send it off to Cassandra for
processing, what is the expected response time?"*

You can use high-level tools like Pig, Hive and Oozie But mind you, it will
depend on your data size, complexity of the job, cluster and tune
parameters.

Regards,
Shahab


On Wed, Jul 24, 2013 at 10:33 AM, Jan Algermissen <
jan.algermis...@nordsc.com> wrote:

> Hi,
>
> I am Jan Algermissen (REST-head, freelance programmer/consultant) and
> Cassandra-newbie.
>
> I am looking at Cassandra for an application I am working on. There will
> be a max. of 10 Million items (Texts and attributes of a retailer's
> products) in the database. There will occasional writes (e.g. price
> updates).
>
> The use case for the application is to work on the whole data set, item by
> item to produce 'exports'. It will be neccessary to access the full set
> every time. There is no relationship between the items. Processing is done
> iteratively.
>
> My question: I am thinking that this is an ideal scenario for map-reduce
> but I am unsure about two things:
>
> Can a user of the system define new jobs in an ad-hoc fashion (like a
> query) or do map reduce jobs need to be prepared by a developer (e.g. in
> RIAK you do a developer to compile-in the job when you need the perormance
> of Erlang-based jobs).
>
> Suppose a user indeed can specify a job and send it off to Cassandra for
> processing, what is the expected response time?
>
> Is it possible to reduce the response time (by tuning, adding more nodes)
> to make a result available within a couple of minutes? Or will there most
> certainly be a gap of 10 minutes or so and more?
>
> I understand that map-reduce is not for ad-hoc 'querying', but my users
> expect the system to feel quasi-ineractive, because they intend to refine
> the processing job based on the results they get. A short gap would be ok,
> but a definite gap in the order of 10+ minutes not.
>
> (For example, as far as I learned with RIAK you would most certainly have
> such a gap. How about Cassandra? Throwing more nodes at the problem would
> be ok, I just need to understand whether there is a definite 'response time
> penalty' I have to expect no matter what)
>
> Jan
>
>
>
>
>


Re: MapReduce response time and speed

2013-07-24 Thread aaron morton
> Is it possible to reduce the response time (by tuning, adding more nodes) to 
> make a result available within a couple of minutes? Or will there most 
> certainly be a gap of 10 minutes or so and more?
Yes. 
More nodes will split the task up and it will run faster. 

How long it takes depends on the complexity of the hadoop tasks and the time 
they have to wait for slots. 

Cheers

-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/07/2013, at 4:14 AM, Shahab Yunus  wrote:

> You have lot of questions there so I can't answer all but for the following:
> "Can a user of the system define new jobs in an ad-hoc fashion (like a query) 
> or do map reduce jobs need to be prepared by a developer (e.g. in RIAK you do 
> a developer to compile-in the job when you need the perormance of 
> Erlang-based jobs).
> 
> Suppose a user indeed can specify a job and send it off to Cassandra for 
> processing, what is the expected response time?"
> 
> You can use high-level tools like Pig, Hive and Oozie But mind you, it will 
> depend on your data size, complexity of the job, cluster and tune parameters.
> 
> Regards,
> Shahab
> 
> 
> On Wed, Jul 24, 2013 at 10:33 AM, Jan Algermissen 
>  wrote:
> Hi,
> 
> I am Jan Algermissen (REST-head, freelance programmer/consultant) and 
> Cassandra-newbie.
> 
> I am looking at Cassandra for an application I am working on. There will be a 
> max. of 10 Million items (Texts and attributes of a retailer's products) in 
> the database. There will occasional writes (e.g. price updates).
> 
> The use case for the application is to work on the whole data set, item by 
> item to produce 'exports'. It will be neccessary to access the full set every 
> time. There is no relationship between the items. Processing is done 
> iteratively.
> 
> My question: I am thinking that this is an ideal scenario for map-reduce but 
> I am unsure about two things:
> 
> Can a user of the system define new jobs in an ad-hoc fashion (like a query) 
> or do map reduce jobs need to be prepared by a developer (e.g. in RIAK you do 
> a developer to compile-in the job when you need the perormance of 
> Erlang-based jobs).
> 
> Suppose a user indeed can specify a job and send it off to Cassandra for 
> processing, what is the expected response time?
> 
> Is it possible to reduce the response time (by tuning, adding more nodes) to 
> make a result available within a couple of minutes? Or will there most 
> certainly be a gap of 10 minutes or so and more?
> 
> I understand that map-reduce is not for ad-hoc 'querying', but my users 
> expect the system to feel quasi-ineractive, because they intend to refine the 
> processing job based on the results they get. A short gap would be ok, but a 
> definite gap in the order of 10+ minutes not.
> 
> (For example, as far as I learned with RIAK you would most certainly have 
> such a gap. How about Cassandra? Throwing more nodes at the problem would be 
> ok, I just need to understand whether there is a definite 'response time 
> penalty' I have to expect no matter what)
> 
> Jan
> 
> 
> 
> 
>  
>