Hi,

interesting...

1. How do you see the 0.7 version evolving beside maintenance update?
Will it have a life of its own? I mean 0.7 is very good for intranet
use or mid-size public site. Why would you want to use mapred version
when you don't need it? (Maybe I don't know enough :-)


Using MapReduce should have little overall impact. The 0.8 release should be as easy and efficient to use for intranets as the 0.7 release.


2. What I also understand mapred version requires extra extra process,


That's not true. By default it is configured to run everything in- process. But it can easily be configured to run things instead on a network of machines, greatly increasing the scalability.

Interesting, last time I looked to map reduce it was necessary to start a lot of separated processes like jobtracker, workeragent, ndfs controller etc. Is that now all running in one JVM? Can you give a pointer what class handles this now?


3. It would be great to get a long term vision or view about mapred version!
3. There has also been discussion about Nutch API. Is there any works
going on this front? I have also seen postings regarding use of JMX
any update?


We would like to have an administration GUI that tracks changes to the core and plugins, without a lot of GUI-specific maintenance. JMX looks like a promising approach for this. The plan is to first improve Nutch's configuration APIs to facilitate this. For example, Nutch does not currently support multiple configurations in the same JVM. Most of the command-line tools in the mapred branch are now implemented so that they can support multiple, simultaneous invocations within the same JVM. Next we must change the plugin APIs to support multiple configurations as well.

I understand that map reduce need different configuration instances until runtime, from the JMX point of view the style this is actually implemented is suboptimal. The problem I see that the Action implements the Configurable Interface that allows to set a configuration. In the JMX world it would be better to have setter and getter for the for the configuration important parameter. Since JMX analyze setter and getter pairs and define values having a get and set method as configurable value, values with only public getter methods are read only values.
All other methods are interpreted as 'methods' (action methods).

So may be it would be better to extends the action classes from the configuration object, however than we may have to much configurable values for a action and this would need some code changes. :(
This is called IOC

http://www.martinfowler.com/articles/injection.html

May a idea would be to have the different configuration instances deployed as different mbean instances inside the jmx bus.



Some more thoughts
The general usage of urls instead of files (local or remote) could make a lot of things very easy.
We can have custom url stream handlers for the ndfs.
One very big advantage I see could be loading plugins from a ndfs or a centralized http or ftp repository, further more it would be possible to support more storage systems that support urls.

Another thing that would help to get a uniform interface for local and distributed installation is the usage of dynamic proxies. This would make the nutch RPC transparent for all objects you can have the object locally or remotely instantiated. (for example a remote configuration object instance)
To get a idea:

http://www.devx.com/Java/Article/21463

The java build in dynamic proxies require that the proxy classes has interfaces, that makes no sense, but it is possible to use cglib.sf.net to have dynamic proxies without interfaces.

I have sample code for url stream handlers and dynamic remote proxies, let me know if someone is interested...

Greetings,
Stefan




Reply via email to