Hi,
interesting...
1. How do you see the 0.7 version evolving beside maintenance update?
Will it have a life of its own? I mean 0.7 is very good for intranet
use or mid-size public site. Why would you want to use mapred version
when you don't need it? (Maybe I don't know enough :-)
Using MapReduce should have little overall impact. The 0.8 release
should be as easy and efficient to use for intranets as the 0.7
release.
2. What I also understand mapred version requires extra extra
process,
That's not true. By default it is configured to run everything in-
process. But it can easily be configured to run things instead on
a network of machines, greatly increasing the scalability.
Interesting, last time I looked to map reduce it was necessary to
start a lot of separated processes like jobtracker, workeragent, ndfs
controller etc.
Is that now all running in one JVM? Can you give a pointer what class
handles this now?
3. It would be great to get a long term vision or view about
mapred version!
3. There has also been discussion about Nutch API. Is there any works
going on this front? I have also seen postings regarding use of JMX
any update?
We would like to have an administration GUI that tracks changes to
the core and plugins, without a lot of GUI-specific maintenance.
JMX looks like a promising approach for this. The plan is to first
improve Nutch's configuration APIs to facilitate this. For
example, Nutch does not currently support multiple configurations
in the same JVM. Most of the command-line tools in the mapred
branch are now implemented so that they can support multiple,
simultaneous invocations within the same JVM. Next we must change
the plugin APIs to support multiple configurations as well.
I understand that map reduce need different configuration instances
until runtime, from the JMX point of view the style this is actually
implemented is suboptimal.
The problem I see that the Action implements the Configurable
Interface that allows to set a configuration.
In the JMX world it would be better to have setter and getter for the
for the configuration important parameter.
Since JMX analyze setter and getter pairs and define values having a
get and set method as configurable value, values with only public
getter methods are read only values.
All other methods are interpreted as 'methods' (action methods).
So may be it would be better to extends the action classes from the
configuration object, however than we may have to much configurable
values for a action and this would need some code changes. :(
This is called IOC
http://www.martinfowler.com/articles/injection.html
May a idea would be to have the different configuration instances
deployed as different mbean instances inside the jmx bus.
Some more thoughts
The general usage of urls instead of files (local or remote) could
make a lot of things very easy.
We can have custom url stream handlers for the ndfs.
One very big advantage I see could be loading plugins from a ndfs or
a centralized http or ftp repository, further more it would be
possible to support more storage systems that support urls.
Another thing that would help to get a uniform interface for local
and distributed installation is the usage of dynamic proxies.
This would make the nutch RPC transparent for all objects you can
have the object locally or remotely instantiated. (for example a
remote configuration object instance)
To get a idea:
http://www.devx.com/Java/Article/21463
The java build in dynamic proxies require that the proxy classes has
interfaces, that makes no sense, but it is possible to use
cglib.sf.net to have dynamic proxies without interfaces.
I have sample code for url stream handlers and dynamic remote
proxies, let me know if someone is interested...
Greetings,
Stefan