On 5/30/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
On 5/30/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Doğacan Güney wrote:
>
> > My patch is just a draft to see if we can create a better caching
> > mechanism. There are definitely some rough edges there:)
>
> One important information: in future versions of Hadoop the method
> Configuration.setObject() is deprecated and then will be removed, so we
> have to grow our own caching mechanism anyway - either use a singleton
> cache, or change nearly all API-s to pass around a user/job/task context.
>
> So, we will face this problem pretty soon, with the next upgrade of Hadoop.

Hmm, well, that sucks, but this is not really a problem for
PluginRepository: PluginRepository already has its own cache
mechanism.

>
>
>
> > You are right about per-plugin parameters but I think it will be very
> > difficult to keep PluginProperty class in sync with plugin parameters.
> > I mean, if a plugin defines a new parameter, we have to remember to
> > update PluginProperty. Perhaps, we can force plugins to define
> > configuration options it will use in, say, its plugin.xml file, but
> > that will be very error-prone too. I don't want to compare entire
> > configuration objects, because changing irrevelant options, like
> > fetcher.store.content shouldn't force loading plugins again, though it
> > seems it may be inevitable....
>
> Let me see if I understand this ... In my opinion this is a non-issue.
>
> Child tasks are started in separate JVMs, so the only "context"
> information that they have is what they can read from job.xml (which is
> a superset of all properties from config files + job-specific data +
> task-specific data). This context is currently instantiated as a
> Configuration object, and we (ab)use it also as a local per-JVM cache
> for plugin instances and other objects.
>
> Once we instantiate the plugins, they exist unchanged throughout the
> lifecycle of JVM (== lifecycle of a single task), so we don't have to
> worry about having different sets of plugins with different parameters
> for different jobs (or even tasks).
>
> In other words, it seems to me that there is no such situation in which
> we have to reload plugins within the same JVM, but with different
> parameters.

Problem is that someone might get a little too smart. Like one may
write a new job where he has two IndexingFilters but creates each from
completely different configuration objects. Then filters some
documents with the first filter and others with the second. I agree
that this is a bit of a reach, but it is possible.

Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.

So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...

What do you think?



>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Doğacan Güney



--
Doğacan Güney

Reply via email to