On 5/30/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> On 5/30/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> > Doğacan Güney wrote:
> >
> > > My patch is just a draft to see if we can create a better caching
> > > mechanism. There are definitely some rough edges there:)
> >
> > One important information: in future versions of Hadoop the method
> > Configuration.setObject() is deprecated and then will be removed, so we
> > have to grow our own caching mechanism anyway - either use a singleton
> > cache, or change nearly all API-s to pass around a user/job/task context.
> >
> > So, we will face this problem pretty soon, with the next upgrade of Hadoop.
>
> Hmm, well, that sucks, but this is not really a problem for
> PluginRepository: PluginRepository already has its own cache
> mechanism.
>
> >
> >
> >
> > > You are right about per-plugin parameters but I think it will be very
> > > difficult to keep PluginProperty class in sync with plugin parameters.
> > > I mean, if a plugin defines a new parameter, we have to remember to
> > > update PluginProperty. Perhaps, we can force plugins to define
> > > configuration options it will use in, say, its plugin.xml file, but
> > > that will be very error-prone too. I don't want to compare entire
> > > configuration objects, because changing irrevelant options, like
> > > fetcher.store.content shouldn't force loading plugins again, though it
> > > seems it may be inevitable....
> >
> > Let me see if I understand this ... In my opinion this is a non-issue.
> >
> > Child tasks are started in separate JVMs, so the only "context"
> > information that they have is what they can read from job.xml (which is
> > a superset of all properties from config files + job-specific data +
> > task-specific data). This context is currently instantiated as a
> > Configuration object, and we (ab)use it also as a local per-JVM cache
> > for plugin instances and other objects.
> >
> > Once we instantiate the plugins, they exist unchanged throughout the
> > lifecycle of JVM (== lifecycle of a single task), so we don't have to
> > worry about having different sets of plugins with different parameters
> > for different jobs (or even tasks).
> >
> > In other words, it seems to me that there is no such situation in which
> > we have to reload plugins within the same JVM, but with different
> > parameters.
>
> Problem is that someone might get a little too smart. Like one may
> write a new job where he has two IndexingFilters but creates each from
> completely different configuration objects. Then filters some
> documents with the first filter and others with the second. I agree
> that this is a bit of a reach, but it is possible.

Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.

So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...

What do you think?

>
>
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>
>
> --
> Doğacan Güney
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to