Re: Nutch is resilient to automated testing

brainstorm Thu, 07 Aug 2008 05:12:06 -0700

I think that if you share some (preliminar/broken or working) code
you're actually writing and:


1) Expected results
2) Actual results

Could be useful to start diagnosing your problem. IMHO, there's
nothing more specific than the actual test code ;)

Regards,
Roman

On Thu, Aug 7, 2008 at 12:07 PM, Rick Moynihan <[EMAIL PROTECTED]> wrote:
> I first posted this to Nutch-Dev, but had no response; so I'm reposting it
> here.  If you've already seen it, apologies for the dupe.
>
> Hi all,
>
> A colleague I have been working with has developed a plugin to index
> content with Nutch.  And though it does the job admirably, the
> complexity and design of Nutch has proven resistent to easily writing
> automated tests for this component.
>
> I'm desperately trying to write some JUnit unit/integration tests for
> this component, however Nutch doesn't make this simple enough, and I
> fear this amongst other things is a barrier to Nutch adoption.
>
> What I want to do is:
>
> - Setup a Jetty server within the test with the content I want to index
> (easy enough with CrawlDBTestUtil)
> - Configure a crawl (i.e. fetch, index, merge, dedup etc...) and
> override the configuration with my plugin and configuration.
> - Store the index (preferably in memory, but on the disk is ok).
> - assert that particular searches return items etc...
>
>
> At first I thought this would be a simple matter of using
> CrawlDBTestUtil to establish the server side, then using
> org.apache.nutch.crawl.Crawl to perform all the relevant steps resulting
> in an index of the content, which I can then run assertions on via
> NutchBean.
>
> Ideally I'd like to create just one Configuration object, override the
> settings as I wish, and then pass this object into Crawl and NutchBean
> appropriately.
>
> Sadly however org.apache.nutch.crawl.Crawl isn't really a class, as it
> really only has a static main method which performs all the operations
> in batch.  This design makes the class hard to reuse within the context
> of my test.  This leaves me with the following options:
>
> - call the main method and pass it an ugly array of Strings to do what I
> require.  This is ugly due also to assumptions underlying the design of
> this component (configuration files on the classpath etc...)  Also it
> allows little or no reuse of configuration with other parts of the code
> (e.g. NutchBean).
>
> - Copy/Paste/Modify Crawl into my test.  The code in Crawl recently
> changed to account for hadoop 0.17, so I don't really want to do this
> only to find the API changes.  Plus I believe that tests should be
> simple to read.  Explicitly performing 30 steps in order to test a
> component isn't a good idea, as it hides the forest for the trees.
>
> CrawlDBTestUtil is a step in the right direction, but more work is
> needed.  Is it possible to get this marked as a bug/feature-request and
> fixed in time for 1.0?
>
> Thanks again for your help.
>
> R.
>
>
>
>

Re: Nutch is resilient to automated testing

Reply via email to