[ 
https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127427#comment-13127427
 ] 

Andrzej Bialecki  commented on NUTCH-1135:
------------------------------------------

A few comments from the author of this monstrosity ;) First, thanks Ferdy for 
taking time to work with this, it's much appreciated, we need to move forward 
on this. I agree that ultimately this test should be moved to Gora and become a 
part of a larger test suite that verifies correctness of concurrent 
multi-threaded and multi-process operations.

However, the immediate purpose of this class was to stress-test the existing 
Gora versions in usage patterns typical for Nutch, in order to verify that a 
particular version of Gora is a viable storage layer for Nutch - so the test 
tries to replicate typical Nutch scenarios. Remember that this has to work not 
only for a toy crawl in a single JVM in local mode, but also for a fully 
distributed parallel map-reduce crawl. Consequently:

* testMultiThread: tests a scenario of multiple threads in a single JVM all 
writing to the same storage instance. This replicates a scenario present e.g. 
in a single Fetcher task. If this test fails (assuming it's properly 
constructed!) then this means that Gora will fail, perhaps silently (see 
NUTCH-893), in a fundamental Nutch tool.

* testMultiProcess: tests a scenario of multiple processes running in multiple 
JVMs all writing to the same storage instance. This replicates a scenario of 
multiple map-reduce tasks all using the same storage config (shared storage, 
e.g. HSQLDB in server mode), and it's fundamental to all Nutch tools running on 
a cluster. In map-reduce jobs there are usually many concurrent tasks, and some 
of them may execute in several copies in parallel (speculative execution) and 
some others may fail catastrophically without proper cleanup - and Gora 
backends must just deal with it. If this test fails (again, assuming it's 
properly constructed and doesn't exceed some OS capabilities of the test 
machine, or some known limits of a storage impl. like the number of concurrent 
connections) then it means that Gora storage is not reliable for a typical 
map-reduce usage, which sort of defeats the point of using it at all.

To summarize: I think the patch in its current form helps the tests pass, but I 
don't think it addresses the underlying problems in Gora (or perhaps the 
problems with HSQL backend), rather it hides the problem. After all, we want 
the test to mean something if it passes, to verify that we can use Gora for 
more than a toy crawl, with guarantees of correctness in presence of concurrent 
updates.

If the above errors don't indicate issues with Gora, but instead are caused by 
exceeded OS or hsql limits, or hsql misconfiguration, then of course we should 
fix the configs and adjust the numbers so that they make sense. But with the 
proper config and proper numbers both tests should pass, otherwise we can't be 
sure that Gora is working properly at all.
                
> Fix TestGoraStorage for Nutchgora
> ---------------------------------
>
>                 Key: NUTCH-1135
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1135
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: storage
>    Affects Versions: nutchgora
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1135-v1.patch, NUTCH-1135-v2.patch
>
>
> This issue is part of a larger target which aims to fix broken JUnit tests 
> for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to