Re: [Nutch-general] subcollections

liv Fri, 15 Dec 2006 04:09:03 -0800

Thanks for your quick answer. However I tried this several times and it
doesn't work. Here is how I do:


- I fetch/index 2 sites, having subcolelctions.xml file as follws:

<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
        <subcollection>
                <name>test1</name>
                <id>test1</id>
                <whitelist>
http://npo.d1.com/
                </whitelist>
                <blacklist />
        </subcollection>
        <subcollection>
                <name>test2</name>
                <id>test2</id>
                <whitelist>
http://rollout.d1.com/
                </whitelist>
                <blacklist />
        </subcollection>
        <subcollection>
                <name>test3</name>
                <id>test3</id>
                <whitelist>
http://npo.d1.com/
http://rollout.d1.com/
                </whitelist>
                <blacklist />
        </subcollection>
</subcollections>


- I inspect the resulting db with luke; the subcollection field is populated
with " test1 test3" and " test2 test3" values respectively

- I change the names of collections in the file subcollections.xml as
follows:

<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
        <subcollection>
                <name>new1</name>
                <id>new1</id>
                <whitelist>
http://npo.d1.com/
                </whitelist>
                <blacklist />
        </subcollection>
        <subcollection>
                <name>new2</name>
                <id>new2</id>
                <whitelist>
http://rollout.d1.com/
                </whitelist>
                <blacklist />
        </subcollection>
        <subcollection>
                <name>new3</name>
                <id>new3</id>
                <whitelist>
http://npo.d1.com/
http://rollout.d1.com/
                </whitelist>
                <blacklist />
        </subcollection>
</subcollections>

- I reindex the db: delete folder "indexes", run the command:

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

- then I inspect the resulting db with luke again

Unfortunately nothing has changed. Maybe I am missing something... Please
tell me if you see anything wrong.

Thanks! 


Sami Siren-2 wrote:
> 
> liv wrote:
>> I intend to use nutch with a fairly complex structure of subcollections.
>> I
>> did some tests and the storage/search performs as expected; however there
>> is
>> an aspect I may have neglected and cannot find an answer. 
>> 
>> How/at which stage are subcollections added to the index structure?
> 
> If you are talking about the subcollections generated by the
> subcollection plugin then the subcollection data is stored at indexing
> phase.
> 
>> I plan on crawling frequently, adding new sites to existent repository,
>> merging/reindexing as needed. However if I need to change the
>> subcollection
>> structure (ie. add a site to a newly created subcollection) I don't want
>> to
>> recrawl it again. I hope it can be done by simply using the
>> existent/crawled
>> data.
> 
> no need to recrawl, unfortunately you still need to reindex.
> 
> --
>  Sami Siren
> 
> 

-- 
View this message in context: 
http://www.nabble.com/subcollections-tf2821188.html#a7889945
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] subcollections

Reply via email to