Re: Prevent crawl of parent URL

2013-08-13 Thread stone2dbone
UPDATE: I found that the anchor for any parent directory is blank. In Solr I have been able to use the following to delete the parent directories: deleteByQuery( "-anchor:[* TO *]" ) However, I would prefer to delete these with Nutch if possible. Any suggestions would be appreciated. Regards.

Re: Prevent crawl of parent URL

2013-08-12 Thread stone2dbone
Thanks for the reference to debuggex. I have followed your recommendations and tested other regex, but I am still having a problem. I appreciate your patience in helping me understand what I am missing. In seed.txt I have: http://my.domain.name/dir/ I need to index the files but not the directori

RE: Prevent crawl of parent URL

2013-08-08 Thread stone2dbone
Unfortunately, -^http://my.domain.name/dir/$ didn't work for me. I need to skip just the documents in the directory, but this skips all the subdirectories as well. Is there another solution, or possibly some way to go back and remove all the parent directories after the crawl? Thanks for your he

Re: Nutch returns index as document

2013-08-02 Thread stone2dbone
Sebastian, Can you please clarify what you mean? Why can I not use https://my.domain.name/inside/test/ as a seed URL? -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323p4082258.html Sent from the Nutch - User mailing list archive at N

Nutch returns index as document

2013-07-25 Thread stone2dbone
When I perform a crawl, one of the documents returned by Nutch is the index of documents. e.g. for a crawl of: https://my.domain.name/inside/test/ the content of the first document is: Index of /inside/test Index of /inside/test Parent Directory test_css.css test_css.html test_css1.html test_css2

RE: Prevent crawl of parent URL

2013-07-24 Thread stone2dbone
Thanks Markus. I will give this a try. I did refilter the crawldb. One more question: I'm not good with regex. If I wanted to crawl http://my.domain.name/dir/subdirA/subdirA1/ http://my.domain.name/dir/subdirB/subdirB1/ http://my.domain.name/dir/subdirB/subdirB2/ http://my.domain.name/dir/subdir

Prevent crawl of parent URL

2013-07-24 Thread stone2dbone
I would like to crawl everything in http://my.domain.name/dir/subdir but nothing in its parent http://my.domain.name/dir/ In regex-urlfilter.txt I have the following: # skip URLs -^http://my.domain.name/dir/ # accept URLs +^http://my.domain.name/dir/subdir/* but Nutch still crawls the skip U

Re: How to add field to index

2013-06-06 Thread stone2dbone
Lewis, A Jira ticket has been created. NUTCH-1580. Thanks for your help. Regards, ADS -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-add-field-to-index-tp4067894p4068558.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to add field to index

2013-06-05 Thread stone2dbone
Lewis, I've been told by someone who knows Java (I don't) that the plugin is "trying to print out the string array and it's printing the string array object, not the contents of the array". Does this sound plausible to you? Is this something you can/would consider fixing? Please advise. Regards,

Re: How to add field to index

2013-06-05 Thread stone2dbone
lewis john mcgibbney wrote > Hi, > > index-static should work as described out of the box. > Make sure that you have a comma-separated list of fields in the form > > : > > within nutch-site.xml > It is as simple as that, I can only imagine that your field definitions > are > not like so. > Lewi

Re: How to add field to index

2013-06-04 Thread stone2dbone
a field to each document? Please clarify. Regards, ADS On Tue, Jun 4, 2013 at 5:08 PM, lewis john mcgibbney [via Lucene] < ml-node+s472066n4068186...@n3.nabble.com> wrote: > Hi, > > On Tue, Jun 4, 2013 at 11:40 AM, stone2dbone <[hidden > email]<http://user/SendEmail.jt

Re: How to add field to index

2013-06-04 Thread stone2dbone
We're using version 1.6, binary version. Okay, IndexFiltersChecker shows the value of my added field is '[Ljava.lang.String;@15d1c817'. What might be causing this? --ADS lewis john mcgibbney wrote > Which version of Nutch and which artifact are you using? > Binaries for 1.x can be run directly

Re: How to add field to index

2013-06-04 Thread stone2dbone
Lewis, We've found the IndexFiltersChecker class file at http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/indexer/ Might you know where to find a compiled version? Thanks. ADS -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-add-field-to-index-tp4

How to add field to index

2013-06-03 Thread stone2dbone
I would like to know how to add a field to an index using Nutch 1.6 and Solr 4.0. I have tried using the index-static, index-extra and index-metadata plugins, all to no avail. I have modified nutch-default.xml: index.static display_type:page A simple plugin called at indexing that add