UPDATE:
I found that the anchor for any parent directory is blank. In Solr I have
been able to use the following to delete the parent directories:
deleteByQuery( "-anchor:[* TO *]" )
However, I would prefer to delete these with Nutch if possible. Any
suggestions would be appreciated.
Regards.
Thanks for the reference to debuggex. I have followed your recommendations
and tested other regex, but I am still having a problem. I appreciate your
patience in helping me understand what I am missing. In seed.txt I have:
http://my.domain.name/dir/
I need to index the files but not the directori
Unfortunately,
-^http://my.domain.name/dir/$
didn't work for me. I need to skip just the documents in the directory, but
this skips all the subdirectories as well. Is there another solution, or
possibly some way to go back and remove all the parent directories after the
crawl?
Thanks for your he
Sebastian,
Can you please clarify what you mean? Why can I not use
https://my.domain.name/inside/test/ as a seed URL?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323p4082258.html
Sent from the Nutch - User mailing list archive at N
When I perform a crawl, one of the documents returned by Nutch is the index
of documents. e.g.
for a crawl of:
https://my.domain.name/inside/test/
the content of the first document is:
Index of /inside/test Index of /inside/test Parent Directory test_css.css
test_css.html test_css1.html test_css2
Thanks Markus. I will give this a try. I did refilter the crawldb. One more
question:
I'm not good with regex. If I wanted to crawl
http://my.domain.name/dir/subdirA/subdirA1/
http://my.domain.name/dir/subdirB/subdirB1/
http://my.domain.name/dir/subdirB/subdirB2/
http://my.domain.name/dir/subdir
I would like to crawl everything in
http://my.domain.name/dir/subdir
but nothing in its parent
http://my.domain.name/dir/
In regex-urlfilter.txt I have the following:
# skip URLs
-^http://my.domain.name/dir/
# accept URLs
+^http://my.domain.name/dir/subdir/*
but Nutch still crawls the skip U
Lewis,
A Jira ticket has been created. NUTCH-1580. Thanks for your help.
Regards,
ADS
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-add-field-to-index-tp4067894p4068558.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Lewis,
I've been told by someone who knows Java (I don't) that the plugin is
"trying to print out the string
array and it's printing the string array object, not the contents of the
array". Does this sound plausible to you? Is this something you can/would
consider fixing? Please advise.
Regards,
lewis john mcgibbney wrote
> Hi,
>
> index-static should work as described out of the box.
> Make sure that you have a comma-separated list of fields in the form
>
> :
>
> within nutch-site.xml
> It is as simple as that, I can only imagine that your field definitions
> are
> not like so.
> Lewi
a field to each document? Please clarify.
Regards,
ADS
On Tue, Jun 4, 2013 at 5:08 PM, lewis john mcgibbney [via Lucene] <
ml-node+s472066n4068186...@n3.nabble.com> wrote:
> Hi,
>
> On Tue, Jun 4, 2013 at 11:40 AM, stone2dbone <[hidden
> email]<http://user/SendEmail.jt
We're using version 1.6, binary version.
Okay, IndexFiltersChecker shows the value of my added field is
'[Ljava.lang.String;@15d1c817'. What might be causing this?
--ADS
lewis john mcgibbney wrote
> Which version of Nutch and which artifact are you using?
> Binaries for 1.x can be run directly
Lewis,
We've found the IndexFiltersChecker class file at
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/indexer/
Might you know where to find a compiled version?
Thanks.
ADS
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-add-field-to-index-tp4
I would like to know how to add a field to an index using Nutch 1.6 and Solr
4.0. I have tried using the index-static, index-extra and index-metadata
plugins, all to no avail. I have modified
nutch-default.xml:
index.static
display_type:page
A simple plugin called at indexing that add
14 matches
Mail list logo