Notice that the getAnchors() in Inlinks only returns a single text per
domain. (This is not in Nutch2). Tricky.
So, in order to correctly setup your test, create all test values with
different domains.
Inlinks inlinks = new Inlinks();
inlinks.add(new Inlink(http://test1.com/;, text1));
I'm not a regular Solr user, but here are some pointers: Somehow, you have
added multiple values for the 'id' field. What did you change from the
default indexing behaviour?Perhaps some custom IndexingFilters? What schema
are you using? (Or as a last resort, perhaps you could give Elasticsearch a
Hi,
There might be something wrong with the field modifiedTime. I'm not sure
how well you can rely on this field (with the default or the adaptive
scheduler).
If you want to get to the bottom of this, I suggest debugging or running
small crawls to test the behaviour. In case something doesn't
I think the modifiedTime comes from the http headers if available, if not it is
left empty. In other words it is the time the content was last modified
according to the source if available and if not available it is left blank.
Depending on what Jacob is trying to achieve the one line patch
In trunk the modified time is based on whether or not the signature has
changed. It makes little sense relying on HTTP headers because almost no CMS
implements it correctly and it messes (or allows to be messed with on purpose)
with an adaptive schedule.
In trunk you can use the Inlink and Inlinks classes. The first for each inline
and the latter to add the Inlink objects to.
Inlinks inlinks = new Inlinks()
inlinks.add(new Inlink(http://nutch.apache.org/;, Apache Nutch));
The inlink URL is the key in the key/value pair so you won't see that
Additionally, please see this issue below and if you are able please
provide feedback based on the patch.
https://issues.apache.org/jira/browse/NUTCH-1486
hth
Lewis
On Tue, Nov 13, 2012 at 8:57 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote:
I'm not a regular Solr user, but here are some
Nice one Gentlemen thank you very much.
Best
Lewis
On Tue, Nov 13, 2012 at 11:39 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
In trunk you can use the Inlink and Inlinks classes. The first for each
inline and the latter to add the Inlink objects to.
Inlinks inlinks = new Inlinks()
Hi Erol,
It looks like the error is from Nutch side and i would suggest you to check
your database for entries and see how the documents, fields are saved or
you can dump the database and see the values of the fields and check if
there are any multiple values in there. Looks like the document Id
Lewis,
Have you checked it to SVN? Where will I get this patch?
Erol Akarsu
On Tue, Nov 13, 2012 at 6:57 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Additionally, please see this issue below and if you are able please
provide feedback based on the patch.
Hi,
If you look at the attachments on the issue you will see the patches
for trunk and 2.x which *should* get a pretty comprehensive Nutch +
Solr 4.X stack up and running. Markus made some additional suggestions
which I have unfortunately not had time to integrate into the proposed
fix however
Lewis,
I applied the patch you told me. I replaced schema.xml of sol4 installation
with schme-sol4.xml. Solr 4.0 system is up and running and I can see its
web page with http://localhost:8080/sol40.
I followed tutorial blindly. Crawling went fine but it seem very slow
compared to previous before
Hi,
On Tue, Nov 13, 2012 at 2:36 PM, Erol Akarsu eaka...@gmail.com wrote:
Lewis,
I applied the patch you told me. I replaced schema.xml of sol4 installation
with schme-sol4.xml. Solr 4.0 system is up and running and I can see its
web page with http://localhost:8080/sol40.
You would need to
Lewis,
Thanks for looking at this. SOL has newest payched schema and I restarted
tomcat.
I set DEBUG for SolrIndexerJob in log4j.properties file
log4j.logger.org.apache.nutch.indexer.solr.SolrIndexerJob=DEBUG,cmdstdout
Can I
also suggest that you experiment with the crawl script (which
Hi Mourad,
I haven't understood your suggestion
Can you please explain?
Erol Akarsu
On Tue, Nov 13, 2012 at 10:53 AM, Mouradk mourad...@gmail.com wrote:
Hello karl,
I have restarted a new one, please let me know if that helps.
Regards,
Mourad
On 13 Nov 2012, at 15:45, Erol Akarsu
Hi,
On Tue, Nov 13, 2012 at 3:45 PM, Erol Akarsu eaka...@gmail.com wrote:
Where is this script? bin folder has only nutch script.
https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
I am using nutch 2.1 not trunk. Does it make any difference on behavior of
nutch script?
I
Lewis,
New craw script is throwing error.
eakarsu@ubuntu:~/searchProject/apache-nutch-2.1/runtime/local$ bin/crawl
seedDir myid1 urls http://localhost:8080/solr40/ 2
InjectorJob: starting
InjectorJob: urlDir: seedDir
InjectorJob: finished
bin/crawl: line 100: ((: http://localhost:8080/solr40/:
Hi,
On Tue, Nov 13, 2012 at 4:22 PM, Erol Akarsu eaka...@gmail.com wrote:
Nov 13, 2012 11:11:48 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Document contains multiple
values for uniqueKey field: id=[org.apache.nutch:http/, ]
The
Hi people:
I'm thinking (just for now is a thought) about the possible integration about
nutch and some queue messaging service (like RabbitMQ) the idea is to do some
offline processing of some data crawled nutch (and indexed into solr). Let's
take an example: I want to categorize the pages
Hi
I'm thinking (just for now is a thought) about the possible integration
about nutch and some queue messaging service (like RabbitMQ) the idea is to
do some offline processing of some data crawled nutch (and indexed into
solr). Let's take an example: I want to categorize the pages crawled
Hi
Thank you for taking the time to reply my email, I'll really appreciate it.
I'm thinking (just for now is a thought) about the possible integration
about nutch and some queue messaging service (like RabbitMQ) the idea is to
do some offline processing of some data crawled nutch (and indexed
No. I haven't used Tika.
I didn't know that 2.x will not allow multiple docs from a single entry,
its good that you pointed. I am using 1.5.1 for my current work where I
need multiple docs from a RSS feed which generally has several outlinks.
All in one parse phase of a crawl cycle.
Sourajit
On
22 matches
Mail list logo