Re: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Ferdy Galema
Notice that the getAnchors() in Inlinks only returns a single text per domain. (This is not in Nutch2). Tricky. So, in order to correctly setup your test, create all test values with different domains. Inlinks inlinks = new Inlinks(); inlinks.add(new Inlink(http://test1.com/;, text1));

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Ferdy Galema
I'm not a regular Solr user, but here are some pointers: Somehow, you have added multiple values for the 'id' field. What did you change from the default indexing behaviour?Perhaps some custom IndexingFilters? What schema are you using? (Or as a last resort, perhaps you could give Elasticsearch a

Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Ferdy Galema
Hi, There might be something wrong with the field modifiedTime. I'm not sure how well you can rely on this field (with the default or the adaptive scheduler). If you want to get to the bottom of this, I suggest debugging or running small crawls to test the behaviour. In case something doesn't

RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread j.sullivan
I think the modifiedTime comes from the http headers if available, if not it is left empty. In other words it is the time the content was last modified according to the source if available and if not available it is left blank. Depending on what Jacob is trying to achieve the one line patch

RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Markus Jelsma
In trunk the modified time is based on whether or not the signature has changed. It makes little sense relying on HTTP headers because almost no CMS implements it correctly and it messes (or allows to be messed with on purpose) with an adaptive schedule.

RE: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Markus Jelsma
In trunk you can use the Inlink and Inlinks classes. The first for each inline and the latter to add the Inlink objects to. Inlinks inlinks = new Inlinks() inlinks.add(new Inlink(http://nutch.apache.org/;, Apache Nutch)); The inlink URL is the key in the key/value pair so you won't see that

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Additionally, please see this issue below and if you are able please provide feedback based on the patch. https://issues.apache.org/jira/browse/NUTCH-1486 hth Lewis On Tue, Nov 13, 2012 at 8:57 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote: I'm not a regular Solr user, but here are some

Re: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Lewis John Mcgibbney
Nice one Gentlemen thank you very much. Best Lewis On Tue, Nov 13, 2012 at 11:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: In trunk you can use the Inlink and Inlinks classes. The first for each inline and the latter to add the Inlink objects to. Inlinks inlinks = new Inlinks()

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread kiran chitturi
Hi Erol, It looks like the error is from Nutch side and i would suggest you to check your database for entries and see how the documents, fields are saved or you can dump the database and see the values of the fields and check if there are any multiple values in there. Looks like the document Id

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, Have you checked it to SVN? Where will I get this patch? Erol Akarsu On Tue, Nov 13, 2012 at 6:57 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Additionally, please see this issue below and if you are able please provide feedback based on the patch.

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, If you look at the attachments on the issue you will see the patches for trunk and 2.x which *should* get a pretty comprehensive Nutch + Solr 4.X stack up and running. Markus made some additional suggestions which I have unfortunately not had time to integrate into the proposed fix however

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, I applied the patch you told me. I replaced schema.xml of sol4 installation with schme-sol4.xml. Solr 4.0 system is up and running and I can see its web page with http://localhost:8080/sol40. I followed tutorial blindly. Crawling went fine but it seem very slow compared to previous before

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 2:36 PM, Erol Akarsu eaka...@gmail.com wrote: Lewis, I applied the patch you told me. I replaced schema.xml of sol4 installation with schme-sol4.xml. Solr 4.0 system is up and running and I can see its web page with http://localhost:8080/sol40. You would need to

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, Thanks for looking at this. SOL has newest payched schema and I restarted tomcat. I set DEBUG for SolrIndexerJob in log4j.properties file log4j.logger.org.apache.nutch.indexer.solr.SolrIndexerJob=DEBUG,cmdstdout Can I also suggest that you experiment with the crawl script (which

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Hi Mourad, I haven't understood your suggestion Can you please explain? Erol Akarsu On Tue, Nov 13, 2012 at 10:53 AM, Mouradk mourad...@gmail.com wrote: Hello karl, I have restarted a new one, please let me know if that helps. Regards, Mourad On 13 Nov 2012, at 15:45, Erol Akarsu

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 3:45 PM, Erol Akarsu eaka...@gmail.com wrote: Where is this script? bin folder has only nutch script. https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl I am using nutch 2.1 not trunk. Does it make any difference on behavior of nutch script? I

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, New craw script is throwing error. eakarsu@ubuntu:~/searchProject/apache-nutch-2.1/runtime/local$ bin/crawl seedDir myid1 urls http://localhost:8080/solr40/ 2 InjectorJob: starting InjectorJob: urlDir: seedDir InjectorJob: finished bin/crawl: line 100: ((: http://localhost:8080/solr40/:

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 4:22 PM, Erol Akarsu eaka...@gmail.com wrote: Nov 13, 2012 11:11:48 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id=[org.apache.nutch:http/, ] The

Integrating Nutch and RabbitMQ

2012-11-13 Thread Jorge Luis Betancourt Gonzalez
Hi people: I'm thinking (just for now is a thought) about the possible integration about nutch and some queue messaging service (like RabbitMQ) the idea is to do some offline processing of some data crawled nutch (and indexed into solr). Let's take an example: I want to categorize the pages

Re: Integrating Nutch and RabbitMQ

2012-11-13 Thread Julien Nioche
Hi I'm thinking (just for now is a thought) about the possible integration about nutch and some queue messaging service (like RabbitMQ) the idea is to do some offline processing of some data crawled nutch (and indexed into solr). Let's take an example: I want to categorize the pages crawled

Re: Integrating Nutch and RabbitMQ

2012-11-13 Thread Jorge Luis Betancourt Gonzalez
Hi Thank you for taking the time to reply my email, I'll really appreciate it. I'm thinking (just for now is a thought) about the possible integration about nutch and some queue messaging service (like RabbitMQ) the idea is to do some offline processing of some data crawled nutch (and indexed

Re: rss feed plugin seems broken (1.5.1)

2012-11-13 Thread Sourajit Basak
No. I haven't used Tika. I didn't know that 2.x will not allow multiple docs from a single entry, its good that you pointed. I am using 1.5.1 for my current work where I need multiple docs from a RSS feed which generally has several outlinks. All in one parse phase of a crawl cycle. Sourajit On