Re: rss feed plugin seems broken (1.5.1)

2012-11-13 Thread Sourajit Basak
No. I haven't used Tika. I didn't know that 2.x will not allow multiple docs from a single entry, its good that you pointed. I am using 1.5.1 for my current work where I need multiple docs from a RSS feed which generally has several outlinks. All in one parse phase of a crawl cycle. Sourajit On

Re: Integrating Nutch and RabbitMQ

2012-11-13 Thread Jorge Luis Betancourt Gonzalez
Hi Thank you for taking the time to reply my email, I'll really appreciate it. > I'm thinking (just for now is a thought) about the possible integration > about nutch and some queue messaging service (like RabbitMQ) the idea is to > do some "offline" processing of some data crawled nutch (and ind

Re: Integrating Nutch and RabbitMQ

2012-11-13 Thread Julien Nioche
Hi > I'm thinking (just for now is a thought) about the possible integration > about nutch and some queue messaging service (like RabbitMQ) the idea is to > do some "offline" processing of some data crawled nutch (and indexed into > solr). Let's take an example: I want to categorize the pages cra

Integrating Nutch and RabbitMQ

2012-11-13 Thread Jorge Luis Betancourt Gonzalez
Hi people: I'm thinking (just for now is a thought) about the possible integration about nutch and some queue messaging service (like RabbitMQ) the idea is to do some "offline" processing of some data crawled nutch (and indexed into solr). Let's take an example: I want to categorize the pages c

Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Jacob Sisk
Hi folks, Thanks for all of your suggestions. Here are two tentative fixes suggested by my colleagues at work: Fix 1: Within Nutch itself, in org.apache.nutch.crawl.DbUpDateReducer change line 129 to: long modifiedTime = (modified == FetchSchedule.STATUS_MODIFIED) ? System.currentTimeMillis()

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, I am sorry SLR 4.0 throws error when we set multiValue = true fro ID field - *collection1:*org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: uniqueKey field (null) can not be configured to be multivalued Please check your logs for more information Erol Ak

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 4:22 PM, Erol Akarsu wrote: > Nov 13, 2012 11:11:48 AM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: Document contains multiple > values for uniqueKey field: id=[org.apache.nutch:http/, ] The proposed schema

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, New craw script is throwing error. eakarsu@ubuntu:~/searchProject/apache-nutch-2.1/runtime/local$ bin/crawl seedDir myid1 urls http://localhost:8080/solr40/ 2 InjectorJob: starting InjectorJob: urlDir: seedDir InjectorJob: finished bin/crawl: line 100: ((: http://localhost:8080/solr40/: sy

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, Yes you catched wrong solr url. I corrected and restarted and now I got previous SOLR error even though it has new schema.xml file. INFO: [collection1] webapp=/solr40 path=/admin/luke params={numTerms=0&show=index&wt=json} status=0 QTime=13 Nov 13, 2012 11:11:48 AM org.apache.solr.update.p

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 3:45 PM, Erol Akarsu wrote: > Where is this script? bin folder has only nutch script. https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl > I am using nutch 2.1 not trunk. Does it make any difference on behavior of > nutch script? I should have been m

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Hi Mourad, I haven't understood your suggestion Can you please explain? Erol Akarsu On Tue, Nov 13, 2012 at 10:53 AM, Mouradk wrote: > Hello karl, > > I have restarted a new one, please let me know if that helps. > > Regards, > > Mourad > On 13 Nov 2012, at 15:45, Erol Akarsu wrote: > > > Le

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Mouradk
Hello karl, I have restarted a new one, please let me know if that helps. Regards, Mourad On 13 Nov 2012, at 15:45, Erol Akarsu wrote: > Lewis, > > Thanks for looking at this. SOL has newest payched schema and I restarted > tomcat. > > I set DEBUG for SolrIndexerJob in log4j.properties file

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, Thanks for looking at this. SOL has newest payched schema and I restarted tomcat. I set DEBUG for SolrIndexerJob in log4j.properties file log4j.logger.org.apache.nutch.indexer.solr.SolrIndexerJob=DEBUG,cmdstdout >Can I >also suggest that you experiment with the crawl script (which >accom

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, On Tue, Nov 13, 2012 at 2:36 PM, Erol Akarsu wrote: > Lewis, > > I applied the patch you told me. I replaced schema.xml of sol4 installation > with schme-sol4.xml. Solr 4.0 system is up and running and I can see its > web page with http://localhost:8080/sol40. You would need to either rename

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, I applied the patch you told me. I replaced schema.xml of sol4 installation with schme-sol4.xml. Solr 4.0 system is up and running and I can see its web page with http://localhost:8080/sol40. I followed tutorial blindly. Crawling went fine but it seem very slow compared to previous before

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Hi, If you look at the attachments on the issue you will see the patches for trunk and 2.x which *should* get a pretty comprehensive Nutch + Solr 4.X stack up and running. Markus made some additional suggestions which I have unfortunately not had time to integrate into the proposed fix however you

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Lewis, Have you checked it to SVN? Where will I get this patch? Erol Akarsu On Tue, Nov 13, 2012 at 6:57 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Additionally, please see this issue below and if you are able please > provide feedback based on the patch. > > https://issues.

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Erol Akarsu
Ferdy, I was not able to use sol4-schema.xml coming with Nutch 2.1 because it was throwing error Lewis pointed ( https://issues.apache.org/jira/browse/NUTCH-1486). Therefore,I have used sol4 with original schema.xml. Maybe, this is why I am facing issue. Lewis suggested to try to patch that I wi

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread kiran chitturi
Hi Erol, It looks like the error is from Nutch side and i would suggest you to check your database for entries and see how the documents, fields are saved or you can dump the database and see the values of the fields and check if there are any multiple values in there. Looks like the document Id i

Re: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Lewis John Mcgibbney
Nice one Gentlemen thank you very much. Best Lewis On Tue, Nov 13, 2012 at 11:39 AM, Markus Jelsma wrote: > In trunk you can use the Inlink and Inlinks classes. The first for each > inline and the latter to add the Inlink objects to. > > Inlinks inlinks = new Inlinks() > inlinks.add(new Inlink

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Lewis John Mcgibbney
Additionally, please see this issue below and if you are able please provide feedback based on the patch. https://issues.apache.org/jira/browse/NUTCH-1486 hth Lewis On Tue, Nov 13, 2012 at 8:57 AM, Ferdy Galema wrote: > I'm not a regular Solr user, but here are some pointers: Somehow, you have

RE: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Markus Jelsma
In trunk you can use the Inlink and Inlinks classes. The first for each inline and the latter to add the Inlink objects to. Inlinks inlinks = new Inlinks() inlinks.add(new Inlink("http://nutch.apache.org/";, "Apache Nutch")); The inlink URL is the key in the key/value pair so you won't see tha

RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Markus Jelsma
In trunk the modified time is based on whether or not the signature has changed. It makes little sense relying on HTTP headers because almost no CMS implements it correctly and it messes (or allows to be messed with on purpose) with an adaptive schedule. https://issues.apache.org/jira/browse/NU

RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread j.sullivan
I think the modifiedTime comes from the http headers if available, if not it is left empty. In other words it is the time the content was last modified according to the source if available and if not available it is left blank. Depending on what Jacob is trying to achieve the one line patch at

Re: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Ferdy Galema
Hi, There might be something wrong with the field modifiedTime. I'm not sure how well you can rely on this field (with the default or the adaptive scheduler). If you want to get to the bottom of this, I suggest debugging or running small crawls to test the behaviour. In case something doesn't wor

Re: org.apache.solr.common.SolrException: Document contains multiple values for uniqueKey field: id?

2012-11-13 Thread Ferdy Galema
I'm not a regular Solr user, but here are some pointers: Somehow, you have added multiple values for the 'id' field. What did you change from the default indexing behaviour?Perhaps some custom IndexingFilters? What schema are you using? (Or as a last resort, perhaps you could give Elasticsearch a t

Re: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Ferdy Galema
Notice that the getAnchors() in Inlinks only returns a single text per domain. (This is not in Nutch2). Tricky. So, in order to correctly setup your test, create all test values with different domains. Inlinks inlinks = new Inlinks(); inlinks.add(new Inlink("http://test1.com/";, "text1")); inlink