No. I haven't used Tika.
I didn't know that 2.x will not allow multiple docs from a single entry,
its good that you pointed. I am using 1.5.1 for my current work where I
need multiple docs from a RSS feed which generally has several outlinks.
All in one parse phase of a crawl cycle.
Sourajit
On
Hi
Thank you for taking the time to reply my email, I'll really appreciate it.
> I'm thinking (just for now is a thought) about the possible integration
> about nutch and some queue messaging service (like RabbitMQ) the idea is to
> do some "offline" processing of some data crawled nutch (and ind
Hi
> I'm thinking (just for now is a thought) about the possible integration
> about nutch and some queue messaging service (like RabbitMQ) the idea is to
> do some "offline" processing of some data crawled nutch (and indexed into
> solr). Let's take an example: I want to categorize the pages cra
Hi people:
I'm thinking (just for now is a thought) about the possible integration about
nutch and some queue messaging service (like RabbitMQ) the idea is to do some
"offline" processing of some data crawled nutch (and indexed into solr). Let's
take an example: I want to categorize the pages c
Hi folks,
Thanks for all of your suggestions. Here are two tentative fixes suggested
by my colleagues at work:
Fix 1:
Within Nutch itself, in org.apache.nutch.crawl.DbUpDateReducer change
line 129 to:
long modifiedTime = (modified == FetchSchedule.STATUS_MODIFIED) ?
System.currentTimeMillis()
Lewis,
I am sorry SLR 4.0 throws error when we set multiValue = true fro ID field
-
*collection1:*org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
uniqueKey field (null) can not be configured to be multivalued
Please check your logs for more information
Erol Ak
Hi,
On Tue, Nov 13, 2012 at 4:22 PM, Erol Akarsu wrote:
> Nov 13, 2012 11:11:48 AM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: Document contains multiple
> values for uniqueKey field: id=[org.apache.nutch:http/, ]
The proposed schema
Lewis,
New craw script is throwing error.
eakarsu@ubuntu:~/searchProject/apache-nutch-2.1/runtime/local$ bin/crawl
seedDir myid1 urls http://localhost:8080/solr40/ 2
InjectorJob: starting
InjectorJob: urlDir: seedDir
InjectorJob: finished
bin/crawl: line 100: ((: http://localhost:8080/solr40/: sy
Lewis,
Yes you catched wrong solr url. I corrected and restarted and now I got
previous SOLR error even though it has new schema.xml file.
INFO: [collection1] webapp=/solr40 path=/admin/luke
params={numTerms=0&show=index&wt=json} status=0 QTime=13
Nov 13, 2012 11:11:48 AM
org.apache.solr.update.p
Hi,
On Tue, Nov 13, 2012 at 3:45 PM, Erol Akarsu wrote:
> Where is this script? bin folder has only nutch script.
https://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
> I am using nutch 2.1 not trunk. Does it make any difference on behavior of
> nutch script?
I should have been m
Hi Mourad,
I haven't understood your suggestion
Can you please explain?
Erol Akarsu
On Tue, Nov 13, 2012 at 10:53 AM, Mouradk wrote:
> Hello karl,
>
> I have restarted a new one, please let me know if that helps.
>
> Regards,
>
> Mourad
> On 13 Nov 2012, at 15:45, Erol Akarsu wrote:
>
> > Le
Hello karl,
I have restarted a new one, please let me know if that helps.
Regards,
Mourad
On 13 Nov 2012, at 15:45, Erol Akarsu wrote:
> Lewis,
>
> Thanks for looking at this. SOL has newest payched schema and I restarted
> tomcat.
>
> I set DEBUG for SolrIndexerJob in log4j.properties file
Lewis,
Thanks for looking at this. SOL has newest payched schema and I restarted
tomcat.
I set DEBUG for SolrIndexerJob in log4j.properties file
log4j.logger.org.apache.nutch.indexer.solr.SolrIndexerJob=DEBUG,cmdstdout
>Can I
>also suggest that you experiment with the crawl script (which
>accom
Hi,
On Tue, Nov 13, 2012 at 2:36 PM, Erol Akarsu wrote:
> Lewis,
>
> I applied the patch you told me. I replaced schema.xml of sol4 installation
> with schme-sol4.xml. Solr 4.0 system is up and running and I can see its
> web page with http://localhost:8080/sol40.
You would need to either rename
Lewis,
I applied the patch you told me. I replaced schema.xml of sol4 installation
with schme-sol4.xml. Solr 4.0 system is up and running and I can see its
web page with http://localhost:8080/sol40.
I followed tutorial blindly. Crawling went fine but it seem very slow
compared to previous before
Hi,
If you look at the attachments on the issue you will see the patches
for trunk and 2.x which *should* get a pretty comprehensive Nutch +
Solr 4.X stack up and running. Markus made some additional suggestions
which I have unfortunately not had time to integrate into the proposed
fix however you
Lewis,
Have you checked it to SVN? Where will I get this patch?
Erol Akarsu
On Tue, Nov 13, 2012 at 6:57 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:
> Additionally, please see this issue below and if you are able please
> provide feedback based on the patch.
>
> https://issues.
Ferdy,
I was not able to use sol4-schema.xml coming with Nutch 2.1 because it was
throwing error Lewis pointed (
https://issues.apache.org/jira/browse/NUTCH-1486).
Therefore,I have used sol4 with original schema.xml.
Maybe, this is why I am facing issue. Lewis suggested to try to patch that
I wi
Hi Erol,
It looks like the error is from Nutch side and i would suggest you to check
your database for entries and see how the documents, fields are saved or
you can dump the database and see the values of the fields and check if
there are any multiple values in there. Looks like the document Id i
Nice one Gentlemen thank you very much.
Best
Lewis
On Tue, Nov 13, 2012 at 11:39 AM, Markus Jelsma
wrote:
> In trunk you can use the Inlink and Inlinks classes. The first for each
> inline and the latter to add the Inlink objects to.
>
> Inlinks inlinks = new Inlinks()
> inlinks.add(new Inlink
Additionally, please see this issue below and if you are able please
provide feedback based on the patch.
https://issues.apache.org/jira/browse/NUTCH-1486
hth
Lewis
On Tue, Nov 13, 2012 at 8:57 AM, Ferdy Galema wrote:
> I'm not a regular Solr user, but here are some pointers: Somehow, you have
In trunk you can use the Inlink and Inlinks classes. The first for each inline
and the latter to add the Inlink objects to.
Inlinks inlinks = new Inlinks()
inlinks.add(new Inlink("http://nutch.apache.org/";, "Apache Nutch"));
The inlink URL is the key in the key/value pair so you won't see tha
In trunk the modified time is based on whether or not the signature has
changed. It makes little sense relying on HTTP headers because almost no CMS
implements it correctly and it messes (or allows to be messed with on purpose)
with an adaptive schedule.
https://issues.apache.org/jira/browse/NU
I think the modifiedTime comes from the http headers if available, if not it is
left empty. In other words it is the time the content was last modified
according to the source if available and if not available it is left blank.
Depending on what Jacob is trying to achieve the one line patch at
Hi,
There might be something wrong with the field modifiedTime. I'm not sure
how well you can rely on this field (with the default or the adaptive
scheduler).
If you want to get to the bottom of this, I suggest debugging or running
small crawls to test the behaviour. In case something doesn't wor
I'm not a regular Solr user, but here are some pointers: Somehow, you have
added multiple values for the 'id' field. What did you change from the
default indexing behaviour?Perhaps some custom IndexingFilters? What schema
are you using? (Or as a last resort, perhaps you could give Elasticsearch a
t
Notice that the getAnchors() in Inlinks only returns a single text per
domain. (This is not in Nutch2). Tricky.
So, in order to correctly setup your test, create all test values with
different domains.
Inlinks inlinks = new Inlinks();
inlinks.add(new Inlink("http://test1.com/";, "text1"));
inlink
27 matches
Mail list logo