subject:"Nutch\\\\\\\+Solr"

Indexing HTML Metatags Nutch - SOLR

2020-01-18 Thread kra...@gds2.de

Hello,

I have been trying this for several days without success. (nutch 1.16 - solr
7.3.1)

I have followed this description:
https://cwiki.apache.org/confluence/display/nutch/IndexMetatags
Below I put my file nutch-site.xml

I have created the core following this description:
https://cwiki.apache.org/confluence/display/nutch/NutchTutorial/

By the way without the metatags everything works fine.

Bevor creating the core I deleted the managed-schema.xml and inserted my
metatag fields into schema.xml in the configsets directory of the core

First Question: After creating the core I see a managed-schema.xml file and
a schema.xml.bak file in the conf directory of the core. Sorry I am new to
this, but I believe I do not want managed-schema.xml??? (See description
above)

Anyway when I run the crawl all is ok until the index is created. Then I end
up with the error:

org.apache.solr.common.SolrException: copyField dest
:'metatag.SITdescription_str' is not an explicit field and doesn't match a
dynamicField.
at
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:902)
at
org.apache.solr.schema.ManagedIndexSchema.addCopyFields(ManagedIndexSchema.java:784)

There is no copyfield instruction for metatag.SITdescription in
managed-schema.xml. I even created a field "metatag.SITdescription_str" in
managed-schema.xml which did not help.

Can you help me please

Best Regards

Martin

nutch-site.xml

http.agent.name
SIT_NUTCH_SPIDER

db.ignore.external.links
true
If true, outlinks leading from a page to external hosts will be
ignored. This is an effective way to limit the crawl to include only
initially injected hosts, without creating complex URLFilters.

plugin.includes

http.robot.rules.whitelist
sitlux02.sit.de
Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for.

metatags.names
SITdescription,SITkeywords,SITcategory,SITintern
Names of the metatags to extract, separated by ','.
Use '*' to extract all metatags. Prefixes the names with 'metatag.'
in the parse-metadata. For instance to index description and keywords,
you need to activate the plugin index-metadata and set the value of the
parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.

index.parse.md

metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern

Comma-separated list of keys to be taken from the parse metadata to
generate fields.
Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
by a parser (see parse-metatags plugin)

index.metadata

metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern

Comma-separated list of keys to be taken from the metadata to generate
fields.
Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
by a parser (see parse-metatags plugin), and property 'metatags.names'.

--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Indexing HTML Metatags Nutch - SOLR

2020-01-18 Thread kra...@gds2.de

Hello,

I have been trying this for several days without success. (nutch 1.16 - solr
7.3.1)

I have followed this description:
https://cwiki.apache.org/confluence/display/nutch/IndexMetatags
Below I put my file nutch-site.xml

I have created the core following this description:
https://cwiki.apache.org/confluence/display/nutch/NutchTutorial/

By the way without the metatags everything works fine.

Bevor creating the core I deleted the managed-schema.xml and inserted my
metatag fields into schema.xml in the configsets directory of the core

Anyway when I run the crawl all is ok until the index is created. Then I end
up with the error:

There is no copyfield instruction for metatag.SITdescription in
managed-schema.xml. I even created a field "metatag.SITdescription_str" in
managed-schema.xml which did not help.

Can you help me please

Best Regards

Martin

nutch-site.xml

http.agent.name
SIT_NUTCH_SPIDER

plugin.includes

http.robot.rules.whitelist
sitlux02.sit.de
Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for.

index.parse.md

metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern

index.metadata

metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern

--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Nutch+Solr

2018-10-08 Thread Bineesh

This is solved.

Nutch 1.15 have index-writers.xml file wherein we can pass the UN/PWD for
indexing to solr.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Nutch+Solr

2018-10-03 Thread Terry Steichen

Bineesh,

I don't use Nutch, so don't know if this is relevant, but I've had
similar-sounding failures in doing and restoring backups.  The solution
for me was to deactivate authentication while the backup was being done,
and then activate it again afterwards.  Then everything was restored
correctly.  Otherwise, I got a whole bunch of efforts (if I left
authentication active when doing the backup). 

Terry


On 10/03/2018 10:21 AM, Bineesh wrote:
> Hello,
>
> We use Solr 7.3.1 and Nutch 1.15
>
> We've placed the authentication for our solr cloud setup using the basic
> auth plugin ( login details -> solr/SolrRocks)
>
> For the nutch to index data to solr, below properties added to nutch-sitexml
> file
>
>  
>   solr.auth
>   true
>   
>   Whether to enable HTTP basic authentication for communicating with Solr.
>   Use the solr.auth.username and solr.auth.password properties to configure
>   your credentials.
>   
> 
>
>
> 
>   solr.auth.username
>   solr
>   
>   Username
>   
> 
>
>
> 
>   solr.auth.password
>   SolrRocks
>   
>   Password
>   
> 
>
> While Nutch index data to solr, its failing due to authentication. Am i
> doing something wrong ? Pls help
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Nutch+Solr

2018-10-03 Thread Bineesh

Hello,

We use Solr 7.3.1 and Nutch 1.15

We've placed the authentication for our solr cloud setup using the basic
auth plugin ( login details -> solr/SolrRocks)

For the nutch to index data to solr, below properties added to nutch-sitexml
file

 
  solr.auth
  true
  
  Whether to enable HTTP basic authentication for communicating with Solr.
  Use the solr.auth.username and solr.auth.password properties to configure
  your credentials.
  




  solr.auth.username
  solr
  
  Username
  




  solr.auth.password
  SolrRocks
  
  Password
  


While Nutch index data to solr, its failing due to authentication. Am i
doing something wrong ? Pls help



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Nutch + Solr - Indexer causes java.lang.OutOfMemoryError: Java heap space

2014-09-07 Thread glumet

Hello everyone, 

I have configured my 2 servers to run in distributed mode (with Hadoop) and
my configuration for crawling process is Nutch 2.2.1 - HBase (as a storage)
and Solr. Solr is run by Tomcat. The problem is everytime I try to do the
last step - I mean when I want to index data from HBase into Solr. After
then this *[1]* error occures. I tried to add CATALINA_OPTS (or JAVA_OPTS)
like this:

CATALINA_OPTS=$JAVA_OPTS -XX:+UseConcMarkSweepGC -Xms1g -Xmx6000m
-XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m
-XX:+CMSClassUnloadingEnabled

to Tomcat's catalina.sh script and run server with this script but it didn't
help. I also add these *[2]* properties to nutch-site.xml file but it ended
up with OutOfMemory again. Can you help me please?

*[1]*
/2014-09-06 22:52:50,683 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587)
at java.lang.StringBuffer.append(StringBuffer.java:332)
at java.io.StringWriter.write(StringWriter.java:77)
at org.apache.solr.common.util.XML.escape(XML.java:204)
at org.apache.solr.common.util.XML.escapeCharData(XML.java:77)
at org.apache.solr.common.util.XML.writeXML(XML.java:147)
at
org.apache.solr.client.solrj.util.ClientUtils.writeVal(ClientUtils.java:161)
at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:129)
at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:355)
at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:271)
at
org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:66)
at
org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:94)
at
org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:104)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:247)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:96)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650)
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1793)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
/

*[2]*

property
  namehttp.content.limit/name
  value15000/value
  descriptionThe length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  For our purposes it is twice bigger than default - parsing big pages: 128
* 1024
  /description
/property

property
   nameindexer.max.tokens/name
   value10/value
/property

property
  namehttp.timeout/name
  value5/value
  descriptionThe default network timeout, in milliseconds./description
/property

property
  namesolr.commit.size/name
  value100/value
  description
  Defines the number of documents to send to Solr in a single update batch.
  Decrease when handling very large documents to prevent Nutch from running
  out of memory. NOTE: It does not explicitly trigger a server side commit.
  /description
/property



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-Solr-Indexer-causes-java-lang-OutOfMemoryError-Java-heap-space-tp4157308.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: document id in nutch/solr

2013-06-24 Thread alxsss

Another way of overriding nutch fields is to modify solrindex-mapping.xml file.

hth
Alex.

 

 

 

-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Sun, Jun 23, 2013 12:04 pm
Subject: Re: document id in nutch/solr


Add the passthrough dynamic field to your Solr schema, and then see what 
fields get passed through to Solr from Nutch. Then, add the missing fields 
to your Solr schema and remove the passthrough.

dynamicField name=* type=string indexed=true stored=true 
multiValued=true /

Or, add Solr copyField directives to place fields in existing named 
fields.

Or... talk to the nutch people about how to do field name mapping on the 
nutch side of the fence.

Hold off on UUIDs until you figure all of the above out and everything is 
working without them.

-- Jack Krupansky

-Original Message- 
From: Joe Zhang
Sent: Sunday, June 23, 2013 2:35 PM
To: solr-user@lucene.apache.org
Subject: Re: document id in nutch/solr

Can somebody help with this one, please?


On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang smartag...@gmail.com wrote:

 A quite standard configuration of nutch seems to autoamtically map url
 to id. Two questions:

 - Where is such mapping defined? I can't find it anywhere in
 nutch-site.xml or schema.xml. The latter does define the id field as 
 well
 as its uniqueness, but not the mapping.

 - Given that nutch nutch has already defined such an id, can i ask solr to
 redefine id as UUID?
 field name=id type=uuid indexed=true stored=true default=NEW/

 - This leads to a related question: do solr and nutch have to have
 IDENTICAL schema.xml?

Re: document id in nutch/solr

2013-06-23 Thread Joe Zhang

Can somebody help with this one, please?


On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang smartag...@gmail.com wrote:

 A quite standard configuration of nutch seems to autoamtically map url
 to id. Two questions:

 - Where is such mapping defined? I can't find it anywhere in
 nutch-site.xml or schema.xml. The latter does define the id field as well
 as its uniqueness, but not the mapping.

 - Given that nutch nutch has already defined such an id, can i ask solr to
 redefine id as UUID?
 field name=id type=uuid indexed=true stored=true default=NEW/

 - This leads to a related question: do solr and nutch have to have
 IDENTICAL schema.xml?

Re: document id in nutch/solr

2013-06-23 Thread Jack Krupansky

Add the passthrough dynamic field to your Solr schema, and then see what 
fields get passed through to Solr from Nutch. Then, add the missing fields 
to your Solr schema and remove the passthrough.


dynamicField name=* type=string indexed=true stored=true 
multiValued=true /


Or, add Solr copyField directives to place fields in existing named 
fields.


Or... talk to the nutch people about how to do field name mapping on the 
nutch side of the fence.


Hold off on UUIDs until you figure all of the above out and everything is 
working without them.


-- Jack Krupansky

-Original Message- 
From: Joe Zhang

Sent: Sunday, June 23, 2013 2:35 PM
To: solr-user@lucene.apache.org
Subject: Re: document id in nutch/solr

Can somebody help with this one, please?


On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang smartag...@gmail.com wrote:


A quite standard configuration of nutch seems to autoamtically map url
to id. Two questions:

- Where is such mapping defined? I can't find it anywhere in
nutch-site.xml or schema.xml. The latter does define the id field as 
well

as its uniqueness, but not the mapping.

- Given that nutch nutch has already defined such an id, can i ask solr to
redefine id as UUID?
field name=id type=uuid indexed=true stored=true default=NEW/

- This leads to a related question: do solr and nutch have to have
IDENTICAL schema.xml?

document id in nutch/solr

2013-06-21 Thread Joe Zhang

A quite standard configuration of nutch seems to autoamtically map url to
id. Two questions:

- Where is such mapping defined? I can't find it anywhere in nutch-site.xml
or schema.xml. The latter does define the id field as well as its
uniqueness, but not the mapping.

- Given that nutch nutch has already defined such an id, can i ask solr to
redefine id as UUID?
field name=id type=uuid indexed=true stored=true default=NEW/

- This leads to a related question: do solr and nutch have to have
IDENTICAL schema.xml?

spellchecking in nutch solr

2011-09-01 Thread alxsss



Hello,
I have tried to implement spellchecker based on index in nutch-solr by adding 
spell field to schema.xml and making it a copy from content field. However, 
this increased data folder size twice and spell filed as a copy of content 
field appears in xml feed which is not necessary. Is it possible to implement 
spellchecker without this issue?

Thanks.
Alex.

Assistance required fine-tuning nutch/solr - (paid work)

2010-11-12 Thread Jean-Luc


I require the expertise of a developer who can assist with fine-tuning my
nutch/solr setup. I have the basics working but I think I probably need a
custom nutch plugin written.

If you're interested please contact me: jeanluct [at] gmail . com

Hope it's ok to post this here - I'm not a recruiter.

Jean-Luc
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Assistance-required-fine-tuning-nutch-solr-paid-work-tp1889229p1889229.html
Sent from the Solr - User mailing list archive at Nabble.com.

Nutch/Solr

2010-09-07 Thread Yavuz Selim YILMAZ

I tried to combine nutch and solr, want to ask somethig.

After crawling, nutch has certain fields such as; content, tstamp, title.

How can I map content field after crawling ? Do I have change the lucene
code (such as add extra field)?

Or overcome in solr stage?

Any suggestion?

Thx.
--

Yavuz Selim YILMAZ

Re: Nutch/Solr

2010-09-07 Thread Markus Jelsma

Depends on your version of Nutch. At least trunk and 1.1 obey the 
solrmapping.xml file in Nutch' configuration directory. I'd suggest you start 
with that mapping file and the Solr schema.xml file shipped with Nutch as it 
exactly matches with the mapping file.

Just restart Solr with the new schema (or you change the mapping), crawl, 
fetch, parse and update your DB's and then push the index from Nutch to your 
Solr instance.


On Tuesday 07 September 2010 10:00:47 Yavuz Selim YILMAZ wrote:
 I tried to combine nutch and solr, want to ask somethig.
 
 After crawling, nutch has certain fields such as; content, tstamp, title.
 
 How can I map content field after crawling ? Do I have change the lucene
 code (such as add extra field)?
 
 Or overcome in solr stage?
 
 Any suggestion?
 
 Thx.
 --
 
 Yavuz Selim YILMAZ
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch/Solr

2010-09-07 Thread Yavuz Selim YILMAZ

In fact, I used nutch 0.9 version, but thinking of passing the new version.

If anybody did something like that, ı want to learn their experience.

If indexing an xml file, there are specific fields and all of them are
dependent among them, so duplicates don't happen.

I want to extract specific fields from the content field. Doing such
extraction, new fileds should be indexed as well, then comes me that,
content indexed twice for every new field.

By the way, any details about how to get new fields from the content will be
helpful.
--

Yavuz Selim YILMAZ


2010/9/7 Markus Jelsma markus.jel...@buyways.nl

 Depends on your version of Nutch. At least trunk and 1.1 obey the
 solrmapping.xml file in Nutch' configuration directory. I'd suggest you
 start
 with that mapping file and the Solr schema.xml file shipped with Nutch as
 it
 exactly matches with the mapping file.

 Just restart Solr with the new schema (or you change the mapping), crawl,
 fetch, parse and update your DB's and then push the index from Nutch to
 your
 Solr instance.


 On Tuesday 07 September 2010 10:00:47 Yavuz Selim YILMAZ wrote:
  I tried to combine nutch and solr, want to ask somethig.
 
  After crawling, nutch has certain fields such as; content, tstamp, title.
 
  How can I map content field after crawling ? Do I have change the
 lucene
  code (such as add extra field)?
 
  Or overcome in solr stage?
 
  Any suggestion?
 
  Thx.
  --
 
  Yavuz Selim YILMAZ
 

 Markus Jelsma - Technisch Architect - Buyways BV
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350

Re: Nutch/Solr

2010-09-07 Thread Markus Jelsma


You should:
- definately upgrade to 1.1 (1.2 is on the way), and
- subscribe to the Nutch mailing list for Nutch specific questions. 


On Tuesday 07 September 2010 10:36:58 Yavuz Selim YILMAZ wrote:
 In fact, I used nutch 0.9 version, but thinking of passing the new version.
 
 If anybody did something like that, ? want to learn their experience.
 
 If indexing an xml file, there are specific fields and all of them are
 dependent among them, so duplicates don't happen.
 
 I want to extract specific fields from the content field. Doing such
 extraction, new fileds should be indexed as well, then comes me that,
 content indexed twice for every new field.
 
 By the way, any details about how to get new fields from the content will
  be helpful.
 --
 
 Yavuz Selim YILMAZ
 
 
 2010/9/7 Markus Jelsma markus.jel...@buyways.nl
 
  Depends on your version of Nutch. At least trunk and 1.1 obey the
  solrmapping.xml file in Nutch' configuration directory. I'd suggest you
  start
  with that mapping file and the Solr schema.xml file shipped with Nutch as
  it
  exactly matches with the mapping file.
 
  Just restart Solr with the new schema (or you change the mapping), crawl,
  fetch, parse and update your DB's and then push the index from Nutch to
  your
  Solr instance.
 
  On Tuesday 07 September 2010 10:00:47 Yavuz Selim YILMAZ wrote:
   I tried to combine nutch and solr, want to ask somethig.
  
   After crawling, nutch has certain fields such as; content, tstamp,
   title.
  
   How can I map content field after crawling ? Do I have change the
 
  lucene
 
   code (such as add extra field)?
  
   Or overcome in solr stage?
  
   Any suggestion?
  
   Thx.
   --
  
   Yavuz Selim YILMAZ
 
  Markus Jelsma - Technisch Architect - Buyways BV
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch - Solr latest?

2008-06-25 Thread Chris Hostetter


: Im curious, is there a spot / patch for the latest on Nutch / Solr
: integration, Ive found a few pages (a few outdated it seems), it would be nice
: (?) if it worked as a DataSource type to DataImportHandler, but not sure if
: that fits w/ how it works.  Either way a nice contrib patch the way the DIH is
: already setup would be nice to have.
...
: Is there currently work ongoing on this?  Seems like it belongs in either / or
: project and not both.

My understanding is that previous wok on bridging Nutch crawling with Solr 
indexing involved patching Nutch and using a Nutch specific schema.xml and 
the client code which has since been committed as SolrJ.

Most of the discussion seemed to take place on the Nutch list (which makes 
sense since Nutch required the patching) so you may wnt to start there).

I'm not sure if Nutch itegration would make sense as a DIH plugin (it 
seems like the Nutch crawler could push the data much more easily then 
DIH could pull it from the crawler) but if there is any advantage to 
having plugin code running in Solr to support this then that would 
absolutely make sense in the new /contrib area of solr (that i believe 
Otis already created/commited) but any nutch plugins or modifications 
would obviously need to be made in Nutch.

-Hoss

Nutch - Solr latest?

2008-06-24 Thread Jon Baer


Hi,

Im curious, is there a spot / patch for the latest on Nutch / Solr  
integration, Ive found a few pages (a few outdated it seems), it would  
be nice (?) if it worked as a DataSource type to DataImportHandler,  
but not sure if that fits w/ how it works.  Either way a nice contrib  
patch the way the DIH is already setup would be nice to have.


Is there currently work ongoing on this?  Seems like it belongs in  
either / or project and not both.


Thanks.

- Jon

Indexing HTML Metatags Nutch - SOLR

Indexing HTML Metatags Nutch - SOLR

Re: Nutch+Solr

Re: Nutch+Solr

Nutch+Solr

Nutch + Solr - Indexer causes java.lang.OutOfMemoryError: Java heap space

Re: document id in nutch/solr

Re: document id in nutch/solr

Re: document id in nutch/solr

document id in nutch/solr

spellchecking in nutch solr

Assistance required fine-tuning nutch/solr - (paid work)

Nutch/Solr

Re: Nutch/Solr

Re: Nutch/Solr

Re: Nutch/Solr

Re: Nutch - Solr latest?

Nutch - Solr latest?

18 matches

Site Navigation

Mail list logo

Footer information