Indexing HTML Metatags Nutch - SOLR
Hello, I have been trying this for several days without success. (nutch 1.16 - solr 7.3.1) I have followed this description: https://cwiki.apache.org/confluence/display/nutch/IndexMetatags Below I put my file nutch-site.xml I have created the core following this description: https://cwiki.apache.org/confluence/display/nutch/NutchTutorial/ By the way without the metatags everything works fine. Bevor creating the core I deleted the managed-schema.xml and inserted my metatag fields into schema.xml in the configsets directory of the core First Question: After creating the core I see a managed-schema.xml file and a schema.xml.bak file in the conf directory of the core. Sorry I am new to this, but I believe I do not want managed-schema.xml??? (See description above) Anyway when I run the crawl all is ok until the index is created. Then I end up with the error: org.apache.solr.common.SolrException: copyField dest :'metatag.SITdescription_str' is not an explicit field and doesn't match a dynamicField. at org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:902) at org.apache.solr.schema.ManagedIndexSchema.addCopyFields(ManagedIndexSchema.java:784) There is no copyfield instruction for metatag.SITdescription in managed-schema.xml. I even created a field "metatag.SITdescription_str" in managed-schema.xml which did not help. Can you help me please Best Regards Martin nutch-site.xml http.agent.name SIT_NUTCH_SPIDER db.ignore.external.links true If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. plugin.includes protocol-http|urlfilter-(regex|validator)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. By default Nutch includes plugins to crawl HTML and various other document formats via HTTP/HTTPS and indexing the crawled content into Solr. More plugins are available to support more indexing backends, to fetch ftp:// and file:// URLs, for focused crawling, and many other use cases. http.robot.rules.whitelist sitlux02.sit.de Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. metatags.names SITdescription,SITkeywords,SITcategory,SITintern Names of the metatags to extract, separated by ','. Use '*' to extract all metatags. Prefixes the names with 'metatag.' in the parse-metadata. For instance to index description and keywords, you need to activate the plugin index-metadata and set the value of the parameter 'index.parse.md' to 'metatag.description,metatag.keywords'. index.parse.md metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin) index.metadata metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern Comma-separated list of keys to be taken from the metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin), and property 'metatags.names'. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Indexing HTML Metatags Nutch - SOLR
Hello, I have been trying this for several days without success. (nutch 1.16 - solr 7.3.1) I have followed this description: https://cwiki.apache.org/confluence/display/nutch/IndexMetatags Below I put my file nutch-site.xml I have created the core following this description: https://cwiki.apache.org/confluence/display/nutch/NutchTutorial/ By the way without the metatags everything works fine. Bevor creating the core I deleted the managed-schema.xml and inserted my metatag fields into schema.xml in the configsets directory of the core First Question: After creating the core I see a managed-schema.xml file and a schema.xml.bak file in the conf directory of the core. Sorry I am new to this, but I believe I do not want managed-schema.xml??? (See description above) Anyway when I run the crawl all is ok until the index is created. Then I end up with the error: org.apache.solr.common.SolrException: copyField dest :'metatag.SITdescription_str' is not an explicit field and doesn't match a dynamicField. at org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:902) at org.apache.solr.schema.ManagedIndexSchema.addCopyFields(ManagedIndexSchema.java:784) There is no copyfield instruction for metatag.SITdescription in managed-schema.xml. I even created a field "metatag.SITdescription_str" in managed-schema.xml which did not help. Can you help me please Best Regards Martin nutch-site.xml http.agent.name SIT_NUTCH_SPIDER db.ignore.external.links true If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. plugin.includes protocol-http|urlfilter-(regex|validator)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. By default Nutch includes plugins to crawl HTML and various other document formats via HTTP/HTTPS and indexing the crawled content into Solr. More plugins are available to support more indexing backends, to fetch ftp:// and file:// URLs, for focused crawling, and many other use cases. http.robot.rules.whitelist sitlux02.sit.de Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. metatags.names SITdescription,SITkeywords,SITcategory,SITintern Names of the metatags to extract, separated by ','. Use '*' to extract all metatags. Prefixes the names with 'metatag.' in the parse-metadata. For instance to index description and keywords, you need to activate the plugin index-metadata and set the value of the parameter 'index.parse.md' to 'metatag.description,metatag.keywords'. index.parse.md metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin) index.metadata metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern Comma-separated list of keys to be taken from the metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin), and property 'metatags.names'. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Nutch+Solr
This is solved. Nutch 1.15 have index-writers.xml file wherein we can pass the UN/PWD for indexing to solr. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Nutch+Solr
Bineesh, I don't use Nutch, so don't know if this is relevant, but I've had similar-sounding failures in doing and restoring backups. The solution for me was to deactivate authentication while the backup was being done, and then activate it again afterwards. Then everything was restored correctly. Otherwise, I got a whole bunch of efforts (if I left authentication active when doing the backup). Terry On 10/03/2018 10:21 AM, Bineesh wrote: > Hello, > > We use Solr 7.3.1 and Nutch 1.15 > > We've placed the authentication for our solr cloud setup using the basic > auth plugin ( login details -> solr/SolrRocks) > > For the nutch to index data to solr, below properties added to nutch-sitexml > file > > > solr.auth > true > > Whether to enable HTTP basic authentication for communicating with Solr. > Use the solr.auth.username and solr.auth.password properties to configure > your credentials. > > > > > > solr.auth.username > solr > > Username > > > > > > solr.auth.password > SolrRocks > > Password > > > > While Nutch index data to solr, its failing due to authentication. Am i > doing something wrong ? Pls help > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Nutch+Solr
Hello, We use Solr 7.3.1 and Nutch 1.15 We've placed the authentication for our solr cloud setup using the basic auth plugin ( login details -> solr/SolrRocks) For the nutch to index data to solr, below properties added to nutch-sitexml file solr.auth true Whether to enable HTTP basic authentication for communicating with Solr. Use the solr.auth.username and solr.auth.password properties to configure your credentials. solr.auth.username solr Username solr.auth.password SolrRocks Password While Nutch index data to solr, its failing due to authentication. Am i doing something wrong ? Pls help -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Nutch + Solr - Indexer causes java.lang.OutOfMemoryError: Java heap space
Hello everyone, I have configured my 2 servers to run in distributed mode (with Hadoop) and my configuration for crawling process is Nutch 2.2.1 - HBase (as a storage) and Solr. Solr is run by Tomcat. The problem is everytime I try to do the last step - I mean when I want to index data from HBase into Solr. After then this *[1]* error occures. I tried to add CATALINA_OPTS (or JAVA_OPTS) like this: CATALINA_OPTS=$JAVA_OPTS -XX:+UseConcMarkSweepGC -Xms1g -Xmx6000m -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -XX:+CMSClassUnloadingEnabled to Tomcat's catalina.sh script and run server with this script but it didn't help. I also add these *[2]* properties to nutch-site.xml file but it ended up with OutOfMemory again. Can you help me please? *[1]* /2014-09-06 22:52:50,683 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587) at java.lang.StringBuffer.append(StringBuffer.java:332) at java.io.StringWriter.write(StringWriter.java:77) at org.apache.solr.common.util.XML.escape(XML.java:204) at org.apache.solr.common.util.XML.escapeCharData(XML.java:77) at org.apache.solr.common.util.XML.writeXML(XML.java:147) at org.apache.solr.client.solrj.util.ClientUtils.writeVal(ClientUtils.java:161) at org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:129) at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:355) at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:271) at org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:66) at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:94) at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:104) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:247) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:96) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650) at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1793) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) / *[2]* property namehttp.content.limit/name value15000/value descriptionThe length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. For our purposes it is twice bigger than default - parsing big pages: 128 * 1024 /description /property property nameindexer.max.tokens/name value10/value /property property namehttp.timeout/name value5/value descriptionThe default network timeout, in milliseconds./description /property property namesolr.commit.size/name value100/value description Defines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory. NOTE: It does not explicitly trigger a server side commit. /description /property -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Solr-Indexer-causes-java-lang-OutOfMemoryError-Java-heap-space-tp4157308.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: document id in nutch/solr
Another way of overriding nutch fields is to modify solrindex-mapping.xml file. hth Alex. -Original Message- From: Jack Krupansky j...@basetechnology.com To: solr-user solr-user@lucene.apache.org Sent: Sun, Jun 23, 2013 12:04 pm Subject: Re: document id in nutch/solr Add the passthrough dynamic field to your Solr schema, and then see what fields get passed through to Solr from Nutch. Then, add the missing fields to your Solr schema and remove the passthrough. dynamicField name=* type=string indexed=true stored=true multiValued=true / Or, add Solr copyField directives to place fields in existing named fields. Or... talk to the nutch people about how to do field name mapping on the nutch side of the fence. Hold off on UUIDs until you figure all of the above out and everything is working without them. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Sunday, June 23, 2013 2:35 PM To: solr-user@lucene.apache.org Subject: Re: document id in nutch/solr Can somebody help with this one, please? On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang smartag...@gmail.com wrote: A quite standard configuration of nutch seems to autoamtically map url to id. Two questions: - Where is such mapping defined? I can't find it anywhere in nutch-site.xml or schema.xml. The latter does define the id field as well as its uniqueness, but not the mapping. - Given that nutch nutch has already defined such an id, can i ask solr to redefine id as UUID? field name=id type=uuid indexed=true stored=true default=NEW/ - This leads to a related question: do solr and nutch have to have IDENTICAL schema.xml?
Re: document id in nutch/solr
Can somebody help with this one, please? On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang smartag...@gmail.com wrote: A quite standard configuration of nutch seems to autoamtically map url to id. Two questions: - Where is such mapping defined? I can't find it anywhere in nutch-site.xml or schema.xml. The latter does define the id field as well as its uniqueness, but not the mapping. - Given that nutch nutch has already defined such an id, can i ask solr to redefine id as UUID? field name=id type=uuid indexed=true stored=true default=NEW/ - This leads to a related question: do solr and nutch have to have IDENTICAL schema.xml?
Re: document id in nutch/solr
Add the passthrough dynamic field to your Solr schema, and then see what fields get passed through to Solr from Nutch. Then, add the missing fields to your Solr schema and remove the passthrough. dynamicField name=* type=string indexed=true stored=true multiValued=true / Or, add Solr copyField directives to place fields in existing named fields. Or... talk to the nutch people about how to do field name mapping on the nutch side of the fence. Hold off on UUIDs until you figure all of the above out and everything is working without them. -- Jack Krupansky -Original Message- From: Joe Zhang Sent: Sunday, June 23, 2013 2:35 PM To: solr-user@lucene.apache.org Subject: Re: document id in nutch/solr Can somebody help with this one, please? On Fri, Jun 21, 2013 at 10:36 PM, Joe Zhang smartag...@gmail.com wrote: A quite standard configuration of nutch seems to autoamtically map url to id. Two questions: - Where is such mapping defined? I can't find it anywhere in nutch-site.xml or schema.xml. The latter does define the id field as well as its uniqueness, but not the mapping. - Given that nutch nutch has already defined such an id, can i ask solr to redefine id as UUID? field name=id type=uuid indexed=true stored=true default=NEW/ - This leads to a related question: do solr and nutch have to have IDENTICAL schema.xml?
document id in nutch/solr
A quite standard configuration of nutch seems to autoamtically map url to id. Two questions: - Where is such mapping defined? I can't find it anywhere in nutch-site.xml or schema.xml. The latter does define the id field as well as its uniqueness, but not the mapping. - Given that nutch nutch has already defined such an id, can i ask solr to redefine id as UUID? field name=id type=uuid indexed=true stored=true default=NEW/ - This leads to a related question: do solr and nutch have to have IDENTICAL schema.xml?
spellchecking in nutch solr
Hello, I have tried to implement spellchecker based on index in nutch-solr by adding spell field to schema.xml and making it a copy from content field. However, this increased data folder size twice and spell filed as a copy of content field appears in xml feed which is not necessary. Is it possible to implement spellchecker without this issue? Thanks. Alex.
Assistance required fine-tuning nutch/solr - (paid work)
I require the expertise of a developer who can assist with fine-tuning my nutch/solr setup. I have the basics working but I think I probably need a custom nutch plugin written. If you're interested please contact me: jeanluct [at] gmail . com Hope it's ok to post this here - I'm not a recruiter. Jean-Luc -- View this message in context: http://lucene.472066.n3.nabble.com/Assistance-required-fine-tuning-nutch-solr-paid-work-tp1889229p1889229.html Sent from the Solr - User mailing list archive at Nabble.com.
Nutch/Solr
I tried to combine nutch and solr, want to ask somethig. After crawling, nutch has certain fields such as; content, tstamp, title. How can I map content field after crawling ? Do I have change the lucene code (such as add extra field)? Or overcome in solr stage? Any suggestion? Thx. -- Yavuz Selim YILMAZ
Re: Nutch/Solr
Depends on your version of Nutch. At least trunk and 1.1 obey the solrmapping.xml file in Nutch' configuration directory. I'd suggest you start with that mapping file and the Solr schema.xml file shipped with Nutch as it exactly matches with the mapping file. Just restart Solr with the new schema (or you change the mapping), crawl, fetch, parse and update your DB's and then push the index from Nutch to your Solr instance. On Tuesday 07 September 2010 10:00:47 Yavuz Selim YILMAZ wrote: I tried to combine nutch and solr, want to ask somethig. After crawling, nutch has certain fields such as; content, tstamp, title. How can I map content field after crawling ? Do I have change the lucene code (such as add extra field)? Or overcome in solr stage? Any suggestion? Thx. -- Yavuz Selim YILMAZ Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch/Solr
In fact, I used nutch 0.9 version, but thinking of passing the new version. If anybody did something like that, ı want to learn their experience. If indexing an xml file, there are specific fields and all of them are dependent among them, so duplicates don't happen. I want to extract specific fields from the content field. Doing such extraction, new fileds should be indexed as well, then comes me that, content indexed twice for every new field. By the way, any details about how to get new fields from the content will be helpful. -- Yavuz Selim YILMAZ 2010/9/7 Markus Jelsma markus.jel...@buyways.nl Depends on your version of Nutch. At least trunk and 1.1 obey the solrmapping.xml file in Nutch' configuration directory. I'd suggest you start with that mapping file and the Solr schema.xml file shipped with Nutch as it exactly matches with the mapping file. Just restart Solr with the new schema (or you change the mapping), crawl, fetch, parse and update your DB's and then push the index from Nutch to your Solr instance. On Tuesday 07 September 2010 10:00:47 Yavuz Selim YILMAZ wrote: I tried to combine nutch and solr, want to ask somethig. After crawling, nutch has certain fields such as; content, tstamp, title. How can I map content field after crawling ? Do I have change the lucene code (such as add extra field)? Or overcome in solr stage? Any suggestion? Thx. -- Yavuz Selim YILMAZ Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch/Solr
You should: - definately upgrade to 1.1 (1.2 is on the way), and - subscribe to the Nutch mailing list for Nutch specific questions. On Tuesday 07 September 2010 10:36:58 Yavuz Selim YILMAZ wrote: In fact, I used nutch 0.9 version, but thinking of passing the new version. If anybody did something like that, ? want to learn their experience. If indexing an xml file, there are specific fields and all of them are dependent among them, so duplicates don't happen. I want to extract specific fields from the content field. Doing such extraction, new fileds should be indexed as well, then comes me that, content indexed twice for every new field. By the way, any details about how to get new fields from the content will be helpful. -- Yavuz Selim YILMAZ 2010/9/7 Markus Jelsma markus.jel...@buyways.nl Depends on your version of Nutch. At least trunk and 1.1 obey the solrmapping.xml file in Nutch' configuration directory. I'd suggest you start with that mapping file and the Solr schema.xml file shipped with Nutch as it exactly matches with the mapping file. Just restart Solr with the new schema (or you change the mapping), crawl, fetch, parse and update your DB's and then push the index from Nutch to your Solr instance. On Tuesday 07 September 2010 10:00:47 Yavuz Selim YILMAZ wrote: I tried to combine nutch and solr, want to ask somethig. After crawling, nutch has certain fields such as; content, tstamp, title. How can I map content field after crawling ? Do I have change the lucene code (such as add extra field)? Or overcome in solr stage? Any suggestion? Thx. -- Yavuz Selim YILMAZ Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch - Solr latest?
: Im curious, is there a spot / patch for the latest on Nutch / Solr : integration, Ive found a few pages (a few outdated it seems), it would be nice : (?) if it worked as a DataSource type to DataImportHandler, but not sure if : that fits w/ how it works. Either way a nice contrib patch the way the DIH is : already setup would be nice to have. ... : Is there currently work ongoing on this? Seems like it belongs in either / or : project and not both. My understanding is that previous wok on bridging Nutch crawling with Solr indexing involved patching Nutch and using a Nutch specific schema.xml and the client code which has since been committed as SolrJ. Most of the discussion seemed to take place on the Nutch list (which makes sense since Nutch required the patching) so you may wnt to start there). I'm not sure if Nutch itegration would make sense as a DIH plugin (it seems like the Nutch crawler could push the data much more easily then DIH could pull it from the crawler) but if there is any advantage to having plugin code running in Solr to support this then that would absolutely make sense in the new /contrib area of solr (that i believe Otis already created/commited) but any nutch plugins or modifications would obviously need to be made in Nutch. -Hoss
Nutch - Solr latest?
Hi, Im curious, is there a spot / patch for the latest on Nutch / Solr integration, Ive found a few pages (a few outdated it seems), it would be nice (?) if it worked as a DataSource type to DataImportHandler, but not sure if that fits w/ how it works. Either way a nice contrib patch the way the DIH is already setup would be nice to have. Is there currently work ongoing on this? Seems like it belongs in either / or project and not both. Thanks. - Jon