
Anton  commented on NUTCH-1478:

Steps to reproduce:
1) Add  fields for metatags 
     <field name="metatag.description" type="string" stored="true" 
in schema.xml both in solr and nutch
2) restart solr 
3) configure nutch-default.xml as in my comment above
4) setup urls/seed.txt in nutch
5) ant clean && ant runtime
6) run crawl command

I use solr-4.6.0 apache-nutch-2.2.1

When I run full crawl with such command

urls/seed.txt az http://localhost:8088/solr/ 1

metadata is successfully parsed and stored in database, problem occurs in 
14/02/04 13:00:46 INFO solr.SolrIndexerJob: SolrIndexerJob: starting
14/02/04 13:00:46 INFO plugin.PluginRepository: Plugins: looking in: 
14/02/04 13:00:46 INFO plugin.PluginRepository: Plugin Auto-activation mode: 
14/02/04 13:00:46 INFO plugin.PluginRepository: Registered Plugins:
14/02/04 13:00:46 INFO plugin.PluginRepository:         the nutch core 
extension points (nutch-extensionpoints)
14/02/04 13:00:46 INFO plugin.PluginRepository:         Basic URL Normalizer 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Html Parse Plug-in 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Basic Indexing Filter 
14/02/04 13:00:46 INFO plugin.PluginRepository:         HTTP Framework 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Pass-through URL 
Normalizer (urlnormalizer-pass)
14/02/04 13:00:46 INFO plugin.PluginRepository:         Regex URL Filter 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Http Protocol Plug-in 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Regex URL Normalizer 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Tika Parser Plug-in 
14/02/04 13:00:46 INFO plugin.PluginRepository:         OPIC Scoring Plug-in 
14/02/04 13:00:46 INFO plugin.PluginRepository:         CyberNeko HTML Parser 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Anchor Indexing Filter 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Regex URL Filter 
Framework (lib-regex-filter)
14/02/04 13:00:46 INFO plugin.PluginRepository:         MetaTags 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Index Metadata 
14/02/04 13:00:46 INFO plugin.PluginRepository: Registered Extension-Points:
14/02/04 13:00:46 INFO plugin.PluginRepository:         Nutch URL Normalizer 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Nutch Protocol 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Parse Filter 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Nutch URL Filter 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Nutch Indexing Filter 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Nutch Content Parser 
14/02/04 13:00:46 INFO plugin.PluginRepository:         Nutch Scoring 
14/02/04 13:00:46 INFO basic.BasicIndexingFilter: Maximum title length for 
indexing set to: 100
14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding 
14/02/04 13:00:46 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off
14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding 
14/02/04 13:00:46 INFO indexer.IndexingFilters: Adding 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
environment:java.vendor=Oracle Corporation
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client environment:user.name=hadoop
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Client 
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to 
server localhost/
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to 
localhost/, initiating session
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on 
server localhost/, sessionid = 0x142ea7be01213d1, negotiated 
timeout = 180000
14/02/04 13:00:46 INFO store.HBaseStore: Keyclass and nameclass match but 
mismatching table names  mappingfile schema is 'webpage' vs actual schema 
'az_webpage' , assuming they are the same.
14/02/04 13:00:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to 
server localhost/
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to 
localhost/, initiating session
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on 
server localhost/, sessionid = 0x142ea7be01213d2, negotiated 
timeout = 180000
14/02/04 13:00:46 INFO store.HBaseStore: Keyclass and nameclass match but 
mismatching table names  mappingfile schema is 'webpage' vs actual schema 
'az_webpage' , assuming they are the same.
14/02/04 13:00:46 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Opening socket connection to 
server localhost/
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Socket connection established to 
localhost/, initiating session
14/02/04 13:00:46 INFO zookeeper.ClientCnxn: Session establishment complete on 
server localhost/, sessionid = 0x142ea7be01213d3, negotiated 
timeout = 180000
14/02/04 13:00:47 INFO mapred.JobClient: Running job: job_local1932930342_0001
14/02/04 13:00:47 INFO mapred.LocalJobRunner: Waiting for map tasks
14/02/04 13:00:47 INFO mapred.LocalJobRunner: Starting task: 
14/02/04 13:00:47 INFO util.ProcessTree: setsid exited with exit code 0
14/02/04 13:00:47 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to 
server localhost/
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to 
localhost/, initiating session
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on 
server localhost/, sessionid = 0x142ea7be01213d4, negotiated 
timeout = 180000
14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but 
mismatching table names  mappingfile schema is 'webpage' vs actual schema 
'az_webpage' , assuming they are the same.
14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to 
server localhost/
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to 
localhost/, initiating session
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on 
server localhost/, sessionid = 0x142ea7be01213d5, negotiated 
timeout = 180000
14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but 
mismatching table names  mappingfile schema is 'webpage' vs actual schema 
'az_webpage' , assuming they are the same.
14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to 
server localhost/
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to 
localhost/, initiating session
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on 
server localhost/, sessionid = 0x142ea7be01213d6, negotiated 
timeout = 180000
14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but 
mismatching table names  mappingfile schema is 'webpage' vs actual schema 
'az_webpage' , assuming they are the same.
14/02/04 13:00:47 INFO mapred.MapTask: Processing split: 
14/02/04 13:00:47 INFO mapreduce.GoraRecordReader: gora.buffer.read.limit = 
14/02/04 13:00:47 INFO solr.SolrIndexerJob: Authenticating as: solr-user
14/02/04 13:00:47 INFO conf.Configuration: found resource solrindex-mapping.xml 
at file:/home/hadoop/data/hadoop-unjar8289682370547831088/solrindex-mapping.xml
14/02/04 13:00:47 INFO solr.SolrMappingReader: source: content dest: content
14/02/04 13:00:47 INFO solr.SolrMappingReader: source: title dest: title
14/02/04 13:00:47 INFO solr.SolrMappingReader: source: host dest: host
14/02/04 13:00:47 INFO solr.SolrMappingReader: source: batchId dest: batchId
14/02/04 13:00:47 INFO solr.SolrMappingReader: source: boost dest: boost
14/02/04 13:00:47 INFO solr.SolrMappingReader: source: digest dest: digest
14/02/04 13:00:47 INFO solr.SolrMappingReader: source: tstamp dest: tstamp
14/02/04 13:00:47 INFO basic.BasicIndexingFilter: Maximum title length for 
indexing set to: 100
14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding 
14/02/04 13:00:47 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off
14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding 
14/02/04 13:00:47 INFO indexer.IndexingFilters: Adding 
14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to 
server localhost/
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to 
localhost/, initiating session
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on 
server localhost/, sessionid = 0x142ea7be01213d7, negotiated 
timeout = 180000
14/02/04 13:00:47 INFO store.HBaseStore: Keyclass and nameclass match but 
mismatching table names  mappingfile schema is 'webpage' vs actual schema 
'az_webpage' , assuming they are the same.
14/02/04 13:00:47 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Opening socket connection to 
server localhost/
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Socket connection established to 
localhost/, initiating session
14/02/04 13:00:47 INFO zookeeper.ClientCnxn: Session establishment complete on 
server localhost/, sessionid = 0x142ea7be01213d8, negotiated 
timeout = 180000
14/02/04 13:00:47 INFO mapred.LocalJobRunner: Map task executor complete.
14/02/04 13:00:47 WARN mapred.FileOutputCommitter: Output path is null in 
14/02/04 13:00:47 WARN mapred.LocalJobRunner: job_local1932930342_0001
java.lang.Exception: java.lang.NullPointerException
Caused by: java.lang.NullPointerException
        at org.apache.nutch.indexer.IndexUtil.index(IndexUtil.java:77)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.lang.Thread.run(Thread.java:744)
14/02/04 13:00:48 INFO mapred.JobClient:  map 0% reduce 0%
14/02/04 13:00:48 INFO mapred.JobClient: Job complete: job_local1932930342_0001
14/02/04 13:00:48 INFO mapred.JobClient: Counters: 0
14/02/04 13:00:48 ERROR solr.SolrIndexerJob: SolrIndexerJob: 
java.lang.RuntimeException: job failed: name=[az]solr-index, 
        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:160)

> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>                 Key: NUTCH-1478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1478
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>             Fix For: 2.3
>         Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, 
> NUTCH-1478v4.patch, Nutch1478.patch, Nutch1478.zip, 
> metadata_parseChecker_sites.png
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
> This will take multiple values of same tag and index in Solr as i patched 
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here 
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
> no need to give 'metatag' keyword before metatag names. For example my 
> configuration looks like this 
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
> This is only the first version and does not include the junit test. I will 
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the 
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.

This message was sent by Atlassian JIRA

Reply via email to