[jira] [Commented] (NUTCH-2270) Solr indexer Failed i
[ https://issues.apache.org/jira/browse/NUTCH-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318973#comment-15318973 ] kaveh minooie commented on NUTCH-2270: -- also a duplicate of NUTCH-2267 > Solr indexer Failed i > - > > Key: NUTCH-2270 > URL: https://issues.apache.org/jira/browse/NUTCH-2270 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: Hadoop 2.7.2 , Solr 6.0.0 , Nutch 1.12 on Single node >Reporter: narendra > > When i run this command > bin/nutch solrindex http://localhost:8983/solr/#/gettingstarted > crawl_Test1/crawldb -linkdb crawl_Test1/linkdb crawl_Test1/segments/* > 16/05/31 22:21:47 WARN segment.SegmentChecker: The input path at * is not a > segment... skipping > 16/05/31 22:21:47 INFO indexer.IndexingJob: Indexer: starting at 2016-05-31 > 22:21:47 > 16/05/31 22:21:47 INFO indexer.IndexingJob: Indexer: deleting gone documents: > false > 16/05/31 22:21:47 INFO indexer.IndexingJob: Indexer: URL filtering: false > 16/05/31 22:21:47 INFO indexer.IndexingJob: Indexer: URL normalizing: false > 16/05/31 22:21:47 INFO plugin.PluginRepository: Plugins: looking in: > /tmp/hadoop-unjar8621976524622577403/classes/plugins > 16/05/31 22:21:47 INFO plugin.PluginRepository: Plugin Auto-activation mode: > [true] > 16/05/31 22:21:47 INFO plugin.PluginRepository: Registered Plugins: > 16/05/31 22:21:47 INFO plugin.PluginRepository: Regex URL Filter > (urlfilter-regex) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Html Parse Plug-in > (parse-html) > 16/05/31 22:21:47 INFO plugin.PluginRepository: HTTP Framework > (lib-http) > 16/05/31 22:21:47 INFO plugin.PluginRepository: the nutch core > extension points (nutch-extensionpoints) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Basic Indexing Filter > (index-basic) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Anchor Indexing Filter > (index-anchor) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Tika Parser Plug-in > (parse-tika) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Basic URL Normalizer > (urlnormalizer-basic) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Regex URL Filter > Framework (lib-regex-filter) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Regex URL Normalizer > (urlnormalizer-regex) > 16/05/31 22:21:47 INFO plugin.PluginRepository: CyberNeko HTML Parser > (lib-nekohtml) > 16/05/31 22:21:47 INFO plugin.PluginRepository: OPIC Scoring Plug-in > (scoring-opic) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Pass-through URL > Normalizer (urlnormalizer-pass) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Http Protocol Plug-in > (protocol-http) > 16/05/31 22:21:47 INFO plugin.PluginRepository: SolrIndexWriter > (indexer-solr) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Registered Extension-Points: > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch Content Parser > (org.apache.nutch.parse.Parser) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch URL Filter > (org.apache.nutch.net.URLFilter) > 16/05/31 22:21:47 INFO plugin.PluginRepository: HTML Parse Filter > (org.apache.nutch.parse.HtmlParseFilter) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch URL Normalizer > (org.apache.nutch.net.URLNormalizer) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch Protocol > (org.apache.nutch.protocol.Protocol) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch URL Ignore > Exemption Filter (org.apache.nutch.net.URLExemptionFilter) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch Index Writer > (org.apache.nutch.indexer.IndexWriter) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch Segment Merge > Filter (org.apache.nutch.segment.SegmentMergeFilter) > 16/05/31 22:21:47 INFO plugin.PluginRepository: Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) > 16/05/31 22:21:47 INFO indexer.IndexWriters: Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 16/05/31 22:21:47 INFO indexer.IndexingJob: Active IndexWriters : > SOLRIndexWriter > solr.server.url : URL of the SOLR instance > solr.zookeeper.hosts : URL of the Zookeeper quorum > solr.commit.size : buffer size when sending to SOLR (default 1000) > solr.mapping.file : name of the mapping file
[jira] [Commented] (NUTCH-2268) SolrIndexerJob: java.lang.RuntimeException
[ https://issues.apache.org/jira/browse/NUTCH-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318948#comment-15318948 ] kaveh minooie commented on NUTCH-2268: -- The reporter's problem has been solved in the same stackoverflow question that is referenced. This issue should be closed. > SolrIndexerJob: java.lang.RuntimeException > -- > > Key: NUTCH-2268 > URL: https://issues.apache.org/jira/browse/NUTCH-2268 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 2.3.1 > Environment: iam using > Hbase V:hbase-0.98.19-hadoop2 > Solr V : 6.0.0 > Nutch : 2.3.1 > java : 8 >Reporter: narendra > Labels: indexing > Original Estimate: 12h > Remaining Estimate: 12h > > Could you please help out of this error > SolrIndexerJob: java.lang.RuntimeException: job > failed:name=apache-nutch-2.3.1.jar > when i run this commend > local/bin/nutch solrindex http://localhost:8983/solr/ -all > Tried with Solr 4.10.3 but same error iam getting -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2267) Solr indexer fails at the end of the job with a java error message
kaveh minooie created NUTCH-2267: Summary: Solr indexer fails at the end of the job with a java error message Key: NUTCH-2267 URL: https://issues.apache.org/jira/browse/NUTCH-2267 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.12 Environment: hadoop v2.7.2 solr6 in cloud configuration with zookeeper 3.4.6. I use the master branch from github currently on commit da252eb7b3d2d7b70 ( NUTCH - 2263 mingram and maxgram support for Unigram Cosine Similarity Model is provided. ) Reporter: kaveh minooie Fix For: 1.13 this is was what I was getting first: 16/05/23 13:52:27 INFO mapreduce.Job: map 100% reduce 100% 16/05/23 13:52:27 INFO mapreduce.Job: Task Id : attempt_1462499602101_0119_r_00_0, Status : FAILED Error: Bad return type Exception Details: Location: org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @58: areturn Reason: Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature) Current Frame: bci: @58 flags: { } locals: { 'org/apache/solr/common/params/SolrParams', 'org/apache/http/conn/ClientConnectionManager', 'org/apache/solr/common/params/ModifiableSolrParams', 'org/apache/http/impl/client/DefaultHttpClient' } stack: { 'org/apache/http/impl/client/DefaultHttpClient' } Bytecode: 0x000: bb00 0359 2ab7 0004 4db2 0005 b900 0601 0x010: 0099 001e b200 05bb 0007 59b7 0008 1209 0x020: b600 0a2c b600 0bb6 000c b900 0d02 002b 0x030: b800 104e 2d2c b800 0f2d b0 Stackmap Table: append_frame(@47,Object[#143]) 16/05/23 13:52:28 INFO mapreduce.Job: map 100% reduce 0% as you can see the failed reducer gets re-spawned. then I found this issue: https://issues.apache.org/jira/browse/SOLR-7657 and I updated my hadoop config file. after that, the indexer seems to be able to finish ( I got the document in the solr, it seems ) but I still get the error message at the end of the job: 16/05/23 16:39:26 INFO mapreduce.Job: map 100% reduce 99% 16/05/23 16:39:44 INFO mapreduce.Job: map 100% reduce 100% 16/05/23 16:39:57 INFO mapreduce.Job: Job job_1464045047943_0001 completed successfully 16/05/23 16:39:58 INFO mapreduce.Job: Counters: 53 File System Counters FILE: Number of bytes read=42700154855 FILE: Number of bytes written=70210771807 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=8699202825 HDFS: Number of bytes written=0 HDFS: Number of read operations=537 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=134 Launched reduce tasks=1 Data-local map tasks=107 Rack-local map tasks=27 Total time spent by all maps in occupied slots (ms)=49377664 Total time spent by all reduces in occupied slots (ms)=32765064 Total time spent by all map tasks (ms)=3086104 Total time spent by all reduce tasks (ms)=1365211 Total vcore-milliseconds taken by all map tasks=3086104 Total vcore-milliseconds taken by all reduce tasks=1365211 Total megabyte-milliseconds taken by all map tasks=12640681984 Total megabyte-milliseconds taken by all reduce tasks=8387856384 Map-Reduce Framework Map input records=25305474 Map output records=25305474 Map output bytes=27422869763 Map output materialized bytes=27489888004 Input split bytes=15225 Combine input records=0 Combine output records=0 Reduce input groups=16061459 Reduce shuffle bytes=27489888004 Reduce input records=25305474 Reduce output records=230 Spilled Records=54688613 Shuffled Maps =134 Failed Shuffles=0 Merged Map outputs=134 GC time elapsed (ms)=88103 CPU time spent (ms)=3361270 Physical memory (bytes) snapshot=144395186176 Virtual memory (bytes) snapshot=751590166528 Total committed heap usage (bytes)=156232056832 IndexerStatus in
[jira] [Commented] (NUTCH-1084) ReadDB url throws exception
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297359#comment-15297359 ] kaveh minooie commented on NUTCH-1084: -- for the next person who ends up here: this is fixed in hadopp 2.7.3, everyone is waiting for that version to be released. > ReadDB url throws exception > --- > > Key: NUTCH-1084 > URL: https://issues.apache.org/jira/browse/NUTCH-1084 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-1084.patch > > > Readdb -url suffers from two problems: > 1. it trips over the _SUCCESS file generated by newer Hadoop version > 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???) > The first problem can be remedied by not allowing the injector or updater to > write the _SUCCESS file. Until now that's the solution implemented for > similar issues. I've not been successful as to make the Hadoop readers simply > skip the file. > The second issue seems a bit strange and did not happen on a local check out. > I'm not yet sure whether this is a Hadoop issue or something being corrupt in > the CrawlDB. Here's the stack trace: > {code} > Exception in thread "main" java.io.IOException: can't find class: > org.apache.nutch.protocol.ProtocolStatus because > org.apache.nutch.protocol.ProtocolStatus > at > org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204) > at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146) > at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278) > at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) > at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524) > at > org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105) > at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383) > at > org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389) > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1084) ReadDB url throws exception
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292406#comment-15292406 ] kaveh minooie commented on NUTCH-1084: -- has there been any update on this issue? I am running master and this is what I got : crawler@d1r2n2:/2locos/nutch/deploy$ NUTCH_HOME=$( pwd ) bin/nutch readseg -list /2locos/segments/20160511193155 WARNING: Use "yarn jar" to launch YARN applications. 16/05/19 16:55:42 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate] 16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate] 16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate] 16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate] 16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate] Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:167) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:317) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2256) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2384) at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:673) at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:534) at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:477) at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:655) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) as you can see setting NUTCH_HOME has no effect. and if I set HADOOP_CLASSPATH I get the same error that [~markus17] has posted 2 posts above, and [~ndouba] branch is way behind the master. anybody has any idea what I should do ? ( short of converting [~ndouba] changes into a patch and try to apply it to master) > ReadDB url throws exception > --- > > Key: NUTCH-1084 > URL: https://issues.apache.org/jira/browse/NUTCH-1084 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-1084.patch > > > Readdb -url suffers from two problems: > 1. it trips over the _SUCCESS file generated by newer Hadoop version > 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???) > The first problem can be remedied by not allowing the injector or updater to > write the _SUCCESS file. Until now that's the solution implemented for > similar issues. I've not been successful as to make the Hadoop readers simply > skip the file. > The second issue seems a bit strange and did not happen on a local check out. > I'm not yet sure whether this is a Hadoop issue or something being corrupt in > the CrawlDB. Here's the stack trace: > {code} > Exception in thread "main" java.io.IOException: can't find class: > org.apache.nutch.protocol.ProtocolStatus because > org.apache.nutch.protocol.ProtocolStatus > at > org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204) > at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146) > at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278) > at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) > at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524) > at > org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105) > at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383) > at > org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389) > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >
[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1140: - Attachment: 0001-NUTCH-1140-trunk.patch 0001-NUTCH-1140-2.x.patch Sorry, there was a typo in both the patch files > index-more plugin, resetTitle method creates multiple values in the Title > field > --- > > Key: NUTCH-1140 > URL: https://issues.apache.org/jira/browse/NUTCH-1140 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.3 >Reporter: Joe Liedtke >Priority: Minor > Fix For: 1.10 > > Attachments: 0001-NUTCH-1140-2.x.patch, 0001-NUTCH-1140-trunk.patch, > MoreIndexingFilter.093011.patch > > > From the comments in MoreIndexingFilter.java, the index-more plugin is meant > to reset the Title field of a document if it contains a Content-Disposition > header. The current behavior is to add a Title regardless of whether one > exists or not, which can cause issues down the line with the Solr Indexing > process, and based on a thread in the nutch user list it appears that this is > causing some users to mark the title as multi-valued in the schema: > > http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8 > The following patch removes the title field before adding a new one, which > has resolved the issue for me: > --- MoreIndexingFilter.old2011-09-30 11:44:35.0 + > +++ MoreIndexingFilter.java 2011-09-30 09:58:48.0 + > @@ -276,6 +276,7 @@ > for (int i=0; iif (matcher.contains(contentDisposition,patterns[i])) { > result = matcher.getMatch(); > +doc.removeField("title"); > doc.add("title", result.group(1)); > break; >} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1140: - Attachment: (was: 0001-NUTCH-1140-trunk.patch) > index-more plugin, resetTitle method creates multiple values in the Title > field > --- > > Key: NUTCH-1140 > URL: https://issues.apache.org/jira/browse/NUTCH-1140 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.3 >Reporter: Joe Liedtke >Priority: Minor > Fix For: 1.10 > > Attachments: MoreIndexingFilter.093011.patch > > > From the comments in MoreIndexingFilter.java, the index-more plugin is meant > to reset the Title field of a document if it contains a Content-Disposition > header. The current behavior is to add a Title regardless of whether one > exists or not, which can cause issues down the line with the Solr Indexing > process, and based on a thread in the nutch user list it appears that this is > causing some users to mark the title as multi-valued in the schema: > > http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8 > The following patch removes the title field before adding a new one, which > has resolved the issue for me: > --- MoreIndexingFilter.old2011-09-30 11:44:35.0 + > +++ MoreIndexingFilter.java 2011-09-30 09:58:48.0 + > @@ -276,6 +276,7 @@ > for (int i=0; iif (matcher.contains(contentDisposition,patterns[i])) { > result = matcher.getMatch(); > +doc.removeField("title"); > doc.add("title", result.group(1)); > break; >} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1140: - Attachment: (was: 0001-NUTCH-1140-2.x.patch) > index-more plugin, resetTitle method creates multiple values in the Title > field > --- > > Key: NUTCH-1140 > URL: https://issues.apache.org/jira/browse/NUTCH-1140 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.3 >Reporter: Joe Liedtke >Priority: Minor > Fix For: 1.10 > > Attachments: MoreIndexingFilter.093011.patch > > > From the comments in MoreIndexingFilter.java, the index-more plugin is meant > to reset the Title field of a document if it contains a Content-Disposition > header. The current behavior is to add a Title regardless of whether one > exists or not, which can cause issues down the line with the Solr Indexing > process, and based on a thread in the nutch user list it appears that this is > causing some users to mark the title as multi-valued in the schema: > > http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8 > The following patch removes the title field before adding a new one, which > has resolved the issue for me: > --- MoreIndexingFilter.old2011-09-30 11:44:35.0 + > +++ MoreIndexingFilter.java 2011-09-30 09:58:48.0 + > @@ -276,6 +276,7 @@ > for (int i=0; iif (matcher.contains(contentDisposition,patterns[i])) { > result = matcher.getMatch(); > +doc.removeField("title"); > doc.add("title", result.group(1)); > break; >} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1140: - Attachment: 0001-NUTCH-1140-trunk.patch 0001-NUTCH-1140-2.x.patch so this is still an issue, here is a sample list of urls in the wild that would trigger this problem: http://www.10-s.com/site/tennis-supply/site-map.html http://www.bigappleherp.com/site/content/big_apple_cares.html http://www.bigappleherp.com/site/content/CareSheets.html http://www.bigappleherp.com/site/content/company_information.html http://www.bigappleherp.com/site/content/customer_service.html http://www.bigappleherp.com/site/content/LiveAnimals.html http://www.bigappleherp.com/site/content/testimonials_02.html http://www.magellangps.com/lp/truckfamily/screens.html Now base on a bit of a reading that I did on Content Disposition, it is a reasonable alternative way of determining a title which would mostly be just the file name, but it should NOT override the actual title if it exist as the information in the title are far more valueable than the file name. Not to mention that title is the actual title and should not be replaced if some other value exist. > index-more plugin, resetTitle method creates multiple values in the Title > field > --- > > Key: NUTCH-1140 > URL: https://issues.apache.org/jira/browse/NUTCH-1140 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.3 >Reporter: Joe Liedtke >Priority: Minor > Fix For: 1.10 > > Attachments: 0001-NUTCH-1140-2.x.patch, 0001-NUTCH-1140-trunk.patch, > MoreIndexingFilter.093011.patch > > > From the comments in MoreIndexingFilter.java, the index-more plugin is meant > to reset the Title field of a document if it contains a Content-Disposition > header. The current behavior is to add a Title regardless of whether one > exists or not, which can cause issues down the line with the Solr Indexing > process, and based on a thread in the nutch user list it appears that this is > causing some users to mark the title as multi-valued in the schema: > > http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8 > The following patch removes the title field before adding a new one, which > has resolved the issue for me: > --- MoreIndexingFilter.old2011-09-30 11:44:35.0 + > +++ MoreIndexingFilter.java 2011-09-30 09:58:48.0 + > @@ -276,6 +276,7 @@ > for (int i=0; iif (matcher.contains(contentDisposition,patterns[i])) { > result = matcher.getMatch(); > +doc.removeField("title"); > doc.add("title", result.group(1)); > break; >} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly
kaveh minooie created NUTCH-1842: Summary: crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly Key: NUTCH-1842 URL: https://issues.apache.org/jira/browse/NUTCH-1842 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.9 Reporter: kaveh minooie Priority: Minor this is from nutch-default.xml: crawl.gen.delay 60480 This value, expressed in milliseconds, defines how long we should keep the lock on records in CrawlDb that were just selected for fetching. If these records are not updated in the meantime, the lock is canceled, i.e. they become eligible for selecting. Default value of this is 7 days (60480 ms). this is the from o.a.n.crawl.Generator.configure(JobConf job) genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L; the value in config file is in milliseconds but the code expect it to be in days. I reported this couple of years ago on the mailing list as well. I didn't post a patch becaue I am not sure which one needs to be fixed. considering all the other values in config file are in milliseconds it can be argued to that consistency matters, but 'day' is a much more reasonable unit for this property. Also this value is not being used in 2.x ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1480: - Attachment: adding-support-for-sharding-indexer-for-solr.patch I just found this issue today when I was checking to see if what I am about to upload would be a duplicate issue or not and good thing I did since apparently there are quite a few issues about this. But considering that this is the latest one, I will post it here. This patch add another plugin, indexer-solrshard, that allows to shard the index data on nutch side. this is mostly geared toward solr 3.x as there are still a few of them are around (including in our production environment ), but it could have benefits even with solr 4.x to which I will get. it adds two new properties to the nutch config file ( solr.shardkey and solr.server.urls ), the solr.shardkey would the name of the field that should be used to generate the hash code ( and if it is being used against solr 3.x should be the uniqekey field in schema file, otherwise the delete would not work properly ), and solr.server.urls would be a comma seperated list of solr core urls or instance urls. The plugin divide the hash value by the number of urls to figure out in which core it should put the doucment. it also uses the reset of the solr properties ( commit sieze, etc... ). the code is really the same. But the idea behind having a solr.server.urls instead of just using solr.server.url was so that both plugin could be used simultinously which can help in migrating from 3.x to 4.x as well, Though I guess the same argument can be made for other properties as well. The code use String.hashCode function which is really good enough in terms of evenly distributing docs accross multiple cores ( in our case with about 85 million docs over 8 cores, the diffrence between the number of docs in each core is less than 5% ), but changing the hash function or even makeing it customizeable as was suggested in NUTCH-945 is trivial. Turning the hasing mechanism off is also trivial ( again, I didn't know about this issue when I was writing this otherwise I would have done it already ) but we can add another property such as solr.usehash and by setting it to false, have the plugin to just post the documents to all the servers which could also be quite usefull. As for using it against the solr 4.x, it can function as a load balancer. believe me when I say watching 40 reduce jobs try to write to a single solr instance is rather horrifying. The patch is against the trunk but porting it to 2.x is trivial ( I actually think that it can probably be applied as it is, but I haven't test it yet ) > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.10 > > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1840) the describe function in SolrIndexWriter is not correct
[ https://issues.apache.org/jira/browse/NUTCH-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1840: - Attachment: 2.x-updated-description-in-SolrIndexWriter.patch trunk-1.10-updated-description-in-SolrIndexWriter.patch > the describe function in SolrIndexWriter is not correct > --- > > Key: NUTCH-1840 > URL: https://issues.apache.org/jira/browse/NUTCH-1840 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 2.3, 1.9 > Reporter: kaveh minooie >Priority: Trivial > Attachments: 2.x-updated-description-in-SolrIndexWriter.patch, > trunk-1.10-updated-description-in-SolrIndexWriter.patch > > > the describe function in SolrIndexWriter is not correct -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1840) the describe function in SolrIndexWriter is not correct
kaveh minooie created NUTCH-1840: Summary: the describe function in SolrIndexWriter is not correct Key: NUTCH-1840 URL: https://issues.apache.org/jira/browse/NUTCH-1840 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.9, 2.3 Reporter: kaveh minooie Priority: Trivial the describe function in SolrIndexWriter is not correct -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1831) compiling against gora-0.5 fails
[ https://issues.apache.org/jira/browse/NUTCH-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1831: - Attachment: NUTCH-1831.patch this seems to fix the problem, but I appreciate it if someone could verify it > compiling against gora-0.5 fails > > > Key: NUTCH-1831 > URL: https://issues.apache.org/jira/browse/NUTCH-1831 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.3 > Reporter: kaveh minooie > Attachments: NUTCH-1831.patch > > > currenty if you try to compile nutch against Gora 0.5 you will get following > errors: > clean-lib: > resolve-default: > [ivy:resolve] :: Apache Ivy 2.3.0-local-20140109133456 - 20140109133456 :: > http://ant.apache.org/ivy/ :: > [ivy:resolve] :: loading settings :: file = /sources/nutch/ivy/ivysettings.xml > [taskdef] Could not load definitions from resource > org/sonar/ant/antlib.xml. It could not be found. > copy-libs: > [copy] Copying 128 files to /sources/nutch/build/lib > compile-core: > [javac] Compiling 200 source files to /sources/nutch/build/classes > [javac] warning: [options] bootstrap class path not set in conjunction > with -source 1.6 > [javac] /sources/nutch/src/java/org/apache/nutch/storage/WebPage.java:8: > error: WebPage is not abstract and does not override abstract method > getFieldsCount() in PersistentBase > [javac] public class WebPage extends > org.apache.gora.persistency.impl.PersistentBase implements > org.apache.avro.specific.SpecificRecord, > org.apache.gora.persistency.Persistent { > [javac]^ > [javac] > /sources/nutch/src/java/org/apache/nutch/storage/ProtocolStatus.java:11: > error: ProtocolStatus is not abstract and does not override abstract method > getFieldsCount() in PersistentBase > [javac] public class ProtocolStatus extends > org.apache.gora.persistency.impl.PersistentBase implements > org.apache.avro.specific.SpecificRecord, > org.apache.gora.persistency.Persistent { > [javac]^ > [javac] > /sources/nutch/src/java/org/apache/nutch/storage/ParseStatus.java:8: error: > ParseStatus is not abstract and does not override abstract method > getFieldsCount() in PersistentBase > [javac] public class ParseStatus extends > org.apache.gora.persistency.impl.PersistentBase implements > org.apache.avro.specific.SpecificRecord, > org.apache.gora.persistency.Persistent { > [javac]^ > [javac] /sources/nutch/src/java/org/apache/nutch/storage/Host.java:12: > error: Host is not abstract and does not override abstract method > getFieldsCount() in PersistentBase > [javac] public class Host extends > org.apache.gora.persistency.impl.PersistentBase implements > org.apache.avro.specific.SpecificRecord, > org.apache.gora.persistency.Persistent { > [javac]^ > [javac] Note: Some input files use unchecked or unsafe operations. > [javac] Note: Recompile with -Xlint:unchecked for details. > [javac] 4 errors > [javac] 1 warning > BUILD FAILED > /sources/nutch/build.xml:101: Compile failed; see the compiler error output > for details. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1831) compiling against gora-0.5 fails
kaveh minooie created NUTCH-1831: Summary: compiling against gora-0.5 fails Key: NUTCH-1831 URL: https://issues.apache.org/jira/browse/NUTCH-1831 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: kaveh minooie currenty if you try to compile nutch against Gora 0.5 you will get following errors: clean-lib: resolve-default: [ivy:resolve] :: Apache Ivy 2.3.0-local-20140109133456 - 20140109133456 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = /sources/nutch/ivy/ivysettings.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. copy-libs: [copy] Copying 128 files to /sources/nutch/build/lib compile-core: [javac] Compiling 200 source files to /sources/nutch/build/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] /sources/nutch/src/java/org/apache/nutch/storage/WebPage.java:8: error: WebPage is not abstract and does not override abstract method getFieldsCount() in PersistentBase [javac] public class WebPage extends org.apache.gora.persistency.impl.PersistentBase implements org.apache.avro.specific.SpecificRecord, org.apache.gora.persistency.Persistent { [javac]^ [javac] /sources/nutch/src/java/org/apache/nutch/storage/ProtocolStatus.java:11: error: ProtocolStatus is not abstract and does not override abstract method getFieldsCount() in PersistentBase [javac] public class ProtocolStatus extends org.apache.gora.persistency.impl.PersistentBase implements org.apache.avro.specific.SpecificRecord, org.apache.gora.persistency.Persistent { [javac]^ [javac] /sources/nutch/src/java/org/apache/nutch/storage/ParseStatus.java:8: error: ParseStatus is not abstract and does not override abstract method getFieldsCount() in PersistentBase [javac] public class ParseStatus extends org.apache.gora.persistency.impl.PersistentBase implements org.apache.avro.specific.SpecificRecord, org.apache.gora.persistency.Persistent { [javac]^ [javac] /sources/nutch/src/java/org/apache/nutch/storage/Host.java:12: error: Host is not abstract and does not override abstract method getFieldsCount() in PersistentBase [javac] public class Host extends org.apache.gora.persistency.impl.PersistentBase implements org.apache.avro.specific.SpecificRecord, org.apache.gora.persistency.Persistent { [javac]^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 4 errors [javac] 1 warning BUILD FAILED /sources/nutch/build.xml:101: Compile failed; see the compiler error output for details. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1791) Null pointer exceptions with gora-cassandra-0.4
[ https://issues.apache.org/jira/browse/NUTCH-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028977#comment-14028977 ] kaveh minooie commented on NUTCH-1791: -- it would have helped if you could post a bit more of those stack traces. specifically, the beginning lines of the sections that start with 'caused by'. > Null pointer exceptions with gora-cassandra-0.4 > --- > > Key: NUTCH-1791 > URL: https://issues.apache.org/jira/browse/NUTCH-1791 > Project: Nutch > Issue Type: Bug > Components: generator, storage >Affects Versions: 2.3 > Environment: dsc-cassandra-2.0.2, dsc-cassandra-2.0.7 >Reporter: Koen Smets > Fix For: 2.3 > > > Latest nutch-2.x source checkout fails to run with Cassandra 2.0.2 (and also > Cassandra 2.0.7) as storage backend both in normal Nutch operations (inject, > generate, fetch) cycle as in the junit tests {{TestGoraStorage}} > {code} > 2014-06-03 11:24:23,495 INFO connection.CassandraHostRetryService > (CassandraHostRetryService.java:(48)) - Downed Host Retry service > started with queue size -1 and retry delay 10s > 2014-06-03 11:24:23,535 INFO service.JmxMonitor > (JmxMonitor.java:registerMonitor(52)) - Registering JMX > me.prettyprint.cassandra.service_Test > Cluster:ServiceType=hector,MonitorType=hector > Exception in thread "main" java.lang.NullPointerException > at > org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121) > at > org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57) > at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114) > at > org.apache.nutch.storage.TestGoraStorage.readWrite(TestGoraStorage.java:93) > at > org.apache.nutch.storage.TestGoraStorage.main(TestGoraStorage.java:230) > {code} > After injecting: > {code} > ksmets@precise64 ~/l/a/r/local> ./bin/nutch inject urls > InjectorJob: starting at 2014-06-03 11:55:11 > InjectorJob: Injecting urlDir: urls > InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as > the Gora storage class. > InjectorJob: total number of urls rejected by filters: 0 > InjectorJob: total number of urls injected after normalization and filtering: > 1 > Injector: finished at 2014-06-03 11:55:13, elapsed: 00:00:02 > ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats > WebTable statistics start > Statistics for WebTable: > min score:1.0 > retry 0: 1 > jobs: {db_stats-job_local1403358409_0001={jobID=job_local1403358409_0001, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, > Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=97, MAP_INPUT_RECORDS=1, > REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=12, MAP_OUTPUT_BYTES=53, > COMMITTED_HEAP_BYTES=358612992, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769, > COMBINE_INPUT_RECORDS=4, REDUCE_INPUT_RECORDS=6, REDUCE_INPUT_GROUPS=6, > COMBINE_OUTPUT_RECORDS=6, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=6, > VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4}, > FileSystemCounters={FILE_BYTES_READ=974145, FILE_BYTES_WRITTEN=1144369}, File > Output Format Counters ={BYTES_WRITTEN=225 > max score:1.0 > TOTAL urls: 1 > status 0 (null): 1 > avg score:1.0 > WebTable statistics: done > ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/ > key: http://example.com/ > baseUrl: null > status: 0 (null) > fetchTime:1401789311270 > prevFetchTime:0 > fetchInterval:2592000 > retriesSinceFetch:0 > modifiedTime: 0 > prevModifiedTime: 0 > protocolStatus: (null) > parseStatus: (null) > title:null > score:1.0 > markers: org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c > reprUrl: null > metadata _csh_ : ?� > {code} > After generating, > {code} > ksmets@precise64 ~/l/a/r/local> ./bin/nutch generate -topN 1 > GeneratorJob: starting at 2014-06-03 11:55:38 > GeneratorJob: Selecting best-scoring urls due for fetch. > GeneratorJob: starting > GeneratorJob: filtering: true > GeneratorJob: normalizing: true > GeneratorJob: topN: 1 > GeneratorJob: finished at 2014-06-03 11:55:40, time elapsed: 00:00:02 > GeneratorJob: generated batch id: 1401789338-222512082 containing 1 URLs > ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats > WebTable statistics start > Statistics for WebTable: > jobs: {db_stats-job_local73029265_0001={jobID=job_loc
[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1780: - Attachment: NUTCH-1780.patch > ttl and gc_grace_seconds attributes are missing from > gora-cassandra-mapping.xml file > > > Key: NUTCH-1780 > URL: https://issues.apache.org/jira/browse/NUTCH-1780 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.3 > Reporter: kaveh minooie > Attachments: NUTCH-1780.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > after upgrading to groa 0.4 ( NUTCH-1714) we need extra properties in C* > mapping file. I also added a few, IMHO, helpful hints. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1780: - Attachment: (was: NUTCH-1780.patch) > ttl and gc_grace_seconds attributes are missing from > gora-cassandra-mapping.xml file > > > Key: NUTCH-1780 > URL: https://issues.apache.org/jira/browse/NUTCH-1780 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.3 > Reporter: kaveh minooie > Attachments: NUTCH-1780.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > after upgrading to groa 0.4 ( NUTCH-1714) we need extra properties in C* > mapping file. I also added a few, IMHO, helpful hints. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
kaveh minooie created NUTCH-1780: Summary: ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file Key: NUTCH-1780 URL: https://issues.apache.org/jira/browse/NUTCH-1780 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.3 Reporter: kaveh minooie Attachments: NUTCH-1780.patch after upgrading to groa 0.4 ( NUTCH-1714) we need extra properties in C* mapping file. I also added a few, IMHO, helpful hints. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
[ https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1780: - Attachment: NUTCH-1780.patch there is really no good default value for gc_grace_seconds. we can use Cassandra default value which is 10 days, but since the out of the box setting is for single node cluster, I used 0 which is the best value for that set up. Also using 0 would force people to actually change this before using it in a real cluster which I think, here, is appropriate. > ttl and gc_grace_seconds attributes are missing from > gora-cassandra-mapping.xml file > > > Key: NUTCH-1780 > URL: https://issues.apache.org/jira/browse/NUTCH-1780 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.3 > Reporter: kaveh minooie > Attachments: NUTCH-1780.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > after upgrading to groa 0.4 ( NUTCH-1714) we need extra properties in C* > mapping file. I also added a few, IMHO, helpful hints. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998118#comment-13998118 ] kaveh minooie commented on NUTCH-1714: -- Hi every one, it seems that ttl and gc_grace_seconds attributes have not been added to gora-cassandra-mapping.xml file. ttl is specially important since for some reason gora-cassandra sets it to be 60 seconds, if it is not defined. > Nutch 2.x upgrade to Gora 0.4 > - > > Key: NUTCH-1714 > URL: https://issues.apache.org/jira/browse/NUTCH-1714 > Project: Nutch > Issue Type: Improvement >Reporter: Alparslan Avcı >Assignee: Alparslan Avcı > Fix For: 2.3 > > Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, > NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch, NUTCH-1714v6.patch > > > Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the > details in this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1642) mvn compile fails on Centos6.3
[ https://issues.apache.org/jira/browse/NUTCH-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768532#comment-13768532 ] kaveh minooie commented on NUTCH-1642: -- I don't know if you know this or not, but if you don't, nutch is build using 'ant' and the dependencies get resolved through ivy not maven. I know there is a pom file in the root but I am pretty sure that it is not being maintained. other people here will be able to give you more detailed information, and they might actually update the pom file as well, but for now just use ant. just go to the root of the project and type ant ( assuming you have ant installed on your system.) > mvn compile fails on Centos6.3 > -- > > Key: NUTCH-1642 > URL: https://issues.apache.org/jira/browse/NUTCH-1642 > Project: Nutch > Issue Type: Bug > Environment: Apache Maven 3.1.0 > (893ca28a1da9d5f51ac03827af98bb730128f9f2; 2013-06-28 10:15:32+0800) > Java version: 1.7.0_25, vendor: Oracle Corporation > Default locale: en_US, platform encoding: UTF-8 > OS name: "linux", version: "2.6.32-279.el6.x86_64", arch: "amd64", family: > "unix" >Reporter: Xibao.Lv > Attachments: NUTCH-1642.patch > > > Hi,all > I am new. when i run 'mvn compile', it return some errors.Like Follows. > 1.[ERROR] Failed to execute goal on project nutch: Could not resolve > dependencies for project org.apache.nutch:nutch:jar:2.2: The following > artifacts could not be resolved: javax.jms:jms:jar:1.1, > com.sun.jdmk:jmxtools:jar:1.2.1, com.sun.jmx:jmxri:jar:1.2.1, > org.restlet.jse:org.restlet:jar:2.0.5, > org.restlet.jse:org.restlet.ext.jackson:jar:2.0.5: Could not transfer > artifact javax.jms:jms:jar:1.1 from/to java.net > (https://maven-repository.dev.java.net/nonav/repository): No connector > available to access repository java.net > (https://maven-repository.dev.java.net/nonav/repository) of type legacy using > the available factories WagonRepositoryConnectorFactory > It means that the org.restlet.jse can not find in main repository. Then I add > repository address to pom.xml > 2.[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' > must be unique: org.jdom:jdom:jar -> duplicate declaration of version 1.1 @ > line 268, column 29 > 3.[ERROR] Failed to execute goal on project nutch: Could not resolve > dependencies for project org.apache.nutch:nutch:jar:2.2: The following > artifacts could not be resolved: javax.jms:jms:jar:1.1, > com.sun.jdmk:jmxtools:jar:1.2.1, com.sun.jmx:jmxri:jar:1.2.1: Could not > transfer artifact javax.jms:jms:jar:1.1 from/to java.net > (https://maven-repository.dev.java.net/nonav/repository): No connector > available to access repository java.net > (https://maven-repository.dev.java.net/nonav/repository) of type legacy using > the available factories WagonRepositoryConnectorFactory > It means that we can not find javax.jms in > there(https://maven-repository.dev.java.net/nonav/repository). Google said > log4j-1.2.15 dependency javax.jms, so we can use higher log4j, such as > 1.2.16+. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1634) readdb -stats show the result twice
[ https://issues.apache.org/jira/browse/NUTCH-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1634: - Attachment: NUTCH-1634-2.x.patch I'll check the trunk as well and will post a patch for it if needed. > readdb -stats show the result twice > --- > > Key: NUTCH-1634 > URL: https://issues.apache.org/jira/browse/NUTCH-1634 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.2.1 >Reporter: kaveh minooie >Priority: Minor > Attachments: NUTCH-1634-2.x.patch > > > right now this is the ouput: > WebTable statistics start > Statistics for WebTable: > status 2 (status_fetched):1115 > min score:0.0 > retry 2: 2 > retry 0: 11369 > jobs: {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, > Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, > MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, > MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, > SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, > REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, > REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, > FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File > Output Format Counters ={BYTES_WRITTEN=373 > retry 1: 5 > status 5 (status_redir_perm): 69 > max score:1.0 > TOTAL urls: 11376 > status 3 (status_gone): 3 > status 4 (status_redir_temp): 3 > status 1 (status_unfetched): 10186 > avg score:0.00342827 > WebTable statistics: done > status 2 (status_fetched):1115 > min score:0.0 > retry 2: 2 > retry 0: 11369 > jobs: {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, > Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, > MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, > MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, > SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, > REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, > REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, > FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File > Output Format Counters ={BYTES_WRITTEN=373 > retry 1: 5 > status 5 (status_redir_perm): 69 > max score:1.0 > TOTAL urls: 11376 > status 3 (status_gone): 3 > status 4 (status_redir_temp): 3 > status 1 (status_unfetched): 10186 > avg score:0.00342827 > imho, it should be this: > WebTable statistics start > Statistics for WebTable: > status 2 (status_fetched):1115 > min score:0.0 > retry 2: 2 > retry 0: 11369 > jobs: {db_stats-job_local801282144_0001={jobID=job_local801282144_0001, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, > Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, > MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, > MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1122631680, CPU_MILLISECONDS=0, > SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, > REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, > REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, > FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166191}, File > Output Format Counters ={BYTES_WRITTEN=373 > retry 1: 5 > status 5 (status_redir_perm): 69 > max score:1.0 > TOTAL urls: 11376 > status 3 (status_gone): 3 > status 4 (status_redir_temp): 3 > status 1 (status_unfetched): 10186 > avg score:0.00342827 > WebTable statistics: done -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1634) readdb -stats show the result twice
[ https://issues.apache.org/jira/browse/NUTCH-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1634: - Description: right now this is the ouput: WebTable statistics start Statistics for WebTable: status 2 (status_fetched): 1115 min score: 0.0 retry 2:2 retry 0:11369 jobs: {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File Output Format Counters ={BYTES_WRITTEN=373 retry 1:5 status 5 (status_redir_perm): 69 max score: 1.0 TOTAL urls: 11376 status 3 (status_gone): 3 status 4 (status_redir_temp): 3 status 1 (status_unfetched):10186 avg score: 0.00342827 WebTable statistics: done status 2 (status_fetched): 1115 min score: 0.0 retry 2:2 retry 0:11369 jobs: {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File Output Format Counters ={BYTES_WRITTEN=373 retry 1:5 status 5 (status_redir_perm): 69 max score: 1.0 TOTAL urls: 11376 status 3 (status_gone): 3 status 4 (status_redir_temp): 3 status 1 (status_unfetched):10186 avg score: 0.00342827 imho, it should be this: WebTable statistics start Statistics for WebTable: status 2 (status_fetched): 1115 min score: 0.0 retry 2:2 retry 0:11369 jobs: {db_stats-job_local801282144_0001={jobID=job_local801282144_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1122631680, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166191}, File Output Format Counters ={BYTES_WRITTEN=373 retry 1:5 status 5 (status_redir_perm): 69 max score: 1.0 TOTAL urls: 11376 status 3 (status_gone): 3 status 4 (status_redir_temp): 3 status 1 (status_unfetched):10186 avg score: 0.00342827 WebTable statistics: done was://TODO > readdb -stats show the result twice > --- > > Key: NUTCH-1634 > URL: https://issues.apache.org/jira/browse/NUTCH-1634 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.2.1 > Reporter: kaveh minooie >Priority: Minor > > right now this is the ouput: > WebTable statistics start > Statistics for WebTable: > status 2 (status_fetched):1115 > min score:0.0 > retry 2: 2 > retry 0: 11369 > jobs: {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, > jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, > Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, > MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, > MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, > SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, > REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, > REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, > FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File > Output Format Counters ={BYTES_WRITTEN=373 > retry 1: 5 > status 5 (status_redir_perm): 69 > max score:1.0 >
[jira] [Created] (NUTCH-1634) readdb -stats show the result twice
kaveh minooie created NUTCH-1634: Summary: readdb -stats show the result twice Key: NUTCH-1634 URL: https://issues.apache.org/jira/browse/NUTCH-1634 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 2.2.1 Reporter: kaveh minooie Priority: Minor //TODO -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1556: - Attachment: NUTCH-1556-v3.patch there are typos (fetch instead of update) in v2 :) > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 > Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1633) slf4j is provided by hadoop and should not be included in the job file.
[ https://issues.apache.org/jira/browse/NUTCH-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1633: - Attachment: not-include-slf4j-in-job-file.trunk.patch not-include-slf4j-in-job-file.2.x.patch > slf4j is provided by hadoop and should not be included in the job file. > --- > > Key: NUTCH-1633 > URL: https://issues.apache.org/jira/browse/NUTCH-1633 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.7, 2.2.1 > Reporter: kaveh minooie >Priority: Minor > Labels: easyfix > Fix For: 2.3 > > Attachments: not-include-slf4j-in-job-file.2.x.patch, > not-include-slf4j-in-job-file.trunk.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > there are two issues with including slf4j in the job file. the minor of the > two is that slf4j starts issuing warnings when it finds more than on > instances in the classpath( GORA-272 ). the bigger issue happens when the > versions of the slf4j in hadoop and nutch are not compatible (ex. hadoop > 1.1.1 & nutch 2.1) which results in all nutch jobs to crash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1633) slf4j is provided by hadoop and should not be included in the job file.
kaveh minooie created NUTCH-1633: Summary: slf4j is provided by hadoop and should not be included in the job file. Key: NUTCH-1633 URL: https://issues.apache.org/jira/browse/NUTCH-1633 Project: Nutch Issue Type: Bug Components: build Affects Versions: 2.2.1, 1.7 Reporter: kaveh minooie Priority: Minor Fix For: 2.3 there are two issues with including slf4j in the job file. the minor of the two is that slf4j starts issuing warnings when it finds more than on instances in the classpath( GORA-272 ). the bigger issue happens when the versions of the slf4j in hadoop and nutch are not compatible (ex. hadoop 1.1.1 & nutch 2.1) which results in all nutch jobs to crash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1632) add batchId argument for DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750268#comment-13750268 ] kaveh minooie commented on NUTCH-1632: -- if this one was accepted make sure to close NUTCH-1556 as well. > add batchId argument for DbUpdaterJob > - > > Key: NUTCH-1632 > URL: https://issues.apache.org/jira/browse/NUTCH-1632 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 2.2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1632.patch > > > add batchId argument for DbUpdaterJob, you can put the batchId to > DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
[ https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747879#comment-13747879 ] kaveh minooie commented on NUTCH-1629: -- :) for future references, the magic switch in git format-patch that does this (not putting 'a/' and 'b/' before the path) is --no-prefix > there is no need to fail on empty lines in seed file when injecting. > > > Key: NUTCH-1629 > URL: https://issues.apache.org/jira/browse/NUTCH-1629 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7, 2.2.1 > Environment: Java 1.7.0_25 >Reporter: kaveh minooie > Labels: easyfix > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1629--2.x.svn.patch, NUTCH-1629--trunk.svn.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > right now, if there is an empty line in a seed file, TableUtil.reversUrl > would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
[ https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1629: - Attachment: (was: NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch) > there is no need to fail on empty lines in seed file when injecting. > > > Key: NUTCH-1629 > URL: https://issues.apache.org/jira/browse/NUTCH-1629 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 2.2.1 > Environment: Java 1.7.0_25 > Reporter: kaveh minooie > Labels: easyfix > Fix For: 2.3 > > Attachments: NUTCH-1629--2.x.svn.patch, NUTCH-1629--trunk.svn.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > right now, if there is an empty line in a seed file, TableUtil.reversUrl > would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
[ https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1629: - Attachment: NUTCH-1629--trunk.svn.patch NUTCH-1629--2.x.svn.patch so like this? > there is no need to fail on empty lines in seed file when injecting. > > > Key: NUTCH-1629 > URL: https://issues.apache.org/jira/browse/NUTCH-1629 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 2.2.1 > Environment: Java 1.7.0_25 > Reporter: kaveh minooie > Labels: easyfix > Fix For: 2.3 > > Attachments: NUTCH-1629--2.x.svn.patch, NUTCH-1629--trunk.svn.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > right now, if there is an empty line in a seed file, TableUtil.reversUrl > would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
[ https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1629: - Attachment: (was: NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch) > there is no need to fail on empty lines in seed file when injecting. > > > Key: NUTCH-1629 > URL: https://issues.apache.org/jira/browse/NUTCH-1629 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 2.2.1 > Environment: Java 1.7.0_25 > Reporter: kaveh minooie > Labels: easyfix > Fix For: 2.3 > > Attachments: NUTCH-1629--2.x.svn.patch, NUTCH-1629--trunk.svn.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > right now, if there is an empty line in a seed file, TableUtil.reversUrl > would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
[ https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747722#comment-13747722 ] kaveh minooie commented on NUTCH-1629: -- Sorry julien, could you tell me what I am missing here? do you mean that I should have only on file for both branches? > there is no need to fail on empty lines in seed file when injecting. > > > Key: NUTCH-1629 > URL: https://issues.apache.org/jira/browse/NUTCH-1629 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 2.2.1 > Environment: Java 1.7.0_25 >Reporter: kaveh minooie > Labels: easyfix > Fix For: 2.3 > > Attachments: > NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch, > NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > right now, if there is an empty line in a seed file, TableUtil.reversUrl > would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
[ https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1629: - Attachment: NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch @Julien these should do it. > there is no need to fail on empty lines in seed file when injecting. > > > Key: NUTCH-1629 > URL: https://issues.apache.org/jira/browse/NUTCH-1629 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 2.2.1 > Environment: Java 1.7.0_25 > Reporter: kaveh minooie > Labels: easyfix > Fix For: 2.3 > > Attachments: > NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch, > NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > right now, if there is an empty line in a seed file, TableUtil.reversUrl > would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
[ https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1629: - Attachment: (was: 0001-no-need-to-fail-on-empty-lines-in-seed-files.patch) > there is no need to fail on empty lines in seed file when injecting. > > > Key: NUTCH-1629 > URL: https://issues.apache.org/jira/browse/NUTCH-1629 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 2.2.1 > Environment: Java 1.7.0_25 > Reporter: kaveh minooie > Labels: easyfix > Fix For: 2.3 > > Attachments: > NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch, > NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > right now, if there is an empty line in a seed file, TableUtil.reversUrl > would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
kaveh minooie created NUTCH-1629: Summary: there is no need to fail on empty lines in seed file when injecting. Key: NUTCH-1629 URL: https://issues.apache.org/jira/browse/NUTCH-1629 Project: Nutch Issue Type: Improvement Components: injector Affects Versions: 2.2.1 Environment: Java 1.7.0_25 Reporter: kaveh minooie Fix For: 2.3 Attachments: 0001-no-need-to-fail-on-empty-lines-in-seed-files.patch right now, if there is an empty line in a seed file, TableUtil.reversUrl would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.
[ https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1629: - Attachment: 0001-no-need-to-fail-on-empty-lines-in-seed-files.patch > there is no need to fail on empty lines in seed file when injecting. > > > Key: NUTCH-1629 > URL: https://issues.apache.org/jira/browse/NUTCH-1629 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 2.2.1 > Environment: Java 1.7.0_25 > Reporter: kaveh minooie > Labels: easyfix > Fix For: 2.3 > > Attachments: 0001-no-need-to-fail-on-empty-lines-in-seed-files.patch > > Original Estimate: 1m > Remaining Estimate: 1m > > right now, if there is an empty line in a seed file, TableUtil.reversUrl > would throw an exception that would kill the inject job. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: problem with running 2.x in eclipse
Never mind. using absolute path took care of the issue. my I suggest that in the wiki ( http://wiki.apache.org/nutch/RunNutchInEclipse#Troubleshooting ) in the trouble shooting section, the sample absolute path should not be the src/plugin, it should be build/plugins as in: plugin.folders /home/../trunk/build/plugin On 08/16/2013 04:11 PM, kaveh minooie wrote: Hi everyone so I am trying to run in eclipse and I followed this https://wiki.apache.org/nutch/RunNutchInEclipse Now when I am trying to run inject command I get this in the hadoop.log 2013-08-16 15:57:18,184 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-08-16 15:57:18,716 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 1 2013-08-16 15:57:18,736 WARN plugin.PluginRepository - Plugins: directory not found: plugins 2013-08-16 15:57:18,738 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-08-16 15:57:18,739 WARN mapred.LocalJobRunner - job_local1120981005_0001 java.lang.Exception: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122) as you can see, nutch can not see the plugins directory. my eclipse is 4.3 so the order and export tab in the java build path dialog box is a bit different, but I think everything is set correctly. Now I know the extensions and using the extension point is a bit new in nutch so I was wondering if, in addition to what is said in the tutorial, there are other stuff that need to be configured in eclipse to be able to run the code within eclipse? thanks, -- Kaveh Minooie
problem with running 2.x in eclipse
Hi everyone so I am trying to run in eclipse and I followed this https://wiki.apache.org/nutch/RunNutchInEclipse Now when I am trying to run inject command I get this in the hadoop.log 2013-08-16 15:57:18,184 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-08-16 15:57:18,716 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 1 2013-08-16 15:57:18,736 WARN plugin.PluginRepository - Plugins: directory not found: plugins 2013-08-16 15:57:18,738 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-08-16 15:57:18,739 WARN mapred.LocalJobRunner - job_local1120981005_0001 java.lang.Exception: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122) as you can see, nutch can not see the plugins directory. my eclipse is 4.3 so the order and export tab in the java build path dialog box is a bit different, but I think everything is set correctly. Now I know the extensions and using the extension point is a bit new in nutch so I was wondering if, in addition to what is said in the tutorial, there are other stuff that need to be configured in eclipse to be able to run the code within eclipse? thanks, -- Kaveh Minooie
crawl.gen.delay
is 'crawl.gen.delay' still being used anywhere? cause I can't find anything in the source code except for here: package org.apache.nutch.crawl; public class GeneratorJob extends NutchTool implements Tool { public static final String GENERATOR_TOP_N = "generate.topN"; public static final String GENERATOR_CUR_TIME = "generate.curTime"; public static final String GENERATOR_DELAY = "crawl.gen.delay"; , and I think it has the wrong value in the nutch-default.xml file. ( the value is in seconds, it should be in days)
[jira] [Updated] (NUTCH-1624) Typo in WebTableReader line 486
[ https://issues.apache.org/jira/browse/NUTCH-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1624: - Attachment: 0001-NUTCH-1624.patch > Typo in WebTableReader line 486 > > > Key: NUTCH-1624 > URL: https://issues.apache.org/jira/browse/NUTCH-1624 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 2.2.1 > Environment: this was seen in 2.X HEAD > Reporter: kaveh minooie >Priority: Minor > Fix For: 2.3 > > Attachments: 0001-NUTCH-1624.patch > > > the error message suggests to the user to use, among other things, '-stat', > it should be '-stats' -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1624) Typo in WebTableReader line 486
[ https://issues.apache.org/jira/browse/NUTCH-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1624: - Summary: Typo in WebTableReader line 486 (was: Type in WebTableReader line 486) > Typo in WebTableReader line 486 > > > Key: NUTCH-1624 > URL: https://issues.apache.org/jira/browse/NUTCH-1624 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 2.2.1 > Environment: this was seen in 2.X HEAD > Reporter: kaveh minooie >Priority: Minor > Fix For: 2.3 > > > the error message suggests to the user to use, among other things, '-stat', > it should be '-stats' -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1624) Type in WebTableReader line 486
kaveh minooie created NUTCH-1624: Summary: Type in WebTableReader line 486 Key: NUTCH-1624 URL: https://issues.apache.org/jira/browse/NUTCH-1624 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 2.2.1 Environment: this was seen in 2.X HEAD Reporter: kaveh minooie Priority: Minor Fix For: 2.3 the error message suggests to the user to use, among other things, '-stat', it should be '-stats' -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: so why does solrindex-mapping.xml get ignored?
the code put the value under the original key anyway. there is no 'mapping'. it just copies. we have other instruction for copying fields. i think the code should strictly follow the mapping file. i think that whole if statement should not be there. On 04/12/2013 02:54 PM, Lewis John Mcgibbney wrote: Hi Kaveh, On Thu, Apr 11, 2013 at 11:53 PM, <mailto:dev-digest-h...@nutch.apache.org>> wrote: so why does solrindex-mapping.xml get ignored? 23089 by: kaveh minooie why are we doing this? I have no idea. What is wrong?
so why does solrindex-mapping.xml get ignored?
this is from nutch/src/java/org/apache/nutch/indexer/solr/SolrWriter.java write function: @Override public void write(NutchDocument doc) throws IOException { final SolrInputDocument inputDoc = new SolrInputDocument(); for(final Entry> e : doc) { for (final String val : e.getValue()) { Object val2 = val; if (e.getKey().equals("content") || e.getKey().equals("title")) { val2 = stripNonCharCodepoints(val); } inputDoc.addField(solrMapping.mapKey(e.getKey()), val2); String sCopy = solrMapping.mapCopyKey(e.getKey()); if (sCopy != e.getKey()) { inputDoc.addField(sCopy, val2); } } } as you can see it checks to see if the field is mapped to a different name and if that is the case, it adds it under the original key in addition to the mapped key. why are we doing this? -- Kaveh Minooie
[jira] [Commented] (NUTCH-1555) bug in 2.x ParserJob command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628051#comment-13628051 ] kaveh minooie commented on NUTCH-1555: -- Commons CLI uses maven to build. considering that nutch uses ivy, wouldn't it be an issue ? > bug in 2.x ParserJob command line parsing > -- > > Key: NUTCH-1555 > URL: https://issues.apache.org/jira/browse/NUTCH-1555 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.1 >Reporter: Lewis John McGibbney > Fix For: 2.2 > > > I just accidentally passed in the following argument to parser job > {code} > law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse > updatedb > ParserJob: starting > ParserJob: resuming: false > ParserJob: forced reparse:false > ParserJob: batchId: updatedb > ParserJob: success > {code} > This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1556: - Description: So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob ( | -all) [-crawlId ] was:So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 > Reporter: kaveh minooie > Attachments: NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1556) enabling updatedb to accept batchId
kaveh minooie created NUTCH-1556: Summary: enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Attachments: NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1556: - Attachment: NUTCH-1556.patch > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 > Reporter: kaveh minooie > Attachments: NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed
so I rewrote the begingin of the index in IndexUtil: public NutchDocument index(String key, WebPage page) { NutchDocument doc = new NutchDocument(); LOG.info("key: " + key); doc.add("id", key); doc.add("digest", StringUtil.toHexString(page.getSignature().array())); //doc.add("batchId", page.getBatchId().toString()); if( null == page.getBatchId()) { LOG.info("batchId is null " ); } else { doc.add("batchId", page.getBatchId().toString()); } try { LOG.info("page is:"+page); }catch(Exception e){ LOG.info("error:"+e); } and here is an example of what I am getting: key: com.nba.www:http/ batchId is null page is:org.apache.nutch.storage.WebPage@ba6fc739 { "baseUrl":"null" "status":"0" "fetchTime":"1366846228423" "prevFetchTime":"0" "fetchInterval":"0" "retriesSinceFetch":"0" "modifiedTime":"0" "prevModifiedTime":"0" "protocolStatus":"null" "content":"null" "contentType":"text/html" "prevSignature":"null" "signature":"java.nio.HeapByteBuffer[pos=0 lim=16 cap=16]" "title":"NBA.com" "text":"NBA.com Skip to , part of the Turner Sports & Entertainment Digital Network." "parseStatus":"org.apache.nutch.storage.ParseStatus@7821 { "majorCode":"1" "minorCode":"0" "args":"[]" }" "score":"1.0" "reprUrl":"null" "headers":"{Content-Encoding=gzip, Connection=close, Content-Type=text/html;charset=UTF-8, Content-Length=19526, Cache-Control=max-age=31, Date=Wed, 03 Apr 2013 23:30:28 GMT, Expires=Wed, 03 Apr 2013 23:30:59 GMT, Server=nginx, X-UA-Device=desktop, Vary=User-Agent, X-UA-Profile=desktop}" "outlinks":"{}" "inlinks":"{}" "markers":"{dist=0, _injmrk_=y, _idxmrk_=1365031584-127026, _updmrk_=1365031584-127026}" "metadata":"{}" "batchId":"null" } the fileds: baseUrl, protoclolStatus, reprUrl, batchId are null and the outlinks is empty. I am still in the process of familiarizing myself with code, so I can't say it for sure, and I apologize for asking stupid questions while we are at it, but this doesn't seem right to me, am i right to assume that the mentioned fields or at least most of them should have values? also, the example that I am showing here is not a one off, these fields have the same value for all, emphasis on ALL, the a few thousands urls that I have fetched and with which I am playing to test the code. the filed text was a lot longer, I removed the extra text since it was irreverent here, everything else I copied directly from the log file. thanks, On 04/03/2013 02:32 PM, Lewis John Mcgibbney wrote: Hi Kaveh, On Wed, Apr 3, 2013 at 1:30 PM, mailto:dev-digest-h...@nutch.apache.org>> wrote: Hi so I am not sure if binoy is talking about this but here it is: the original exception comes from src/java/org/apache/nutch/__indexer/IndexUtil.java line 66 public NutchDocument index(String key, WebPage page) { NutchDocument doc = new NutchDocument(); doc.add("id", key); doc.add("digest", StringUtil.toHexString(page.__getSignature().array())); ==>>doc.add("batchId", page.getBatchId().toString()); page.getBatchId() returns null for every urls. my guess is that updatedb removes the batchID from the rows in webpage since the generate and fetch work fine with batchId but after the updatedb ( which by the way does not accept batchId as one of its parameter which means that it is going over the entire webpage table everytime you run it, but that is a different issue) solrindex can't find the batchIds I've reopened NUTCH-1532 and attached a trivial patch which should now protect against the NPE people have been getting. Can you please check it out and get back to us? Thank you Kaveh -- Kaveh Minooie
Re: dev Digest 2 Apr 2013 18:42:33 -0000 Issue 1587
Hi so I am not sure if binoy is talking about this but here it is: the original exception comes from src/java/org/apache/nutch/indexer/IndexUtil.java line 66 public NutchDocument index(String key, WebPage page) { NutchDocument doc = new NutchDocument(); doc.add("id", key); doc.add("digest", StringUtil.toHexString(page.getSignature().array())); ==>>doc.add("batchId", page.getBatchId().toString()); page.getBatchId() returns null for every urls. my guess is that updatedb removes the batchID from the rows in webpage since the generate and fetch work fine with batchId but after the updatedb ( which by the way does not accept batchId as one of its parameter which means that it is going over the entire webpage table everytime you run it, but that is a different issue) solrindex can't find the batchIds thou I am not sure, I am going over the code right after I hit the send :) On 04/02/2013 01:55 PM, Lewis John Mcgibbney wrote: Hi Binoy, On Tue, Apr 2, 2013 at 11:42 AM, mailto:dev-digest-h...@nutch.apache.org>> wrote: Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed. 22979 by: Binoy d Hi Lewis, I understand the head branch can be unstable some of the time. I was trying to point out that I was not able to reproduce the issue with HEAD for 2.x . I will try and create the jira after I am back from office. I try to not the create jiras without conforming the issue, they just tend to add noise. I haven't used the crawl scripts much so it might take some time for me to get logs from there . Anything you can do to help us better understand the source of the issue is greatly appreciated Binoy. Thank you for your perseverance (and others who are helping on these issues) it is of real value to the Nutch community. Best Lewis -- Kaveh Minooie
[jira] [Updated] (NUTCH-1552) possibility of a NPE in index-more plugin
[ https://issues.apache.org/jira/browse/NUTCH-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kaveh minooie updated NUTCH-1552: - Attachment: NUTCH-1552.patch > possibility of a NPE in index-more plugin > - > > Key: NUTCH-1552 > URL: https://issues.apache.org/jira/browse/NUTCH-1552 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 2.2 > Reporter: kaveh minooie > Attachments: NUTCH-1552.patch > > > in line 203 of src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > the code attempt to read from variable contentType even thou it is possible > for it to be null. for me, it happened when I tried to index > http://www.pscars.com/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1552) possibility of a NPE in index-more plugin
kaveh minooie created NUTCH-1552: Summary: possibility of a NPE in index-more plugin Key: NUTCH-1552 URL: https://issues.apache.org/jira/browse/NUTCH-1552 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 2.2 Reporter: kaveh minooie in line 203 of src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java the code attempt to read from variable contentType even thou it is possible for it to be null. for me, it happened when I tried to index http://www.pscars.com/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: error using generate in 2.x
ok so i got gora-core-0.3-20130401.060419-325.jar gora-hbase-0.3-20130401.065448-305.jar and when I run generate the code finished without any exception but the log file was full of lines like this (one for every url that I had in webpage table) INFO mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class while writing to datastore.HBase mapping for field [org.apache.nutch.storage.WebPage#batchId] not found. Wrong gora-hbase-mapping.xml? when i checked gora-hbase-mapping.xml there was no field for batchId so I copied this line from gora-cassandra-mapping.xml after that everything (and by that I mean generate fetch updatedb) worked fine. So now here are my questions: 1- as I said that line is missing for gora-hbase-mapping.xml. does this needs an jira issue or can you guys just add it and commit without going through all the hoops? 2- is the trunk version supposed to be compiled against the gora trunk? cause the current HEAD is not working with 0.2.1? P.S this by the way worked the same with and without NUTCH-1551 patch On 04/01/2013 03:28 PM, Lewis John Mcgibbney wrote: You're right, this is a dev issue for sure. On Mon, Apr 1, 2013 at 2:45 PM, kaveh minooie mailto:ka...@plutoz.com>> wrote: The patch NUTCH-1551 didn't solve my issue. I am still getting the same exact error when i try to run generate. (this was run in local mode) : NUTCH-1551 is not supposed to fix this problem entirely. It merely attempts to make the WebTableReader tool backwards compatible and permits you to check whether accesor methods WebPage.getBatchID() and WebPage.getPrevModifiedTime() actually work for your use case. If you are able to check and provide feedback of the webtable dump for the URL causing the NPE it would be very valuable indeed. now the likely variable that is null seems to be 'mapkey' which is probably as a result of male formed URL ( thou I can't say that for sure ) now the put function is being called from here this is from gora 2.1: gora/blob/0.2.1/gora-core/src/__main/java/org/apache/gora/__mapreduce/GoraRecordWriter.__java: ... the same function in gora trunk is like this: ... which seems to me that would allow the code to recover from this kind of errors. now I get gora through ivy and I don't know how or if I can have ivy to fetch the trunk but regardless I still think the question remains whether it is a nutch issue or gora? So it appears that some issues have been addressed and improved within Gora trunk (which is nice). You can pull a Gora SNAPSHOT from here [0] and place it on your class path then try it out. Feedback would be greatly appreciated. The underlying problem here is that not everyone using and developing Gora is using and developing Nutch. We have been making good progress towards building diversity over in Gora so that it is not so heavily reliant upon Nutch users. This means the project can stand on its own two feet. The downside of this, is that *some* bugs arising from *some* use cases are not discovered until a little later than we would like. Your feedback is really really helpful. It should be noted that you can also patch your local copy of 2.x HEAD to not contain the two offending issues we've previously discussed. [0] https://repository.apache.org/content/repositories/snapshots/org/apache/gora/ -- Kaveh Minooie
Re: error using generate in 2.x
ains whether it is a nutch issue or gora? sorry for the long email. On 03/30/2013 04:03 PM, Lewis John Mcgibbney wrote: I think we need also may need to add the BATCH_ID to one Job's HashSet private static final Collection FIELDS = new HashSet(); static { ... FIELDS.add(WebPage.Field.BATCH_ID); } On Sat, Mar 30, 2013 at 3:55 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: Hi, I've tried to sort this out locally this morning... I can almost replicate this behaviour with gora-cassandra and it looks most likely that the patch(es) applied in * NUTCH-1533 - NUTCH-1532 Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage, and * NUTCH-1532 - Replace 'segment' mapping field with batchId, respectively are not backwards compatible because some URLs within the web database do not contain values to the batchId. Of course this is a major problem. I opened NUTCH-1551 [0] and submitted a patch to make WebTableReader backwards compatible with the above patches. Please try out the patch if you can and comment so I can commit. We have a couple options here. 1) Revert both of the above until we can get a fix 2) Get a fix just now and commit it. What do you guys want to do? I have a question about whether or not we can dynamically add fields to existing data base entires by injecting them? Say for example, you inject URLs without the batchId field in your mapping file, then add the field and inject some more URLs... will the field be added to your data base? If so then why are we getting the NPE? There must be some other location in the Nutch code where an asserted attempt is being made to obtain the batchId fore some given key... it cannot be obtained and we receive the NPE. [0] https://issues.apache.org/jira/browse/NUTCH-1551 On Fri, Mar 29, 2013 at 5:05 PM, kaveh minooie wrote: I use git and i fetch from github (https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>) currently I am on this commit: commit 4bb01d6b908dc230c8be89d398b03a**86581ec42b Author: lufeng Date: Thu Mar 28 13:09:09 2013 + NUTCH-1547 BasicIndexingFilter - Problem to index full title git-svn-id: https://svn.apache.org/repos/** asf/nutch/branches/2.x@1462079<https://svn.apache.org/repos/asf/nutch/branches/2.x@1462079>13f79535-47bb-0310-9956- **ffa450edef68 before I was on this commit : commit f02dcf62566583551426c08bd38808**0e5b2bc93e f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml On 03/29/2013 04:35 PM, alx...@aim.com wrote: Yes, with hbase. Here is the error 13/03/29 16:33:29 INFO zookeeper.ZooKeeper: Session: 0x13d7770d67d005f closed 13/03/29 16:33:29 ERROR crawl.WebTableReader: WebTableReader: java.lang.NullPointerException at org.apache.gora.hbase.store.**HBaseStore.addFields(** HBaseStore.java:398) at org.apache.gora.hbase.store.**HBaseStore.execute(HBaseStore. **java:360) at org.apache.nutch.crawl.**WebTableReader.read(** WebTableReader.java:234) at org.apache.nutch.crawl.**WebTableReader.run(** WebTableReader.java:476) at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.** java:65) at org.apache.nutch.crawl.**WebTableReader.main(** WebTableReader.java:412) at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method) at sun.reflect.**NativeMethodAccessorImpl.**invoke(** NativeMethodAccessorImpl.java:**39) at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(** DelegatingMethodAccessorImpl.**java:25) at java.lang.reflect.Method.**invoke(Method.java:597) at org.apache.hadoop.util.RunJar.**main(RunJar.java:156) If I revert to previous release it works fine. Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Fri, Mar 29, 2013 4:30 pm Subject: Re: error using generate in 2.x Hi Alex, With HBase also? There 'was' a bug in gora-cassandra module for this command + params however I thought it had been addressed and therefore resolved it. Lewis On Fri, Mar 29, 2013 at 4:00 PM, wrote: Hi, It seems that trunk has a few bugs. I found out that readdb -url urlname also gives errors. Thanks. Alex. -Original Message- From: kaveh minooie To: user Sent: Fri, Mar 29, 2013 1:53 pm Subject: Re: error using generate in 2.x Hi lewis the mapping file that I am using is the one that comes with nutch, and I haven't touched it. this message in the log is caused by using the -crawlId on the command line. for example this log was the result of this command : bin/nutch generate -topN 1000 -crawlId t1 which causes the nutch( or i guess technically gora ) to use a table name 't1_webpage'. thou, I have to say that i don't understand the rational behind the code generating a warning like this ( I mean I know it is not actually a warning, just tha
Re: slf4j issue with nutch 2.x over hadoop 1.1.1
So when you say "prune the dependencies", I am not sure what you are talking about cause what I could think of is not working. let me explain the situation again. nutch 2.x ivy file ( ivy/ivy.xml ) has this in it: hadoop 1.1.1 ships with slf4j 1.4.3. these 2 are not compatible. now, i rather not mess with my hadoop cluster so I tried to downgrade slf4j in nutch. I changed the above lines to : as you can see I am upgrading the solr and zookeeper and removing the elasticsearch, and all of these changes work fine since I can see the appropriate files in the build/lib directory after ant is done. but it doesn't work for slf4j, and the files copied to build/lib ( and subsequently in my job file ) are : kaveh@d1r2n2:/source/nutch/nutch$ ll build/lib/slf* -rw-r--r-- 1 kaveh kaveh 25496 Jul 5 2010 build/lib/slf4j-api-1.6.1.jar -rw-r--r-- 1 kaveh kaveh 9753 Jul 5 2010 build/lib/slf4j-log4j12-1.6.1.jar since i need the job file i can't just manually change the files in build/lib, won't do me any good. now I don't know ant very well, and that is mostly why I am asking this from you guys. I have to say that I also changed the same thing in pom.xml as well: org.slf4j slf4j-log4j12 1.4.3 true but I still end up with the 1.6.1 version. I don't know how exactly ant and ivy and pom work together, so I am asking if there is any other config file that I am missing, or why while it is working fine for solr and zookeeper it is not affecting the slf4j? thanks, On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote: A solution would be to manually prune the dependencies which are fetched via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then maybe we need to make the exclusions explicit within ivy.xml. if you are able , then please provide a patch which fixes this if it is really a problem. It is important to note that pom.xml will most likely be outdated. You should build nutch with ant + ivy for the time being as this is stable. Thank you Lewis On Saturday, February 16, 2013, kaveh minooie wrote: unfortunately your links have been removed from the email that i got so i am not sure what [0] and [1] are, but this is what i am using : kaveh@d1r2n2:/source/nutch/nutch.git$ git remote -v originhttps://github.com/apache/nutch.git (fetch) originhttps://github.com/apache/nutch.git (push) kaveh@d1r2n2:/source/nutch/nutch.git$ git branch -v * 2.x f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers kaveh@d1r2n2:/2locos/source/nutch/nutch.git$ i am using branch 2.x On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote: Hi Kaveh, Two seconds please. First lets set some thing straight. Nutch trunk is from here [0] Nutch 2,x is from here [1] Which one do you use? On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie wrote: but here is my problem. I tried to build the nutch using ver 1.4.3 of the slf4j. i changed the version in both ivy.xml and pom.xml and cleaned my ivy cache but ant still fetches the version 1.6.1 when it builds the project. what am I missing? We can progress with the problem once we know what's actually going on. Thanks Lewis
slf4j issue with nutch 2.x over hadoop 1.1.1
Hi everyone I recently build the nutch 2.x from the trunk, but it crashes almost immediately in run time. it seems that the there is a version incompatibility between the slf4j in hadoop which is (1.4.3) and the one in nutch (1.6.1) : (actually is between versions above 1.6 and below it) $ PATH="$(pwd)/bin:$PATH" bin/nutch inject /temp/urls/ Error: Could not find or load main class org.apache.hadoop.util.PlatformName 13/02/15 15:47:15 INFO crawl.InjectorJob: InjectorJob: starting at 2013-02-15 15:47:15 13/02/15 15:47:15 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: /temp/urls Exception in thread "main" java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:139) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:205) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:477) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463) at org.apache.hadoop.mapreduce.JobContext.(JobContext.java:80) at org.apache.hadoop.mapreduce.Job.(Job.java:50) at org.apache.hadoop.mapreduce.Job.(Job.java:54) at org.apache.nutch.util.NutchJob.(NutchJob.java:37) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) but here is my problem. I tried to build the nutch using ver 1.4.3 of the slf4j. i changed the version in both ivy.xml and pom.xml and cleaned my ivy cache but ant still fetches the version 1.6.1 when it builds the project. what am I missing? thanks, -- Kaveh Minooie www.plutoz.com
I think I found a bug --> multiple_values_encountered_for_non_multiValued_field_title
so I've been getting this error "multiple_values_encountered_for_non_multiValued_field_title" every once in a while when I am trying to run solrindex. I can now say that this is being caused by index-more plug in (MoreIndexingFilter.java) private NutchDocument resetTitle(NutchDocument doc, ParseData data, String url) { String contentDisposition = data.getMeta(Metadata.CONTENT_DISPOSITION); if (contentDisposition == null) return doc; for (int i=0; ithe problem here is that in my case this function is not reseting but it is just adding a new title. it seems that the original idea was that if CONTENT_DISPOSITION exist then the document will not have a title set from other plug ins (namely index-basic). unfortunately this seems not to be always the case as you can see by running this command: bin/nutch indexchecker http://www.2modern.com/site/gift-registry.html what i do get (the part that is relevant) is: tstamp :Tue Feb 21 13:18:13 PST 2012 type : text/html type : text type : html date : Tue Feb 21 13:18:13 PST 2012 url : http://www.2modern.com/site/gift-registry.html content : 2Modern Gift Registry Modern Furniture & Lighting items in cart 0 checkout Returning 2Modern cu user_ranking : 25.0 title : 2Modern Gift Registry title : gift-registry.html plutoz_ranking :10.0 categories :Furniture Home contentLength : 12924 and as you can see there are 2 titles. I think it would be very easy to fix that. just check to see if a title exist already before setting the name of the file as title: if (contentDisposition == null || null != doc.getField("title")) return doc; or if the substitution must happen in presence of CONTENT_DISPOSITION, at least remove the old one: if (matcher.find()) { doc.remove("title"); doc.add("title", matcher.group(1)); break; } now that being said, the real problem here is why NutchDocument doesn't observe the schema.xml file and alway assumes that all fields are multi value? public void add(String name, Object value) { 53 NutchField field = fields.get(name); 54 if (field == null) { 55field = new NutchField(value); 56fields.put(name, field); 57 } else { 58 > field.add(value); <--- 59 } 60} -- Kaveh Minooie www.plutoz.com
slf4j-log4j12 new version causes runtime error
I hope I am sending this to the correct list :) I just saw this issue when I was trying to run indexchecker in local mode. I solved this by changing this line in the ivy.xml conf="*->master" /> to conf="*->master" /> the pom.xml has the "correct" version but ivy.xml seems to be overriding it. (if that was an obvious statement, I apologize, but I am new to ivy and the whole Maven stuff) this would be also helpfull: http://www.slf4j.org/faq.html#IllegalAccessError and for the record this is the error I got kaveh@index9:~/build/nutch/runtime/local$ bin/nutch indexchecker Exception in thread "main" java.lang.IllegalAccessError: tried to access field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class org.slf4j.LoggerFactory at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83) at org.slf4j.LoggerFactory.(LoggerFactory.java:73) at org.apache.nutch.indexer.IndexingFiltersChecker.(IndexingFiltersChecker.java:36) -- Kaveh Minooie www.plutoz.com
Re: issue in nutch-default.xml
I know but the code expects to read the number of days: genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L; and as you can see the default value, as is mentioned in the description, is 7 and it is in days not milliseconds; On 02/17/2012 12:33 AM, Markus Jelsma wrote: this is actually 7 days in milliseconds. so I checked the source code. the value seems that should be in fact 7. current default value means 1.6 thousand millenniums . crawl.gen.delay 60480 This value, expressed in days, defines how long we should keep the lock on records in CrawlDb that were just selected for fetching. If these records are not updated in the meantime, the lock is canceled, i.e. the become eligible for selecting. Default value of this is 7 days. -- Kaveh Minooie www.plutoz.com
Re: make nutch plugin to get termfreqvectors
Hi I am having simillar problem in that I have to update bunch of plugins that were writen for nutch 1.1. It would be great If we could get some hints. Thanks, On 01/19/2012 03:36 PM, Ale wrote: Hi, I'm quite new working with nutch plugins. I'm trying to save the termfreqvectors of the documents. I'm using nutch 1.4 I've seen that I had to use, in the plugin class, the method addFieldOption, like: -- public void addIndexBackendOptions(Configuration conf) { //add lucene options // // host is un-stored, indexed and tokenized LuceneWriter.addFieldOptions("host", LuceneWriter.STORE.NO, LuceneWriter.INDEX.TOKENIZED, conf); // site is un-stored, indexed and un-tokenized LuceneWriter.addFieldOptions("site", LuceneWriter.STORE.NO, LuceneWriter.INDEX.UNTOKENIZED, conf); // url is both stored and indexed, so it's both searchable and returned LuceneWriter.addFieldOptions("url", LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf); // content is indexed, so that it's searchable, but not stored in index LuceneWriter.addFieldOptions("content", LuceneWriter.STORE.NO, LuceneWriter.INDEX.TOKENIZED, conf); // anchors are indexed, so they're searchable, but not stored in index LuceneWriter.addFieldOptions("anchor", LuceneWriter.STORE.NO, LuceneWriter.INDEX.TOKENIZED, conf); // title is indexed and stored so that it can be displayed LuceneWriter.addFieldOptions("title", LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf); The problem is, as far as I have seen, that LuceneWriter no longer exists in 1.4 (Lucene 3.5) WHich is the correct way to do it ? Thank you very much in advance ! -- -- Kaveh Minooie www.plutoz.com