[jira] [Resolved] (NUTCH-1640) OOM in ParseSegment Phase
[ https://issues.apache.org/jira/browse/NUTCH-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1640. -- Resolution: Fixed Committed revision 1529802. Thanks Mitesh. OOM in ParseSegment Phase - Key: NUTCH-1640 URL: https://issues.apache.org/jira/browse/NUTCH-1640 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: RHEL 6.2 x86_64 Reporter: Mitesh Singh Jat Attachments: NUTCH-1640.patch The nutch ParseSegment phase fails after 2 runs on same TaskTracker, with the following Exception: {noformat} Exception in thread main org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:553) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvmRunner(JvmManager.java:317) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvm(JvmManager.java:297) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.taskKilled(JvmManager.java:289) at org.apache.hadoop.mapred.JvmManager.taskKilled(JvmManager.java:158) at org.apache.hadoop.mapred.TaskRunner.kill(TaskRunner.java:802) at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:3315) at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.jobHasFinished(TaskTracker.java:3287) at org.apache.hadoop.mapred.TaskTracker.purgeTask(TaskTracker.java:2316) at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3710) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1118) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at $Proxy1.fatalError(Unknown Source) at org.apache.hadoop.mapred.Child.main(Child.java:310) {noformat} Whereas similar parsing when done in Nutch Fetcher Phase (fetcher.parse=true, fetcher.store.content=false) does not give such issue. Hence, on analysing the code of Fetcher and ParseSegment, it seems the issue should be related to creation parseResult foreach url in ParseSegment.java. {code} 95 ParseResult parseResult = null; 96 try { 97 parseResult = new ParseUtil(getConf()).parse(content); // * 98 } catch (Exception e) { 99 LOG.warn(Error parsing: + key + : + StringUtils.stringifyException(e)); 100 return; 101 } {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788039#comment-13788039 ] Julien Nioche commented on NUTCH-1562: -- Hi Seb You are right about the order from plugin.includes, this had completely passed me by. I really like your patch, it makes loads of sense to centralize that code and will make it simpler to address NUTCH-1606 for instance. Will commit your patch shortly with a minor modification (getOrderedPlugins() is synchronized) Thanks Order of execution for scoring filters -- Key: NUTCH-1562 URL: https://issues.apache.org/jira/browse/NUTCH-1562 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.6, 2.1 Reporter: Julien Nioche Fix For: 2.3, 1.8 Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, NUTCH-1562-trunk.patch.v3 The documentation in nutch-default.xml states that : {quote} property namescoring.filter.order/name value/value descriptionThe order in which scoring filters are applied. This may be left empty (in which case all available scoring filters will be applied in the order defined in plugin-includes and plugin-excludes), or a space separated list of implementation classes. /description /property {quote} however if no order is specified the filters are ordered randomly and not in the order defined in plugin-includes. The other *order parameters (e.g. urlfilter.order) have a different documentation and are loaded and applied in system defined order which corresponds to what the code does. The patch attached is for 1.x and puts the code in accordance with the documentation by ordering the filters according to the order of the plugins, which gives users more control without having to specify the classes explicitly in scoring.filter.order. We could extend the same idea to the other *order params. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1562. -- Resolution: Fixed Committed revision 1529813. Order of execution for scoring filters -- Key: NUTCH-1562 URL: https://issues.apache.org/jira/browse/NUTCH-1562 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.6, 2.1 Reporter: Julien Nioche Fix For: 2.3, 1.8 Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, NUTCH-1562-trunk.patch.v3 The documentation in nutch-default.xml states that : {quote} property namescoring.filter.order/name value/value descriptionThe order in which scoring filters are applied. This may be left empty (in which case all available scoring filters will be applied in the order defined in plugin-includes and plugin-excludes), or a space separated list of implementation classes. /description /property {quote} however if no order is specified the filters are ordered randomly and not in the order defined in plugin-includes. The other *order parameters (e.g. urlfilter.order) have a different documentation and are loaded and applied in system defined order which corresponds to what the code does. The patch attached is for 1.x and puts the code in accordance with the documentation by ordering the filters according to the order of the plugins, which gives users more control without having to specify the classes explicitly in scoring.filter.order. We could extend the same idea to the other *order params. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1588) Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Talat UYARER updated NUTCH-1588: Attachment: NUTCH-1588-final.patch I updated coding's style. Thanks for notice, Sebastian Nagel Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x --- Key: NUTCH-1588 URL: https://issues.apache.org/jira/browse/NUTCH-1588 Project: Nutch Issue Type: Bug Reporter: Lewis John McGibbney Priority: Critical Fix For: 2.3 Attachments: NUTCH-1588-final.patch, NUTCH-1588.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788059#comment-13788059 ] Hudson commented on NUTCH-1562: --- SUCCESS: Integrated in Nutch-trunk #2380 (See [https://builds.apache.org/job/Nutch-trunk/2380/]) NUTCH-1562 (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1529813) * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java * /nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java * /nutch/trunk/src/java/org/apache/nutch/parse/HtmlParseFilters.java * /nutch/trunk/src/java/org/apache/nutch/plugin/PluginRepository.java * /nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java Order of execution for scoring filters -- Key: NUTCH-1562 URL: https://issues.apache.org/jira/browse/NUTCH-1562 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.6, 2.1 Reporter: Julien Nioche Fix For: 2.3, 1.8 Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, NUTCH-1562-trunk.patch.v3 The documentation in nutch-default.xml states that : {quote} property namescoring.filter.order/name value/value descriptionThe order in which scoring filters are applied. This may be left empty (in which case all available scoring filters will be applied in the order defined in plugin-includes and plugin-excludes), or a space separated list of implementation classes. /description /property {quote} however if no order is specified the filters are ordered randomly and not in the order defined in plugin-includes. The other *order parameters (e.g. urlfilter.order) have a different documentation and are loaded and applied in system defined order which corresponds to what the code does. The patch attached is for 1.x and puts the code in accordance with the documentation by ordering the filters according to the order of the plugins, which gives users more control without having to specify the classes explicitly in scoring.filter.order. We could extend the same idea to the other *order params. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1606) Check that Factory classes use the cache in a thread safe way
[ https://issues.apache.org/jira/browse/NUTCH-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1606: - Attachment: NUTCH-1606.patch Synchronized methods on ObjectCache + calls from FetchScheduleFactory and SignatureFactory. I haven't checked calls from URLNormalizers or ParserFactory but the others look fine Check that Factory classes use the cache in a thread safe way - Key: NUTCH-1606 URL: https://issues.apache.org/jira/browse/NUTCH-1606 Project: Nutch Issue Type: Task Affects Versions: 1.7, 2.2.1 Reporter: Julien Nioche Priority: Minor Attachments: NUTCH-1606.patch I found in [NUTCH-1604] that the ProtocolFactory class was not handling access to the cache properly. The same mechanism is used in other Factory classes so we should make sure that they are properly synchronized + make ObjectCache thread safe as well -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (NUTCH-1652) Avoid instanciation of MimeUtil for each Content object created
Julien Nioche created NUTCH-1652: Summary: Avoid instanciation of MimeUtil for each Content object created Key: NUTCH-1652 URL: https://issues.apache.org/jira/browse/NUTCH-1652 Project: Nutch Issue Type: Improvement Affects Versions: 1.7 Reporter: Julien Nioche Content objects instantiate and hold a MimeUtil in the constructor used by the HttpBase class. This is wasteful and unnecessarily slows down the creation of Content object as the MimeUtil creates a new Tika instance, reads from the configuration etc... Instead we could create a single instance of the MimeUtil class and pass it to the a new Content constructor {code} public Content(String url, String base, byte[] content, String contentType, Metadata metadata, MimeUtil mime) {code} and create a single instance of MimeUtil in HttpBase. We would also need to make sure that the synchronisation is handled properly in MimeUtil (especially for the calls to Tika) as the creation of the Content is done in a multithreaded environment. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788911#comment-13788911 ] Nguyen Manh Tien commented on NUTCH-961: I used patch NUTCH-961-2.1-v2.patch for nutch-2.2.1 i found that the text parsed by nutch-tika (with boilerpipe support) is different from text parsed by demo site http://boilerpipe-web.appspot.com I did upgrade to boilerpipe 1.2.0 to be match with demo site. The url i tested is http://www.medhelp.org/posts/Eye-Care/EYE/show/1199003 The text from nutch-tika (i use ArticleExtractor) EYE - Eye Care - MedHelp Experts My MedHelp Login or Signup Eye Care Community EYE Post a Question « Back to Community About This Community: This patient support community is for discussions relating to eye care, cataracts , glaucoma , retinal detachment , eye infections, misaligned eyes , intra-ocular implants, refractive surgery ( LASIK and CK), glasses, contact lenses, amblyopia , eye injuries, dry eyes , ocular allergy, eye pain and discomfort, pediatric eye disorders, eyelid and tearduct surgery, poor eyesight, and eye surgery. View community archives Font Size: A A ABackground: Search this Community: Go 3 Comments EYE My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ? Watch this discussion Tweet Related Discussions How to decide if glasses are needed for children? (8 replies):How can a Doctor tell if a child has amblyopia? Is t... [more] Astigmatism (1 replies):My 5 year old son has severe astigmatism. He wears glass... [more] Can someone help me in regards to my sons eyes? (6 replies):I had noticed my son had, had an eye issue when he was a... [more] Blurred vision with glasses (2 replies):Hi, I recently got new glasses and but the vision in my ... [more] Eyesight getting worse (2 replies):Hello! So here's the story. My eyesight had never been ... [more] AND from demo 3 Comments EYE My son is 4 and half years old and have + no .Our doctor told me six months ago that + no. decreases as time passed and he not to wear glasses after two -three years if he wears glasses regularly.But yesterday he told me that his + No. increases and he have to wear glasses always.If you wish u can go for laser surgery after 14 years i.e. when my son will have age of 17 years.please help me what to do ? the result from demo is much better for this url. So the parse-tike/boilerpipe not only extract main content from page but also include title and other node content. Is it expected? Expose Tika's boilerpipe support Key: NUTCH-961 URL: https://issues.apache.org/jira/browse/NUTCH-961 Project: Nutch Issue Type: New Feature Components: parser Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.3, 1.8 Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.1#6144)