[jira] [Resolved] (NUTCH-1640) OOM in ParseSegment Phase

2013-10-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1640.
--

Resolution: Fixed

Committed revision 1529802.

Thanks Mitesh. 

 OOM in ParseSegment Phase
 -

 Key: NUTCH-1640
 URL: https://issues.apache.org/jira/browse/NUTCH-1640
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: RHEL 6.2 x86_64
Reporter: Mitesh Singh Jat
 Attachments: NUTCH-1640.patch


 The nutch ParseSegment phase fails after 2 runs on same TaskTracker, with the 
 following Exception:
 {noformat}
 Exception in thread main org.apache.hadoop.ipc.RemoteException: 
 java.io.IOException: java.lang.OutOfMemoryError: unable to create new native 
 thread
   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:640)
   at 
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:553)
   at 
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvmRunner(JvmManager.java:317)
   at 
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvm(JvmManager.java:297)
   at 
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType.taskKilled(JvmManager.java:289)
   at org.apache.hadoop.mapred.JvmManager.taskKilled(JvmManager.java:158)
   at org.apache.hadoop.mapred.TaskRunner.kill(TaskRunner.java:802)
   at 
 org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:3315)
   at 
 org.apache.hadoop.mapred.TaskTracker$TaskInProgress.jobHasFinished(TaskTracker.java:3287)
   at org.apache.hadoop.mapred.TaskTracker.purgeTask(TaskTracker.java:2316)
   at 
 org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3710)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)
   at org.apache.hadoop.ipc.Client.call(Client.java:1118)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
   at $Proxy1.fatalError(Unknown Source)
   at org.apache.hadoop.mapred.Child.main(Child.java:310)
 {noformat}
 Whereas similar parsing when done in Nutch Fetcher Phase (fetcher.parse=true, 
 fetcher.store.content=false) does not give such issue.
 Hence, on analysing the code of Fetcher and ParseSegment, it seems the issue
 should be related to creation parseResult foreach url in ParseSegment.java.
 {code}
  95 ParseResult parseResult = null;
  96 try {
  97   parseResult = new ParseUtil(getConf()).parse(content); // *
  98 } catch (Exception e) {
  99   LOG.warn(Error parsing:  + key + :  + 
 StringUtils.stringifyException(e));
 100   return;
 101 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-10-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788039#comment-13788039
 ] 

Julien Nioche commented on NUTCH-1562:
--

Hi Seb

You are right about the order from plugin.includes, this had completely passed 
me by. I really like your patch, it makes loads of sense to centralize that 
code and will make it simpler to address NUTCH-1606 for instance.

Will commit your patch shortly with a minor modification (getOrderedPlugins() 
is synchronized)

Thanks 

 Order of execution for scoring filters
 --

 Key: NUTCH-1562
 URL: https://issues.apache.org/jira/browse/NUTCH-1562
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.6, 2.1
Reporter: Julien Nioche
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
 NUTCH-1562-trunk.patch.v3


 The documentation in nutch-default.xml states that :
 {quote}
 property
   namescoring.filter.order/name
   value/value
   descriptionThe order in which scoring filters are applied.
   This may be left empty (in which case all available scoring
   filters will be applied in the order defined in plugin-includes
   and plugin-excludes), or a space separated list of implementation
   classes.
   /description
 /property
 {quote}
 however if no order is specified the filters are ordered randomly and not in 
 the order defined in plugin-includes.
 The other *order parameters (e.g. urlfilter.order) have a different 
 documentation and are loaded and applied in system defined order which 
 corresponds to what the code does.
 The patch attached is for 1.x and puts the code in accordance with the 
 documentation by ordering the filters according to the order of the plugins, 
 which gives users more control without having to specify the classes 
 explicitly in scoring.filter.order.
 We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (NUTCH-1562) Order of execution for scoring filters

2013-10-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1562.
--

Resolution: Fixed

Committed revision 1529813.


 Order of execution for scoring filters
 --

 Key: NUTCH-1562
 URL: https://issues.apache.org/jira/browse/NUTCH-1562
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.6, 2.1
Reporter: Julien Nioche
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
 NUTCH-1562-trunk.patch.v3


 The documentation in nutch-default.xml states that :
 {quote}
 property
   namescoring.filter.order/name
   value/value
   descriptionThe order in which scoring filters are applied.
   This may be left empty (in which case all available scoring
   filters will be applied in the order defined in plugin-includes
   and plugin-excludes), or a space separated list of implementation
   classes.
   /description
 /property
 {quote}
 however if no order is specified the filters are ordered randomly and not in 
 the order defined in plugin-includes.
 The other *order parameters (e.g. urlfilter.order) have a different 
 documentation and are loaded and applied in system defined order which 
 corresponds to what the code does.
 The patch attached is for 1.x and puts the code in accordance with the 
 documentation by ordering the filters according to the order of the plugins, 
 which gives users more control without having to specify the classes 
 explicitly in scoring.filter.order.
 We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1588) Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x

2013-10-07 Thread Talat UYARER (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Talat UYARER updated NUTCH-1588:


Attachment: NUTCH-1588-final.patch

I updated coding's style. Thanks for notice, Sebastian Nagel

 Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays 
 db_unfetched in CrawlDb and is generated over and over again to 2.x
 ---

 Key: NUTCH-1588
 URL: https://issues.apache.org/jira/browse/NUTCH-1588
 Project: Nutch
  Issue Type: Bug
Reporter: Lewis John McGibbney
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1588-final.patch, NUTCH-1588.patch


 A document gone with 404 after db.fetch.interval.max (90 days) has passed
 is fetched over and over again but although fetch status is fetch_gone
 its status in CrawlDb keeps db_unfetched. Consequently, this document will
 be generated and fetched from now on in every cycle.
 To reproduce:
 # create a CrawlDatum in CrawlDb which retry interval hits 
 db.fetch.interval.max (I manipulated the shouldFetch() in 
 AbstractFetchSchedule to achieve this)
 # now this URL is fetched again
 # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
 db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
 days)
 # this does not change with every generate-fetch-update cycle, here for two 
 segments:
 {noformat}
 /tmp/testcrawl/segments/20120105161430
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:14:21 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:14:48 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 /tmp/testcrawl/segments/20120105161631
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:16:23 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:20:05 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 {noformat}
 As far as I can see it's caused by setPageGoneSchedule() in 
 AbstractFetchSchedule. Some pseudo-code:
 {code}
 setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
 datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
 maxInterval
 datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
 if (maxInterval  datum.fetchInterval) // necessarily true
forceRefetch()
 forceRefetch:
 if (datum.fetchInterval  maxInterval) // true because it's 1.35 * 
 maxInterval
datum.fetchInterval = 0.9 * maxInterval
 datum.status = db_unfetched // 
 shouldFetch (called from generate / Generator.map):
 if ((datum.fetchTime - curTime)  maxInterval)
// always true if the crawler is launched in short intervals
// (lower than 0.35 * maxInterval)
datum.fetchTime = curTime // forces a refetch
 {code}
 After setPageGoneSchedule is called via update the state is db_unfetched and 
 the retry interval 0.9 * db.fetch.interval.max (81 days). 
 Although the fetch time in the CrawlDb is far in the future
 {noformat}
 % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
 URL: http://localhost/page_gone
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Sun May 06 05:20:05 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Score: 1.0
 Signature: null
 Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
 {noformat}
 the URL is generated again because (fetch time - current time) is larger than 
 db.fetch.interval.max.
 The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
 the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
 It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788059#comment-13788059
 ] 

Hudson commented on NUTCH-1562:
---

SUCCESS: Integrated in Nutch-trunk #2380 (See 
[https://builds.apache.org/job/Nutch-trunk/2380/])
NUTCH-1562 (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1529813)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java
* /nutch/trunk/src/java/org/apache/nutch/parse/HtmlParseFilters.java
* /nutch/trunk/src/java/org/apache/nutch/plugin/PluginRepository.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java


 Order of execution for scoring filters
 --

 Key: NUTCH-1562
 URL: https://issues.apache.org/jira/browse/NUTCH-1562
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.6, 2.1
Reporter: Julien Nioche
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
 NUTCH-1562-trunk.patch.v3


 The documentation in nutch-default.xml states that :
 {quote}
 property
   namescoring.filter.order/name
   value/value
   descriptionThe order in which scoring filters are applied.
   This may be left empty (in which case all available scoring
   filters will be applied in the order defined in plugin-includes
   and plugin-excludes), or a space separated list of implementation
   classes.
   /description
 /property
 {quote}
 however if no order is specified the filters are ordered randomly and not in 
 the order defined in plugin-includes.
 The other *order parameters (e.g. urlfilter.order) have a different 
 documentation and are loaded and applied in system defined order which 
 corresponds to what the code does.
 The patch attached is for 1.x and puts the code in accordance with the 
 documentation by ordering the filters according to the order of the plugins, 
 which gives users more control without having to specify the classes 
 explicitly in scoring.filter.order.
 We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1606) Check that Factory classes use the cache in a thread safe way

2013-10-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1606:
-

Attachment: NUTCH-1606.patch

Synchronized methods on ObjectCache + calls from FetchScheduleFactory and 
SignatureFactory.

I haven't checked calls from URLNormalizers or ParserFactory but the others 
look fine

 Check that Factory classes use the cache in a thread safe way
 -

 Key: NUTCH-1606
 URL: https://issues.apache.org/jira/browse/NUTCH-1606
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.7, 2.2.1
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-1606.patch


 I found in [NUTCH-1604] that the ProtocolFactory class was not handling 
 access to the cache properly. The same mechanism is used in other Factory 
 classes so we should make sure that they are properly synchronized + make 
 ObjectCache thread safe as well



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (NUTCH-1652) Avoid instanciation of MimeUtil for each Content object created

2013-10-07 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1652:


 Summary: Avoid instanciation of MimeUtil for each Content object 
created
 Key: NUTCH-1652
 URL: https://issues.apache.org/jira/browse/NUTCH-1652
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7
Reporter: Julien Nioche


Content objects instantiate and hold a MimeUtil in the constructor used by the 
HttpBase class. This is wasteful and unnecessarily slows down the creation of 
Content object as the MimeUtil creates a new Tika instance, reads from the 
configuration etc...

Instead we could create a single instance of the MimeUtil class and pass it to 
the a new Content constructor   

{code}
public Content(String url, String base, byte[] content, String contentType,
  Metadata metadata, MimeUtil mime)
{code}

and create a single instance of MimeUtil in HttpBase. We would also need to 
make sure that the synchronisation is handled properly in MimeUtil (especially 
for the calls to Tika) as the creation of the Content is done in a 
multithreaded environment.




--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2013-10-07 Thread Nguyen Manh Tien (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788911#comment-13788911
 ] 

Nguyen Manh Tien commented on NUTCH-961:


I used patch NUTCH-961-2.1-v2.patch for nutch-2.2.1
i found that the text parsed by nutch-tika (with boilerpipe support) is 
different from text parsed by demo site http://boilerpipe-web.appspot.com 
I did upgrade to boilerpipe 1.2.0 to be match with demo site.

The url i tested is http://www.medhelp.org/posts/Eye-Care/EYE/show/1199003

The text from nutch-tika (i use ArticleExtractor)

EYE - Eye Care - MedHelp Experts My MedHelp Login or Signup Eye Care Community 
EYE Post a Question « Back to Community About This Community: This patient 
support community is for discussions relating to eye care, cataracts , glaucoma 
, retinal detachment , eye infections, misaligned eyes , intra-ocular implants, 
refractive surgery ( LASIK and CK), glasses, contact lenses, amblyopia , eye 
injuries, dry eyes , ocular allergy, eye pain and discomfort, pediatric eye 
disorders, eyelid and tearduct surgery, poor eyesight, and eye surgery. View 
community archives Font Size: A A ABackground: Search this Community: Go 3 
Comments EYE My son is 4 and half years old and have + no .Our doctor told me 
six months ago that + no. decreases as time passed and he not to wear glasses 
after two -three years if he wears glasses regularly.But yesterday he told me 
that his + No. increases and he have to wear glasses always.If you wish u can 
go for laser surgery after 14 years i.e. when my son will have age of 17 
years.please help me what to do ? Watch this discussion Tweet Related 
Discussions How to decide if glasses are needed for children? (8 replies):How 
can a Doctor tell if a child has amblyopia? Is t... [more] Astigmatism (1 
replies):My 5 year old son has severe astigmatism. He wears glass... [more] Can 
someone help me in regards to my sons eyes? (6 replies):I had noticed my son 
had, had an eye issue when he was a... [more] Blurred vision with glasses (2 
replies):Hi, I recently got new glasses and but the vision in my ... [more] 
Eyesight getting worse (2 replies):Hello! So here's the story. My eyesight had 
never been ... [more]

AND from demo

3 Comments
EYE
My son is 4 and half years old and have + no .Our doctor told me six months ago 
that + no. decreases as time passed and he not to wear glasses after two -three 
years if he wears glasses regularly.But yesterday he told me that his + No. 
increases and he have to wear glasses always.If you wish u can go for laser 
surgery after 14 years i.e. when my son will have age of 17 years.please help 
me what to do ?

the result from demo is much better for this url.
So the parse-tike/boilerpipe not only extract main content from page but also 
include title and other node content.
Is it expected?

 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.3, 1.8

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser1.patch, 
 NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.4-dombuilder-1.patch, 
 NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, 
 NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.1#6144)