[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Attachment: NUTCH-1182-2x.patch Patch for 2.x. > fetcher should track and shut down hung threads > --- > > Key: NUTCH-1182 > URL: https://issues.apache.org/jira/browse/NUTCH-1182 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.3, 1.4 > Environment: Linux, local job runner >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.9 > > Attachments: NUTCH-1182-2x.patch, NUTCH-1182-trunk-v1.patch > > > While crawling a slow server with a couple of very large PDF documents (30 > MB) on it > after some time and a bulk of successfully fetched documents the fetcher stops > with the message: ??Aborting with 10 hung threads.?? > From now on every cycle ends with hung threads, almost no documents are > fetched > successfully. In addition, strange hadoop errors are logged: > {noformat} >fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException > at java.lang.System.arraycopy(Native Method) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108) > ... > {noformat} > or > {noformat} >Exception in thread "QueueFeeder" java.lang.NullPointerException > at > org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48) > at > org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214) > {noformat} > I've run the debugger and found: > # after the "hung threads" are reported the fetcher stops but the threads are > still alive and continue fetching a document. In consequence, this will > #* limit the small bandwidth of network/server even more > #* after the document is fetched the thread tries to write the content via > {{output.collect()}} which must fail because the fetcher map job is already > finished and the associated temporary mapred directory is deleted. The error > message may get mixed with the progress output of the next fetch cycle > causing additional confusion. > # documents/URLs causing the hung thread are never reported nor stored. That > is, it's hard to track them down, and they will cause a hung thread again and > again. > The problem is reproducible when fetching bigger documents and setting > {{mapred.task.timeout}} to a low value (this will definitely cause hung > threads). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Attachment: NUTCH-1182-trunk-v1.patch >From time to time this problem is reported by users >([2013|http://mail-archives.apache.org/mod_mbox/nutch-user/201304.mbox/%3ccajvbnigoqjl2hbuhv0gdbcjea2xzxhabqrsbpjaqtmfldkw...@mail.gmail.com%3E], > >[2012a|http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads], > >[2012b|http://stackoverflow.com/questions/12181249/nutch-crawl-fails-when-run-as-a-background-process-on-linux], > >[2011|http://lucene.472066.n3.nabble.com/Nutch-1-2-fetcher-aborting-with-N-hung-threads-td2411724.html]). > Shutting down hung threads is hard to implement (cf. NUTCH-1387). But logging >the URLs which cause threads to hang would definitely help in many situations. >Patch attached. > fetcher should track and shut down hung threads > --- > > Key: NUTCH-1182 > URL: https://issues.apache.org/jira/browse/NUTCH-1182 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.3, 1.4 > Environment: Linux, local job runner >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.9 > > Attachments: NUTCH-1182-trunk-v1.patch > > > While crawling a slow server with a couple of very large PDF documents (30 > MB) on it > after some time and a bulk of successfully fetched documents the fetcher stops > with the message: ??Aborting with 10 hung threads.?? > From now on every cycle ends with hung threads, almost no documents are > fetched > successfully. In addition, strange hadoop errors are logged: > {noformat} >fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException > at java.lang.System.arraycopy(Native Method) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108) > ... > {noformat} > or > {noformat} >Exception in thread "QueueFeeder" java.lang.NullPointerException > at > org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48) > at > org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214) > {noformat} > I've run the debugger and found: > # after the "hung threads" are reported the fetcher stops but the threads are > still alive and continue fetching a document. In consequence, this will > #* limit the small bandwidth of network/server even more > #* after the document is fetched the thread tries to write the content via > {{output.collect()}} which must fail because the fetcher map job is already > finished and the associated temporary mapred directory is deleted. The error > message may get mixed with the progress output of the next fetch cycle > causing additional confusion. > # documents/URLs causing the hung thread are never reported nor stored. That > is, it's hard to track them down, and they will cause a hung thread again and > again. > The problem is reproducible when fetching bigger documents and setting > {{mapred.task.timeout}} to a low value (this will definitely cause hung > threads). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Fix Version/s: 1.9 > fetcher should track and shut down hung threads > --- > > Key: NUTCH-1182 > URL: https://issues.apache.org/jira/browse/NUTCH-1182 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.3, 1.4 > Environment: Linux, local job runner >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 2.4, 1.9 > > > While crawling a slow server with a couple of very large PDF documents (30 > MB) on it > after some time and a bulk of successfully fetched documents the fetcher stops > with the message: ??Aborting with 10 hung threads.?? > From now on every cycle ends with hung threads, almost no documents are > fetched > successfully. In addition, strange hadoop errors are logged: > {noformat} >fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException > at java.lang.System.arraycopy(Native Method) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108) > ... > {noformat} > or > {noformat} >Exception in thread "QueueFeeder" java.lang.NullPointerException > at > org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48) > at > org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214) > {noformat} > I've run the debugger and found: > # after the "hung threads" are reported the fetcher stops but the threads are > still alive and continue fetching a document. In consequence, this will > #* limit the small bandwidth of network/server even more > #* after the document is fetched the thread tries to write the content via > {{output.collect()}} which must fail because the fetcher map job is already > finished and the associated temporary mapred directory is deleted. The error > message may get mixed with the progress output of the next fetch cycle > causing additional confusion. > # documents/URLs causing the hung thread are never reported nor stored. That > is, it's hard to track them down, and they will cause a hung thread again and > again. > The problem is reproducible when fetching bigger documents and setting > {{mapred.task.timeout}} to a low value (this will definitely cause hung > threads). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1182: Fix Version/s: 2.2 1.7 > fetcher should track and shut down hung threads > --- > > Key: NUTCH-1182 > URL: https://issues.apache.org/jira/browse/NUTCH-1182 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.3, 1.4 > Environment: Linux, local job runner >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.7, 2.2 > > > While crawling a slow server with a couple of very large PDF documents (30 > MB) on it > after some time and a bulk of successfully fetched documents the fetcher stops > with the message: ??Aborting with 10 hung threads.?? > From now on every cycle ends with hung threads, almost no documents are > fetched > successfully. In addition, strange hadoop errors are logged: > {noformat} >fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException > at java.lang.System.arraycopy(Native Method) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108) > ... > {noformat} > or > {noformat} >Exception in thread "QueueFeeder" java.lang.NullPointerException > at > org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48) > at > org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214) > {noformat} > I've run the debugger and found: > # after the "hung threads" are reported the fetcher stops but the threads are > still alive and continue fetching a document. In consequence, this will > #* limit the small bandwidth of network/server even more > #* after the document is fetched the thread tries to write the content via > {{output.collect()}} which must fail because the fetcher map job is already > finished and the associated temporary mapred directory is deleted. The error > message may get mixed with the progress output of the next fetch cycle > causing additional confusion. > # documents/URLs causing the hung thread are never reported nor stored. That > is, it's hard to track them down, and they will cause a hung thread again and > again. > The problem is reproducible when fetching bigger documents and setting > {{mapred.task.timeout}} to a low value (this will definitely cause hung > threads). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira