Hello Lewis,

We do have some weird and complicated rules, but these should not time out for 
450 seconds, e.g. keep the JVM busy for that amount of time. We still haven't 
fully investigated yet so it is a possibility some sitemap entries are very 
long and complicated. But 450 seconds, very odd, but it seems reproducible as 
it happened twice in a row.

The disaster is not that big of a problem thanks to HDFS snapshots.

Thanks,
Markus
 
-----Original message-----
> From:lewis john mcgibbney <lewi...@apache.org>
> Sent: Wednesday 17th January 2018 17:47
> To: user@nutch.apache.org
> Subject: Re: SitemapProcessor destroyed our CrawlDB
> 
> Hi Markus,
> 
> What a disaster... do/did you have any crazy rules, replacements and/or
> substitutions present in the urlnormalizer-regex configuration?
> Lewis
> 
> On Wed, Jan 17, 2018 at 2:51 AM, <user-digest-h...@nutch.apache.org> wrote:
> 
> >
> > From: Markus Jelsma <markus.jel...@openindex.io>
> > To: User <user@nutch.apache.org>
> > Cc:
> > Bcc:
> > Date: Wed, 17 Jan 2018 10:51:49 +0000
> > Subject: SitemapProcessor destroyed our CrawlDB
> > Hello,
> >
> > We noticed some abnormalities in our crawl cycle caused by a sudden
> > reduction of our CrawlDB's size. The SitemapProcessor ran, failed (timed
> > out, see below) and left us with a decimated CrawlDB.
> >
> > This is odd because of:
> >
> >     } catch (Exception e) {
> >       if (fs.exists(tempCrawlDb))
> >         fs.delete(tempCrawlDb, true);
> >
> >       LockUtil.removeLockFile(fs, lock);
> >       throw e;
> >     }
> >
> > Any ideas?
> >
> > Thanks,
> > Markus
> >
> > Full thread dump OpenJDK 64-Bit Server VM (25.151-b12 mixed mode):
> >
> > "Thread-52" #74 prio=5 os_prio=0 tid=0x00007fe2adc85000 nid=0x6cf8
> > runnable [0x00007fe28a86d000]
> >    java.lang.Thread.State: RUNNABLE
> > at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3797)
> > at java.util.regex.Pattern$Start.match(Pattern.java:3461)
> > at java.util.regex.Matcher.search(Matcher.java:1248)
> > at java.util.regex.Matcher.find(Matcher.java:637)
> > at java.util.regex.Matcher.replaceAll(Matcher.java:951)
> > at org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.
> > regexNormalize(RegexURLNormalizer.java:193)
> > at org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(
> > RegexURLNormalizer.java:200)
> > at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.filterNormalize(
> > SitemapProcessor.java:176)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.
> > generateSitemapUrlDatum(SitemapProcessor.java:225)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.
> > generateSitemapUrlDatum(SitemapProcessor.java:264)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(
> > SitemapProcessor.java:154)
> > at org.apache.nutch.util.SitemapProcessor$SitemapMapper.map(
> > SitemapProcessor.java:95)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> > at org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(
> > MultithreadedMapper.java:273)
> >
> > "SpillThread" #34 daemon prio=5 os_prio=0 tid=0x00007fe2ada12000
> > nid=0x6c2f waiting on condition [0x00007fe28d2ad000]
> >    java.lang.Thread.State: WAITING (parking)
> > at sun.misc.Unsafe.park(Native Method)
> > - parking to wait for  <0x00000000ede6dc80> (a java.util.concurrent.locks.
> > AbstractQueuedSynchronizer$ConditionObject)
> > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> > at java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> > at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$
> > SpillThread.run(MapTask.java:1530)
> >
> > "org.apache.hadoop.hdfs.PeerCache@1fc0053e" #33 daemon prio=5 os_prio=0
> > tid=0x00007fe2ad7fe000 nid=0x6be7 waiting on condition [0x00007fe28d3ae000]
> >    java.lang.Thread.State: TIMED_WAITING (sleeping)
> > at java.lang.Thread.sleep(Native Method)
> > at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:253)
> > at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)
> > at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > "communication thread" #28 daemon prio=5 os_prio=0 tid=0x00007fe2ad975800
> > nid=0x6b9e in Object.wait() [0x00007fe28d8b1000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:799)
> > - locked <0x00000000ede69ae8> (a java.lang.Object)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > "client DomainSocketWatcher" #27 daemon prio=5 os_prio=0
> > tid=0x00007fe2ad952000 nid=0x6b95 runnable [0x00007fe28d9b2000]
> >    java.lang.Thread.State: RUNNABLE
> > at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
> > at org.apache.hadoop.net.unix.DomainSocketWatcher.access$
> > 900(DomainSocketWatcher.java:52)
> > at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(
> > DomainSocketWatcher.java:503)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > "Thread for syncLogs" #26 daemon prio=5 os_prio=0 tid=0x00007fe2ad820000
> > nid=0x6b81 waiting on condition [0x00007fe28deb3000]
> >    java.lang.Thread.State: TIMED_WAITING (parking)
> > at sun.misc.Unsafe.park(Native Method)
> > - parking to wait for  <0x00000000e7118190> (a java.util.concurrent.locks.
> > AbstractQueuedSynchronizer$ConditionObject)
> > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> > at java.util.concurrent.locks.AbstractQueuedSynchronizer$
> > ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> > at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(
> > ScheduledThreadPoolExecutor.java:1093)
> > at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(
> > ScheduledThreadPoolExecutor.java:809)
> > at java.util.concurrent.ThreadPoolExecutor.getTask(
> > ThreadPoolExecutor.java:1074)
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1134)
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > "org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner"
> > #24 daemon prio=5 os_prio=0 tid=0x00007fe2ad746800 nid=0x6b79 in
> > Object.wait() [0x00007fe28e1cc000]
> >    java.lang.Thread.State: WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > - waiting on <0x00000000e7171060> (a java.lang.ref.ReferenceQueue$Lock)
> > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
> > - locked <0x00000000e7171060> (a java.lang.ref.ReferenceQueue$Lock)
> > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
> > at org.apache.hadoop.fs.FileSystem$Statistics$
> > StatisticsDataReferenceCleaner.run(FileSystem.java:3212)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > "IPC Parameter Sending Thread #0" #23 daemon prio=5 os_prio=0
> > tid=0x00007fe2ad637000 nid=0x6b6d waiting on condition [0x00007fe28e4cd000]
> >    java.lang.Thread.State: TIMED_WAITING (parking)
> > at sun.misc.Unsafe.park(Native Method)
> > - parking to wait for  <0x00000000e7117338> (a java.util.concurrent.
> > SynchronousQueue$TransferStack)
> > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> > at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(
> > SynchronousQueue.java:460)
> > at java.util.concurrent.SynchronousQueue$TransferStack.transfer(
> > SynchronousQueue.java:362)
> > at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
> > at java.util.concurrent.ThreadPoolExecutor.getTask(
> > ThreadPoolExecutor.java:1073)
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1134)
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > "IPC Client (1373419525) connection to /89.188.14.3:36783 from
> > job_1516025831039_0247" #22 daemon prio=5 os_prio=0 tid=0x00007fe2ad632000
> > nid=0x6b6c in Object.wait() [0x00007fe28e5ce000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1008)
> > - locked <0x00000000e7119130> (a org.apache.hadoop.ipc.Client$Connection)
> > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1052)
> >
> > "Timer for 'MapTask' metrics system" #21 daemon prio=5 os_prio=0
> > tid=0x00007fe2ad525800 nid=0x6b5f in Object.wait() [0x00007fe28f551000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > at java.util.TimerThread.mainLoop(Timer.java:552)
> > - locked <0x00000000e713ca30> (a java.util.TaskQueue)
> > at java.util.TimerThread.run(Timer.java:505)
> >
> > "Service Thread" #17 daemon prio=9 os_prio=0 tid=0x00007fe2ac0f8800
> > nid=0x6b36 runnable [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C1 CompilerThread11" #16 daemon prio=9 os_prio=0 tid=0x00007fe2ac0eb800
> > nid=0x6b34 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C1 CompilerThread10" #15 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e9800
> > nid=0x6b33 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C1 CompilerThread9" #14 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e7000
> > nid=0x6b32 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C1 CompilerThread8" #13 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e5000
> > nid=0x6b31 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C2 CompilerThread7" #12 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e3000
> > nid=0x6b30 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C2 CompilerThread6" #11 daemon prio=9 os_prio=0 tid=0x00007fe2ac0e1000
> > nid=0x6b2f waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C2 CompilerThread5" #10 daemon prio=9 os_prio=0 tid=0x00007fe2ac0de800
> > nid=0x6b2d waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C2 CompilerThread4" #9 daemon prio=9 os_prio=0 tid=0x00007fe2ac0d4800
> > nid=0x6b2b waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C2 CompilerThread3" #8 daemon prio=9 os_prio=0 tid=0x00007fe2ac0d2800
> > nid=0x6b2a waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C2 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007fe2ac0ce000
> > nid=0x6b29 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007fe2ac0cc000
> > nid=0x6b28 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007fe2ac0c9000
> > nid=0x6b26 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007fe2ac0c7000
> > nid=0x6b24 waiting on condition [0x0000000000000000]
> >    java.lang.Thread.State: RUNNABLE
> >
> > "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007fe2ac0a0000 nid=0x6ab7
> > in Object.wait() [0x00007fe29592c000]
> >    java.lang.Thread.State: WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
> > - locked <0x00000000e72f0140> (a java.lang.ref.ReferenceQueue$Lock)
> > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
> > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
> >
> > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007fe2ac09b800
> > nid=0x6ab6 in Object.wait() [0x00007fe295a2d000]
> >    java.lang.Thread.State: WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > at java.lang.Object.wait(Object.java:502)
> > at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> > - locked <0x00000000e72f0180> (a java.lang.ref.Reference$Lock)
> > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> >
> > "main" #1 prio=5 os_prio=0 tid=0x00007fe2ac014000 nid=0x6a69 in
> > Object.wait() [0x00007fe2b5747000]
> >    java.lang.Thread.State: WAITING (on object monitor)
> > at java.lang.Object.wait(Native Method)
> > - waiting on <0x00000000edb32e88> (a org.apache.hadoop.mapreduce.
> > lib.map.MultithreadedMapper$MapRunner)
> > at java.lang.Thread.join(Thread.java:1252)
> > - locked <0x00000000edb32e88> (a org.apache.hadoop.mapreduce.
> > lib.map.MultithreadedMapper$MapRunner)
> > at java.lang.Thread.join(Thread.java:1326)
> > at org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper.
> > run(MultithreadedMapper.java:146)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at org.apache.hadoop.security.UserGroupInformation.doAs(
> > UserGroupInformation.java:1836)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> >
> > "VM Thread" os_prio=0 tid=0x00007fe2ac093800 nid=0x6aad runnable
> >
> > "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007fe2ac029000
> > nid=0x6a77 runnable
> >
> > "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007fe2ac02b000
> > nid=0x6a79 runnable
> >
> > "GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007fe2ac02c800
> > nid=0x6a7b runnable
> >
> > "GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007fe2ac02e800
> > nid=0x6a7d runnable
> >
> > "GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007fe2ac030000
> > nid=0x6a80 runnable
> >
> > "GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007fe2ac032000
> > nid=0x6a95 runnable
> >
> > "GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007fe2ac033800
> > nid=0x6a96 runnable
> >
> > "GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007fe2ac035800
> > nid=0x6a97 runnable
> >
> > "GC task thread#8 (ParallelGC)" os_prio=0 tid=0x00007fe2ac037000
> > nid=0x6a98 runnable
> >
> > "GC task thread#9 (ParallelGC)" os_prio=0 tid=0x00007fe2ac039000
> > nid=0x6a99 runnable
> >
> > "GC task thread#10 (ParallelGC)" os_prio=0 tid=0x00007fe2ac03a800
> > nid=0x6a9a runnable
> >
> > "GC task thread#11 (ParallelGC)" os_prio=0 tid=0x00007fe2ac03c800
> > nid=0x6a9b runnable
> >
> > "GC task thread#12 (ParallelGC)" os_prio=0 tid=0x00007fe2ac03e000
> > nid=0x6a9c runnable
> >
> > "VM Periodic Task Thread" os_prio=0 tid=0x00007fe2ac0fb000 nid=0x6b38
> > waiting on condition
> >
> > JNI global references: 275
> >
> > Heap
> >  PSYoungGen      total 116224K, used 105934K [0x00000000f7b00000,
> > 0x0000000100000000, 0x0000000100000000)
> >   eden space 100864K, 89% used [0x00000000f7b00000,0x00000000fd3a6228,
> > 0x00000000fdd80000)
> >   from space 15360K, 98% used [0x00000000fdd80000,0x00000000fec4d7a0,
> > 0x00000000fec80000)
> >   to   space 19456K, 0% used [0x00000000fed00000,0x00000000fed00000,
> > 0x0000000100000000)
> >  ParOldGen       total 273408K, used 189187K [0x00000000e7000000,
> > 0x00000000f7b00000, 0x00000000f7b00000)
> >   object space 273408K, 69% used [0x00000000e7000000,0x00000000f28c0c88,
> > 0x00000000f7b00000)
> >  Metaspace       used 33001K, capacity 33602K, committed 34048K, reserved
> > 1079296K
> >   class space    used 3581K, capacity 3675K, committed 3840K, reserved
> > 1048576K
> >
> >
> >
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
> 

Reply via email to