[jira] Created: (NUTCH-181) mapred.local.dir temp dir. space allocation limited by smallest area

2006-01-16 Thread Paul Baclace (JIRA)
Versions: 0.8-dev Environment: all Reporter: Paul Baclace When mapred.local.dir is used to specify multiple temp dir. areas, space allocation limited by smallest area because the temp dir. selection algorithm is round robin starting from a randomish point. When round robin is used

[jira] Commented: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-10 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12362383 ] Paul Baclace commented on NUTCH-151: The number of threads that invoke _barrier.barrier() or .attemptBarrier() should match the count passed to the contructor

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-09 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ] Paul Baclace commented on NUTCH-153: NUTCH-160? There is slowness and then there is continental drift. The quantifiers should be used with any regex package unless

[jira] Commented: (NUTCH-162) country code jp is used instead of language code ja for Japanese

2006-01-09 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362274 ] Paul Baclace commented on NUTCH-162: The best practice for identifying localization is to use the ISO language and country code in the form of lowercase language code

[jira] Created: (NUTCH-156) nutch-daemon.sh should not overwrite old logs by default

2005-12-28 Thread Paul Baclace (JIRA)
: Paul Baclace nutch-daemon.sh creates a log file with the name pattern $NUTCH_LOG_DIR/nutch-$NUTCH_IDENT_STRING-$command-`hostname`.log every time it is run. This can overwrite a previous log without warning. As such, it is too easy to accidently lose a log that might contain unique failure

[jira] Updated: (NUTCH-156) nutch-daemon.sh should not overwrite old logs by default

2005-12-28 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-156?page=all ] Paul Baclace updated NUTCH-156: --- Attachment: nutch-daemon.sh.patch This is the suggested fix as a patch. Example log name: nutch-peb-jobtracker-ia109102.archive.org-20051228T230740.log nutch

[jira] Commented: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12361242 ] Paul Baclace commented on NUTCH-151: Analysis: CommandRunner uses CyclicBarrier is to synchronize the thread that does the exec (lets call it the main thread) with the io

[jira] Updated: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Paul Baclace updated NUTCH-151: --- Attachment: CommandRunner.java Minimal required changes to fix bug NUTCH-151: 1. The pipe io threads should be daemons. 2. The main thread should always interrupt

[jira] Updated: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Paul Baclace updated NUTCH-151: --- Attachment: CommandRunner.java.patch Here is the patch for CommandRunner (previously, I attached the actual file). CommandRunner can hang after the main thread

[jira] Updated: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-152?page=all ] Paul Baclace updated NUTCH-152: --- Attachment: TaskRunner.java.patch The patch addresses each issue listed in the detailed description of this bug. The detailed description is suitable

[jira] Created: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2005-12-26 Thread Paul Baclace (JIRA)
Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: all Reporter: Paul Baclace If TextParser is given postscript, it can take hours and then fail. This can be avoided with careful configuration, but if the server MIME type is wrong

[jira] Updated: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=all ] Paul Baclace updated NUTCH-153: --- Attachment: TextParser.java.patch A patch to reject files with %!PS-Adobe in the first 40 characters of the file. TextParser is only supposed to parse plain

[jira] Commented: (NUTCH-128) second configuration nodes overwrites first node

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-128?page=comments#action_12361254 ] Paul Baclace commented on NUTCH-128: In general, it might be helpful to issue an INFO level log msg whenever a configuration attribute is overridden. If the override

[jira] Created: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2005-12-23 Thread Paul Baclace (JIRA)
Type: Bug Components: indexer Versions: 0.8-dev Environment: all Reporter: Paul Baclace I encountered a case where the JVM of a Tasktracker child did not exit after the main thread returned; a thread dump showed only the threads named STDOUT and STDERR from CommandRunner as non

[jira] Updated: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

2005-12-22 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ] Paul Baclace updated NUTCH-150: --- Attachment: OutlinkExtractor.java.patch This patch has 3 changes: 1. Adds a comment that non-plain-text can be a problem. 2. Adds quantifiers to the regular

Re: nutch-0.8-dev *mapred.input.subdir* problem ?

2005-12-21 Thread Paul Baclace
You can ignore mapred.input.subdir; I find it is an unneeded option. Now that the mapred branch is merged to be the trunk, there is a need to clarify the documentation since the a change was made to have the input be specified as a directory and then all files in that directory are considered

Re: NDFS Connection reset

2005-12-20 Thread Paul Baclace
I have recently seen the connection reset problem, and no firewall was involved. I have been doing a mapred index build over more than 5TB of arc files and I noticed: SocketException: Connection reset that occurred in 1 of 1070 map tasks during the parse phase; the task was automatically

Re: Do nutch help me?

2005-11-10 Thread Paul Baclace
Arun Kaundal wrote: Hi I want to crawl local files, internet/intranet documents/files. Do u think nutch help me in this case? Although the tutorial describes these separately, conf/crawl-urlfilter.txt can allow any combination of Internet, Intranet, and local filesystem crawling.

Re: problem with inject url on mapred

2005-11-10 Thread Paul Baclace
[regarding mapred ver 0.8] Anton Potehin wrote: I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111. 051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31 Please help me to find out what the problem is? And what I did wrong? Is the problem the negative

Re: mapred bug -- bad part calculation?

2005-11-09 Thread Paul Baclace
Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the No input directories issue when using a local filesystem with multiple task

Re: Distributed nutch

2005-11-09 Thread Paul Baclace
In addition to Stefan Groschupf's detailed references, here are some short, high-level answers to your questions: Rozina Sorathia wrote: 1. What is Distributed nutch Nutch is a distributed Lucene with large scale web crawling. 2. How nutch distributed works? Modeled after Google's

Re: mapred bug -- bad part calculation?

2005-11-07 Thread Paul Baclace
Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the No input directories issue when using a local filesystem with multiple task

Re: mapred bug -- bad part calculation?

2005-11-07 Thread Paul Baclace
Rod Taylor wrote: NDFS accomplishes the above path finding by auto-prefixing any path not beginning with / with a /user/$USER. I didn't think it was appropriate for LocalFileSystem.java to be mucking around trying to automatically adjust paths to what the user may have intended. Grep-ing for

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-11-04 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: required_by_TestNDFS_v3.patch I found and fixed a problem with a standalone DataNode process exiting too early (this was not detected by the current

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-11-04 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: comments_msgs_and_local_renames_during_TestNDFS.patch TestNDFS a JUnit test specifically for NDFS

deltas to wiki page nutch/NutchDistributedFileSystem

2005-10-31 Thread Paul Baclace
I made these corrections to the wiki page nutch/NutchDistributedFileSystem located at: http://wiki.apache.org/nutch/NutchDistributedFileSystem The OLD/NEW diffs below are based on the action=raw view of the text. I hope someone can fold these into the wiki page since it appears as Immutable

[jira] Updated: (NUTCH-122) block numbers need a better random number generator

2005-10-21 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ] Paul Baclace updated NUTCH-122: --- Attachment: MersenneTwister.java I am attaching MersenneTwister.java The license on the attached source: All implementations are based on the mt19937 C code

[jira] Updated: (NUTCH-122) block numbers need a better random number generator

2005-10-21 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ] Paul Baclace updated NUTCH-122: --- Attachment: MersenneTwister.java Resubmitting MersenneTwister.java, this time with the Grant ASF checked. block numbers need a better random number generator

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-21 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: TestNDFS.java Revised TestNDFS to add a log message about which random number generator is in use (also changed the fixed seed to a newly created

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-19 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: required_by_TestNDFS_v2.patch Change Notes revised for patch required_by_TestNDFS_v2.patch which supercedes required_by_TestNDFS.patch: src/java/org

[jira] Commented: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-19 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=comments#action_12332546 ] Paul Baclace commented on NUTCH-116: Doug, Thanks for the quick response. 1. Should BLOCKREPORT_INTERVAL and DATANODE_STARTUP_PERIOD be removed from FSConstants

[jira] Created: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-18 Thread Paul Baclace (JIRA)
: Paul Baclace TestNDFS is a JUnit test for NDFS using pseudo multiprocessing (or more strictly, pseudo distributed) meaning all daemons run in one process and sockets are used to communicate between daemons. The test permutes various block sizes, number of files, file sizes, and number

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-18 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: required_by_TestNDFS.patch TestNDFS.java Patch comments: src/java/org/apache/nutch/ipc/Server.java improved logging details, use

patch for changes related to TestNDFS

2005-10-13 Thread Paul Baclace
This patch is for comments/local name change or error msg change only, to clarify the code as it relates to the new JUnit test, TestNDFS that I wrote. In many cases the intention was to help the next person reading the source. These changes should be very safe and are unlikely to introduce

org.apache.commons.io.FileUtils

2005-10-12 Thread Paul Baclace
I need a recursive file delete for cleaning up after a JUnit test. There is one in Commons IO (org.apache.commons.io): FileUtils.deleteDirectory(File directory) I wonder whether I should use org.apache.commons.io as a new jar added to lib or arrange a libtest for jars only used by JUnit

Re: org.apache.commons.io.FileUtils

2005-10-12 Thread Paul Baclace
Paul Baclace wrote: I need a recursive file delete for cleaning up after a JUnit test. I just now spotted: org.apache.nutch.fs.LocalFileSystem.delete(File f) which does what I want (recursive, local delete). So no need for common.io. Paul

Re: DNS

2005-10-11 Thread Paul Baclace
Fuad Efendi wrote: Another cause of another problem: By default, Java 1.4 caches DNS-to-IP mappings forever... java.security.Security.setProperty(networkaddress.cache.ttl , 1); I had to look up what the units are for this since your message was possibly ambiguous. The units are in

Re: failing of org.apache.nutch.tools.TestSegmentMergeTool?

2005-09-27 Thread Paul Baclace
Chris Mattmann wrote: I just noticed after checking out the latest SVN of Nutch that I am currently failing the TestSegmentMergeTool Junit test when I type ant test for Nutch. I'm on the mapred branch, not the trunk, and all tests pass. One thing I have noticed is that it is best to start

Re: why task tracker ports random?

2005-09-27 Thread Paul Baclace
Stefan Groschupf wrote: As far I understand the code there is only one tasktracker per machine. That is true, but only for the most apparent use case. I'm working on testing which needs emulate a multi machine deployment. As you can see in the tasktracker code, the ports are cleanly closed

Re: why task tracker ports random?

2005-09-26 Thread Paul Baclace
Stefan Groschupf wrote: Beside that a behavior like the datanode that iterates until it find a free port would be a better than just random. There is a possibility that a test run could start many processes on one machine and a sequential available port search could be contentious. If you

Re: why task tracker ports random?

2005-09-26 Thread Paul Baclace
Stefan Groschupf wrote: Hi Paul, my call stack say that actually no other classes using the tasktracker. Beside that tasktracker could be implement NutchConfigurable than all problems would be solved since this is IOC pattern. Or do I oversee something? I am thinking about the mapred branch

Re: Random number generators for NDFS block numbers

2005-09-26 Thread Paul Baclace
Doug Cutting wrote: It just occurred to me that perhaps we could simply use sequential block numbering. All block ids are generated centrally on the namenode. I'm not sure what the advantage of sequential block numbers would be since long period PRNG block numbering does not even need to

Random number generators for NDFS block numbers

2005-09-23 Thread Paul Baclace
Doug Cutting expressed a concern to me about using util.Random to generate random 64 bit block numbers for NDFS. The following is my analysis. Random number generators for NDFS block numbers Requirements capable of billions of block numbers 64 bit block numbers deterministic for

Re: Random number generators for NDFS block numbers

2005-09-23 Thread Paul Baclace
Thanks to archive.org, I found this additional reference: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/eindex.html which is the English home page of Matsumoto san, co-originator of Mersenne Twister. There is a faq and the original C source. Paul Paul Baclace wrote: [...] Alternative

Re: question re: usage of createTempFile() for NDFS

2005-09-21 Thread Paul Baclace
Doug Cutting wrote: Ordway, Ryan wrote: As a quick workaround, I made a few quick adjustments to the NDFSClient.java code to change the directory that temporary files are created in. This is hard coded to /nutch/tmp, but if someone could perhaps add a config option to make it configurable