[jira] Created: (NUTCH-181) mapred.local.dir temp dir. space allocation limited by smallest area

2006-01-16 Thread Paul Baclace (JIRA)
Versions: 0.8-dev Environment: all Reporter: Paul Baclace When mapred.local.dir is used to specify multiple temp dir. areas, space allocation limited by smallest area because the temp dir. selection algorithm is "round robin starting from a randomish point". When round robin is

[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

2006-01-10 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12362392 ] Paul Baclace commented on NUTCH-159: mapred.temp.dir and mapred.local.dir are used for different purposes. I think this is a sysadmin useability bug that really means: 1

[jira] Commented: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-10 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12362383 ] Paul Baclace commented on NUTCH-151: The number of threads that invoke _barrier.barrier() or .attemptBarrier() should match the count passed to the contructor of

[jira] Commented: (NUTCH-162) country code "jp" is used instead of language code "ja" for Japanese

2006-01-09 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362274 ] Paul Baclace commented on NUTCH-162: The best practice for identifying localization is to use the ISO language and country code in the form of lowercase language code

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-09 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ] Paul Baclace commented on NUTCH-153: > NUTCH-160? There is slowness and then there is continental drift. The quantifiers should be used with any regex package unless

[jira] Commented: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2006-01-06 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362043 ] Paul Baclace commented on NUTCH-152: >re 3: Why is a separate thread needed for stdout? It certainly makes the code easier to read. Using the main thread to read

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362000 ] Paul Baclace commented on NUTCH-153: > mime.type.magic? The particular run that had problems was using mime.type.magic=true. It turns out that the magic "%!

[jira] Updated: (NUTCH-156) nutch-daemon.sh should not overwrite old logs by default

2005-12-28 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-156?page=all ] Paul Baclace updated NUTCH-156: --- Attachment: nutch-daemon.sh.patch This is the suggested fix as a patch. Example log name: nutch-peb-jobtracker-ia109102.archive.org-20051228T230740.log

[jira] Created: (NUTCH-156) nutch-daemon.sh should not overwrite old logs by default

2005-12-28 Thread Paul Baclace (JIRA)
: Paul Baclace nutch-daemon.sh creates a log file with the name pattern "$NUTCH_LOG_DIR/nutch-$NUTCH_IDENT_STRING-$command-`hostname`.log" every time it is run. This can overwrite a previous log without warning. As such, it is too easy to accidently lose a log that might contain uniq

[jira] Commented: (NUTCH-108) tasktracker crashs when reconnecting to a new jobtracker.

2005-12-28 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-108?page=comments#action_12361339 ] Paul Baclace commented on NUTCH-108: I just had the opportunity to test this with 33 tasktrackers. One thing I noticed: TaskTracker.java should be patched to reduce the

[jira] Updated: (NUTCH-108) tasktracker crashs when reconnecting to a new jobtracker.

2005-12-28 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-108?page=all ] Paul Baclace updated NUTCH-108: --- Attachment: TaskTracker.java.patch Here is a patch for reducing redundant, voluminous output while retrying to connect. > tasktracker crashs w

[jira] Commented: (NUTCH-128) second configuration nodes overwrites first node

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-128?page=comments#action_12361254 ] Paul Baclace commented on NUTCH-128: In general, it might be helpful to issue an INFO level log msg whenever a configuration attribute is overridden. If the override

[jira] Updated: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-153?page=all ] Paul Baclace updated NUTCH-153: --- Attachment: TextParser.java.patch A patch to reject files with "%!PS-Adobe" in the first 40 characters of the file. > TextParser is only supp

[jira] Created: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2005-12-26 Thread Paul Baclace (JIRA)
Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: all Reporter: Paul Baclace If TextParser is given postscript, it can take hours and then fail. This can be avoided with careful configuration, but if the server MIME type is wrong and the

[jira] Updated: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-152?page=all ] Paul Baclace updated NUTCH-152: --- Attachment: TaskRunner.java.patch The patch addresses each issue listed in the detailed description of this bug. The detailed description is suitable as a

[jira] Created: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2005-12-26 Thread Paul Baclace (JIRA)
/NUTCH-152 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: all Reporter: Paul Baclace 1. io pipes should be setDaemon(true) so that process cannot hang. 2. error messages for Exceptions are incomplete since e.getMessage() is used and it can be

[jira] Updated: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Paul Baclace updated NUTCH-151: --- Attachment: CommandRunner.java.patch Here is the patch for CommandRunner (previously, I attached the actual file). > CommandRunner can hang after the m

[jira] Updated: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Paul Baclace updated NUTCH-151: --- Attachment: CommandRunner.java Minimal required changes to fix bug NUTCH-151: 1. The pipe io threads should be daemons. 2. The main thread should always interrupt

[jira] Commented: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2005-12-26 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12361242 ] Paul Baclace commented on NUTCH-151: Analysis: CommandRunner uses CyclicBarrier is to synchronize the thread that does the exec (lets call it the main thread) with the io

[jira] Created: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2005-12-23 Thread Paul Baclace (JIRA)
Type: Bug Components: indexer Versions: 0.8-dev Environment: all Reporter: Paul Baclace I encountered a case where the JVM of a Tasktracker child did not exit after the main thread returned; a thread dump showed only the threads named STDOUT and STDERR from CommandRunner as non

[jira] Updated: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

2005-12-22 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ] Paul Baclace updated NUTCH-150: --- Attachment: OutlinkExtractor.java.patch This patch has 3 changes: 1. Adds a comment that non-plain-text can be a problem. 2. Adds quantifiers to the regular

[jira] Created: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

2005-12-22 Thread Paul Baclace (JIRA)
OutlinkExtractor extremely slow on some non-plain text -- Key: NUTCH-150 URL: http://issues.apache.org/jira/browse/NUTCH-150 Project: Nutch Type: Bug Versions: 0.8-dev Environment: All Reporter: Paul

Re: nutch-0.8-dev *mapred.input.subdir* problem ?

2005-12-21 Thread Paul Baclace
You can ignore mapred.input.subdir; I find it is an unneeded option. Now that the mapred branch is merged to be the trunk, there is a need to clarify the documentation since the a change was made to have the input be specified as a directory and then all files in that directory are considered inp

Re: NDFS Connection reset

2005-12-20 Thread Paul Baclace
I have recently seen the connection reset problem, and no firewall was involved. I have been doing a mapred index build over more than 5TB of arc files and I noticed: SocketException: Connection reset that occurred in 1 of 1070 map tasks during the parse phase; the task was automatically rest

Re: [bug] overwriting job properties until runtime is not possible

2005-12-20 Thread Paul Baclace
Stefan Groschupf wrote: > My suggestion is that we change NutchConf is following way: > > resourceNames.add(resourceNames.size()-1, name); // add second to last > loadResource(properties, name, false); This would make property settings in the new resource (name, in the above) override explicitly

Re: NDFS Connection reset

2005-12-08 Thread Paul Baclace
Jack Tang wrote: It was odd that when I input every command, the NameNode would throw exception: 051206 003714 Server connection on port 9000 from 127.0.0.1: starting 051206 003715 Server connection on port 9000 from 127.0.0.1 caught: java.net.SocketException: Connection reset java.net.SocketExc

[jira] Commented: (NUTCH-120) one "bad" link on a page kills parsing

2005-11-23 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358426 ] Paul Baclace commented on NUTCH-120: Indeed there is a comment that indicates the code keeps trying, but luckily it does not, and it might be unwise to keep trying after

Re: problem with inject url on mapred

2005-11-10 Thread Paul Baclace
[regarding mapred ver 0.8] Anton Potehin wrote: I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111. > 051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31 > Please help me to find out what the problem is? And what I did wrong? Is the problem the negative p

Re: Do nutch help me?

2005-11-10 Thread Paul Baclace
Arun Kaundal wrote: Hi I want to crawl local files, internet/intranet documents/files. Do u think nutch help me in this case? Although the tutorial describes these separately, conf/crawl-urlfilter.txt can allow any combination of Internet, Intranet, and local filesystem crawling.

Re: Distributed nutch

2005-11-09 Thread Paul Baclace
In addition to Stefan Groschupf's detailed references, here are some short, high-level answers to your questions: Rozina Sorathia wrote: > 1. What is Distributed nutch Nutch is a distributed Lucene with large scale web crawling. >2. How nutch distributed works? Modeled after Google's Map-R

Re: mapred bug -- bad part calculation?

2005-11-09 Thread Paul Baclace
Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the "No input directories" issue when using a local filesystem with multiple task tracke

Re: mapred bug -- bad part calculation?

2005-11-07 Thread Paul Baclace
Rod Taylor wrote: NDFS accomplishes the above path finding by auto-prefixing any path not beginning with / with a /user/$USER. I didn't think it was appropriate for LocalFileSystem.java to be mucking around trying to automatically adjust paths to what the user may have intended. Grep-ing for /

Re: mapred bug -- bad part calculation?

2005-11-07 Thread Paul Baclace
Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the "No input directories" issue when using a local filesystem with multiple task tracke

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-11-04 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: comments_msgs_and_local_renames_during_TestNDFS.patch > TestNDFS a JUnit test specifically for N

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-11-04 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: required_by_TestNDFS_v3.patch I found and fixed a problem with a standalone DataNode process exiting too early (this was not detected by the current

deltas to wiki page nutch/NutchDistributedFileSystem

2005-10-31 Thread Paul Baclace
I made these corrections to the wiki page nutch/NutchDistributedFileSystem located at: http://wiki.apache.org/nutch/NutchDistributedFileSystem The OLD/NEW diffs below are based on the action=raw view of the text. I hope someone can fold these into the wiki page since it appears as "Immutable

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-21 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: TestNDFS.java Revised TestNDFS to add a log message about which random number generator is in use (also changed the fixed seed to a newly created

[jira] Updated: (NUTCH-122) block numbers need a better random number generator

2005-10-21 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ] Paul Baclace updated NUTCH-122: --- Attachment: MersenneTwister.java Resubmitting MersenneTwister.java, this time with the Grant ASF checked. > block numbers need a better random number genera

[jira] Updated: (NUTCH-122) block numbers need a better random number generator

2005-10-21 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ] Paul Baclace updated NUTCH-122: --- Attachment: MersenneTwister.java I am attaching MersenneTwister.java The license on the attached source: All implementations are based on the mt19937 C code

[jira] Created: (NUTCH-122) block numbers need a better random number generator

2005-10-20 Thread Paul Baclace (JIRA)
Reporter: Paul Baclace In order to support billions of block numbers, a better PRNG than java.util.Random is needed. To reach billions with low probability of collision, 64 bit random numbers are needed (the Birthday Problem is the model for the number of bits needed; the result is that

Re: No buffer space available

2005-10-19 Thread Paul Baclace
[EMAIL PROTECTED] wrote: Thanks for the tips. But I have a monster computer, 12G RAM and dual 64 bits processors, my network connection is 100 MB/S! I guess Nutch doesn't close the opened sockets in the case of bad host! I am still strugelling with problem. If the OS is using a default/generic

[jira] Commented: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-19 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=comments#action_12332546 ] Paul Baclace commented on NUTCH-116: Doug, Thanks for the quick response. 1. Should BLOCKREPORT_INTERVAL and DATANODE_STARTUP_PERIOD be removed from FSConstants

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-19 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: required_by_TestNDFS_v2.patch Change Notes revised for patch required_by_TestNDFS_v2.patch which supercedes required_by_TestNDFS.patch: src/java/org

Re: Event queues vs threads

2005-10-18 Thread Paul Baclace
Doug Cutting wrote: >Kelvin Tan wrote: fetcher as a series of event queues (ala SEDA) instead of with threads. I have never been able to write a async version of things with Java's nio that outperforms a threaded version. In theory it is possible, since you can avoid thread switching overhea

[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-18 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: required_by_TestNDFS.patch TestNDFS.java Patch comments: src/java/org/apache/nutch/ipc/Server.java improved logging details, use

[jira] Created: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-18 Thread Paul Baclace (JIRA)
: Paul Baclace TestNDFS is a JUnit test for NDFS using "pseudo multiprocessing" (or more strictly, pseudo distributed) meaning all daemons run in one process and sockets are used to communicate between daemons. The test permutes various block sizes, number of files, file sizes, and

patch to fix NPE in Daemon.getRunnable()

2005-10-17 Thread Paul Baclace
An obvious fix; I also set the name of the Thread which is needed for debugging clean-shutdown of daemons. This NPE is seen when calling DataNode.shutdown(). Paul Index: E:/Src/nutch/mapred/src/java/org/apache/nutch/util/Daemon.java ===

patch for changes related to TestNDFS

2005-10-13 Thread Paul Baclace
This patch is for comments/local name change or error msg change only, to clarify the code as it relates to the new JUnit test, TestNDFS that I wrote. In many cases the intention was to help the next person reading the source. These changes should be very safe and are unlikely to introduce subtl

Re: org.apache.commons.io.FileUtils

2005-10-12 Thread Paul Baclace
Paul Baclace wrote: I need a recursive file delete for cleaning up after a JUnit test. I just now spotted: org.apache.nutch.fs.LocalFileSystem.delete(File f) which does what I want (recursive, local delete). So no need for common.io. Paul

org.apache.commons.io.FileUtils

2005-10-11 Thread Paul Baclace
I need a recursive file delete for cleaning up after a JUnit test. There is one in Commons IO (org.apache.commons.io): FileUtils.deleteDirectory(File directory) I wonder whether I should use org.apache.commons.io as a new jar added to lib or arrange a libtest for jars only used by JUnit tests,

Re: DNS

2005-10-11 Thread Paul Baclace
Fuad Efendi wrote: Another cause of another problem: By default, Java 1.4 caches DNS-to-IP mappings forever... java.security.Security.setProperty("networkaddress.cache.ttl" , "1"); I had to look up what the units are for this since your message was possibly ambiguous. The units are in se

Re: why task tracker ports random?

2005-09-27 Thread Paul Baclace
Stefan Groschupf wrote: As far I understand the code there is only one tasktracker per machine. That is true, but only for the most apparent use case. I'm working on testing which needs emulate a multi machine deployment. As you can see in the tasktracker code, the ports are cleanly closed i

Re: failing of org.apache.nutch.tools.TestSegmentMergeTool?

2005-09-27 Thread Paul Baclace
Chris Mattmann wrote: I just noticed after checking out the latest SVN of Nutch that I am currently failing the TestSegmentMergeTool Junit test when I type "ant test" for Nutch. I'm on the mapred branch, not the trunk, and all tests pass. One thing I have noticed is that it is best to start

Re: Random number generators for NDFS block numbers

2005-09-26 Thread Paul Baclace
Doug Cutting wrote: It just occurred to me that perhaps we could simply use sequential block numbering. All block ids are generated centrally on the namenode. I'm not sure what the advantage of sequential block numbers would be since long period PRNG block numbering does not even need to sto

Re: why task tracker ports random?

2005-09-26 Thread Paul Baclace
Stefan Groschupf wrote: Hi Paul, my call stack say that actually no other classes using the tasktracker. Beside that tasktracker could be implement NutchConfigurable than all problems would be solved since this is IOC pattern. Or do I oversee something? I am thinking about the mapred branch

Re: why task tracker ports random?

2005-09-26 Thread Paul Baclace
Stefan Groschupf wrote: Beside that a behavior like the datanode that iterates until it find a free port would be a better than just random. There is a possibility that a test run could start many processes on one machine and a sequential available port search could be contentious. If you can

Re: Random number generators for NDFS block numbers

2005-09-23 Thread Paul Baclace
Thanks to archive.org, I found this additional reference: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/eindex.html which is the English home page of Matsumoto san, co-originator of Mersenne Twister. There is a faq and the original C source. Paul Paul Baclace wrote: [...] Alternative

Random number generators for NDFS block numbers

2005-09-23 Thread Paul Baclace
Doug Cutting expressed a concern to me about using util.Random to generate random 64 bit block numbers for NDFS. The following is my analysis. Random number generators for NDFS block numbers Requirements capable of billions of block numbers 64 bit block numbers deterministic for re

Re: question re: usage of createTempFile() for NDFS

2005-09-21 Thread Paul Baclace
Paul Baclace wrote: Doug Cutting wrote: In Nutch, define this in NUTCH_OPTS, with something like: export NUTCH_OPTS=-Djava.io.tmpdir=/foo It is not yet that clean. There are a few explicit uses of "/tmp": JobConf.java NameNode.java NDFS.java (last seen in Release-0.7) Th

Re: question re: usage of createTempFile() for NDFS

2005-09-20 Thread Paul Baclace
Doug Cutting wrote: Ordway, Ryan wrote: As a quick workaround, I made a few quick adjustments to the NDFSClient.java code to change the directory that temporary files are created in. This is hard coded to /nutch/tmp, but if someone could perhaps add a config option to make it configurable t

mapred patch for improved error message and some javadoc comments

2005-09-16 Thread Paul Baclace
Here is a patch for improving the error message that is displayed when an intranet crawl commandline has a file instead of a directory of files containing URLs. The old error msg: java.io.IOException: No input files in: [Ljava.io.File;@c24c0 Obviously, the default toString() says nothing. The

Re: Nutch vulnerabilities

2005-09-16 Thread Paul Baclace
Michael Ji wrote: No particular vunerable higher than the case you running a web server, if I am not wrong; tomcat is same as a webserver except JSP is its' core engine; I would suggest following any instructions that Tomcat has for locking it down. For instance, there is a conf setting (the