Versions: 0.8-dev
Environment: all
Reporter: Paul Baclace
When mapred.local.dir is used to specify multiple temp dir. areas, space
allocation limited by smallest area because the temp dir. selection algorithm
is "round robin starting from a randomish point". When round robin is
[
http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12362392 ]
Paul Baclace commented on NUTCH-159:
mapred.temp.dir and mapred.local.dir are used for different purposes.
I think this is a sysadmin useability bug that really means:
1
[
http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12362383 ]
Paul Baclace commented on NUTCH-151:
The number of threads that invoke _barrier.barrier() or .attemptBarrier()
should match the count passed to the contructor of
[
http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362274 ]
Paul Baclace commented on NUTCH-162:
The best practice for identifying localization is to use the ISO language and
country code in the form of lowercase language code
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ]
Paul Baclace commented on NUTCH-153:
> NUTCH-160?
There is slowness and then there is continental drift. The quantifiers should
be used with any regex package unless
[
http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362043 ]
Paul Baclace commented on NUTCH-152:
>re 3: Why is a separate thread needed for stdout?
It certainly makes the code easier to read. Using the main thread to read
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362000 ]
Paul Baclace commented on NUTCH-153:
> mime.type.magic?
The particular run that had problems was using mime.type.magic=true. It turns
out that the magic "%!
[ http://issues.apache.org/jira/browse/NUTCH-156?page=all ]
Paul Baclace updated NUTCH-156:
---
Attachment: nutch-daemon.sh.patch
This is the suggested fix as a patch.
Example log name:
nutch-peb-jobtracker-ia109102.archive.org-20051228T230740.log
: Paul Baclace
nutch-daemon.sh creates a log file with the name pattern
"$NUTCH_LOG_DIR/nutch-$NUTCH_IDENT_STRING-$command-`hostname`.log"
every time it is run. This can overwrite a previous log without warning. As
such, it is too easy to accidently lose a log that might contain uniq
[
http://issues.apache.org/jira/browse/NUTCH-108?page=comments#action_12361339 ]
Paul Baclace commented on NUTCH-108:
I just had the opportunity to test this with 33 tasktrackers.
One thing I noticed: TaskTracker.java should be patched to reduce the
[ http://issues.apache.org/jira/browse/NUTCH-108?page=all ]
Paul Baclace updated NUTCH-108:
---
Attachment: TaskTracker.java.patch
Here is a patch for reducing redundant, voluminous output while retrying to
connect.
> tasktracker crashs w
[
http://issues.apache.org/jira/browse/NUTCH-128?page=comments#action_12361254 ]
Paul Baclace commented on NUTCH-128:
In general, it might be helpful to issue an INFO level log msg whenever a
configuration attribute is overridden. If the override
[ http://issues.apache.org/jira/browse/NUTCH-153?page=all ]
Paul Baclace updated NUTCH-153:
---
Attachment: TextParser.java.patch
A patch to reject files with "%!PS-Adobe" in the first 40 characters of the
file.
> TextParser is only supp
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.8-dev
Environment: all
Reporter: Paul Baclace
If TextParser is given postscript, it can take hours and then fail. This can
be avoided with careful configuration, but if the server MIME type is wrong and
the
[ http://issues.apache.org/jira/browse/NUTCH-152?page=all ]
Paul Baclace updated NUTCH-152:
---
Attachment: TaskRunner.java.patch
The patch addresses each issue listed in the detailed description of this bug.
The detailed description is suitable as a
/NUTCH-152
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.8-dev
Environment: all
Reporter: Paul Baclace
1. io pipes should be setDaemon(true) so that process cannot hang.
2. error messages for Exceptions are incomplete since e.getMessage() is used
and it can be
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Paul Baclace updated NUTCH-151:
---
Attachment: CommandRunner.java.patch
Here is the patch for CommandRunner (previously, I attached the actual file).
> CommandRunner can hang after the m
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Paul Baclace updated NUTCH-151:
---
Attachment: CommandRunner.java
Minimal required changes to fix bug NUTCH-151:
1. The pipe io threads should be daemons.
2. The main thread should always interrupt
[
http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12361242 ]
Paul Baclace commented on NUTCH-151:
Analysis:
CommandRunner uses CyclicBarrier is to synchronize the thread that does the
exec (lets call it the main thread) with the io
Type: Bug
Components: indexer
Versions: 0.8-dev
Environment: all
Reporter: Paul Baclace
I encountered a case where the JVM of a Tasktracker child did not exit after
the main thread returned; a thread dump showed only the threads named STDOUT
and STDERR from CommandRunner as non
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
Paul Baclace updated NUTCH-150:
---
Attachment: OutlinkExtractor.java.patch
This patch has 3 changes:
1. Adds a comment that non-plain-text can be a problem.
2. Adds quantifiers to the regular
OutlinkExtractor extremely slow on some non-plain text
--
Key: NUTCH-150
URL: http://issues.apache.org/jira/browse/NUTCH-150
Project: Nutch
Type: Bug
Versions: 0.8-dev
Environment: All
Reporter: Paul
You can ignore mapred.input.subdir; I find it is an unneeded option.
Now that the mapred branch is merged to be the trunk, there is a need
to clarify the documentation since the a change was made to have the
input be specified as a directory and then all files in that directory
are considered inp
I have recently seen the connection reset problem, and no firewall was involved.
I have been doing a mapred index build over more than 5TB of arc files and I
noticed:
SocketException: Connection reset
that occurred in 1 of 1070 map tasks during the parse phase; the task was
automatically rest
Stefan Groschupf wrote:
> My suggestion is that we change NutchConf is following way:
>
> resourceNames.add(resourceNames.size()-1, name); // add second to last
> loadResource(properties, name, false);
This would make property settings in the new resource (name, in the above)
override explicitly
Jack Tang wrote:
It was odd that when I input
every command, the NameNode would throw exception:
051206 003714 Server connection on port 9000 from 127.0.0.1: starting
051206 003715 Server connection on port 9000 from 127.0.0.1 caught:
java.net.SocketException: Connection reset
java.net.SocketExc
[
http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358426 ]
Paul Baclace commented on NUTCH-120:
Indeed there is a comment that indicates the code keeps trying, but luckily it
does not, and it might be unwise to keep trying after
[regarding mapred ver 0.8]
Anton Potehin wrote:
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.
> 051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
> Please help me to find out what the problem is? And what I did wrong?
Is the problem the negative p
Arun Kaundal wrote:
Hi
I want to crawl local files, internet/intranet documents/files. Do u think
nutch help me in this case?
Although the tutorial describes these separately,
conf/crawl-urlfilter.txt can allow any combination of
Internet, Intranet, and local filesystem crawling.
In addition to Stefan Groschupf's detailed references, here are some short,
high-level answers to your questions:
Rozina Sorathia wrote:
> 1. What is Distributed nutch
Nutch is a distributed Lucene with large scale web crawling.
>2. How nutch distributed works?
Modeled after Google's Map-R
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the "No input
directories" issue when using a local filesystem with multiple task
tracke
Rod Taylor wrote:
NDFS accomplishes the above path finding by auto-prefixing any path not
beginning with / with a /user/$USER. I didn't think it was appropriate
for LocalFileSystem.java to be mucking around trying to automatically
adjust paths to what the user may have intended.
Grep-ing for /
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the "No input
directories" issue when using a local filesystem with multiple task
tracke
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: comments_msgs_and_local_renames_during_TestNDFS.patch
> TestNDFS a JUnit test specifically for N
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: required_by_TestNDFS_v3.patch
I found and fixed a problem with a standalone DataNode process exiting too
early (this was not detected by the current
I made these corrections to the wiki page nutch/NutchDistributedFileSystem
located at:
http://wiki.apache.org/nutch/NutchDistributedFileSystem
The OLD/NEW diffs below are based on the action=raw view of the text.
I hope someone can fold these into the wiki page since it appears as
"Immutable
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: TestNDFS.java
Revised TestNDFS to add a log message about which random number generator is
in use (also changed the fixed seed to a newly created
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ]
Paul Baclace updated NUTCH-122:
---
Attachment: MersenneTwister.java
Resubmitting MersenneTwister.java, this time with the Grant ASF checked.
> block numbers need a better random number genera
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ]
Paul Baclace updated NUTCH-122:
---
Attachment: MersenneTwister.java
I am attaching MersenneTwister.java
The license on the attached source:
All implementations are based on the mt19937 C code
Reporter: Paul Baclace
In order to support billions of block numbers, a better PRNG than
java.util.Random is needed. To reach billions with low probability of
collision, 64 bit random numbers are needed (the Birthday Problem is the model
for the number of bits needed; the result is that
[EMAIL PROTECTED] wrote:
Thanks for the tips. But I have a monster computer, 12G RAM and dual
64 bits processors, my network connection is 100 MB/S! I guess Nutch
doesn't close the opened sockets in the case of bad host! I am still
strugelling with problem.
If the OS is using a default/generic
[
http://issues.apache.org/jira/browse/NUTCH-116?page=comments#action_12332546 ]
Paul Baclace commented on NUTCH-116:
Doug,
Thanks for the quick response.
1. Should BLOCKREPORT_INTERVAL and DATANODE_STARTUP_PERIOD be removed from
FSConstants
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: required_by_TestNDFS_v2.patch
Change Notes revised for patch required_by_TestNDFS_v2.patch which supercedes
required_by_TestNDFS.patch:
src/java/org
Doug Cutting wrote:
>Kelvin Tan wrote:
fetcher as a series of event queues (ala SEDA) instead
of with threads.
I have never been able to write a async version of things with Java's
nio that outperforms a threaded version. In theory it is possible,
since you can avoid thread switching overhea
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: required_by_TestNDFS.patch
TestNDFS.java
Patch comments:
src/java/org/apache/nutch/ipc/Server.java
improved logging details, use
: Paul Baclace
TestNDFS is a JUnit test for NDFS using "pseudo multiprocessing" (or more
strictly, pseudo distributed) meaning all daemons run in one process and
sockets are used to communicate between daemons.
The test permutes various block sizes, number of files, file sizes, and
An obvious fix; I also set the name of the Thread which is needed for debugging
clean-shutdown of daemons.
This NPE is seen when calling DataNode.shutdown().
Paul
Index: E:/Src/nutch/mapred/src/java/org/apache/nutch/util/Daemon.java
===
This patch is for comments/local name change or error
msg change only, to clarify the code as it relates to
the new JUnit test, TestNDFS that I wrote. In many
cases the intention was to help the next person reading
the source. These changes should be very safe and
are unlikely to introduce subtl
Paul Baclace wrote:
I need a recursive file delete for cleaning up after a JUnit test.
I just now spotted:
org.apache.nutch.fs.LocalFileSystem.delete(File f)
which does what I want (recursive, local delete).
So no need for common.io.
Paul
I need a recursive file delete for cleaning up after a JUnit test.
There is one in Commons IO (org.apache.commons.io):
FileUtils.deleteDirectory(File directory)
I wonder whether I should use org.apache.commons.io as a new
jar added to lib or arrange a libtest for jars only used by
JUnit tests,
Fuad Efendi wrote:
Another cause of another problem:
By default, Java 1.4 caches DNS-to-IP mappings forever...
java.security.Security.setProperty("networkaddress.cache.ttl" ,
"1");
I had to look up what the units are for this since your message
was possibly ambiguous.
The units are in se
Stefan Groschupf wrote:
As far I understand the code there is only one tasktracker per machine.
That is true, but only for the most apparent use case. I'm working on
testing which needs emulate a multi machine deployment.
As you can see in the tasktracker code, the ports are cleanly closed i
Chris Mattmann wrote:
I just noticed after checking out the latest SVN of Nutch that I am
currently failing the TestSegmentMergeTool Junit test when I type "ant test"
for Nutch.
I'm on the mapred branch, not the trunk, and all tests pass.
One thing I have noticed is that it is best to start
Doug Cutting wrote:
It just occurred to me that perhaps we could simply use sequential block
numbering. All block ids are generated centrally on the namenode.
I'm not sure what the advantage of sequential block numbers would be
since long period PRNG block numbering does not even need to sto
Stefan Groschupf wrote:
Hi Paul,
my call stack say that actually no other classes using the tasktracker.
Beside that tasktracker could be implement NutchConfigurable than all
problems would be solved since this is IOC pattern.
Or do I oversee something?
I am thinking about the mapred branch
Stefan Groschupf wrote:
Beside that a behavior like the datanode that iterates until it find
a free port would be a better than just random.
There is a possibility that a test run could start many processes
on one machine and a sequential available port search could be
contentious.
If you can
Thanks to archive.org, I found this additional reference:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/eindex.html
which is the English home page of Matsumoto san, co-originator of Mersenne
Twister.
There is a faq and the original C source.
Paul
Paul Baclace wrote:
[...]
Alternative
Doug Cutting expressed a concern to me about using util.Random to generate
random 64 bit block numbers for NDFS. The following is my analysis.
Random number generators for NDFS block numbers
Requirements
capable of billions of block numbers
64 bit block numbers
deterministic for re
Paul Baclace wrote:
Doug Cutting wrote:
In Nutch, define this in NUTCH_OPTS, with something like:
export NUTCH_OPTS=-Djava.io.tmpdir=/foo
It is not yet that clean. There are a few explicit uses of "/tmp":
JobConf.java
NameNode.java
NDFS.java (last seen in Release-0.7)
Th
Doug Cutting wrote:
Ordway, Ryan wrote:
As a quick workaround, I made a few quick adjustments to the
NDFSClient.java code to change the directory that temporary files
are created in. This is hard coded to /nutch/tmp, but if someone
could perhaps add a config option to make it configurable t
Here is a patch for improving the error message that is displayed
when an intranet crawl commandline has a file instead of a directory
of files containing URLs.
The old error msg:
java.io.IOException: No input files in: [Ljava.io.File;@c24c0
Obviously, the default toString() says nothing.
The
Michael Ji wrote:
No particular vunerable higher than the case you
running a web server, if I am not wrong;
tomcat is same as a webserver except JSP is its' core
engine;
I would suggest following any instructions that Tomcat has
for locking it down. For instance, there is a conf setting
(the
62 matches
Mail list logo