Versions: 0.8-dev
Environment: all
Reporter: Paul Baclace
When mapred.local.dir is used to specify multiple temp dir. areas, space
allocation limited by smallest area because the temp dir. selection algorithm
is round robin starting from a randomish point. When round robin is used
[
http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12362383 ]
Paul Baclace commented on NUTCH-151:
The number of threads that invoke _barrier.barrier() or .attemptBarrier()
should match the count passed to the contructor
[
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ]
Paul Baclace commented on NUTCH-153:
NUTCH-160?
There is slowness and then there is continental drift. The quantifiers should
be used with any regex package unless
[
http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362274 ]
Paul Baclace commented on NUTCH-162:
The best practice for identifying localization is to use the ISO language and
country code in the form of lowercase language code
: Paul Baclace
nutch-daemon.sh creates a log file with the name pattern
$NUTCH_LOG_DIR/nutch-$NUTCH_IDENT_STRING-$command-`hostname`.log
every time it is run. This can overwrite a previous log without warning. As
such, it is too easy to accidently lose a log that might contain unique failure
[ http://issues.apache.org/jira/browse/NUTCH-156?page=all ]
Paul Baclace updated NUTCH-156:
---
Attachment: nutch-daemon.sh.patch
This is the suggested fix as a patch.
Example log name:
nutch-peb-jobtracker-ia109102.archive.org-20051228T230740.log
nutch
[
http://issues.apache.org/jira/browse/NUTCH-151?page=comments#action_12361242 ]
Paul Baclace commented on NUTCH-151:
Analysis:
CommandRunner uses CyclicBarrier is to synchronize the thread that does the
exec (lets call it the main thread) with the io
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Paul Baclace updated NUTCH-151:
---
Attachment: CommandRunner.java
Minimal required changes to fix bug NUTCH-151:
1. The pipe io threads should be daemons.
2. The main thread should always interrupt
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Paul Baclace updated NUTCH-151:
---
Attachment: CommandRunner.java.patch
Here is the patch for CommandRunner (previously, I attached the actual file).
CommandRunner can hang after the main thread
[ http://issues.apache.org/jira/browse/NUTCH-152?page=all ]
Paul Baclace updated NUTCH-152:
---
Attachment: TaskRunner.java.patch
The patch addresses each issue listed in the detailed description of this bug.
The detailed description is suitable
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.8-dev
Environment: all
Reporter: Paul Baclace
If TextParser is given postscript, it can take hours and then fail. This can
be avoided with careful configuration, but if the server MIME type is wrong
[ http://issues.apache.org/jira/browse/NUTCH-153?page=all ]
Paul Baclace updated NUTCH-153:
---
Attachment: TextParser.java.patch
A patch to reject files with %!PS-Adobe in the first 40 characters of the
file.
TextParser is only supposed to parse plain
[
http://issues.apache.org/jira/browse/NUTCH-128?page=comments#action_12361254 ]
Paul Baclace commented on NUTCH-128:
In general, it might be helpful to issue an INFO level log msg whenever a
configuration attribute is overridden. If the override
Type: Bug
Components: indexer
Versions: 0.8-dev
Environment: all
Reporter: Paul Baclace
I encountered a case where the JVM of a Tasktracker child did not exit after
the main thread returned; a thread dump showed only the threads named STDOUT
and STDERR from CommandRunner as non
[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
Paul Baclace updated NUTCH-150:
---
Attachment: OutlinkExtractor.java.patch
This patch has 3 changes:
1. Adds a comment that non-plain-text can be a problem.
2. Adds quantifiers to the regular
You can ignore mapred.input.subdir; I find it is an unneeded option.
Now that the mapred branch is merged to be the trunk, there is a need
to clarify the documentation since the a change was made to have the
input be specified as a directory and then all files in that directory
are considered
I have recently seen the connection reset problem, and no firewall was involved.
I have been doing a mapred index build over more than 5TB of arc files and I
noticed:
SocketException: Connection reset
that occurred in 1 of 1070 map tasks during the parse phase; the task was
automatically
Arun Kaundal wrote:
Hi
I want to crawl local files, internet/intranet documents/files. Do u think
nutch help me in this case?
Although the tutorial describes these separately,
conf/crawl-urlfilter.txt can allow any combination of
Internet, Intranet, and local filesystem crawling.
[regarding mapred ver 0.8]
Anton Potehin wrote:
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
Please help me to find out what the problem is? And what I did wrong?
Is the problem the negative
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the No input
directories issue when using a local filesystem with multiple task
In addition to Stefan Groschupf's detailed references, here are some short,
high-level answers to your questions:
Rozina Sorathia wrote:
1. What is Distributed nutch
Nutch is a distributed Lucene with large scale web crawling.
2. How nutch distributed works?
Modeled after Google's
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the No input
directories issue when using a local filesystem with multiple task
Rod Taylor wrote:
NDFS accomplishes the above path finding by auto-prefixing any path not
beginning with / with a /user/$USER. I didn't think it was appropriate
for LocalFileSystem.java to be mucking around trying to automatically
adjust paths to what the user may have intended.
Grep-ing for
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: required_by_TestNDFS_v3.patch
I found and fixed a problem with a standalone DataNode process exiting too
early (this was not detected by the current
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: comments_msgs_and_local_renames_during_TestNDFS.patch
TestNDFS a JUnit test specifically for NDFS
I made these corrections to the wiki page nutch/NutchDistributedFileSystem
located at:
http://wiki.apache.org/nutch/NutchDistributedFileSystem
The OLD/NEW diffs below are based on the action=raw view of the text.
I hope someone can fold these into the wiki page since it appears as
Immutable
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ]
Paul Baclace updated NUTCH-122:
---
Attachment: MersenneTwister.java
I am attaching MersenneTwister.java
The license on the attached source:
All implementations are based on the mt19937 C code
[ http://issues.apache.org/jira/browse/NUTCH-122?page=all ]
Paul Baclace updated NUTCH-122:
---
Attachment: MersenneTwister.java
Resubmitting MersenneTwister.java, this time with the Grant ASF checked.
block numbers need a better random number generator
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: TestNDFS.java
Revised TestNDFS to add a log message about which random number generator is
in use (also changed the fixed seed to a newly created
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: required_by_TestNDFS_v2.patch
Change Notes revised for patch required_by_TestNDFS_v2.patch which supercedes
required_by_TestNDFS.patch:
src/java/org
[
http://issues.apache.org/jira/browse/NUTCH-116?page=comments#action_12332546 ]
Paul Baclace commented on NUTCH-116:
Doug,
Thanks for the quick response.
1. Should BLOCKREPORT_INTERVAL and DATANODE_STARTUP_PERIOD be removed from
FSConstants
: Paul Baclace
TestNDFS is a JUnit test for NDFS using pseudo multiprocessing (or more
strictly, pseudo distributed) meaning all daemons run in one process and
sockets are used to communicate between daemons.
The test permutes various block sizes, number of files, file sizes, and number
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
Paul Baclace updated NUTCH-116:
---
Attachment: required_by_TestNDFS.patch
TestNDFS.java
Patch comments:
src/java/org/apache/nutch/ipc/Server.java
improved logging details, use
This patch is for comments/local name change or error
msg change only, to clarify the code as it relates to
the new JUnit test, TestNDFS that I wrote. In many
cases the intention was to help the next person reading
the source. These changes should be very safe and
are unlikely to introduce
I need a recursive file delete for cleaning up after a JUnit test.
There is one in Commons IO (org.apache.commons.io):
FileUtils.deleteDirectory(File directory)
I wonder whether I should use org.apache.commons.io as a new
jar added to lib or arrange a libtest for jars only used by
JUnit
Paul Baclace wrote:
I need a recursive file delete for cleaning up after a JUnit test.
I just now spotted:
org.apache.nutch.fs.LocalFileSystem.delete(File f)
which does what I want (recursive, local delete).
So no need for common.io.
Paul
Fuad Efendi wrote:
Another cause of another problem:
By default, Java 1.4 caches DNS-to-IP mappings forever...
java.security.Security.setProperty(networkaddress.cache.ttl ,
1);
I had to look up what the units are for this since your message
was possibly ambiguous.
The units are in
Chris Mattmann wrote:
I just noticed after checking out the latest SVN of Nutch that I am
currently failing the TestSegmentMergeTool Junit test when I type ant test
for Nutch.
I'm on the mapred branch, not the trunk, and all tests pass.
One thing I have noticed is that it is best to start
Stefan Groschupf wrote:
As far I understand the code there is only one tasktracker per machine.
That is true, but only for the most apparent use case. I'm working on
testing which needs emulate a multi machine deployment.
As you can see in the tasktracker code, the ports are cleanly closed
Stefan Groschupf wrote:
Beside that a behavior like the datanode that iterates until it find
a free port would be a better than just random.
There is a possibility that a test run could start many processes
on one machine and a sequential available port search could be
contentious.
If you
Stefan Groschupf wrote:
Hi Paul,
my call stack say that actually no other classes using the tasktracker.
Beside that tasktracker could be implement NutchConfigurable than all
problems would be solved since this is IOC pattern.
Or do I oversee something?
I am thinking about the mapred branch
Doug Cutting wrote:
It just occurred to me that perhaps we could simply use sequential block
numbering. All block ids are generated centrally on the namenode.
I'm not sure what the advantage of sequential block numbers would be
since long period PRNG block numbering does not even need to
Doug Cutting expressed a concern to me about using util.Random to generate
random 64 bit block numbers for NDFS. The following is my analysis.
Random number generators for NDFS block numbers
Requirements
capable of billions of block numbers
64 bit block numbers
deterministic for
Thanks to archive.org, I found this additional reference:
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/eindex.html
which is the English home page of Matsumoto san, co-originator of Mersenne
Twister.
There is a faq and the original C source.
Paul
Paul Baclace wrote:
[...]
Alternative
Doug Cutting wrote:
Ordway, Ryan wrote:
As a quick workaround, I made a few quick adjustments to the
NDFSClient.java code to change the directory that temporary files
are created in. This is hard coded to /nutch/tmp, but if someone
could perhaps add a config option to make it configurable
45 matches
Mail list logo