Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

2005-08-16 Thread Doug Cutting

Jeremy Bensley wrote:

After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location.


The temprorary storage between map and reduce is actually not stored in 
NDFS, but on node's local disks.  But the input (the url file in this 
case) must be shared.



So, my process for running crawl is now:
1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS 
3. Set up / start job and task trackers

4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs


That looks right to me.

We really need a mapred & ndfs-based tutorial...


The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:

050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
at org.apache.nutch.mapred.TaskTracker.(TaskTracker.java:72)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)

It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.


The config files are a bit confusing.  mapred-default.xml is for stuff 
that may be reasonably overidden by applications, while nutch-site.xml 
is for stuff that should not be overridden by applications.  So the name 
of the shared filesystem and of the job tracker should be in 
nutch-site.xml, since they should not be overridden.  But, e.g., the 
default number of map and reduce tasks should be in mapred-default.xml, 
since applications do sometimes change these.


The "local" job tracker should only be used in standalone 
configurations, when everything runs in the same process.  It doesn't 
make sense to start a task tracker process configured with a "local" job 
tracker.  If you want to run them on the same host then you might 
configure "localhost:" as the job tracker.


Doug


Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

2005-08-16 Thread Jeremy Bensley
After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location. So, my process for running crawl is now:

1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS 
3. Set up / start job and task trackers
4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs

Following these steps I was able to get it to work as expected.


The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:

050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
at org.apache.nutch.mapred.TaskTracker.(TaskTracker.java:72)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)

It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.

I appreciate the help, and look forward to experimenting with the software.

Jeremy


On 8/16/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Jeremy Bensley wrote:
> > First, I have observed the same behavior as a previous poster from
> > yesterday who, instead of specifying a file for the URLs to be read
> > from, must now specify a directory (full path) to which a file
> > containing the URL list is stored. From the response to that thread I
> > am gathering that it isn't desired behavior to specify a directory
> > instead of a file.
> 
> A directory is required.  For consistency, all inputs and outputs are
> now directories of files rather than individual files.
> 
> > Second, and more importantly, I am having issues with task trackers. I
> > have three machines running task tracker, and a fourth running the job
> > tracker, and they seem to be talking well. Whenever I try to invoke
> > crawl using the job tracker, however, all of my task trackers
> > continually fail with this:
> >
> > 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
> > [Fatal Error] :-1:-1: Premature end of file.
> > 050816 134532 SEVERE error parsing conf file:
> > org.xml.sax.SAXParseException: Premature end of file.
> > java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
> > end of file.
> > at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
> > at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
> > at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
> > at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
> > at 
> > org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
> > at 
> > org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319)
> > at 
> > org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
> > at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
> > at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
> > Caused by: org.xml.sax.SAXParseException: Premature end of file.
> > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
> > at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
> > ... 8 more
> >
> > Whenever I look at the job.xml file specified by this location, it
> > turns out that it is a directory, not a file.
> >
> > drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml
> 
> I have not seen this before.  If you remove everything in /tmp/nutch, is
> this reproducible?  Are you using NDFS?  If not, how are you sharing
> files between task trackers?  Is this on Win32, Linux or what?  Are you
> running the latest mapred code?  If your troubles continue, please post
> your nutch-site.xml and mapred-default.xml.
> 
> Doug
>


Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

2005-08-16 Thread Doug Cutting

Jeremy Bensley wrote:

First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.


A directory is required.  For consistency, all inputs and outputs are 
now directories of files rather than individual files.



Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:

050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319)
at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
... 8 more

Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.

drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml


I have not seen this before.  If you remove everything in /tmp/nutch, is 
this reproducible?  Are you using NDFS?  If not, how are you sharing 
files between task trackers?  Is this on Win32, Linux or what?  Are you 
running the latest mapred code?  If your troubles continue, please post 
your nutch-site.xml and mapred-default.xml.


Doug


(mapred branch) Job.xml as a directory instead of a file, other issues.

2005-08-16 Thread Jeremy Bensley
I have been attempting to get the mapred branch version of the crawler
working and have hit some snags.

First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.

Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:

050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319)
at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
... 8 more

Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.

drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml


Any help / observation of these issues is most appreciated.

Thanks,

Jeremy