Re: (mapred branch) Job.xml as a directory instead of a file, other issues.
Jeremy Bensley wrote: After going through your checklist, I realized that my view on how the MapReduce function behaves was slightly flawed, as I did not realize that the temporary storage phase between map and reduce had to be in a shared location. The temprorary storage between map and reduce is actually not stored in NDFS, but on node's local disks. But the input (the url file in this case) must be shared. So, my process for running crawl is now: 1. Set up / start NDFS name and data nodes 2. Copy url file into NDFS 3. Set up / start job and task trackers 4. run crawl with arguments referencing the NDFS positions of my inputs and outputs That looks right to me. We really need a mapred & ndfs-based tutorial... The only lasting issue I have is that, whenever I attempt to start a tasktracker or jobtracker and have the configuration parameters for mapred specified only in mapred-default.xml, I get the following error: 050816 164343 parsing file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml 050816 164343 parsing file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml Exception in thread "main" java.lang.RuntimeException: Bad mapred.job.tracker: local at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245) at org.apache.nutch.mapred.TaskTracker.(TaskTracker.java:72) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609) It is as if the mapred-default.xml is not being parsed for its options. If I specify the same options in nutch-site.xml it works just fine. The config files are a bit confusing. mapred-default.xml is for stuff that may be reasonably overidden by applications, while nutch-site.xml is for stuff that should not be overridden by applications. So the name of the shared filesystem and of the job tracker should be in nutch-site.xml, since they should not be overridden. But, e.g., the default number of map and reduce tasks should be in mapred-default.xml, since applications do sometimes change these. The "local" job tracker should only be used in standalone configurations, when everything runs in the same process. It doesn't make sense to start a task tracker process configured with a "local" job tracker. If you want to run them on the same host then you might configure "localhost:" as the job tracker. Doug
Re: (mapred branch) Job.xml as a directory instead of a file, other issues.
After going through your checklist, I realized that my view on how the MapReduce function behaves was slightly flawed, as I did not realize that the temporary storage phase between map and reduce had to be in a shared location. So, my process for running crawl is now: 1. Set up / start NDFS name and data nodes 2. Copy url file into NDFS 3. Set up / start job and task trackers 4. run crawl with arguments referencing the NDFS positions of my inputs and outputs Following these steps I was able to get it to work as expected. The only lasting issue I have is that, whenever I attempt to start a tasktracker or jobtracker and have the configuration parameters for mapred specified only in mapred-default.xml, I get the following error: 050816 164343 parsing file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml 050816 164343 parsing file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml Exception in thread "main" java.lang.RuntimeException: Bad mapred.job.tracker: local at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245) at org.apache.nutch.mapred.TaskTracker.(TaskTracker.java:72) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609) It is as if the mapred-default.xml is not being parsed for its options. If I specify the same options in nutch-site.xml it works just fine. I appreciate the help, and look forward to experimenting with the software. Jeremy On 8/16/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Jeremy Bensley wrote: > > First, I have observed the same behavior as a previous poster from > > yesterday who, instead of specifying a file for the URLs to be read > > from, must now specify a directory (full path) to which a file > > containing the URL list is stored. From the response to that thread I > > am gathering that it isn't desired behavior to specify a directory > > instead of a file. > > A directory is required. For consistency, all inputs and outputs are > now directories of files rather than individual files. > > > Second, and more importantly, I am having issues with task trackers. I > > have three machines running task tracker, and a fourth running the job > > tracker, and they seem to be talking well. Whenever I try to invoke > > crawl using the job tracker, however, all of my task trackers > > continually fail with this: > > > > 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml > > [Fatal Error] :-1:-1: Premature end of file. > > 050816 134532 SEVERE error parsing conf file: > > org.xml.sax.SAXParseException: Premature end of file. > > java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature > > end of file. > > at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355) > > at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290) > > at org.apache.nutch.util.NutchConf.get(NutchConf.java:91) > > at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80) > > at > > org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335) > > at > > org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319) > > at > > org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221) > > at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269) > > at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610) > > Caused by: org.xml.sax.SAXParseException: Premature end of file. > > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172) > > at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315) > > ... 8 more > > > > Whenever I look at the job.xml file specified by this location, it > > turns out that it is a directory, not a file. > > > > drwxrwxr-x 2 jeremy users 4096 Aug 16 13:45 job.xml > > I have not seen this before. If you remove everything in /tmp/nutch, is > this reproducible? Are you using NDFS? If not, how are you sharing > files between task trackers? Is this on Win32, Linux or what? Are you > running the latest mapred code? If your troubles continue, please post > your nutch-site.xml and mapred-default.xml. > > Doug >
Re: (mapred branch) Job.xml as a directory instead of a file, other issues.
Jeremy Bensley wrote: First, I have observed the same behavior as a previous poster from yesterday who, instead of specifying a file for the URLs to be read from, must now specify a directory (full path) to which a file containing the URL list is stored. From the response to that thread I am gathering that it isn't desired behavior to specify a directory instead of a file. A directory is required. For consistency, all inputs and outputs are now directories of files rather than individual files. Second, and more importantly, I am having issues with task trackers. I have three machines running task tracker, and a fourth running the job tracker, and they seem to be talking well. Whenever I try to invoke crawl using the job tracker, however, all of my task trackers continually fail with this: 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml [Fatal Error] :-1:-1: Premature end of file. 050816 134532 SEVERE error parsing conf file: org.xml.sax.SAXParseException: Premature end of file. java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature end of file. at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355) at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290) at org.apache.nutch.util.NutchConf.get(NutchConf.java:91) at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610) Caused by: org.xml.sax.SAXParseException: Premature end of file. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172) at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315) ... 8 more Whenever I look at the job.xml file specified by this location, it turns out that it is a directory, not a file. drwxrwxr-x 2 jeremy users 4096 Aug 16 13:45 job.xml I have not seen this before. If you remove everything in /tmp/nutch, is this reproducible? Are you using NDFS? If not, how are you sharing files between task trackers? Is this on Win32, Linux or what? Are you running the latest mapred code? If your troubles continue, please post your nutch-site.xml and mapred-default.xml. Doug
(mapred branch) Job.xml as a directory instead of a file, other issues.
I have been attempting to get the mapred branch version of the crawler working and have hit some snags. First, I have observed the same behavior as a previous poster from yesterday who, instead of specifying a file for the URLs to be read from, must now specify a directory (full path) to which a file containing the URL list is stored. From the response to that thread I am gathering that it isn't desired behavior to specify a directory instead of a file. Second, and more importantly, I am having issues with task trackers. I have three machines running task tracker, and a fourth running the job tracker, and they seem to be talking well. Whenever I try to invoke crawl using the job tracker, however, all of my task trackers continually fail with this: 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml [Fatal Error] :-1:-1: Premature end of file. 050816 134532 SEVERE error parsing conf file: org.xml.sax.SAXParseException: Premature end of file. java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature end of file. at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355) at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290) at org.apache.nutch.util.NutchConf.get(NutchConf.java:91) at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610) Caused by: org.xml.sax.SAXParseException: Premature end of file. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172) at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315) ... 8 more Whenever I look at the job.xml file specified by this location, it turns out that it is a directory, not a file. drwxrwxr-x 2 jeremy users 4096 Aug 16 13:45 job.xml Any help / observation of these issues is most appreciated. Thanks, Jeremy