Doğacan Güney wrote:
Hi,
After hadoop-0.9.1, parsing and indexing doesn't seem to work.
If you parse while fetching then it is fine, but if you run parse as a
different job, it creates an essentially empty parse_data
directory(which has index files, but doesn't have data files). I am
looking into this, but so far, I couldn't find the source of error.
Also, indexing fails at Indexer.OutputFormat.getRecordWriter. The
parameter fs seems to be an instance of PhasedFileSystem which throws
exceptions on delete and {start,complete}LocalOutput. The following
patch should fix it, but may not be the best way of doing this.
Index: src/java/org/apache/nutch/indexer/Indexer.java
===================================================================
--- src/java/org/apache/nutch/indexer/Indexer.java (revision 487240)
+++ src/java/org/apache/nutch/indexer/Indexer.java (working copy)
@@ -94,11 +94,15 @@
final Path temp =
job.getLocalPath("index/_"+Integer.toString(new
Random().nextInt()));
- fs.delete(perm); // delete old, if any
-
+ final FileSystem dfs = FileSystem.get(job);
+ + if (dfs.exists(perm)) {
+ dfs.delete(perm); // delete old,
if any
+ }
+ final AnalyzerFactory factory = new AnalyzerFactory(job);
final IndexWriter writer = // build locally first
- new IndexWriter(fs.startLocalOutput(perm, temp).toString(),
+ new IndexWriter(dfs.startLocalOutput(perm, temp).toString(),
new NutchDocumentAnalyzer(job), true);
writer.setMergeFactor(job.getInt("indexer.mergeFactor", 10));
@@ -146,7 +150,7 @@
// optimize & close index
writer.optimize();
writer.close();
- fs.completeLocalOutput(perm, temp); // copy to dfs
+ dfs.completeLocalOutput(perm, temp);
fs.createNewFile(new Path(perm, DONE_NAME));
} finally {
closed = true;
Sorry about the patch, it got garbled somehow. I am attaching it, I hope
mailing list doesn't drop attachments.
Index: src/java/org/apache/nutch/indexer/Indexer.java
===================================================================
--- src/java/org/apache/nutch/indexer/Indexer.java (revision 487240)
+++ src/java/org/apache/nutch/indexer/Indexer.java (working copy)
@@ -94,11 +94,15 @@
final Path temp =
job.getLocalPath("index/_"+Integer.toString(new Random().nextInt()));
- fs.delete(perm); // delete old, if any
-
+ final FileSystem dfs = FileSystem.get(job);
+
+ if (dfs.exists(perm)) {
+ dfs.delete(perm); // delete old, if any
+ }
+
final AnalyzerFactory factory = new AnalyzerFactory(job);
final IndexWriter writer = // build locally first
- new IndexWriter(fs.startLocalOutput(perm, temp).toString(),
+ new IndexWriter(dfs.startLocalOutput(perm, temp).toString(),
new NutchDocumentAnalyzer(job), true);
writer.setMergeFactor(job.getInt("indexer.mergeFactor", 10));
@@ -146,7 +150,7 @@
// optimize & close index
writer.optimize();
writer.close();
- fs.completeLocalOutput(perm, temp); // copy to dfs
+ dfs.completeLocalOutput(perm, temp);
fs.createNewFile(new Path(perm, DONE_NAME));
} finally {
closed = true;
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general