[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158366#comment-13158366 ] dibyendu ghosh commented on NUTCH-1206: --- Hi Chris, I have attached the direct.pdf file. You can also test with any simple pdf, for example, by exporting to pdf from a open office document. Results are same. Noticed that Nutch 1.4 has got released on 24th Nov. Will update after testing with that. Thanks, Dibyendu tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for
[jira] [Closed] (NUTCH-619) Another Language Identifier Plugin using Unicode code point range
[ https://issues.apache.org/jira/browse/NUTCH-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-619. --- Resolution: Won't Fix Language identification is now delegated to Tika. Another Language Identifier Plugin using Unicode code point range - Key: NUTCH-619 URL: https://issues.apache.org/jira/browse/NUTCH-619 Project: Nutch Issue Type: Wish Reporter: Vinci After I checked the language-identifier plugin, I found the internal implementation is inefficient for language that can be clear identify based on their unicode codepoint (e.g. CJK Language) If Nutch work under unicode, can anybody write a language identifier based on unicode code point range? The map is here: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane also you can refer to NutchAnalysis.jj for some of language code range * Some late developed language or rare character - include some CJK character, are moved to SIP * May be a special property should be set if multiple language character detected (languages that are other than English alphabet) - my suggestion here is, let CJK locale be the default case as they need bi-gram or other analyzer for better indexing ** CJK character is very difficult to further divide as they are share han characters - if you really want to identify the specific member of CJK, you need to use the language identifier plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-839) nutch doesnt run under 0.20.2+228-1~karmic-cdh3b1 version of hadoop
[ https://issues.apache.org/jira/browse/NUTCH-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-839. --- Resolution: Duplicate Fix Version/s: 1.4 https://issues.apache.org/jira/browse/NUTCH-937 nutch doesnt run under 0.20.2+228-1~karmic-cdh3b1 version of hadoop --- Key: NUTCH-839 URL: https://issues.apache.org/jira/browse/NUTCH-839 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.1 Environment: ubuntu linux version 2.6.31-14-server, x86_64 GNU/Linux Reporter: Robert Gonzalez Fix For: 1.4 new versions of hadoop appear to put jars in a different format now, instead of file:/a/b/c/d/job.jar, its now jar:file:/a/b/c/d/job.jar!, which breaks nutch when its trying to load its plugins. Specifically, the stack trace looks like: Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.init(URLNormalizers.java:124) at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57) A simple test class was written the used the URLFilters class, and the following stack trace resulted: 10/07/01 14:25:25 INFO mapred.JobClient: Task Id : attempt_201006171624_46525_m_00_1, Status : FAILED java.lang.RuntimeException: org.apache.nutch.net.URLFilter not found. at org.apache.nutch.net.URLFilters.init(URLFilters.java:52) at com.maxpoint.crawl.BidSampler$BIdSMapper.setup(BidSampler.java:42) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Running this on an older version of hadoop works. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1213) Pass additional SolrParams when indexing to Solr
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158429#comment-13158429 ] Julien Nioche commented on NUTCH-1213: -- Looks fine to me, feel free to go ahead and commit Pass additional SolrParams when indexing to Solr Key: NUTCH-1213 URL: https://issues.apache.org/jira/browse/NUTCH-1213 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-1213.diff This is a simple improvement of the SolrIndexer. It adds the ability to pass additional Solr parameters that are applied to each UpdateRequest. This is useful when you have to pass parameters specific to a particular indexing run, which are not in Solr invariants for the update handler, and modifying the Solr configuration for each different indexing run is inconvenient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-657) Estonian N-gram profile has wrong name
[ https://issues.apache.org/jira/browse/NUTCH-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-657. --- Resolution: Fixed https://issues.apache.org/jira/browse/TIKA-453 Estonian N-gram profile has wrong name -- Key: NUTCH-657 URL: https://issues.apache.org/jira/browse/NUTCH-657 Project: Nutch Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Jonathan Young Priority: Trivial The Nutch language identifier plugin contains an ngram profile, ee.ngp, in src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang . ee is the ISO-3166-1-alpha-2 code for Estonia (see http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_names_and_code_elements.htm), but it is the ISO-639-2 code for Ewe (see http://www.loc.gov/standards/iso639-2/php/English_list.php). et is the ISO-639-2 code for Estonian, and the language profile in ee.ngp is clearly Estonian. Proposed solution: rename ee.ngp to et.ngp . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1213) Pass additional SolrParams when indexing to Solr
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-1213. -- Resolution: Fixed Committed in rev. 1207217, thanks for the review. Pass additional SolrParams when indexing to Solr Key: NUTCH-1213 URL: https://issues.apache.org/jira/browse/NUTCH-1213 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-1213.diff This is a simple improvement of the SolrIndexer. It adds the ability to pass additional Solr parameters that are applied to each UpdateRequest. This is useful when you have to pass parameters specific to a particular indexing run, which are not in Solr invariants for the update handler, and modifying the Solr configuration for each different indexing run is inconvenient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
nutch and openJDK 1.6 for fedora
Hi everyone, Can anybody tell me why Nutch doesn't compile on my new old Fedora OS 9 with openJDK java installed. It complains on many warnings and errors during compilation. With Sun Java compilation is OK. Should I consider that openJDK is not compliant with java standards? potentially I intend to use it in other java systems and this might urge me to re-think my decision. Best Regards Alexander Aristov
Class Crawl don't close...
Hello everyone... I'm new with development for Nutch. I'm trying insert some code lines to create a better environment for my project, but one thing took my attention. The new version of nutch don't close the main method after finished. This is a little strange for explain, but I took a printscreen for show my problem... http://lucene.472066.n3.nabble.com/file/n3542214/ScreenShot.png Look for the last line... The terminal don't write nothing and don't ask for a new command line... If i press enter, the terminal apparently close the application and ask for a new command, but I need automatic close because I will put some repeat lines for crawl, readdb and dump. How can I do that? Is this a known error? I'm waiting for a response... Thank you very much... Danilo Fernandes -- View this message in context: http://lucene.472066.n3.nabble.com/Class-Crawl-don-t-close-tp3542214p3542214.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
[jira] [Commented] (NUTCH-1213) Pass additional SolrParams when indexing to Solr
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158487#comment-13158487 ] Hudson commented on NUTCH-1213: --- Integrated in nutch-trunk-maven #43 (See [https://builds.apache.org/job/nutch-trunk-maven/43/]) NUTCH-1213 Pass additional SolrParams when indexing to Solr. ab : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=revroot=revision=1207217 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrConstants.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java Pass additional SolrParams when indexing to Solr Key: NUTCH-1213 URL: https://issues.apache.org/jira/browse/NUTCH-1213 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-1213.diff This is a simple improvement of the SolrIndexer. It adds the ability to pass additional Solr parameters that are applied to each UpdateRequest. This is useful when you have to pass parameters specific to a particular indexing run, which are not in Solr invariants for the update handler, and modifying the Solr configuration for each different indexing run is inconvenient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: nutch and openJDK 1.6 for fedora
Hi Alexander, Which version of OpenJDK is it? I have Nutch running on one of my servers with *java version 1.6.0_22 OpenJDK Runtime Environment (IcedTea6 1.10.2) (6b22-1.10.2-0ubuntu1~11.04.1) OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)* and I don't have any problems compiling Julien On 28 November 2011 10:43, Alexander Aristov alexander.aris...@gmail.comwrote: Hi everyone, Can anybody tell me why Nutch doesn't compile on my new old Fedora OS 9 with openJDK java installed. It complains on many warnings and errors during compilation. With Sun Java compilation is OK. Should I consider that openJDK is not compliant with java standards? potentially I intend to use it in other java systems and this might urge me to re-think my decision. Best Regards Alexander Aristov -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Best way to get files out of segment directories
Hey Guys, So, I've completed my crawl of the vault.fbi.gov website for my class that I'm preparing for. I've got: [chipotle:local/nutch/framework] mattmann% du -hs crawl 28Gcrawl [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ total 0 drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 2027104947/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 2027104955/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 2027105006/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 2027105251/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 2027125721/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 2027144648/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 2027164220/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 2027184345/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 2027204447/ drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 2027224816/ [chipotle:local/nutch/framework] mattmann% ./bin/nutch readseg -list -dir crawl/segments/ NAMEGENERATED FETCHER START FETCHER END FETCHED PARSED 2027104947 1 2011-11-27T10:49:50 2011-11-27T10:49:50 1 1 2027104955 31 2011-11-27T10:49:57 2011-11-27T10:49:58 31 31 2027105006 48982011-11-27T10:50:08 2011-11-27T10:51:40 48984890 2027105251 98902011-11-27T10:52:52 2011-11-27T11:56:06 714 713 2027125721 92022011-11-27T12:57:24 2011-11-27T14:00:17 971 686 2027144648 82612011-11-27T14:46:50 2011-11-27T15:48:25 714 712 2027164220 75752011-11-27T16:42:22 2011-11-27T17:45:50 720 718 2027184345 68712011-11-27T18:43:48 2011-11-27T19:47:11 767 766 2027204447 61162011-11-27T20:44:50 2011-11-27T21:48:07 725 724 2027224816 54062011-11-27T22:48:18 2011-11-27T23:51:33 744 744 [chipotle:local/nutch/framework] mattmann% So the reality is, after crawling vault.fbi.gov, all I really wanted is the extracted PDF files that are housed in those segments. I've been playing around with ./bin/nutch readseg, and all I can say based on my initial impressions here are that it's really hard to get it to fulfill these simple requirements that I want it to do: 1. Iterate over all the segments - pull out URLs that have at_download/file in them - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor is the readable PDF name, the actual URL is a Plone CMS url, with little meaning) 2. for each PDF file anchor name - create a file in output_dir with the PDF file data read from the segment My guess is that even at the scale of data that I'm dealing with (10s of GB), that it's impossible and impractical to do anything that's not M/R here. Unfortunately there isn't a tool that will simply grab me the PDF files out of the segment files and then output those into a director, appropriately named with the anchor text. Or...is there? ;-) I'm running in Local mode, with no Hadoop cluster behind me, and with a Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working, intentionally as I don't want it to be a requirement for folks to have a cluster to do this assignment that I'm working on. I was talking to Ken Krugler about this, and after picking his brain, I think that I'm going to have to end up writing a tool to do what I want. So, if that's the case, fine, but can someone point me in the right direction for a good starting point for this? Ken also thought Andrzej might have like 10 magic solutions to make this happen, so here's hoping he's out there listening :-) Thanks for the help, guys. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Best way to get files out of segment directories
Hey Guys, One more thing. Just to let you know I've followed this blog here: http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/ And started to write a simple program to read the keys in a Segment file, and then dump out the byte content if the key matches the desired URL. You can find my code here: https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java Unfortunately, this code keeps dying due to OOM issues, clearly because the data file is too big, and because I likely have to M/R this. Just wanted to let you guys know where I'm at, and what I've been trying. Thanks, Chris On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote: Hey Guys, So, I've completed my crawl of the vault.fbi.gov website for my class that I'm preparing for. I've got: [chipotle:local/nutch/framework] mattmann% du -hs crawl 28G crawl [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ total 0 drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 2027104947/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 2027104955/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 2027105006/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 2027105251/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 2027125721/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 2027144648/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 2027164220/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 2027184345/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 2027204447/ drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 2027224816/ [chipotle:local/nutch/framework] mattmann% ./bin/nutch readseg -list -dir crawl/segments/ NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20271049471 2011-11-27T10:49:50 2011-11-27T10:49:50 1 1 202710495531 2011-11-27T10:49:57 2011-11-27T10:49:58 31 31 202710500648982011-11-27T10:50:08 2011-11-27T10:51:40 48984890 202710525198902011-11-27T10:52:52 2011-11-27T11:56:06 714 713 202712572192022011-11-27T12:57:24 2011-11-27T14:00:17 971 686 202714464882612011-11-27T14:46:50 2011-11-27T15:48:25 714 712 202716422075752011-11-27T16:42:22 2011-11-27T17:45:50 720 718 202718434568712011-11-27T18:43:48 2011-11-27T19:47:11 767 766 202720444761162011-11-27T20:44:50 2011-11-27T21:48:07 725 724 202722481654062011-11-27T22:48:18 2011-11-27T23:51:33 744 744 [chipotle:local/nutch/framework] mattmann% So the reality is, after crawling vault.fbi.gov, all I really wanted is the extracted PDF files that are housed in those segments. I've been playing around with ./bin/nutch readseg, and all I can say based on my initial impressions here are that it's really hard to get it to fulfill these simple requirements that I want it to do: 1. Iterate over all the segments - pull out URLs that have at_download/file in them - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor is the readable PDF name, the actual URL is a Plone CMS url, with little meaning) 2. for each PDF file anchor name - create a file in output_dir with the PDF file data read from the segment My guess is that even at the scale of data that I'm dealing with (10s of GB), that it's impossible and impractical to do anything that's not M/R here. Unfortunately there isn't a tool that will simply grab me the PDF files out of the segment files and then output those into a director, appropriately named with the anchor text. Or...is there? ;-) I'm running in Local mode, with no Hadoop cluster behind me, and with a Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working, intentionally as I don't want it to be a requirement for folks to have a cluster to do this assignment that I'm working on. I was talking to Ken Krugler about this, and after picking his brain, I think that I'm going to have to end up writing a tool to do what I want. So, if that's the case, fine, but can someone point me in the right direction for a good starting point for this? Ken also thought Andrzej might have like 10 magic solutions to make this happen, so here's hoping he's out there listening :-) Thanks for the help, guys. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246
Build failed in Jenkins: Nutch-nutchgora #82
See https://builds.apache.org/job/Nutch-nutchgora/82/ -- [...truncated 2457 lines...] clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-suffix [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/classes [javac] Note: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-validator [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [copy] Copying 1 file to
[jira] [Commented] (NUTCH-1213) Pass additional SolrParams when indexing to Solr
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159076#comment-13159076 ] Hudson commented on NUTCH-1213: --- Integrated in Nutch-trunk #1678 (See [https://builds.apache.org/job/Nutch-trunk/1678/]) NUTCH-1213 Pass additional SolrParams when indexing to Solr. ab : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=revroot=revision=1207217 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrConstants.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java Pass additional SolrParams when indexing to Solr Key: NUTCH-1213 URL: https://issues.apache.org/jira/browse/NUTCH-1213 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-1213.diff This is a simple improvement of the SolrIndexer. It adds the ability to pass additional Solr parameters that are applied to each UpdateRequest. This is useful when you have to pass parameters specific to a particular indexing run, which are not in Solr invariants for the update handler, and modifying the Solr configuration for each different indexing run is inconvenient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Best way to get files out of segment directories
OK, of course, I figured it out, and updated my program :-) You can see it on Github below. I'm going to clean up and generalize this program because I think it's of general use. I'll create an issue shortly. I'm thinking the tool could be something like: ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options] -segmentRootDir full file path to the root segment directory, e.g., crawl/segments -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment -outputDir The output directory to write file names to. -metadata --key=value where key is a Content Metadata key and value is a value to check. If the URL and its content metadata have a matching key,value pair, dump it. Allow for regex matching on the value. This would allow users to unravel the content hidden in segment directories and in sequence files into useable files that were downloaded by Nutch. Do you guys see this as a useful tool? If so, I'll contribute it this week for 1.5. Cheers, Chris On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote: Hey Guys, One more thing. Just to let you know I've followed this blog here: http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/ And started to write a simple program to read the keys in a Segment file, and then dump out the byte content if the key matches the desired URL. You can find my code here: https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java Unfortunately, this code keeps dying due to OOM issues, clearly because the data file is too big, and because I likely have to M/R this. Just wanted to let you guys know where I'm at, and what I've been trying. Thanks, Chris On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote: Hey Guys, So, I've completed my crawl of the vault.fbi.gov website for my class that I'm preparing for. I've got: [chipotle:local/nutch/framework] mattmann% du -hs crawl 28G crawl [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ total 0 drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 2027104947/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 2027104955/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 2027105006/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 2027105251/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 2027125721/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 2027144648/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 2027164220/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 2027184345/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 2027204447/ drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 2027224816/ [chipotle:local/nutch/framework] mattmann% ./bin/nutch readseg -list -dir crawl/segments/ NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 2027104947 1 2011-11-27T10:49:50 2011-11-27T10:49:50 1 1 2027104955 31 2011-11-27T10:49:57 2011-11-27T10:49:58 31 31 2027105006 48982011-11-27T10:50:08 2011-11-27T10:51:40 48984890 2027105251 98902011-11-27T10:52:52 2011-11-27T11:56:06 714 713 2027125721 92022011-11-27T12:57:24 2011-11-27T14:00:17 971 686 2027144648 82612011-11-27T14:46:50 2011-11-27T15:48:25 714 712 2027164220 75752011-11-27T16:42:22 2011-11-27T17:45:50 720 718 2027184345 68712011-11-27T18:43:48 2011-11-27T19:47:11 767 766 2027204447 61162011-11-27T20:44:50 2011-11-27T21:48:07 725 724 2027224816 54062011-11-27T22:48:18 2011-11-27T23:51:33 744 744 [chipotle:local/nutch/framework] mattmann% So the reality is, after crawling vault.fbi.gov, all I really wanted is the extracted PDF files that are housed in those segments. I've been playing around with ./bin/nutch readseg, and all I can say based on my initial impressions here are that it's really hard to get it to fulfill these simple requirements that I want it to do: 1. Iterate over all the segments - pull out URLs that have at_download/file in them - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor is the readable PDF name, the actual URL is a Plone CMS url, with little meaning) 2. for each PDF file anchor name - create a file in output_dir with the PDF file data read from the segment My guess is that even at the scale of data that I'm dealing with (10s of GB), that it's impossible and impractical to do anything that's not M/R here. Unfortunately there isn't a tool that will