[jira] [Commented] (NUTCH-1200) Resolving Ivy dependencies in several plugins
[ https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160803#comment-13160803 ] Blaise Thomson commented on NUTCH-1200: --- Hi - I'm having the same problem with setting up in Eclipse. What was your configuration problem so that I can try do the same fix? Many thanks! Resolving Ivy dependencies in several plugins -- Key: NUTCH-1200 URL: https://issues.apache.org/jira/browse/NUTCH-1200 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.5 Attachments: NUTCH-1200-trunk.patch, NUTCH-1200-v2-trunk.patch When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins requiring additional libraries OVER AND ABOVE the ones specified in NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the classes are {code} - FeedParser dependency org=net.java.dev.rome name=rome rev=1.0.0 conf=*-master/ - URLAutomationFilter - dependency org=dk.brics name=automaton rev=???/ - SWFParser dependency org=com.google.gwt name=gwt-incubator rev=2.0.1/ - HTMLParser dependency org=net.sourceforge.nekohtml name=nekohtml rev=1.9.15/ {code} Further to this, I cannot locate the dk.brics dependency! Finally, the plugin/ivy.xml files for the above plugins cannot be parsed corectly due to the ${nutch.root} vairable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1200) Resolving Ivy dependencies in several plugins
[ https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160847#comment-13160847 ] Lewis John McGibbney commented on NUTCH-1200: - Hi Blaise I would direct you to this tutorial [1]. It covers everything you should need to get Nutch working within your Eclipse IDE. It takes about a half hour or so to set up but definitely works as I have been debugging some simple jobs from within Eclipse. Please get back to us on the user lists if you are having any problems. Thank you [1] http://wiki.apache.org/nutch/RunNutchInEclipse Resolving Ivy dependencies in several plugins -- Key: NUTCH-1200 URL: https://issues.apache.org/jira/browse/NUTCH-1200 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.5 Attachments: NUTCH-1200-trunk.patch, NUTCH-1200-v2-trunk.patch When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins requiring additional libraries OVER AND ABOVE the ones specified in NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the classes are {code} - FeedParser dependency org=net.java.dev.rome name=rome rev=1.0.0 conf=*-master/ - URLAutomationFilter - dependency org=dk.brics name=automaton rev=???/ - SWFParser dependency org=com.google.gwt name=gwt-incubator rev=2.0.1/ - HTMLParser dependency org=net.sourceforge.nekohtml name=nekohtml rev=1.9.15/ {code} Further to this, I cannot locate the dk.brics dependency! Finally, the plugin/ivy.xml files for the above plugins cannot be parsed corectly due to the ${nutch.root} vairable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of RunNutchInEclipse by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunNutchInEclipse page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/RunNutchInEclipse?action=diffrev1=34rev2=35 * Once we have ensured that Nutch trunk is correctly configured we can progress to building within Eclipse. === Build Nutch === - * We can now progress to building Nutch by simply dragging the build.xml file into the Ant perspective and double clicking on the build file. If you configured the project correctly, Eclipse will build Nutch for you into bin and you should see something similar to the following: + * We can now progress to building Nutch by simply dragging the build.xml file into the Ant view and double clicking on the build file. If you configured the project correctly, Eclipse will build Nutch for you into bin and you should see something similar to the following: {{{ BUILD SUCCESSFUL Total time: 33 seconds
Jenkins build is back to normal : Nutch-trunk #1681
See https://builds.apache.org/job/Nutch-trunk/1681/
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161463#comment-13161463 ] dibyendu ghosh commented on NUTCH-1206: --- Tried with 1.4. Its still not working. 1.3 did not have parsechecker option for nutch script. 1.4 is showing the following output: === bash-2.00$ bin/nutch parsechecker -dumpText https://issues.apache.org/jira/secu re/attachment/12505323/direct.pdf fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf Can't fetch URL successfully === This is after keeping the above mentioned conf. setting in nutch-site.xml tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names