[jira] [Commented] (NUTCH-1200) Resolving Ivy dependencies in several plugins

2011-12-01 Thread Blaise Thomson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160803#comment-13160803
 ] 

Blaise Thomson commented on NUTCH-1200:
---

Hi - I'm having the same problem with setting up in Eclipse. What was your 
configuration problem so that I can try do the same fix? Many thanks!

 Resolving Ivy dependencies in several plugins 
 --

 Key: NUTCH-1200
 URL: https://issues.apache.org/jira/browse/NUTCH-1200
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.5

 Attachments: NUTCH-1200-trunk.patch, NUTCH-1200-v2-trunk.patch


 When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins 
 requiring additional libraries OVER AND ABOVE the ones specified in 
 NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the 
 classes are 
 {code}
 - FeedParser dependency org=net.java.dev.rome name=rome rev=1.0.0 
 conf=*-master/
 - URLAutomationFilter - dependency org=dk.brics name=automaton 
 rev=???/
 - SWFParser dependency org=com.google.gwt name=gwt-incubator 
 rev=2.0.1/
 - HTMLParser   dependency org=net.sourceforge.nekohtml name=nekohtml 
 rev=1.9.15/ 
 {code}
 Further to this, I cannot locate the dk.brics dependency!
 Finally, the plugin/ivy.xml files for the above plugins cannot be parsed 
 corectly due to the ${nutch.root} vairable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1200) Resolving Ivy dependencies in several plugins

2011-12-01 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160847#comment-13160847
 ] 

Lewis John McGibbney commented on NUTCH-1200:
-

Hi Blaise I would direct you to this tutorial [1]. It covers everything you 
should need to get Nutch working within your Eclipse IDE. It takes about a half 
hour or so to set up but definitely works as I have been debugging some simple 
jobs from within Eclipse. Please get back to us on the user lists if you are 
having any problems. Thank you

[1] http://wiki.apache.org/nutch/RunNutchInEclipse

 Resolving Ivy dependencies in several plugins 
 --

 Key: NUTCH-1200
 URL: https://issues.apache.org/jira/browse/NUTCH-1200
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.5

 Attachments: NUTCH-1200-trunk.patch, NUTCH-1200-v2-trunk.patch


 When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins 
 requiring additional libraries OVER AND ABOVE the ones specified in 
 NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the 
 classes are 
 {code}
 - FeedParser dependency org=net.java.dev.rome name=rome rev=1.0.0 
 conf=*-master/
 - URLAutomationFilter - dependency org=dk.brics name=automaton 
 rev=???/
 - SWFParser dependency org=com.google.gwt name=gwt-incubator 
 rev=2.0.1/
 - HTMLParser   dependency org=net.sourceforge.nekohtml name=nekohtml 
 rev=1.9.15/ 
 {code}
 Further to this, I cannot locate the dk.brics dependency!
 Finally, the plugin/ivy.xml files for the above plugins cannot be parsed 
 corectly due to the ${nutch.root} vairable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of RunNutchInEclipse by LewisJohnMcgibbney

2011-12-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunNutchInEclipse page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunNutchInEclipse?action=diffrev1=34rev2=35

   * Once we have ensured that Nutch trunk is correctly configured we can 
progress to building within Eclipse.
  
  === Build Nutch ===
-  * We can now progress to building Nutch by simply dragging the build.xml 
file into the Ant perspective and double clicking on the build file. If you 
configured the project correctly, Eclipse will build Nutch for you into bin 
and you should see something similar to the following:
+  * We can now progress to building Nutch by simply dragging the build.xml 
file into the Ant view and double clicking on the build file. If you configured 
the project correctly, Eclipse will build Nutch for you into bin and you 
should see something similar to the following:
  {{{
  BUILD SUCCESSFUL
  Total time: 33 seconds


Jenkins build is back to normal : Nutch-trunk #1681

2011-12-01 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/1681/




[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-01 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161463#comment-13161463
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

Tried with 1.4. Its still not working. 1.3 did not have parsechecker option for 
nutch script. 1.4 is showing the following output:
===
bash-2.00$ bin/nutch parsechecker -dumpText https://issues.apache.org/jira/secu
re/attachment/12505323/direct.pdf
fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf
Can't fetch URL successfully
===
This is after keeping the above mentioned conf. setting in nutch-site.xml

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names