[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-11-28 Thread dibyendu ghosh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158366#comment-13158366
 ] 

dibyendu ghosh commented on NUTCH-1206:
---

Hi Chris,
I have attached the direct.pdf file. You can also test with any simple pdf, for 
example, by exporting to pdf from a open office document. Results are same.

Noticed that Nutch 1.4 has got released on 24th Nov. Will update after testing 
with that.

Thanks,
Dibyendu

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for 

[jira] [Closed] (NUTCH-619) Another Language Identifier Plugin using Unicode code point range

2011-11-28 Thread Julien Nioche (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-619.
---

Resolution: Won't Fix

Language identification is now delegated to Tika.

 Another Language Identifier Plugin using Unicode code point range
 -

 Key: NUTCH-619
 URL: https://issues.apache.org/jira/browse/NUTCH-619
 Project: Nutch
  Issue Type: Wish
Reporter: Vinci

 After I checked the language-identifier plugin, I found the internal 
 implementation is inefficient for language that can be clear identify based 
 on their unicode codepoint  (e.g. CJK Language)
 If Nutch work under unicode, can anybody write a language identifier based on 
 unicode  code point range? The map is here:
 http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
 also you can refer to NutchAnalysis.jj for some of language code range 
 * Some late developed language or rare character - include some CJK 
 character, are moved to SIP
 * May be a special property should be set if multiple language character 
 detected (languages that are other than English alphabet) - my suggestion 
 here is, let CJK locale be the default case as they need bi-gram or other 
 analyzer for better indexing
 ** CJK character is very difficult to further divide as they are share han 
 characters - if you really want to identify the specific  member of CJK, you 
 need to use the language identifier plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-839) nutch doesnt run under 0.20.2+228-1~karmic-cdh3b1 version of hadoop

2011-11-28 Thread Julien Nioche (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-839.
---

   Resolution: Duplicate
Fix Version/s: 1.4

https://issues.apache.org/jira/browse/NUTCH-937

 nutch doesnt run under 0.20.2+228-1~karmic-cdh3b1 version of hadoop
 ---

 Key: NUTCH-839
 URL: https://issues.apache.org/jira/browse/NUTCH-839
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.1
 Environment: ubuntu linux version 2.6.31-14-server, x86_64 GNU/Linux
Reporter: Robert Gonzalez
 Fix For: 1.4


 new versions of hadoop appear to put jars in a different format now, instead 
 of file:/a/b/c/d/job.jar, its now jar:file:/a/b/c/d/job.jar!, which breaks 
 nutch when its trying to load its plugins.  Specifically, the stack trace 
 looks like:
 Caused by: java.lang.RuntimeException: x point 
 org.apache.nutch.net.URLNormalizer not found.
   at org.apache.nutch.net.URLNormalizers.init(URLNormalizers.java:124)
   at 
 org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
 A simple test class was written the used the URLFilters class, and the 
 following stack trace resulted:
 10/07/01 14:25:25 INFO mapred.JobClient: Task Id : 
 attempt_201006171624_46525_m_00_1, Status : FAILED
 java.lang.RuntimeException: org.apache.nutch.net.URLFilter not found.
   at org.apache.nutch.net.URLFilters.init(URLFilters.java:52)
   at com.maxpoint.crawl.BidSampler$BIdSMapper.setup(BidSampler.java:42)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Running this on an older version of hadoop works.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-28 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158429#comment-13158429
 ] 

Julien Nioche commented on NUTCH-1213:
--

Looks fine to me, feel free to go ahead and commit

 Pass additional SolrParams when indexing to Solr
 

 Key: NUTCH-1213
 URL: https://issues.apache.org/jira/browse/NUTCH-1213
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-1213.diff


 This is a simple improvement of the SolrIndexer. It adds the ability to pass 
 additional Solr parameters that are applied to each UpdateRequest. This is 
 useful when you have to pass parameters specific to a particular indexing 
 run, which are not in Solr invariants for the update handler, and modifying 
 the Solr configuration for each different indexing run is inconvenient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-657) Estonian N-gram profile has wrong name

2011-11-28 Thread Julien Nioche (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-657.
---

Resolution: Fixed

https://issues.apache.org/jira/browse/TIKA-453

 Estonian N-gram profile has wrong name
 --

 Key: NUTCH-657
 URL: https://issues.apache.org/jira/browse/NUTCH-657
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Jonathan Young
Priority: Trivial

 The Nutch language identifier plugin contains an ngram profile, ee.ngp, in 
 src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang .  ee 
 is the ISO-3166-1-alpha-2 code for Estonia (see 
 http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_names_and_code_elements.htm),
  but it is the ISO-639-2 code for Ewe (see 
 http://www.loc.gov/standards/iso639-2/php/English_list.php).  et is the 
 ISO-639-2 code for Estonian, and the language profile in ee.ngp is clearly 
 Estonian.
 Proposed solution: rename ee.ngp to et.ngp .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-28 Thread Andrzej Bialecki (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-1213.
--

Resolution: Fixed

Committed in rev. 1207217, thanks for the review.

 Pass additional SolrParams when indexing to Solr
 

 Key: NUTCH-1213
 URL: https://issues.apache.org/jira/browse/NUTCH-1213
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-1213.diff


 This is a simple improvement of the SolrIndexer. It adds the ability to pass 
 additional Solr parameters that are applied to each UpdateRequest. This is 
 useful when you have to pass parameters specific to a particular indexing 
 run, which are not in Solr invariants for the update handler, and modifying 
 the Solr configuration for each different indexing run is inconvenient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




nutch and openJDK 1.6 for fedora

2011-11-28 Thread Alexander Aristov
Hi everyone,

Can anybody tell me why Nutch doesn't compile on my new old Fedora OS 9
 with openJDK java installed.

It complains on many warnings and errors during compilation. With Sun Java
compilation is OK.

Should I consider that openJDK is not compliant with java standards?
potentially I intend to use it in other java systems and this might urge me
to re-think my decision.


Best Regards
Alexander Aristov


Class Crawl don't close...

2011-11-28 Thread DanFernandes
Hello everyone...

I'm new with development for Nutch.

I'm trying insert some code lines to create a better environment for my
project, but one thing took my attention.

The new version of nutch don't close the main method after finished. This is
a little strange for explain, but I took a printscreen for show my
problem...

http://lucene.472066.n3.nabble.com/file/n3542214/ScreenShot.png 

Look for the last line...
The terminal don't write nothing and don't ask for a new command line...

If i press enter, the terminal apparently close the application and ask
for a new command, but I need automatic close because I will put some repeat
lines for crawl, readdb and dump.

How can I do that? Is this a known error?

I'm waiting for a response...
Thank you very much...

Danilo Fernandes

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Class-Crawl-don-t-close-tp3542214p3542214.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


[jira] [Commented] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-28 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158487#comment-13158487
 ] 

Hudson commented on NUTCH-1213:
---

Integrated in nutch-trunk-maven #43 (See 
[https://builds.apache.org/job/nutch-trunk-maven/43/])
NUTCH-1213 Pass additional SolrParams when indexing to Solr.

ab : 
http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=revroot=revision=1207217
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrConstants.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java


 Pass additional SolrParams when indexing to Solr
 

 Key: NUTCH-1213
 URL: https://issues.apache.org/jira/browse/NUTCH-1213
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-1213.diff


 This is a simple improvement of the SolrIndexer. It adds the ability to pass 
 additional Solr parameters that are applied to each UpdateRequest. This is 
 useful when you have to pass parameters specific to a particular indexing 
 run, which are not in Solr invariants for the update handler, and modifying 
 the Solr configuration for each different indexing run is inconvenient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: nutch and openJDK 1.6 for fedora

2011-11-28 Thread Julien Nioche
Hi Alexander,

Which version of OpenJDK is it? I have Nutch running on one of my servers
with

*java version 1.6.0_22
OpenJDK Runtime Environment (IcedTea6 1.10.2) (6b22-1.10.2-0ubuntu1~11.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)*

and I don't have any problems compiling

Julien


On 28 November 2011 10:43, Alexander Aristov alexander.aris...@gmail.comwrote:

 Hi everyone,

 Can anybody tell me why Nutch doesn't compile on my new old Fedora OS 9
  with openJDK java installed.

 It complains on many warnings and errors during compilation. With Sun Java
 compilation is OK.

 Should I consider that openJDK is not compliant with java standards?
 potentially I intend to use it in other java systems and this might urge me
 to re-think my decision.


 Best Regards
 Alexander Aristov




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Best way to get files out of segment directories

2011-11-28 Thread Mattmann, Chris A (388J)
Hey Guys,

So, I've completed my crawl of the vault.fbi.gov website for my class that I'm 
preparing 
for. I've got:

[chipotle:local/nutch/framework] mattmann% du -hs crawl
 28Gcrawl
[chipotle:local/nutch/framework] mattmann% 

[chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
total 0
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 2027104947/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 2027104955/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 2027105006/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 2027105251/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 2027125721/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 2027144648/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 2027164220/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 2027184345/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 2027204447/
drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 2027224816/
[chipotle:local/nutch/framework] mattmann% 

./bin/nutch readseg -list -dir crawl/segments/
NAMEGENERATED   FETCHER START   FETCHER END 
FETCHED PARSED
2027104947  1   2011-11-27T10:49:50 2011-11-27T10:49:50 
1   1
2027104955  31  2011-11-27T10:49:57 2011-11-27T10:49:58 
31  31
2027105006  48982011-11-27T10:50:08 2011-11-27T10:51:40 
48984890
2027105251  98902011-11-27T10:52:52 2011-11-27T11:56:06 
714 713
2027125721  92022011-11-27T12:57:24 2011-11-27T14:00:17 
971 686
2027144648  82612011-11-27T14:46:50 2011-11-27T15:48:25 
714 712
2027164220  75752011-11-27T16:42:22 2011-11-27T17:45:50 
720 718
2027184345  68712011-11-27T18:43:48 2011-11-27T19:47:11 
767 766
2027204447  61162011-11-27T20:44:50 2011-11-27T21:48:07 
725 724
2027224816  54062011-11-27T22:48:18 2011-11-27T23:51:33 
744 744
[chipotle:local/nutch/framework] mattmann% 

So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
extracted PDF files
that are housed in those segments. I've been playing around with ./bin/nutch 
readseg, 
and all I can say based on my initial impressions here are that it's really 
hard to 
get it to fulfill these simple requirements that I want it to do:

1. Iterate over all the segments 
  - pull out URLs that have at_download/file in them
  - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
is the readable PDF name,
the actual URL is a Plone CMS url, with little meaning)

2. for each PDF file anchor name
   - create a file in output_dir with the PDF file data read from the segment

My guess is that even at the scale of data that I'm dealing with (10s of GB), 
that it's impossible
and impractical to do anything that's not M/R here. Unfortunately there isn't a 
tool that will simply
grab me the PDF files out of the segment files and then output those into a 
director, appropriately 
named with the anchor text. Or...is there? ;-)

I'm running in Local mode, with no Hadoop cluster behind me, and with a 
Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working,
intentionally as I don't want it to be a requirement for folks to have a cluster
to do this assignment that I'm working on.

I was talking to Ken Krugler about this, and after picking his brain, I think 
that 
I'm going to have to end up writing a tool to do what I want. So, if that's the 
case, 
fine, but can someone point me in the right direction for a good starting point
for this? Ken also thought Andrzej might have like 10 magic solutions to make 
this happen, so here's hoping he's out there listening :-)

Thanks for the help, guys.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Best way to get files out of segment directories

2011-11-28 Thread Mattmann, Chris A (388J)
Hey Guys,

One more thing. Just to let you know I've followed this blog here:

http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/

And started to write a simple program to read the keys in a 
Segment file, and then dump out the byte content if the key
matches the desired URL. You can find my code here:

https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java

Unfortunately, this code keeps dying due to OOM issues, 
clearly because the data file is too big, and because 
I likely have to M/R this. 

Just wanted to let you guys know where I'm at, and what
I've been trying.

Thanks,
Chris

On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:

 Hey Guys,
 
 So, I've completed my crawl of the vault.fbi.gov website for my class that 
 I'm preparing 
 for. I've got:
 
 [chipotle:local/nutch/framework] mattmann% du -hs crawl
 28G   crawl
 [chipotle:local/nutch/framework] mattmann% 
 
 [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
 total 0
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 2027104947/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 2027104955/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 2027105006/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 2027105251/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 2027125721/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 2027144648/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 2027164220/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 2027184345/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 2027204447/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 2027224816/
 [chipotle:local/nutch/framework] mattmann% 
 
 ./bin/nutch readseg -list -dir crawl/segments/
 NAME  GENERATED   FETCHER START   FETCHER END 
 FETCHED PARSED
 20271049471   2011-11-27T10:49:50 
 2011-11-27T10:49:50 1   1
 202710495531  2011-11-27T10:49:57 
 2011-11-27T10:49:58 31  31
 202710500648982011-11-27T10:50:08 
 2011-11-27T10:51:40 48984890
 202710525198902011-11-27T10:52:52 
 2011-11-27T11:56:06 714 713
 202712572192022011-11-27T12:57:24 
 2011-11-27T14:00:17 971 686
 202714464882612011-11-27T14:46:50 
 2011-11-27T15:48:25 714 712
 202716422075752011-11-27T16:42:22 
 2011-11-27T17:45:50 720 718
 202718434568712011-11-27T18:43:48 
 2011-11-27T19:47:11 767 766
 202720444761162011-11-27T20:44:50 
 2011-11-27T21:48:07 725 724
 202722481654062011-11-27T22:48:18 
 2011-11-27T23:51:33 744 744
 [chipotle:local/nutch/framework] mattmann% 
 
 So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
 extracted PDF files
 that are housed in those segments. I've been playing around with ./bin/nutch 
 readseg, 
 and all I can say based on my initial impressions here are that it's really 
 hard to 
 get it to fulfill these simple requirements that I want it to do:
 
 1. Iterate over all the segments 
  - pull out URLs that have at_download/file in them
  - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
 is the readable PDF name,
 the actual URL is a Plone CMS url, with little meaning)
 
 2. for each PDF file anchor name
   - create a file in output_dir with the PDF file data read from the segment
 
 My guess is that even at the scale of data that I'm dealing with (10s of GB), 
 that it's impossible
 and impractical to do anything that's not M/R here. Unfortunately there isn't 
 a tool that will simply
 grab me the PDF files out of the segment files and then output those into a 
 director, appropriately 
 named with the anchor text. Or...is there? ;-)
 
 I'm running in Local mode, with no Hadoop cluster behind me, and with a 
 Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working,
 intentionally as I don't want it to be a requirement for folks to have a 
 cluster
 to do this assignment that I'm working on.
 
 I was talking to Ken Krugler about this, and after picking his brain, I think 
 that 
 I'm going to have to end up writing a tool to do what I want. So, if that's 
 the case, 
 fine, but can someone point me in the right direction for a good starting 
 point
 for this? Ken also thought Andrzej might have like 10 magic solutions to make 
 this happen, so here's hoping he's out there listening :-)
 
 Thanks for the help, guys.
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 

Build failed in Jenkins: Nutch-nutchgora #82

2011-11-28 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/82/

--
[...truncated 2457 lines...]

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/classes
[javac] Note: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
 [copy] Copying 1 file to 

[jira] [Commented] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-28 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159076#comment-13159076
 ] 

Hudson commented on NUTCH-1213:
---

Integrated in Nutch-trunk #1678 (See 
[https://builds.apache.org/job/Nutch-trunk/1678/])
NUTCH-1213 Pass additional SolrParams when indexing to Solr.

ab : 
http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=revroot=revision=1207217
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrConstants.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java


 Pass additional SolrParams when indexing to Solr
 

 Key: NUTCH-1213
 URL: https://issues.apache.org/jira/browse/NUTCH-1213
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-1213.diff


 This is a simple improvement of the SolrIndexer. It adds the ability to pass 
 additional Solr parameters that are applied to each UpdateRequest. This is 
 useful when you have to pass parameters specific to a particular indexing 
 run, which are not in Solr invariants for the update handler, and modifying 
 the Solr configuration for each different indexing run is inconvenient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Best way to get files out of segment directories

2011-11-28 Thread Mattmann, Chris A (388J)
OK, of course, I figured it out, and updated my program :-)

You can see it on Github below. I'm going to clean up and 
generalize this program because I think it's of general use.
I'll create an issue shortly. 

I'm thinking the tool could be something like:

./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
  -segmentRootDir full file path to the root segment directory, e.g., 
crawl/segments
  -regexUrlPattern a regex URL pattern to select URL keys to dump from the 
content DB in each segment
  -outputDir The output directory to write file names to.
  -metadata --key=value where key is a Content Metadata key and value is a 
value to check. If the URL and
its content metadata have a matching key,value pair, dump it. Allow for regex 
matching on the value.

This would allow users to unravel the content hidden in segment directories and 
in sequence files
into useable files that were downloaded by Nutch.

Do you guys see this as a useful tool? If so, I'll contribute it this week for 
1.5.

Cheers,
Chris

On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote:

 Hey Guys,
 
 One more thing. Just to let you know I've followed this blog here:
 
 http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/
 
 And started to write a simple program to read the keys in a 
 Segment file, and then dump out the byte content if the key
 matches the desired URL. You can find my code here:
 
 https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java
 
 Unfortunately, this code keeps dying due to OOM issues, 
 clearly because the data file is too big, and because 
 I likely have to M/R this. 
 
 Just wanted to let you guys know where I'm at, and what
 I've been trying.
 
 Thanks,
 Chris
 
 On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:
 
 Hey Guys,
 
 So, I've completed my crawl of the vault.fbi.gov website for my class that 
 I'm preparing 
 for. I've got:
 
 [chipotle:local/nutch/framework] mattmann% du -hs crawl
 28G  crawl
 [chipotle:local/nutch/framework] mattmann% 
 
 [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
 total 0
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 2027104947/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 2027104955/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 2027105006/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 2027105251/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 2027125721/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 2027144648/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 2027164220/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 2027184345/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 2027204447/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 2027224816/
 [chipotle:local/nutch/framework] mattmann% 
 
 ./bin/nutch readseg -list -dir crawl/segments/
 NAME GENERATED   FETCHER START   FETCHER END 
 FETCHED PARSED
 2027104947   1   2011-11-27T10:49:50 
 2011-11-27T10:49:50 1   1
 2027104955   31  2011-11-27T10:49:57 
 2011-11-27T10:49:58 31  31
 2027105006   48982011-11-27T10:50:08 
 2011-11-27T10:51:40 48984890
 2027105251   98902011-11-27T10:52:52 
 2011-11-27T11:56:06 714 713
 2027125721   92022011-11-27T12:57:24 
 2011-11-27T14:00:17 971 686
 2027144648   82612011-11-27T14:46:50 
 2011-11-27T15:48:25 714 712
 2027164220   75752011-11-27T16:42:22 
 2011-11-27T17:45:50 720 718
 2027184345   68712011-11-27T18:43:48 
 2011-11-27T19:47:11 767 766
 2027204447   61162011-11-27T20:44:50 
 2011-11-27T21:48:07 725 724
 2027224816   54062011-11-27T22:48:18 
 2011-11-27T23:51:33 744 744
 [chipotle:local/nutch/framework] mattmann% 
 
 So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
 extracted PDF files
 that are housed in those segments. I've been playing around with ./bin/nutch 
 readseg, 
 and all I can say based on my initial impressions here are that it's really 
 hard to 
 get it to fulfill these simple requirements that I want it to do:
 
 1. Iterate over all the segments 
 - pull out URLs that have at_download/file in them
 - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
 is the readable PDF name,
 the actual URL is a Plone CMS url, with little meaning)
 
 2. for each PDF file anchor name
  - create a file in output_dir with the PDF file data read from the segment
 
 My guess is that even at the scale of data that I'm dealing with (10s of 
 GB), that it's impossible
 and impractical to do anything that's not M/R here. Unfortunately there 
 isn't a tool that will