Re: Enabling Nutch wiki override of ACLs for Attachments
Is anyone aware of the AdminGroup and ContributerGroup we have set up for the wiki? The intention would be to have all committers on the AdminGroup, then anyone who wishes to edit the wiki any any way can be added to the ContributersGroup. This would meant that we could enable contributers to upload attachments, it would also enable all other uses to view attachments, whilst reducing the possibility of spam. If I can get an answer (if there is one) to the question above, I'll progress with setting this up. On Tue, Nov 22, 2011 at 11:15 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: any decisions on this guys? The last thing I want to see is spammers, however it would also be nice to obtain the attachments to give the wiki articles some additional context. On Mon, Nov 21, 2011 at 4:23 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I don't think this is possible. Setting can either be configured such that anyone can edit but not upload attachments or else ONLY an AdminGroup or ContributersGroup can add material. This requires someone to maintain the respective configuration files in our wiki instance... which is not a huge deal. The whole blocking attachment issue was introduced as some projects were experiencing high levels of spam. If this has/is not the case with Nutch then for the time being we can simply remove this restriction and implement the above restriction if/when spam occurs. Any thoughts? Examples of material which has been blocked are http://wiki.apache.org/nutch/CrawlDatumStates?action=AttachFiledo=viewtarget=CrawlDatum.uxf http://wiki.apache.org/nutch/Evaluations?action=AttachFiledo=viewtarget=OSU_Queries.pdf On Mon, Nov 21, 2011 at 3:46 PM, Markus Jelsma markus.jel...@openindex.io wrote: Spam happens once in a while. Can uploading of attachments be restricted to committers? On Monday 21 November 2011 16:40:11 Lewis John Mcgibbney wrote: Hi Guys, There has been some discussion recently about broken links to attachments on the Nutch wiki. The reason for this can be seen here [1]. I am not aware of the Nutch wiki suffering from Spam attacks, however this is not to say that it might not happen. Therefore is it worth re-enabling this feature as per the comments in the link below? Thanks [1] http://wiki.apache.org/general/OurWikiFarm#Attachments -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- *Lewis* -- *Lewis* -- *Lewis*
[RESULT] [VOTE] Apache Nutch 1.4 release rc #2
Hi Everyone, This VOTE has passed: +1 PMC Julien Nioche Markus Jelsma Lewis John McGibbney Chris Mattmann I'll go ahead and update the website and push the release out to the mirrors. Thanks for VOTE'ing and for your patience! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [RESULT] [VOTE] Apache Nutch 1.4 release rc #2
Top man Chris. Well done everyone, there are some great contributions between 1.3 4. All the best Lewis On Sat, Nov 26, 2011 at 6:31 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Everyone, This VOTE has passed: +1 PMC Julien Nioche Markus Jelsma Lewis John McGibbney Chris Mattmann I'll go ahead and update the website and push the release out to the mirrors. Thanks for VOTE'ing and for your patience! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- *Lewis*
[ANNOUNCE] Apache Nutch 1.4 released
(...apologies for the cross posting...) The Apache Nutch project is pleased to announce the release of Apache Nutch 1.4. The release contents have been pushed out to the main Apache release site so the releases should be available as soon as the mirrors get the syncs. Apache Nutch is an extensible framework for building out large-scale web-based search. Layered on top of fellow Apache projects Hadoop, Lucene/Solr, and Tika, Nutch provides an out of the box platform for fetching web pages, pdf files, word documents, and more. Nutch parses the content and its relevant information, indexes its metadata, and makes it available for efficient query and retrieval over modern Internet protocols. Apache Nutch 1.4 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/nutch/CHANGES-1.4.txt Apache Nutch is available in source and binary form from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/ Nutch is also available as a Jar dependency from the Central repository: http://repo2.maven.org/maven2/org/apache/nutch/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://nutch.apache.org -- Chris Mattmann (on behalf of the Apache Nutch community) ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157654#comment-13157654 ] Chris A. Mattmann commented on NUTCH-1206: -- Hi Dibyendu, Can you please post direct.pdf? Or send me the URL for it? You can use the bin/nutch org.apache.nutch.parse.ParserChecker program to evaluate whether or not Nutch will parse your content. You could also try upgrading to 1.4 and see if that helps. tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) {