Re: Enabling Nutch wiki override of ACLs for Attachments

2011-11-26 Thread Lewis John Mcgibbney
Is anyone aware of the AdminGroup and ContributerGroup we have set up for
the wiki?

The intention would be to have all committers on the AdminGroup, then
anyone who wishes to edit the wiki any any way can be added to the
ContributersGroup. This would meant that we could enable contributers to
upload attachments, it would also enable all other uses to view
attachments, whilst reducing the possibility of spam.

If I can get an answer (if there is one) to the question above, I'll
progress with setting this up.

On Tue, Nov 22, 2011 at 11:15 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 any decisions on this guys?

 The last thing I want to see is spammers, however it would also be nice to
 obtain the attachments to give the wiki articles some additional context.


 On Mon, Nov 21, 2011 at 4:23 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 I don't think this is possible. Setting can either be configured such
 that anyone can edit but not upload attachments or else ONLY an AdminGroup
 or ContributersGroup can add material. This requires someone to maintain
 the respective configuration files in our wiki instance... which is not a
 huge deal.

 The whole blocking attachment issue was introduced as some projects were
 experiencing high levels of spam. If this has/is not the case with Nutch
 then for the time being we can simply remove this restriction and implement
 the above restriction if/when spam occurs.

 Any thoughts?

 Examples of material which has been blocked are


 http://wiki.apache.org/nutch/CrawlDatumStates?action=AttachFiledo=viewtarget=CrawlDatum.uxf

 http://wiki.apache.org/nutch/Evaluations?action=AttachFiledo=viewtarget=OSU_Queries.pdf



 On Mon, Nov 21, 2011 at 3:46 PM, Markus Jelsma 
 markus.jel...@openindex.io wrote:

 Spam happens once in a while. Can uploading of attachments be restricted
 to
 committers?

 On Monday 21 November 2011 16:40:11 Lewis John Mcgibbney wrote:
  Hi Guys,
 
  There has been some discussion recently about broken links to
 attachments
  on the Nutch wiki. The reason for this can be seen here [1].
 
  I am not aware of the Nutch wiki suffering from Spam attacks, however
 this
  is not to say that it might not happen. Therefore is it worth
 re-enabling
  this feature as per the comments in the link below?
 
  Thanks
 
  [1] http://wiki.apache.org/general/OurWikiFarm#Attachments

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350




 --
 *Lewis*




 --
 *Lewis*




-- 
*Lewis*


[RESULT] [VOTE] Apache Nutch 1.4 release rc #2

2011-11-26 Thread Mattmann, Chris A (388J)
Hi Everyone,

This VOTE has passed:

+1 PMC

Julien Nioche
Markus Jelsma
Lewis John McGibbney
Chris Mattmann

I'll go ahead and update the website and push the release out to the mirrors. 
Thanks
for VOTE'ing and for your patience!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [RESULT] [VOTE] Apache Nutch 1.4 release rc #2

2011-11-26 Thread Lewis John Mcgibbney
Top man Chris.

Well done everyone, there are some great contributions between 1.3  4.

All the best

Lewis

On Sat, Nov 26, 2011 at 6:31 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Everyone,

 This VOTE has passed:

 +1 PMC

 Julien Nioche
 Markus Jelsma
 Lewis John McGibbney
 Chris Mattmann

 I'll go ahead and update the website and push the release out to the
 mirrors. Thanks
 for VOTE'ing and for your patience!

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




-- 
*Lewis*


[ANNOUNCE] Apache Nutch 1.4 released

2011-11-26 Thread Mattmann, Chris A (388J)
(...apologies for the cross posting...)

The Apache Nutch project is pleased to announce the release of Apache Nutch
1.4. The release contents have been pushed out to the main Apache release
site so the releases should be available as soon as the mirrors get the
syncs. 

Apache Nutch is an extensible framework for building out large-scale
web-based search. Layered on top of fellow Apache projects Hadoop,
Lucene/Solr, and Tika, Nutch provides an out of the box platform for
fetching web pages, pdf files, word documents, and more. Nutch parses the
content and its relevant information, indexes its metadata, and makes it
available for efficient query and retrieval over modern Internet protocols.

Apache Nutch 1.4 contains a number of improvements and bug fixes. Details
can be found in the changes file:

http://www.apache.org/dist/nutch/CHANGES-1.4.txt

Apache Nutch is available in source and binary form from the following
download page: http://www.apache.org/dyn/closer.cgi/nutch/

Nutch is also available as a Jar dependency from the Central repository:

http://repo2.maven.org/maven2/org/apache/nutch/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the downloads
using signatures found on the Apache site:

http://www.apache.org/dist/nutch/KEYS

For more information on Apache Nutch, visit the project home page:
http://nutch.apache.org

-- Chris Mattmann (on behalf of the Apache Nutch community)

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-11-26 Thread Chris A. Mattmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157654#comment-13157654
 ] 

Chris A. Mattmann commented on NUTCH-1206:
--

Hi Dibyendu,

Can you please post direct.pdf? Or send me the URL for it? You can use the 
bin/nutch org.apache.nutch.parse.ParserChecker program to evaluate whether or 
not Nutch will parse your content. You could also try upgrading to 1.4 and see 
if that helps.


 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann

 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {