Hi Remiz,

Sure!

Check out this 5 min writing a parser guide in Tika:

https://tika.apache.org/1.7/parser_guide.html


OK, so then check out Any23:

http://any23.apache.org/

It has support for parsing RDF Microformats. So, you
may want to create a MicroformatsParser in Tika; then
if it’s supported in Tika, it will in turn be available
in Nutch and its parse-tika plugin if you upgrade it to
the latest version of Tika.

You can see how to do this here:

http://s.apache.org/fsY

Cheers and best of luck - hope that’s enough to get
your proposal kicked off.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Remzi Düzağaç <remz...@gmail.com>
Reply-To: "d...@nutch.apache.org" <d...@nutch.apache.org>
Date: Friday, March 27, 2015 at 7:22 AM
To: dev <d...@nutch.apache.org>
Cc: "dev@tika.apache.org" <dev@tika.apache.org>, "d...@any23.apache.org"
<d...@any23.apache.org>
Subject: Re: GSOC RDF Microformats Support

>Hi Chris,
>
>
>Thanks for your feedback.
>I was planning to use any23 and tika but I dont have detailed grasp of
>both projects. I guess Im gonna need to dive in both.
>
>
>I would appreciate if you could guide me
>
>
>thanks
>
>On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980)
><chris.a.mattm...@jpl.nasa.gov> wrote:
>
>Hi Remzi - thanks! You may want to consider this as a Tika or
>Any23 project since Nutch delegates its parsing to Tika (and
>Any23 uses Tika [and vice versa] to handle micro formats).
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Remzi Düzağaç <remz...@gmail.com>
>Reply-To: "d...@nutch.apache.org" <d...@nutch.apache.org>
>Date: Friday, March 27, 2015 at 5:07 AM
>To: "d...@nutch.apache.org" <d...@nutch.apache.org>
>Subject: GSOC RDF Microformats Support
>
>>Hi Guys,
>>
>>
>>I have sent a proposal to gsoc. I would like to add rdf microformat
>>support to nutch. I kindly ask for your support. Is there anyone
>>volunteer to be my mentor on this topic?
>>
>>
>>Thank you very much
>>
>
>
>
>
>
>
>
>
>
>
>

Reply via email to