Hi Remiz, Sure!
Check out this 5 min writing a parser guide in Tika: https://tika.apache.org/1.7/parser_guide.html OK, so then check out Any23: http://any23.apache.org/ It has support for parsing RDF Microformats. So, you may want to create a MicroformatsParser in Tika; then if it’s supported in Tika, it will in turn be available in Nutch and its parse-tika plugin if you upgrade it to the latest version of Tika. You can see how to do this here: http://s.apache.org/fsY Cheers and best of luck - hope that’s enough to get your proposal kicked off. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Remzi Düzağaç <remz...@gmail.com> Reply-To: "d...@nutch.apache.org" <d...@nutch.apache.org> Date: Friday, March 27, 2015 at 7:22 AM To: dev <d...@nutch.apache.org> Cc: "dev@tika.apache.org" <dev@tika.apache.org>, "d...@any23.apache.org" <d...@any23.apache.org> Subject: Re: GSOC RDF Microformats Support >Hi Chris, > > >Thanks for your feedback. >I was planning to use any23 and tika but I dont have detailed grasp of >both projects. I guess Im gonna need to dive in both. > > >I would appreciate if you could guide me > > >thanks > >On Fri, Mar 27, 2015 at 4:07 PM, Mattmann, Chris A (3980) ><chris.a.mattm...@jpl.nasa.gov> wrote: > >Hi Remzi - thanks! You may want to consider this as a Tika or >Any23 project since Nutch delegates its parsing to Tika (and >Any23 uses Tika [and vice versa] to handle micro formats). > >Cheers, >Chris > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: chris.a.mattm...@nasa.gov >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Remzi Düzağaç <remz...@gmail.com> >Reply-To: "d...@nutch.apache.org" <d...@nutch.apache.org> >Date: Friday, March 27, 2015 at 5:07 AM >To: "d...@nutch.apache.org" <d...@nutch.apache.org> >Subject: GSOC RDF Microformats Support > >>Hi Guys, >> >> >>I have sent a proposal to gsoc. I would like to add rdf microformat >>support to nutch. I kindly ask for your support. Is there anyone >>volunteer to be my mentor on this topic? >> >> >>Thank you very much >> > > > > > > > > > > >