looking to contribute

2015-12-16 Thread Joey Hong
Hi Tika Developers,

My name is Joey. I am a college freshmen with programming experience looking to 
get into the world of open-source. I was hoping to contribute to the Tika 
project, and was wondering if there were any tasks that a beginner like me 
could tackle. I am willing to do anything, whether it be fixing a minor bug, or 
adding test suites or documentation.

Thanks,
Joey

Looking to contribute

2015-12-20 Thread Pavan Sudheendra
Hi all,

My name is Pavan and I'm a software engineer working at Cisco on big data
projects from the past 2 years.

I'm looking to contribute to the Tika project and i'm wondering if I should
start looking at the Github issues page or somewhere else?

I've started reading the documentation and getting familiar with the build
process.

Also, any guidance on this subject would be great.

Thanks all.

-- 
Regards-
Pavan


Re: looking to contribute

2015-12-16 Thread Nick Burch

On Wed, 16 Dec 2015, Joey Hong wrote:
My name is Joey. I am a college freshmen with programming experience 
looking to get into the world of open-source. I was hoping to contribute 
to the Tika project, and was wondering if there were any tasks that a 
beginner like me could tackle. I am willing to do anything, whether it 
be fixing a minor bug, or adding test suites or documentation.


On the docs / examples side, we have a few examples on the website, but 
probably not enough! One thing might be to look through those, identify 
gaps with your fresh eyes, and work on those. We also have instructions 
for some more complicated integrations on the wiki, maybe try some of 
those and feed back on which ones aren't clear enough?


If you want to try more coding, Tim quite often runs Tika against some 
large filesets, and has a nifty tool to report on what breaks. He can 
hopefully point you at the most recent report! Maybe have a look through 
that, identify a few common failures from unidentified or common 
exceptions, and try to fix one or two of those?


Nick


RE: looking to contribute

2015-12-17 Thread Allison, Timothy B.
Speaking of the docs/examples, TIKA-1329 is still open because I haven't gotten 
around to documenting it.

Y, if you'd like a report of exceptions, let me know.  IIRC, it would be great 
if we could improve on XML detection (we're currently over detecting), and 
there's plenty of work to do on html parsing TIKA-1599.

I also have probably a full grad student semester worth of curation project 
ideas on the test corpus.  Not glamorous, but very useful for the community.

Then there's the eval code itself...that still needs to make it into shape to 
be added.

I agree with Nick though, start small on documentation/examples.

Cheers,

   Tim

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org] 
Sent: Wednesday, December 16, 2015 4:23 PM
To: dev@tika.apache.org
Subject: Re: looking to contribute

On Wed, 16 Dec 2015, Joey Hong wrote:
> My name is Joey. I am a college freshmen with programming experience 
> looking to get into the world of open-source. I was hoping to 
> contribute to the Tika project, and was wondering if there were any 
> tasks that a beginner like me could tackle. I am willing to do 
> anything, whether it be fixing a minor bug, or adding test suites or 
> documentation.

On the docs / examples side, we have a few examples on the website, but 
probably not enough! One thing might be to look through those, identify gaps 
with your fresh eyes, and work on those. We also have instructions for some 
more complicated integrations on the wiki, maybe try some of those and feed 
back on which ones aren't clear enough?

If you want to try more coding, Tim quite often runs Tika against some large 
filesets, and has a nifty tool to report on what breaks. He can hopefully point 
you at the most recent report! Maybe have a look through that, identify a few 
common failures from unidentified or common exceptions, and try to fix one or 
two of those?

Nick


Re: looking to contribute

2015-12-17 Thread Mattmann, Chris A (3980)
What Tim and Nick said. :) Joey is at Caltech and interested in
working with me, so I said jump on the Tika lists and let’s see
if there is something we can pin down.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, December 17, 2015 at 5:32 AM
To: "dev@tika.apache.org" 
Subject: RE: looking to contribute

>Speaking of the docs/examples, TIKA-1329 is still open because I haven't
>gotten around to documenting it.
>
>Y, if you'd like a report of exceptions, let me know.  IIRC, it would be
>great if we could improve on XML detection (we're currently over
>detecting), and there's plenty of work to do on html parsing TIKA-1599.
>
>I also have probably a full grad student semester worth of curation
>project ideas on the test corpus.  Not glamorous, but very useful for the
>community.
>
>Then there's the eval code itself...that still needs to make it into
>shape to be added.
>
>I agree with Nick though, start small on documentation/examples.
>
>Cheers,
>
>   Tim
>
>-Original Message-
>From: Nick Burch [mailto:apa...@gagravarr.org]
>Sent: Wednesday, December 16, 2015 4:23 PM
>To: dev@tika.apache.org
>Subject: Re: looking to contribute
>
>On Wed, 16 Dec 2015, Joey Hong wrote:
>> My name is Joey. I am a college freshmen with programming experience
>> looking to get into the world of open-source. I was hoping to
>> contribute to the Tika project, and was wondering if there were any
>> tasks that a beginner like me could tackle. I am willing to do
>> anything, whether it be fixing a minor bug, or adding test suites or
>>documentation.
>
>On the docs / examples side, we have a few examples on the website, but
>probably not enough! One thing might be to look through those, identify
>gaps with your fresh eyes, and work on those. We also have instructions
>for some more complicated integrations on the wiki, maybe try some of
>those and feed back on which ones aren't clear enough?
>
>If you want to try more coding, Tim quite often runs Tika against some
>large filesets, and has a nifty tool to report on what breaks. He can
>hopefully point you at the most recent report! Maybe have a look through
>that, identify a few common failures from unidentified or common
>exceptions, and try to fix one or two of those?
>
>Nick



Re: looking to contribute

2015-12-17 Thread Joey Hong
Thanks for the advice! I’ll start with some documentation and tests and move to 
harder tasks from there.

Regarding the JIRA instance for TIKA-1329, would the documentation for the 
RecursiveParserWrapper go with the RecursiveMetadata page on the wiki?

Thanks,
Joey

> On Dec 17, 2015, at 5:32 AM, Allison, Timothy B.  wrote:
> 
> Speaking of the docs/examples, TIKA-1329 is still open because I haven't 
> gotten around to documenting it.
> 
> Y, if you'd like a report of exceptions, let me know.  IIRC, it would be 
> great if we could improve on XML detection (we're currently over detecting), 
> and there's plenty of work to do on html parsing TIKA-1599.
> 
> I also have probably a full grad student semester worth of curation project 
> ideas on the test corpus.  Not glamorous, but very useful for the community.
> 
> Then there's the eval code itself...that still needs to make it into shape to 
> be added.
> 
> I agree with Nick though, start small on documentation/examples.
> 
> Cheers,
> 
>   Tim
> 
> -Original Message-
> From: Nick Burch [mailto:apa...@gagravarr.org] 
> Sent: Wednesday, December 16, 2015 4:23 PM
> To: dev@tika.apache.org
> Subject: Re: looking to contribute
> 
> On Wed, 16 Dec 2015, Joey Hong wrote:
>> My name is Joey. I am a college freshmen with programming experience 
>> looking to get into the world of open-source. I was hoping to 
>> contribute to the Tika project, and was wondering if there were any 
>> tasks that a beginner like me could tackle. I am willing to do 
>> anything, whether it be fixing a minor bug, or adding test suites or 
>> documentation.
> 
> On the docs / examples side, we have a few examples on the website, but 
> probably not enough! One thing might be to look through those, identify gaps 
> with your fresh eyes, and work on those. We also have instructions for some 
> more complicated integrations on the wiki, maybe try some of those and feed 
> back on which ones aren't clear enough?
> 
> If you want to try more coding, Tim quite often runs Tika against some large 
> filesets, and has a nifty tool to report on what breaks. He can hopefully 
> point you at the most recent report! Maybe have a look through that, identify 
> a few common failures from unidentified or common exceptions, and try to fix 
> one or two of those?
> 
> Nick



RE: looking to contribute

2015-12-18 Thread Allison, Timothy B.
Oh, he is?!  Did I mention I have a grad-student-semester of projects for 
corpus curation? :)



-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Thursday, December 17, 2015 12:55 PM
To: dev@tika.apache.org
Subject: Re: looking to contribute

What Tim and Nick said. :) Joey is at Caltech and interested in working with 
me, so I said jump on the Tika lists and let’s see if there is something we can 
pin down.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, December 17, 2015 at 5:32 AM
To: "dev@tika.apache.org" 
Subject: RE: looking to contribute

>Speaking of the docs/examples, TIKA-1329 is still open because I 
>haven't gotten around to documenting it.
>
>Y, if you'd like a report of exceptions, let me know.  IIRC, it would 
>be great if we could improve on XML detection (we're currently over 
>detecting), and there's plenty of work to do on html parsing TIKA-1599.
>
>I also have probably a full grad student semester worth of curation 
>project ideas on the test corpus.  Not glamorous, but very useful for 
>the community.
>
>Then there's the eval code itself...that still needs to make it into 
>shape to be added.
>
>I agree with Nick though, start small on documentation/examples.
>
>Cheers,
>
>   Tim
>
>-Original Message-
>From: Nick Burch [mailto:apa...@gagravarr.org]
>Sent: Wednesday, December 16, 2015 4:23 PM
>To: dev@tika.apache.org
>Subject: Re: looking to contribute
>
>On Wed, 16 Dec 2015, Joey Hong wrote:
>> My name is Joey. I am a college freshmen with programming experience  
>>looking to get into the world of open-source. I was hoping to  
>>contribute to the Tika project, and was wondering if there were any  
>>tasks that a beginner like me could tackle. I am willing to do  
>>anything, whether it be fixing a minor bug, or adding test suites or 
>>documentation.
>
>On the docs / examples side, we have a few examples on the website, but 
>probably not enough! One thing might be to look through those, identify 
>gaps with your fresh eyes, and work on those. We also have instructions 
>for some more complicated integrations on the wiki, maybe try some of 
>those and feed back on which ones aren't clear enough?
>
>If you want to try more coding, Tim quite often runs Tika against some 
>large filesets, and has a nifty tool to report on what breaks. He can 
>hopefully point you at the most recent report! Maybe have a look 
>through that, identify a few common failures from unidentified or 
>common exceptions, and try to fix one or two of those?
>
>Nick



RE: looking to contribute

2015-12-18 Thread Allison, Timothy B.
Y, I think we could use some updates there, but I think the key part of what I 
haven't gotten around to doing is [0].

If I understand that correctly, we have to update some stuff on the site branch.

[0] 
https://issues.apache.org/jira/browse/TIKA-1329?focusedCommentId=14295800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14295800

-Original Message-
From: Joey Hong [mailto:jxih...@gmail.com] 
Sent: Thursday, December 17, 2015 1:31 PM
To: dev@tika.apache.org
Subject: Re: looking to contribute

Thanks for the advice! I’ll start with some documentation and tests and move to 
harder tasks from there.

Regarding the JIRA instance for TIKA-1329, would the documentation for the 
RecursiveParserWrapper go with the RecursiveMetadata page on the wiki?

Thanks,
Joey

> On Dec 17, 2015, at 5:32 AM, Allison, Timothy B.  wrote:
> 
> Speaking of the docs/examples, TIKA-1329 is still open because I haven't 
> gotten around to documenting it.
> 
> Y, if you'd like a report of exceptions, let me know.  IIRC, it would be 
> great if we could improve on XML detection (we're currently over detecting), 
> and there's plenty of work to do on html parsing TIKA-1599.
> 
> I also have probably a full grad student semester worth of curation project 
> ideas on the test corpus.  Not glamorous, but very useful for the community.
> 
> Then there's the eval code itself...that still needs to make it into shape to 
> be added.
> 
> I agree with Nick though, start small on documentation/examples.
> 
> Cheers,
> 
>   Tim
> 
> -Original Message-
> From: Nick Burch [mailto:apa...@gagravarr.org]
> Sent: Wednesday, December 16, 2015 4:23 PM
> To: dev@tika.apache.org
> Subject: Re: looking to contribute
> 
> On Wed, 16 Dec 2015, Joey Hong wrote:
>> My name is Joey. I am a college freshmen with programming experience 
>> looking to get into the world of open-source. I was hoping to 
>> contribute to the Tika project, and was wondering if there were any 
>> tasks that a beginner like me could tackle. I am willing to do 
>> anything, whether it be fixing a minor bug, or adding test suites or 
>> documentation.
> 
> On the docs / examples side, we have a few examples on the website, but 
> probably not enough! One thing might be to look through those, identify gaps 
> with your fresh eyes, and work on those. We also have instructions for some 
> more complicated integrations on the wiki, maybe try some of those and feed 
> back on which ones aren't clear enough?
> 
> If you want to try more coding, Tim quite often runs Tika against some large 
> filesets, and has a nifty tool to report on what breaks. He can hopefully 
> point you at the most recent report! Maybe have a look through that, identify 
> a few common failures from unidentified or common exceptions, and try to fix 
> one or two of those?
> 
> Nick



Re: looking to contribute

2015-12-19 Thread Joey Hong
Regarding TIKA-1329, I found the tike-site on the Subversion source code, and I called:  	svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/.Since this isn’t part of the main tika/trunk repository, I was wondering if I should still follow the same protocol and svn commit my changes to the site folder. In case I shouldn’t, I’ve attached my changes to the usage examples page of the website below. I basically added how to parse documents with embedded docs using the RecursiveParserWrapper class, and how to serialize the returned Metadata list to JSON, with some description.Thanks,JoeyTitle: Apache Tika – Tika API Usage Examples











  
  

  


  
  

Apache Tika API Usage Examples
This page provides a number of examples on how to use the various Tika APIs. All of the examples shown are also available in the Tika Example module in SVN.

Apache Tika API Usage Examples

Parsing

Parsing using the Tika Facade
Parsing using the Auto-Detect Parser
 Parsing using the Recursive Parser Wrapper 
Picking different output formats

Parsing to Plain Text
Parsing to XHTML
Fetching just certain bits of the XHTML
Custom Content Handlers

Extract Phone Numbers from Content into the Metadata
Streaming the plain text in chunks
Translation

Translation using the Microsoft Translation API
Language Identification
Additional Examples

Parsing
Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity.

Parsing using the Tika Facade
The Tika facade, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text
public String parseToStringExample() throws IOException, SAXException, TikaException {Tika tika = new Tika();try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {return tika.parseToString(stream);}}


Parsing using the Auto-Detect Parser
For more control, you can call the Tika Parsers directly. Most likely, you'll want to start out using the Auto-Detect Parser, which automatically figures out what kind of content you have, then calls the appropriate parser for you.


public String parseExample() throws IOException, SAXException, TikaException {

AutoDetectParser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {

parser.parse(stream, handler, metadata);

return handler.toString();

}
}



Parsing using the Recursive Parser Wrapper
 When you want to parse embedded documents, you can extract content from both the enclosing document and all embedded ones by passing the parser into the ParseContext instance. 


public String parseEmbeddedExample() throws IOException, SAXException, TikaException {

AutoDetectParser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

ParseContext context = new ParseContext();

context.set(Parser.class, parser);

try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {

parser.parse(stream, handler, metadata);

return handler.toString();
}
}

Alternatively, you can use the RecursiveParserWrapper, which handles passing the parser into ParseContext. This wrapper class returns a list of Metadata objects, where the first element is the metadata and content for the container document, and the rest for each embedded document. 


public List recursiveParserWrapperExample() throws IOException, SAXException, TikaException {

Parser p = new  AutoDetectParser();

ContentHandlerFactor factory = new BasicContentHandlerFactory(
 
BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1);

RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p, factory);


Metadata metadata = new Metadata();

metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx");

ParseContext context = new ParseContext();

try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {

wrapper.parse(stream, new DefaultHandler(), metadata, context)
}

return wrapper.getMetadata();
}

 The JsonMetadataList class can serialize the metadata list into JSON, and deserialize back into the list. 

public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException {

 List metadataList = recursiveParserWrapperExample();

 StringWriter writer = new StringWriter();

 JsonMetadataList.toJson(metadataList, writer);

return writer.toString();
}



Picking different output formats
With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, 

Re: Looking to contribute

2015-12-20 Thread Mattmann, Chris A (3980)
Pavan awesome glad to have your interest and to have you in the
community!

Check out our JIRA:

https://issues.apache.org/jira/browse/TIKA

My own personal recent interests in Tika are related to Named
Entity Recognition (Stanford NER, CoreNLP and OpenNLP), and in
Automated IR-based Geo-Gazetteers; in Audio/Video extraction,
and so forth. Also in language identification (N-grams; MIT-LL’s
Text.jl) and automated machine translation (Joshua, Moses).

If you are interested in that type of stuff, look for stuff
I reported or assigned to me, or with the label “memex”. In
addition in general if you are more interested in the types
of work that I’m contributing to Tika, see http://memex.jpl.nasa.gov/

Cheers, and happy holidays!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Pavan Sudheendra 
Reply-To: "dev@tika.apache.org" 
Date: Sunday, December 20, 2015 at 9:52 AM
To: "dev@tika.apache.org" 
Subject: Looking to contribute

>Hi all,
>
>My name is Pavan and I'm a software engineer working at Cisco on big data
>projects from the past 2 years.
>
>I'm looking to contribute to the Tika project and i'm wondering if I
>should
>start looking at the Github issues page or somewhere else?
>
>I've started reading the documentation and getting familiar with the build
>process.
>
>Also, any guidance on this subject would be great.
>
>Thanks all.
>
>-- 
>Regards-
>Pavan



Re: looking to contribute

2015-12-20 Thread Nick Burch

On Sat, 19 Dec 2015, Joey Hong wrote:

Regarding TIKA-1329, I found the tike-site on the Subversion source code, and I 
called:
svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/ 
.

Since this isn’t part of the main tika/trunk repository, I was wondering 
if I should still follow the same protocol and svn commit my changes to 
the site folder.


You shouldn't be working on those files - they're the generated HTML. You 
need to work on the original APT (Almost Plain Text) files which are in a 
sibling folder


I'd suggest, if you want to work on any docs stuff (yey!), you just 
checkout https://svn.apache.org/repos/asf/tika/site


Then edit the files in src/site/apt/1.x/, and use "mvn install" in the 
checkout root to test how the resulting HTML looks


In case I shouldn’t, I’ve attached my changes to the usage examples page 
of the website below. I basically added how to parse documents with 
embedded docs using the RecursiveParserWrapper class, and how to 
serialize the returned Metadata list to JSON, with some description.


Examples is a bit special! Any code should go into the tika-example module 
in the main source tree, along with unit tests that verify that they work 
properly + stay working properly. That avoids the common issue of examples 
that no longer work/compile!


Once your changes are in the example svn area, edit in the site folder the 
file src/site/apt/1.{x+1}/examples.apt to both pull in the appropriate 
code snippet + describe it. Use the %{include} directive to have the code 
pulled in, tell it which file to grab from, and which method, and it'll 
nicely inline the unit-tested example for you


Nick

Re: looking to contribute

2015-12-20 Thread Joey Hong
Oh, my bad. I should have realized when the HTML looked generated. I have now 
added the usage examples to the examples.apt file, and the page looks find 
after it was built by mvn. As of now, the examples are edited both for the 
1.11/ and 1.12/ folders; should they only affect the 1.12/ one? 

Also, when this is all done, would i svn commit my changes the same way as for 
the main Tika app?

Thanks,
Joey

> On Dec 20, 2015, at 1:50 PM, Nick Burch  wrote:
> 
> On Sat, 19 Dec 2015, Joey Hong wrote:
>> Regarding TIKA-1329, I found the tike-site on the Subversion source code, 
>> and I called:
>>  svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/ 
>> .
>> 
>> Since this isn’t part of the main tika/trunk repository, I was wondering if 
>> I should still follow the same protocol and svn commit my changes to the 
>> site folder.
> 
> You shouldn't be working on those files - they're the generated HTML. You 
> need to work on the original APT (Almost Plain Text) files which are in a 
> sibling folder
> 
> I'd suggest, if you want to work on any docs stuff (yey!), you just checkout 
> https://svn.apache.org/repos/asf/tika/site
> 
> Then edit the files in src/site/apt/1.x/, and use "mvn install" in the 
> checkout root to test how the resulting HTML looks
> 
>> In case I shouldn’t, I’ve attached my changes to the usage examples page of 
>> the website below. I basically added how to parse documents with embedded 
>> docs using the RecursiveParserWrapper class, and how to serialize the 
>> returned Metadata list to JSON, with some description.
> 
> Examples is a bit special! Any code should go into the tika-example module in 
> the main source tree, along with unit tests that verify that they work 
> properly + stay working properly. That avoids the common issue of examples 
> that no longer work/compile!
> 
> Once your changes are in the example svn area, edit in the site folder the 
> file src/site/apt/1.{x+1}/examples.apt to both pull in the appropriate code 
> snippet + describe it. Use the %{include} directive to have the code pulled 
> in, tell it which file to grab from, and which method, and it'll nicely 
> inline the unit-tested example for you
> 
> Nick



Re: looking to contribute

2015-12-20 Thread Nick Burch

On Sun, 20 Dec 2015, Joey Hong wrote:
Oh, my bad. I should have realized when the HTML looked generated. I 
have now added the usage examples to the examples.apt file, and the page 
looks find after it was built by mvn. As of now, the examples are edited 
both for the 1.11/ and 1.12/ folders; should they only affect the 1.12/ 
one?


If the example applies to the 1.11 release (eg because we forget to add it 
there in time), pop it into the 1.11 apt file and add a note that we 
should apply it to 1.12 as well


If the example relies on new functionality added since 1.11, just in the 
1.12 folder


Also, when this is all done, would i svn commit my changes the same way 
as for the main Tika app?


Because they're in different bits of the tree, you'd likely need one 
patch/commit for the changes to the example source+tests, and one for the 
examples page that references + explains it


Nick


Re: Looking to contribute

2015-12-20 Thread Pavan Sudheendra
Sure, thanks Chris. Sounds very interesting..

Happy Holidays :-)

On Mon, Dec 21, 2015, 12:29 AM Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Pavan awesome glad to have your interest and to have you in the
> community!
>
> Check out our JIRA:
>
> https://issues.apache.org/jira/browse/TIKA
>
> My own personal recent interests in Tika are related to Named
> Entity Recognition (Stanford NER, CoreNLP and OpenNLP), and in
> Automated IR-based Geo-Gazetteers; in Audio/Video extraction,
> and so forth. Also in language identification (N-grams; MIT-LL’s
> Text.jl) and automated machine translation (Joshua, Moses).
>
> If you are interested in that type of stuff, look for stuff
> I reported or assigned to me, or with the label “memex”. In
> addition in general if you are more interested in the types
> of work that I’m contributing to Tika, see http://memex.jpl.nasa.gov/
>
> Cheers, and happy holidays!
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
> -Original Message-
> From: Pavan Sudheendra 
> Reply-To: "dev@tika.apache.org" 
> Date: Sunday, December 20, 2015 at 9:52 AM
> To: "dev@tika.apache.org" 
> Subject: Looking to contribute
>
> >Hi all,
> >
> >My name is Pavan and I'm a software engineer working at Cisco on big data
> >projects from the past 2 years.
> >
> >I'm looking to contribute to the Tika project and i'm wondering if I
> >should
> >start looking at the Github issues page or somewhere else?
> >
> >I've started reading the documentation and getting familiar with the build
> >process.
> >
> >Also, any guidance on this subject would be great.
> >
> >Thanks all.
> >
> >--
> >Regards-
> >Pavan
>
>


Re: looking to contribute

2015-12-22 Thread Nick Burch

On Wed, 16 Dec 2015, Nick Burch wrote:
If you want to try more coding, Tim quite often runs Tika against some 
large filesets, and has a nifty tool to report on what breaks. He can 
hopefully point you at the most recent report! Maybe have a look through 
that, identify a few common failures from unidentified or common 
exceptions, and try to fix one or two of those?


Another one might be TIKA-1817 - needs two or three new parsers, all 
hopefully fairly straightforward. There'll want to be a text-based one for 
ASCII DXF, likely along the lines of some of the scientific text-based 
formats. There also needs a binary one for binary DXF, maybe also able to 
do DXB at the same time. The DWG parser might be a good starting point for 
that, or maybe even could be extended to do those too


Nick