Re: TIKA-1302 blog post

2016-10-06 Thread Mattmann, Chris A (3980)
Hey Tim yep let’s add the other apachecon prezos from me and Nick thanks.

++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 10/6/16, 6:15 AM, "Allison, Timothy B."  wrote:

Looks like you beat me to it!  Thank you.

I added my ApacheCon NA 2015 slides and a link to William Palmer's "Tika to 
Ride".  

Should we add other ApacheCon presentations from you and Nick?

> Will try and make progress today.
Great!  Thank you!

 Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Wednesday, October 5, 2016 12:57 PM
To: dev@tika.apache.org
Subject: Re: TIKA-1302 blog post

Tim this is GREAT!

Please link it from the wiki that mentions web resource document links. I 
think:

http://wiki.apache.org/tika/TikaResources

I fell behind on spinning the release. Will try and make progress today.

Chris

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398) 
Manager, Open Source Projects Formulation and Development Office (8212) NASA 
Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct 
Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 10/4/16, 7:34 AM, "Allison, Timothy B."  wrote:

All,

  In responding to a request for collaboration on file id from Jenny 
Mitcham at IPRES 2016 (http://www.ipres2016.ch/) , I thought it might be useful 
to blog on the TIKA-1302 work.


http://openpreservation.org/blog/2016/10/04/apache-tikas-regression-corpus-tika-1302/

Let me know if I should add/modify anything.

  Cheers,

   Tim






Re: TIKA-1302 blog post

2016-10-05 Thread Mattmann, Chris A (3980)
Tim this is GREAT!

Please link it from the wiki that mentions web resource document links. I think:

http://wiki.apache.org/tika/TikaResources

I fell behind on spinning the release. Will try and make progress today.

Chris

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 10/4/16, 7:34 AM, "Allison, Timothy B."  wrote:

All,

  In responding to a request for collaboration on file id from Jenny 
Mitcham at IPRES 2016 (http://www.ipres2016.ch/) , I thought it might be useful 
to blog on the TIKA-1302 work.


http://openpreservation.org/blog/2016/10/04/apache-tikas-regression-corpus-tika-1302/

Let me know if I should add/modify anything.

  Cheers,

   Tim




Re: Tika 1.14?

2016-09-29 Thread Mattmann, Chris A (3980)
If there aren’t any objections I’ll roll 1.14 this weekend with an RC1 by 
Monday.

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 9/29/16, 8:07 AM, "Allison, Timothy B."  wrote:

I didn't find any showstoppers.  Are we ready for Chris to roll 1.14-rc1?


Some notes:
We're getting quite a few new attachments: 315k (mostly from newly 
recognized mbox, and MSOffice)
New mimetypes: mbox, text/calendar, x-sh, vnd.djvu, dbf, and many more
The upgraded copy of icu4j is misidentifying a handful of files as 
UTF-16[LB]E.
We're missing a small amount of text from custom PPT templates (known issue)
We're getting quite a few new exceptions for attachments that weren't 
formerly extracted.  These are unknown embedded objects that are being 
misidentified as PSD, other image files or TTF. 
We're getting quite a few new exceptions for files that are now correctly 
identified as "x-ms-asx" because they contain invalid xml


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, September 28, 2016 1:34 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.14?

All,
  I finished running the regression tests.  I have just started going 
through the results.

Reports are available here:


https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14-trunk_vs_1_13.zip



-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Thursday, September 22, 2016 12:25 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.14?

Sounds great to me Tim. If you tell me when the tests are done, I’d be 
happy to RC a release!





On 9/21/16, 11:31 AM, "Allison, Timothy B."  wrote:

All,
  PDFBox 2.0.3 is now integrated, I'm about to push the integration 
with POI-3.15.  I have a few cleanup things I'd like to take care of.
  Any other items for 1.14?
  Should we aim for Mon 26th for final code changes for 1.14?  I can 
run the regression tests, and then maybe we could cut the release candidate 
some time mid to end of next week?

   Best,

   Tim









Re: Plans for the first Tika 2.0 release

2016-09-21 Thread Mattmann, Chris A (3980)
NLP/NER is as high a priority to me as the OCR stuff..we have a whole meta 
framework
for doing NER/NLP with NERRecogniser and really cool Tensorflow and other stuff.
Hoping 2.0 can help solve this! ☺

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 


On 9/21/16, 7:40 AM, "Nick Burch"  wrote:

On Mon, 19 Sep 2016, Bob Paulin wrote:
> I think it's a good thing to discuss.  I know there are other features 
> that are targeted for 2.0.  Do we have a general sense of where those 
> features are at?

I think the big one we need to crack is allowing multiple parsers to run 
against a file. OCR is probably the most critical of these from the 
modularisation perspective, with all those nasty interlinkings between the 
parsers to allow the manual delegation. If we can crack the problem of 
multiple parsers, those proxy issues should go away (or at least get 
better!)

As a bonus, it ought to also improve things for error cases (fallback 
parsers etc), but for your needs, the simplification for "ocr + image 
metadata" is likely your biggest win!

(I think it might also let us tidy up some of the enhancement parsers too, 
like how the NLP stuff fits into the parsing framework)

Nick





Re: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract document at remote url - my request is not working

2016-09-14 Thread Mattmann, Chris A (3980)
+1 Great idea Konstantin

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 


On 9/14/16, 8:35 AM, "Konstantin Gribov"  wrote:

+1 to re-enable fileUrl with warning (with CVE ID at least) and at least
special flag to enable it.

IMHO, even better would be to require two flags (something like
`--enable-dangerous-features/--enable-unsecure-features` plus actual
`--enable-fileurl` like Sun/Oracle use for commercial features). It will
force user to think twice before start tika-server with fileUrl enabled and
clearly state that server is running in unsecure mode for anyone looking in
ps/htop/initscript/et cetera.

ср, 14 сент. 2016 г. в 17:15, Chris Mattmann :

> As long as we have a switch and a warning (and pointer to CVE URL with 
that
> warning), I’m +1 to re-enable it.
>
> On 9/14/16, 4:40 AM, "Nick Burch"  wrote:
>
> On Wed, 14 Sep 2016, Allison, Timothy B. wrote:
> > Would it be as much of a disaster to require the user to allow the
> > fileUrl capability on the commandline at server startup?  We could
> add
> > some menacing "all bets are off, we hope you know what you're doing"
> > warning.
>
> With a special switch, and a warning, enabling file:/// again wouldn't
> be
> too bad in my view.
>
> I'm not sure about arbitrary URLs though - there's the security + dos
> stuff, plus the fact that we won't be doing robots checking / niceness
> /
> etc. For anyone doing remote URLs, I think they do need to be using a
> proper + safe + server-friendly crawler, then passing the result of a
> successful fetch to the Tika server
>
> >> My main concern in accessing the Tika libraries via TikaJAXRS is 
the
> >> performance overheads associated ?>with going through sockets (and
> >> possible the additional memory/file copying of file data if fileUrl
> is
> >> not >available).
> >
> > In my experience, depending on the file types, y, there's definitely
> > some overhead, but the bottleneck is in the parsers (esp for complex
> > document formats -- msoffice, pdf, etc), not data sloshing.
>
> I agree - for almost all formats, the slow bit isn't byte shuffling
> it's
> parsing
>
> Nick
>
>
>
>
> --

Best regards,
Konstantin Gribov





Re: A new Tika App in 2.0?

2016-09-13 Thread Mattmann, Chris A (3980)
I’ll try and comment on this tomorrow sorry it’s been a tough few weeks, really
busy.

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 


On 9/13/16, 8:35 PM, "Bob Paulin"  wrote:

Hey Nick,


Thanks for the thoughts.  Just to clear a few things up.  The version of 
the app on my github does already include all the parsers as the current 
app does.  If you build it and run --list-parsers you'll see them 
there.  As for the desire to quickly test new bits I think much of the 
OSGi stuff has been abstracted away.  For an example see the example 
folder [1].  The only additions are the Activator class (which is 
identical for all the current bundles) and the maven-bundle-plugin in 
the pom.xml. But don't take my word for it why not give it a spin?

As for the use cases I would say consider whenever we upgrade or add 
parsers/detectors/encodingdetectors/languagedetectors we .may introduce 
new dependencies or new versions.  For example the pom for the tika-app 
currently pulls in 3 different versions of commons-io, 2 versions of 
commons-codec, 2 versions of Guava. Maven resolves to just one version 
in the final build but the effect is that every part of the code must 
work with the selected version. In the OSGi version of tika-app the 
modules can have different versions of the dependencies within the same 
app.  Also within TIKA-1285 [2] it could have been possible to support 2 
different versions of PDFBox within different OSGi bundles. So I see it 
as more of a gain but I'd be interesting in hearing if there is any 
degradation in the development experience.


- Bob


[1] 

https://github.com/bobpaulin/tika-app-osgi/tree/master/examples/dummy-parser-bundle

[2] https://issues.apache.org/jira/browse/TIKA-1285


On 9/13/2016 3:38 PM, Nick Burch wrote:
> On Sun, 11 Sep 2016, Bob Paulin wrote:
>> I'd like to propose a new Tika App for the 2.0 branch.  One of the 
>> reasons we broke apart the Tika parsers into modules was due to the 
>> complexity of having to deal with all the parser dependencies and 
>> transitive dependencies.  Now developers can use just the modules 
>> they want without pulling the kitchen sink with it.  Unfortunately 
>> this approach doesn't simplify the problem in the tika-parser or 
>> tika-app project where the whole kitchen sink comes together again.
>
> One of the nice things about the tika app (and server) is you do get 
> everything, so it's very easy to test and get started with!
>
> Another nice thing is that you can test small changes (eg a new parser 
> or a new mime type) quite quickly, just by using the tika app jar on 
> your classpath along with your customisation. Makes it very easy to 
> try out new things if you're a new developer, and I find usually 
> easier than firing up eclipe if I just want to try a new mime type 
> change for someone.
>
>
> More modular versions of the Tika server I could certainly get behind, 
> if we haven't already done so!
>
> For the app, are there that many use cases for it where you might only 
> want some of Tika? (Most people calling Tika from another language 
> would likely be better off with the server, to avoid the JVM 
> start/stop overhead).
>
> Would the new osgi version make it harder for people to test new bits 
> with tika? For one example, whenever we've done a hackathon and are 
> helping people with a new parser, helping them get their new parser 
> used with just the app is about do-able. I fear if we made them also 
> learn osgi + build a bundle, at that stage when they're trying to do a 
> "hello world", we'd loose them :/
>
> The github project does look interesting though! I'd hate for us to 
> get a few shiny new bits, but loose some key bits important for 
> newbies / quick-win developers in the process though...
>
> Nick
>






Re: Can't get Tensorflow REST recognizer to work

2016-08-14 Thread Mattmann, Chris A (3980)
Fixed!

finally fixed it! 2 issues:

Needed startDocument and endDocument in the handler - that fixed the JSON and 
in turn ended up fixing the REST and script based Tensorflow calls.
The often come up (but still undocumented we need to fix that!) problem that 
you can't concurrently mess with the metadata object whilst doing the 
ContentHandler stuff. You have to have an ImmutableMetadata object by the time 
you do ContentHandler stuff.
I'm going to do a few more tests then get this committed! Great work 
@thammegowda. Overall this is an amazing contribution it will be awesome for 
Tika users!

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++


On 8/14/16, 10:15 AM, "Chris Mattmann"  wrote:

Hi Devs,

Here’s what I’m seeing in TIKA-1993 and 1508, which I would love to finish 
today.

1. Tensorflow python script works great.
2. Tensorflow REST service – Docker container works (had to upgrade Docker 
to latest)
3. Tensorflow REST service – Tika parser metadata works great.
4. Tensorflow REST service – Tika XHTML won’t print or work.

I can’t get the XHTML to print with the tika app –x flag (though –m 
produces the following):

LMC-053601:tika1.14 mattmann$ java -cp 
tika-app/target/tika-app-1.14-SNAPSHOT.jar org.apache.tika.cli.TikaCLI 
--config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml
 -m tika-parsers/src/test/resources/test-documents/testJPEG.jpg
INFO  Available = true, API Status = HTTP/1.0 200 OK
INFO  minConfidence = 0.015, topN=7
INFO  Recogniser = 
org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
INFO  Recogniser Available = true
Content-Length: 7686
Content-Type: image/jpeg
OBJECT: Egyptian cat (0.09168)
OBJECT: Border collie (0.07553)
OBJECT: bluetick (0.06043)
OBJECT: collie (0.02982)
OBJECT: English foxhound (0.02759)
OBJECT: Siamese cat, Siamese (0.02053)
OBJECT: tabby, tabby cat (0.01826)
X-Parsed-By: org.apache.tika.parser.CompositeParser
X-Parsed-By: org.apache.tika.parser.recognition.ObjectRecognitionParser
org.apache.tika.parser.recognition.object.rec.impl: 
org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser
resourceName: testJPEG.jpg
LMC-053601:tika1.14 mattmann$ 

Thoughts? @Thamme?

Cheers,
Chris







Re: Tika 1.14?

2016-08-11 Thread Mattmann, Chris A (3980)
Sounds good to me

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++


On 8/11/16, 11:59 AM, "Allison, Timothy B."  wrote:

All,
  Any interest in a Tika 1.14 release in a few weeks, say first week of 
September?  I'd like to test and integrate POI 3.15-beta3 which should be out 
fairly soon.  Any other blockers or wishes?

 Cheers,

Tim





Re: Your project VM needs to be migrated.

2016-07-17 Thread Mattmann, Chris A (3980)
Thanks Gav, I replied on the INFRA ticket.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 7/16/16, 7:41 PM, "Gav"  wrote:

>Hello Tika devs!
>
>You guys have a VM 'tika-vm.apache.org' which is housed on old hardware due
>to be turned off.
>
>Therefore your VM needs to either be migrated to our cloud infrastructure
>and integrated into our puppet management environment or just turned off if
>you no longer require it.
>
>Could we have someone from your project who has access to the VM please
>liaise with
>ASF Infra in planning and executing the migration and/or shutdown of your
>VM.
>
>There is an INFRA Jira ticket at :-
>
>https://issues.apache.org/jira/browse/INFRA-12288
>
>so please respond to either that ticket, infrastructure@ mailing list
>and/or HipChat #asfinfra room. (infra.chat). Note I am not subscribed to
>the tika dev list at this time.
>
>I hope that with your assistance we can have this done within a month.
>
>Thanks and I look forward to working with you on the migration.
>
>Gav... (ASF Infra.)


Re: Sentiment Analysis Parser updates

2016-07-06 Thread Mattmann, Chris A (3980)
Great work:!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 7/6/16, 2:06 PM, "Anastasija Mensikova"  
wrote:

>Hi everyone,
>
>Here are some updates on Sentiment Analysis Parser.
>
>As you might now, during the last week I created a pull request to OpenNLP,
>and I still have a few things to fix there, but the work is happening right
>now for it behind the scenes. I have also used Stanford Sentiment Treebank
>to create a new, categorical, labeled dataset to train on to have more than
>two categories (or Facebook similar categories) for sentiment analysis. Of
>course, this categorical sentiment analysis is not perfect yet, but it is
>up and working. I have also started working on our own SentimentEvaluator
>and SentimentCrossValidator, which will hopefully be done soon.
>My next goal is to, of course, finish the Evaluator and CrossValidator and
>use my new categorical output to create more D3 graphs on our GitHub page.
>
>Have a great day/night!
>
>Thank you,
>Anastasija.


Re: TIKA-1164

2016-07-04 Thread Mattmann, Chris A (3980)
Hi Samuel I am forwarding your email to dev@tika.a.o and moving
dev-owner@t.a.o to BCC.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 7/4/16, 8:41 AM, "scatherine@gouv.mc"  wrote:

>Hi,
>
>I use Tika to detect MediaType and i have the same problem than the JIRA 
>TIKA-1164
>https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> 
>But I use the version 1.13. How can I solve this problem, please ?
>
>MediaType mediaType=null;
>Metadata md =
>new Metadata();
>md.set(Metadata.RESOURCE_NAME_KEY,
>fileName);
>Detector detector = TikaConfig.getDefaultConfig().getDetector();
>
>try {
>mediaType =
>detector.detect(TikaInputStream.get(content),
>md);
>
>} catch (IOException
>e) {
>   
>mediaType =
>null;
>}
>
>The contentsize (content.available()) change between before and after the 
>detect call.
>
>Regards,
>
>Samuel Catherine
>
>


Tika-Python: parsing PDFs and showing analytics

2016-06-30 Thread Mattmann, Chris A (3980)
Great Blog post by Clinton Brownley today:

 


If you haven’t had a chance to check out tika-python [1], I 
recommend doing so! Would also appreciate any feedback or 
stars! 

Cheers,
Chris

[1] http://github.com/chrismattmann/tika-python/

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++







Re: Sentiment Analysis Parser updates

2016-06-28 Thread Mattmann, Chris A (3980)
Thanks William, this is a great idea. I will discuss it with 
Anastasija tomorrow.


Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/28/16, 12:01 PM, "William Colen"  wrote:

>Hi,
>
>I tried your code. Very good work so far! Congratulations.
>
>Is the examples/result file corrupted? It has only one line.
>
>Do you plan to implement a simple CLI to use it interactively from command
>line, similar to
>
>bin/opennlp Doccat
>bin/opennlp TokenNameFinder
>
>?
>
>Also, do you plan to add evaluation tools by extending
>AbstractEvaluatorTool and AbstractCrossValidatorTool, as well as the
>listener EvaluationErrorPrinter? I found these tools very useful while I am
>developing new models and features, maybe you would find it useful as well.
>
>You could also check the DoccatFineGrainedReportListener as a start point
>to create a confusion matrix (I think it would be easy because Doccat data
>structures are similar to yours).
>
>The result would look like the follow (this is a 300 entries Portuguese
>corpus I am building from Facebook messages):
>
>
>=== Evaluation summary ===
>  Number of documents:298
>Min sentence size:  1
>Max sentence size:463
>Average sentence size:  18,01
> Categories count:  4
> Accuracy: 61,41%
>
>=== Detailed Accuracy By Tag ===
>
>-
>|  Tag | Errors |  Count |   % Err | Precision | Recall | F-Measure |
>-
>|  neutral | 46 | 56 | 0,821   | 0,588 | 0,179  | 0,274 |
>| positive | 46 | 70 | 0,657   | 0,48  | 0,343  | 0,4   |
>| negative | 18 |167 | 0,108   | 0,651 | 0,892  | 0,753 |
>| spam |  5 |  5 | 1   | 0 | 0  | 0 |
>-
>
>=== Confusion matrix ===
>
>
>a b c d | Accuracy | <-- classified as
> <149>   13 4 1 |   89,22% |   a = negative
>   42   <24>3 1 |   34,29% |   b = positive
>   3511   <10>. |   17,86% |   c = neutral
>3 2 .<.>|   0% |   d = spam
>
>
>
>
>Regards,
>William
>
>2016-06-23 2:11 GMT-03:00 Mattmann, Chris A (3980) <
>chris.a.mattm...@jpl.nasa.gov>:
>
>> Thank you Jason!
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6/22/16, 8:41 PM, "Jason Baldridge"  wrote:
>>
>> >Anastasija,
>> >
>> >There might be a few appropriate sentiment datasets listed in my homework
>> >on Twitter sentiment analysis:
>> >
>> >https://github.com/utcompling/applied-nlp/wiki/Homework5
>> >
>> >There may also be some useful data sets in the Crowdflower Open Data
>> >collection:
>> >
>> >https://www.crowdflower.com/data-for-everyone/
>> >
>> >Hope this helps!
>> >
>> >-Jason
>> >
>> >On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova <
>> >mensikova.anastas...@gmail.com> wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> Some updates on our Sentiment Analysis Parser work.
>>

Re: Metadata key for "original file location/name"?

2016-06-27 Thread Mattmann, Chris A (3980)
Tim:

+1 to TikaCoreProperties.ORIGINAL_RESOURCE_NAME being mapped
to:  

X-TIKA:origResourceName

Sound good?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/27/16, 10:08 AM, "Allison, Timothy B."  wrote:

>All,
>  Some file formats store the original file location or file name for the file 
> itself or embedded documents.  What metadata key should we use for this info? 
>  DublinCore's Identifier is vaguely on the right track, but not at all 
> appropriate.
>
>TikaCoreProperties.ORIGINAL_RESOURCE_NAME?
>
>Should we create a tika-based namespace for this metadata element?  (If so, 
>let's use that name space for TIKA-1759)
>
>  Thank you.
>
>  Best,
>
>   Tim
>


Re: regression corpus/vm discussions

2016-06-23 Thread Mattmann, Chris A (3980)
dev@tika is a great place, +1 from me.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/23/16, 7:12 AM, "Allison, Timothy B."  wrote:

>All,
>  Dominik Stadler (Apache POI committer, Common Crawl tamer...among many other 
> things) has recently started working on our vm.
>  I'd like to discuss document selection/allocation/removal and other vm 
> related things publicly.  This list already gets plenty of blather from me.  
> Is it ok if we use the dev@tika list with [vm] in the Subject line for these 
> discussions or should we look for a different venue?
>
>Thank you.
>
>  Best,
>
>Tim


Re: Sentiment Analysis Parser updates

2016-06-22 Thread Mattmann, Chris A (3980)
Thank you Jason!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/22/16, 8:41 PM, "Jason Baldridge"  wrote:

>Anastasija,
>
>There might be a few appropriate sentiment datasets listed in my homework
>on Twitter sentiment analysis:
>
>https://github.com/utcompling/applied-nlp/wiki/Homework5
>
>There may also be some useful data sets in the Crowdflower Open Data
>collection:
>
>https://www.crowdflower.com/data-for-everyone/
>
>Hope this helps!
>
>-Jason
>
>On Wed, 22 Jun 2016 at 15:59 Anastasija Mensikova <
>mensikova.anastas...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Some updates on our Sentiment Analysis Parser work.
>>
>> You might have noticed, I have enhanced our website (the GH page) recently,
>> polished it and made it more user-friendly. My next step will be sending a
>> pull request to Tika. However, my main goal until the end of Google Summer
>> of Code is to enhance the parser in a way that will allow it to work
>> categorically (in other words, the sentiment determined won't be just
>> positive or negative, it will have a few categories). This means that my
>> next step is to look for a categorical open data set (which I will
>> hopefully do by the end of the weekend the latest) and, of course, enhance
>> my model and training. After that I will look into how the confidence
>> levels can be increased.
>>
>> Have a great day/night!
>>
>> Thank you,
>> Anastasija Mensikova.
>>


Re: Sentiment Analysis Parser updates

2016-06-22 Thread Mattmann, Chris A (3980)
Great work Anastasija!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/22/16, 1:55 PM, "Anastasija Mensikova"  
wrote:

>Hi everyone,
>
>Some updates on our Sentiment Analysis Parser work.
>
>You might have noticed, I have enhanced our website (the GH page) recently,
>polished it and made it more user-friendly. My next step will be sending a
>pull request to Tika. However, my main goal until the end of Google Summer
>of Code is to enhance the parser in a way that will allow it to work
>categorically (in other words, the sentiment determined won't be just
>positive or negative, it will have a few categories). This means that my
>next step is to look for a categorical open data set (which I will
>hopefully do by the end of the weekend the latest) and, of course, enhance
>my model and training. After that I will look into how the confidence
>levels can be increased.
>
>Have a great day/night!
>
>Thank you,
>Anastasija Mensikova.


Re: Sentiment Analysis Parser updates

2016-06-17 Thread Mattmann, Chris A (3980)
Great update Anastasija!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/17/16, 2:28 PM, "Anastasija Mensikova"  
wrote:

>Hello everyone,
>
>Some updates on my work on the Sentiment Analysis Parser.
>
>As you know, I have finished a basic version of the parser, and I'm
>currently working on going through the results of the parser run on the gun
>ads the right way so I can easily build graphs to illustrate how it all
>works.
>As you probably noticed, I changed some parts of the parser allowing it to
>output the data in JSON. I have also worked on creating scripts (not on
>GitHub) that load the 100 random gun ads, perform sentiment analysis on
>them using the parser and output the data needed for the graph. Using the
>output I received I have already managed to build two graphs using D3: one
>solely on the distribution of sentiment among the gun ads, and the other
>one on the distribution of sentiment of the gun ads in the countries (where
>the guns were made) presented, which you can all see on our GitHub page.
>
>I hope you have a great weekend!
>
>Thank you,
>Anastasija.


Re: About tika-python error

2016-06-11 Thread Mattmann, Chris A (3980)
Thank you! Please note too that if you submit the PR it will run
automatically using TravisCI the unit tests for the project.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/10/16, 11:03 PM, "Rakesh Kumar"  wrote:

>Hi, I just saw the mail. I will do it today as I want to run some test before 
>I can push or create pull request. 
>
>On Fri, Jun 10, 2016 at 11:19 PM, Mattmann, Chris A (3980)
> wrote:
>
>Hi Rakesh,
>
>Got it. Can you please submit a PR for your simple fix? I’ll happily
>credit you and it would be great for the Tika community.
>
>Thanks!
>
>-C
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>
>On 6/10/16, 9:11 AM, "Rakesh Kumar"  wrote:
>
>>No, I corrected the error, it was a small mistake on your side.  Problem was 
>>trailing forward slash after url  . Please see the explanation below
>>
>>
>>
>>
>> 
>http://www.assignmenthelp.net/contactus 
><http://www.assignmenthelp.net/contactus>
>> will be saved as /Temp/contactus
>> 
>http://www.assignmenthelp.net/abc.pdf <http://www.assignmenthelp.net/abc.pdf>
>>  will be saved as /Temp/abc.pdf
>>
>>
>>What about,  
>http://www.assignmenthelp.net/ <http://www.assignmenthelp.net/>
>> => you are trying to save it as /tmp/ hence the error
>>However if there is no "/"   at the end of url i.e.
>http://www.assignmenthelp.net <http://www.assignmenthelp.net> 
><http://www.assignmenthelp.net/>  then
>> you try to save it as \Temp/www.assignmenthelp.net 
>> <http://www.assignmenthelp.net> <http://www.assignmenthelp.net>
>>
>>
>>
>>Hence small correction was to remove "/" if it is there at the end of url  .
>>
>>
>>Rest everything is ok.
>>
>>
>>On Fri, Jun 10, 2016 at 8:38 PM, Mattmann, Chris A (3980)
>> wrote:
>>
>>[moved to dev@tika.a.o list please follow replies there.]
>>
>>Rakesh - looks like you don’t have permissions to write to
>>your temp dir on Windows. Can you confirm that’s the case?
>>
>>++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattm...@nasa.gov
>>WWW:
>>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>>++
>>Director, Information Retrieval and Data Science Group (IRDS)
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>WWW: http://irds.usc.edu/
>>++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>On 6/10/16, 2:01 AM, "Rakesh Kumar"  wrote:
>>
>>>Hi when I try to extract content from url it prints error
>>>
>>>
>>>
>>>
>>>tika.py: Retrieving
>>http://www.assignmenthelp.net/ <http://www.assignmenthelp.net/> to 
>>C:\Users\Rakesh\AppData\Local\Temp/
>>>Traceback (most recent call last):
>>>File "C:\Users\Rakesh\Anac

Re: About tika-python error

2016-06-10 Thread Mattmann, Chris A (3980)
Hi Rakesh,

Got it. Can you please submit a PR for your simple fix? I’ll happily
credit you and it would be great for the Tika community.

Thanks!

-C

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/10/16, 9:11 AM, "Rakesh Kumar"  wrote:

>No, I corrected the error, it was a small mistake on your side.  Problem was 
>trailing forward slash after url  . Please see the explanation below 
>
>
>
>
> http://www.assignmenthelp.net/contactus
> will be saved as /Temp/contactus 
> http://www.assignmenthelp.net/abc.pdf
>  will be saved as /Temp/abc.pdf 
>
>
>What about,  http://www.assignmenthelp.net/ 
> => you are trying to save it as /tmp/ hence the error 
>However if there is no "/"   at the end of url i.e.
>http://www.assignmenthelp.net <http://www.assignmenthelp.net/>  then
> you try to save it as \Temp/www.assignmenthelp.net 
> <http://www.assignmenthelp.net>
>
>
>
>Hence small correction was to remove "/" if it is there at the end of url  . 
>
>
>Rest everything is ok.  
>
>
>On Fri, Jun 10, 2016 at 8:38 PM, Mattmann, Chris A (3980)
> wrote:
>
>[moved to dev@tika.a.o list please follow replies there.]
>
>Rakesh - looks like you don’t have permissions to write to
>your temp dir on Windows. Can you confirm that’s the case?
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>
>On 6/10/16, 2:01 AM, "Rakesh Kumar"  wrote:
>
>>Hi when I try to extract content from url it prints error
>>
>>
>>
>>
>>tika.py: Retrieving 
>http://www.assignmenthelp.net/ <http://www.assignmenthelp.net/> to 
>C:\Users\Rakesh\AppData\Local\Temp/
>>Traceback (most recent call last):
>>File "C:\Users\Rakesh\Anaconda3\lib\site-packages\tika\tika.py", line 368, in 
>>getRemoteFile
>>urlretrieve(urlOrPath, destPath)
>>File "C:\Users\Rakesh\Anaconda3\lib\urllib\request.py", line 197, in 
>>urlretrieve
>>tfp = open(filename, 'wb')
>>PermissionError: [Errno 13] Permission denied: 
>>'C:\Users\Rakesh\AppData\Local\Temp/'
>>During handling of the above exception, another exception occurred:
>>Traceback (most recent call last):
>>File "prg.py", line 7, in
>>parsed = parser.from_file(fileUrl, tikServer)
>>File "C:\Users\Rakesh\Anaconda3\lib\site-packages\tika\parser.py", line 25, 
>>in from_file
>>jsonOutput = parse1('all', filename, serverEndpoint)
>>File "C:\Users\Rakesh\Anaconda3\lib\site-packages\tika\tika.py", line 184, in 
>>parse1
>>path, file_type = getRemoteFile(urlOrPath, TikaFilesPath)
>>File "C:\Users\Rakesh\Anaconda3\lib\site-packages\tika\tika.py", line 378, in 
>>getRemoteFile
>>urlretrieve(urlOrPath, destPath)
>>File "C:\Users\Rakesh\Anaconda3\lib\urllib\request.py", line 197, in 
>>urlretrieve
>>tfp = open(filename, 'wb')
>>PermissionError: [Errno 13] Permission denied: 
>>'C:\Users\Rakesh\AppData\Local\Temp/'
>>
>>
>
>
>
>
>
>
>


Re: About tika-python error

2016-06-10 Thread Mattmann, Chris A (3980)
[moved to dev@tika.a.o list please follow replies there.]

Rakesh - looks like you don’t have permissions to write to 
your temp dir on Windows. Can you confirm that’s the case?

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/10/16, 2:01 AM, "Rakesh Kumar"  wrote:

>Hi when I try to extract content from url it prints error 
>
>
>
>
>tika.py: Retrieving http://www.assignmenthelp.net/ to 
>C:\Users\Rakesh\AppData\Local\Temp/
>Traceback (most recent call last):
>File "C:\Users\Rakesh\Anaconda3\lib\site-packages\tika\tika.py", line 368, in 
>getRemoteFile
>urlretrieve(urlOrPath, destPath)
>File "C:\Users\Rakesh\Anaconda3\lib\urllib\request.py", line 197, in 
>urlretrieve
>tfp = open(filename, 'wb')
>PermissionError: [Errno 13] Permission denied: 
>'C:\Users\Rakesh\AppData\Local\Temp/'
>During handling of the above exception, another exception occurred:
>Traceback (most recent call last):
>File "prg.py", line 7, in 
>parsed = parser.from_file(fileUrl, tikServer)
>File "C:\Users\Rakesh\Anaconda3\lib\site-packages\tika\parser.py", line 25, in 
>from_file
>jsonOutput = parse1('all', filename, serverEndpoint)
>File "C:\Users\Rakesh\Anaconda3\lib\site-packages\tika\tika.py", line 184, in 
>parse1
>path, file_type = getRemoteFile(urlOrPath, TikaFilesPath)
>File "C:\Users\Rakesh\Anaconda3\lib\site-packages\tika\tika.py", line 378, in 
>getRemoteFile
>urlretrieve(urlOrPath, destPath)
>File "C:\Users\Rakesh\Anaconda3\lib\urllib\request.py", line 197, in 
>urlretrieve
>tfp = open(filename, 'wb')
>PermissionError: [Errno 13] Permission denied: 
>'C:\Users\Rakesh\AppData\Local\Temp/'
>
>


Re: Profiler for OpenNLP

2016-06-07 Thread Mattmann, Chris A (3980)
We would love to have this part of Apache Tika. You can take a look
at the existing NER/NLP stuff integrated like in GeoTopicParser as
an example and yes please file a JIRA issue:

http://issues.apache.org/jira/browse/TIKA 

I would be happy to work with you to make it happen.

See: http://github.com/apache/tika/#contributing-via-github 

For guidance.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/7/16, 9:36 AM, "Anthony Beylerian"  wrote:

>Hello,
>
>We are currently working on an experimental author profiler that we think
>could be added to the toolkit.
>
>The profiler aims to detect the gender and age range of an author.
>Later we hope to add personality aspects such as:
>[extroverted, stable, agreeable, conscientious]
>
>We would like the teams' opinion on the matter.
>An initial code drop can be found here[1] if someone is willing to
>contribute/collaborate on it with us please let us know.
>
>Thanks!
>
>[1] https://github.com/beylerian/profiler


Re: Tika 2.0 Migration Guide

2016-05-20 Thread Mattmann, Chris A (3980)
great work Bob!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 5/19/16, 5:24 AM, "Allison, Timothy B."  wrote:

>Thank you, Bob!
>
>Side note: To be filed "under eating our own dog food" and "yearning for 2.0", 
>I was just playing with SolrJ and spent longer removing parsers with 
>conflicting dependencies than it took to code the SolrJ integration with 
>tika-batch.
>
>-Original Message-
>From: Bob Paulin [mailto:b...@apache.org] 
>Sent: Tuesday, May 17, 2016 11:04 PM
>To: dev@tika.apache.org
>Subject: Tika 2.0 Migration Guide
>
>Hi,
>
>Started to add some content to the migration guide (Thanks for creating Tim 
>Allison!) now that we've got some folks that are pulling 2.x into test 
>projects.  Please review and let me know if there are any questions or 
>omissions.  Thanks!
>
>https://wiki.apache.org/tika/Tika2_0MigrationGuide
>
>- Bob
>


Re: GSoC 2016: OpenNLP Sentiment Analysis: Status Update

2016-05-19 Thread Mattmann, Chris A (3980)
Hi Chen,

Sorry this should have went to the Tika lists, my bad!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 5/18/16, 11:33 PM, "Chen Li"  wrote:

>Just curious, how is this task related to AsterixDB?
>
>
>
>On Wed, May 18, 2016 at 8:57 AM, Mattmann, Chris A (3980) <
>chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Hi Everyone,
>>
>> Anastasija and I met this morning. Here are her next steps:
>>
>>
>> 0. Completed learning, installing and using GeoTopicParser in Apache Tika
>> 1. Learning about Movie Review Dataset (labeled data, yay!)
>> 2. Try and build OpeNNLP model for that
>>
>> She and I will meet again next week and report progress.
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 4/26/16, 12:23 PM, "Rodrigo Agerri"  wrote:
>>
>> >Hello,
>> >
>> >Everything looks very interesting.  Other options are the Aspect Based
>> >Sentiment Analysis tasks as described in
>> >
>> >http://alt.qcri.org/semeval2014/task4/
>> >http://alt.qcri.org/semeval2015/task12/
>> >http://alt.qcri.org/semeval2016/task5/
>> >
>> >The task is well circumscribed plus data is publicly available, which
>> >is good to try and make manageable objectives for a GSOC.
>> >
>> >Best,
>> >
>> >Rodrigo
>> >
>> >
>> >
>> >On Tue, Apr 26, 2016 at 6:10 PM, Anthony Beylerian
>> > wrote:
>> >> Please check this approach [1] it could be useful to combine
>> >> a labeled seed set with unlabeled Fisher CallHome.
>> >> Since it maybe a long read there's a shorter ppt as well [2]
>> >>
>> >> [1] link.springer.com/article/10.1023%2FA%3A1007692713085
>> >> [2] cseweb.ucsd.edu/~atsmith/presentation_final.ppt
>> >>
>> >>
>> >> On Tue, Apr 26, 2016 at 11:36 PM, Joern Kottmann 
>> wrote:
>> >>
>> >>> The Large Movie Review Dataset might be interesting for this as well:
>> >>> http://ai.stanford.edu/~amaas/data/sentiment/
>> >>>
>> >>> Jörn
>> >>>
>> >>> On Tue, Apr 26, 2016 at 4:26 PM, Anthony Beylerian <
>> >>> anthony.beyler...@gmail.com> wrote:
>> >>>
>> >>> > sentiment analysis discussion doc :
>> >>> >
>> >>> >
>> >>> >
>> >>>
>> https://docs.google.com/document/d/1Gi59YqtisY4NLaVY3B7CNLMTgCRZm9JEk17kmBmWXqQ/edit?usp=sharing
>> >>> >
>> >>> > On Tue, Apr 26, 2016 at 10:56 PM, Mattmann, Chris A (3980) <
>> >>> > chris.a.mattm...@jpl.nasa.gov> wrote:
>> >>> >
>> >>> > > Hi,
>> >>> > >
>> >>> > > Sure here is the link:
>> >>> > >
>> >>> > > https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee
>> >>> > >
>> >>> > > Sorry for the delay.
>> >>> > >
>> >>> > > Cheers,
>> >>> > > Chris
>> >>> > >
>> >>> > > +

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-05-17 Thread Mattmann, Chris A (3980)
Great, OK saw your conversation on Hangouts, I’ll reply back there
and we can set something up for tomorrow cheers!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 5/17/16, 8:37 AM, "Anastasija Mensikova"  
wrote:

>Hi Chris,
>
>
>I just sent you a Hangout invitation. I definitely can and want to talk 
>tomorrow. I'm back at home (in Latvia) now, so I'm free any time of the day 
>here (with the time difference it would be from around 7am ET till maybe 3pm 
>or 4pm ET the latest).
>
>
>Let me know!
>
>
>Thank you,
>Anastasija
>
>
>On 17 May 2016 at 07:41, Mattmann, Chris A (3980) 
> wrote:
>
>Dear Anastasija,
>
>I’m reconnecting here since it’s been a bit. Do you have time for
>a Google Hangout tomorrow? Would you like to discuss your progress
>to date on the project?
>
>Thanks and please ping me on Google Hangout so we can chat.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>On 5/3/16, 7:21 AM, "Anastasija Mensikova"  
>wrote:
>
>>Hi everyone,
>>
>>
>>I joined the hangout at 9:40am ET just like last time, but nobody was there 
>>(it's my fault, I should have checked beforehand that it's still happening), 
>>I waited for about 25 minutes and left because I have to run to class.
>>
>>
>>So, reporting on what I have done this week.
>>
>>
>>I'm in the period of all the final exams and projects right now, and have 
>>been pulling all-nighters to catch up with all the school work, so couldn't 
>>do as much as I wanted to with this project, but school is over in 2 weeks 
>>and I promise I will devote
>> all of my time to this project right after.
>>I was trying to download GeoTopicParser, but for that I had to download and 
>>install Maven in order to be able to use the mvn command, but, even though 
>>it's a simple task, my computer just wouldn't let me use it. It throws an 
>>exception, and I spent three
>> hours trying to figure out why, made sure my Java version matched, even had 
>> someone professional look at it, but still couldn't fix it. I will do that 
>> as soon as school is over. Nevertheless, I went through the Gazetteer code 
>> to understand the logic behind
>> it, and then went on looking through OpenNLP and used the lecture notes from 
>> the coursera course I was telling you about as my guide. It makes more sense 
>> now how it works and how training the model is done.
>>I just have one quick question. I noticed OpenNLP uses MaxEntropy. In our 
>>case, are we going to be using it as well, or are we going to be using 
>>logistic regression instead for data classification?
>>
>>
>>I also have one little problem. I have a final exam this time next week (for 
>>my Theory of Computation class), so I can't do the hangout at this time.
>>
>>
>>Sorry for all the time confusions. I realise how hard it is to find the 
>>perfect time to talk considering the time differences.
>>
>>
>>Thank you very much,
>>Anastasija
>>
>>
>>On 26 April 2016 at 09:56, Mattmann, Chris A (3980)
>> wrote:
>>
>>Hi,
>>
>>Sure here is the link:
>>
>>https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee
>>
>>Sorry for the delay.
>>
>&

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-05-16 Thread Mattmann, Chris A (3980)
Dear Anastasija,

I’m reconnecting here since it’s been a bit. Do you have time for
a Google Hangout tomorrow? Would you like to discuss your progress
to date on the project?

Thanks and please ping me on Google Hangout so we can chat.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 5/3/16, 7:21 AM, "Anastasija Mensikova"  
wrote:

>Hi everyone,
>
>
>I joined the hangout at 9:40am ET just like last time, but nobody was there 
>(it's my fault, I should have checked beforehand that it's still happening), I 
>waited for about 25 minutes and left because I have to run to class.
>
>
>So, reporting on what I have done this week.
>
>
>I'm in the period of all the final exams and projects right now, and have been 
>pulling all-nighters to catch up with all the school work, so couldn't do as 
>much as I wanted to with this project, but school is over in 2 weeks and I 
>promise I will devote
> all of my time to this project right after.
>I was trying to download GeoTopicParser, but for that I had to download and 
>install Maven in order to be able to use the mvn command, but, even though 
>it's a simple task, my computer just wouldn't let me use it. It throws an 
>exception, and I spent three
> hours trying to figure out why, made sure my Java version matched, even had 
> someone professional look at it, but still couldn't fix it. I will do that as 
> soon as school is over. Nevertheless, I went through the Gazetteer code to 
> understand the logic behind
> it, and then went on looking through OpenNLP and used the lecture notes from 
> the coursera course I was telling you about as my guide. It makes more sense 
> now how it works and how training the model is done.
>I just have one quick question. I noticed OpenNLP uses MaxEntropy. In our 
>case, are we going to be using it as well, or are we going to be using 
>logistic regression instead for data classification?
>
>
>I also have one little problem. I have a final exam this time next week (for 
>my Theory of Computation class), so I can't do the hangout at this time.
>
>
>Sorry for all the time confusions. I realise how hard it is to find the 
>perfect time to talk considering the time differences.
>
>
>Thank you very much,
>Anastasija
>
>
>On 26 April 2016 at 09:56, Mattmann, Chris A (3980) 
> wrote:
>
>Hi,
>
>Sure here is the link:
>
>https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee
>
>Sorry for the delay.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>On 4/26/16, 6:48 AM, "Anastasija Mensikova"  
>wrote:
>
>>Hi everyone,
>>
>>
>>Is the 9:40 ET hangout still happening? I just have to leave soon to go to 
>>class.
>>
>>
>>Thank you,
>>Anastasija
>>
>>
>>On 25 April 2016 at 23:39, Anastasija Mensikova
>> wrote:
>>
>>Hi Chris,
>>
>>
>>Yes, that's perfect. I'll be ready by 9:40am.
>>
>>
>>Thank you,
>>Anastasija
>>
>>
>>On 25 April 2016 at 23:28, Mattmann, Chris A (3980)
>> wrote:
>>
>>Hey Anastasija,
>>
>>To be honest 9am EST is a little aggressive, I will likely be able
>>to do 6:40 am PT (am traveling back from DC as I type this) which
>>is 9:40am ET.
>>

Re: [VOTE] Release Apache Tika 1.13 Candidate #1

2016-05-16 Thread Mattmann, Chris A (3980)
Late to the party, but voting anyways:

+1 from me, SIGS and MD5 looks good!


LMC-053601:apache-tika-1.13-rc1 mattmann$ for name in app server; do 
> /Users/mattmann/bin/stage_apache_rc tika-$name 1.13 
> https://dist.apache.org/repos/dist/dev/tika/
> done
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 52.2M  100 52.2M0 0  1988k  0  0:00:26  0:00:26 --:--:-- 2389k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   819  100   8190 0   3196  0 --:--:-- --:--:-- --:--:--  3199
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10033  100330 0126  0 --:--:-- --:--:-- --:--:--   126
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 54.6M  100 54.6M0 0  1204k  0  0:00:46  0:00:46 --:--:-- 2385k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   819  100   8190 0   3225  0 --:--:-- --:--:-- --:--:--  3237
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10033  100330 0120  0 --:--:-- --:--:-- --:--:--   120
LMC-053601:apache-tika-1.13-rc1 mattmann$ ls
tika-app-1.13.jar   tika-app-1.13.jar.md5   
tika-server-1.13.jar.asc
tika-app-1.13.jar.asc   tika-server-1.13.jar
tika-server-1.13.jar.md5
LMC-053601:apache-tika-1.13-rc1 mattmann$ /Users/mattmann/bin/stage_apache_rc 
tika 1.13-src https://dist.apache.org/repos/dist/dev/tika/
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 96.0M  100 96.0M0 0  1289k  0  0:01:16  0:01:16 --:--:-- 1186k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   819  100   8190 0   2917  0 --:--:-- --:--:-- --:--:--  2925
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10033  100330 0128  0 --:--:-- --:--:-- --:--:--   128
LMC-053601:apache-tika-1.13-rc1 mattmann$ ls
tika-1.13-src.zip   tika-1.13-src.zip.md5   
tika-app-1.13.jar.asc   tika-server-1.13.jar
tika-server-1.13.jar.md5
tika-1.13-src.zip.asc   tika-app-1.13.jar   
tika-app-1.13.jar.md5   tika-server-1.13.jar.asc
LMC-053601:apache-tika-1.13-rc1 mattmann$ $HOME/bin/verify_checksums
-bash: /Users/mattmann/bin/verify_checksums: No such file or directory
LMC-053601:apache-tika-1.13-rc1 mattmann$ /Users/mattmann/bin/verify_gpg_sigs 
Verifying Signature for file tika-1.13-src.zip.asc
gpg: assuming signed data in `tika-1.13-src.zip'
gpg: Signature made Mon May  9 10:38:15 2016 PDT using RSA key ID 0EB30B07
gpg: Can't check signature: public key not found
Verifying Signature for file tika-app-1.13.jar.asc
gpg: assuming signed data in `tika-app-1.13.jar'
gpg: Signature made Mon May  9 10:30:09 2016 PDT using RSA key ID 0EB30B07
gpg: Can't check signature: public key not found
Verifying Signature for file tika-server-1.13.jar.asc
gpg: assuming signed data in `tika-server-1.13.jar'
gpg: Signature made Mon May  9 10:34:48 2016 PDT using RSA key ID 0EB30B07
gpg: Can't check signature: public key not found
LMC-053601:apache-tika-1.13-rc1 mattmann$ curl -O 
https://raw.githubusercontent.com/apache/tika/master/KEYS
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 20287  100 202870 0  20881  0 --:--:-- --:--:-- --:--:-- 20871
LMC-053601:apache-tika-1.13-rc1 mattmann$ gpg --import < KEYS
gpg: key A355A63E: public key "Jukka Zitting " imported
gpg: key B876884A: "Chris Mattmann (CODE SIGNING KEY) " 
not changed
gpg: key 9740DD55: public key "David Meikle (CODE SIGNING KEY) 
" imported
gpg: key AEA8C6AB: public key "David Meikle (CODE SIGNING KEY) 
" imported
gpg: key 0EB30B07: public key "David Meikle (CODE SIGNING KEY) 
" imported
gpg: key D4F10117: public key "Tyler Palsulich " imported
gpg: Total number processed: 6
gpg:   imported: 5  (RSA: 3)
gpg:  unchanged: 1
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   1  signed:   0  tr

Re: Squashing GitHub pull requests while merging

2016-05-07 Thread Mattmann, Chris A (3980)
yep I think so Tyler, I think if someone just does it upstream before
the PR we’re all good.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 5/7/16, 6:21 AM, "Tyler Palsulich"  wrote:

>A contributor should be able to squash the commits in the pull request
>before we merge into the Tika. So, we don't need to mess up Tika's history.
>Right?
>
>Tyler
>On May 6, 2016 8:41 PM, "Mattmann, Chris A (3980)" <
>chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Squashing messes up history and atm requires infra intervention song would
>> suggest we stay away from it for now
>>
>> Sent from my iPhone
>>
>> > On May 6, 2016, at 2:20 PM, Ken Krugler 
>> wrote:
>> >
>> > I was perusing https://wiki.apache.org/tika/UsingGit <
>> https://wiki.apache.org/tika/UsingGit>, and noticed that it doesn’t talk
>> about squashing a pull request’s commits while merging.
>> >
>> > This is described at https://mahout.apache.org/developers/github.html <
>> https://mahout.apache.org/developers/github.html>
>> >
>> > Isn't this something we’d want to do as well?
>> >
>> > Thanks,
>> >
>> > — Ken
>> >
>> > --
>> > Ken Krugler
>> > +1 530-210-6378
>> > http://www.scaleunlimited.com
>> > custom big data solutions & training
>> > Hadoop, Cascading, Cassandra & Solr
>> >
>> >
>> >
>>


Re: Squashing GitHub pull requests while merging

2016-05-06 Thread Mattmann, Chris A (3980)
Squashing messes up history and atm requires infra intervention song would 
suggest we stay away from it for now 

Sent from my iPhone

> On May 6, 2016, at 2:20 PM, Ken Krugler  wrote:
> 
> I was perusing https://wiki.apache.org/tika/UsingGit 
> , and noticed that it doesn’t talk 
> about squashing a pull request’s commits while merging.
> 
> This is described at https://mahout.apache.org/developers/github.html 
> 
> 
> Isn't this something we’d want to do as well?
> 
> Thanks,
> 
> — Ken
> 
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 


Re: pre-release 1.13 regression testing

2016-05-02 Thread Mattmann, Chris A (3980)
+1 go for it Dave!

I’m in Hawaii on vacation so please push forward ;)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 5/2/16, 5:59 AM, "Allison, Timothy B."  wrote:

>+1 
>
>Thank you!
>
>-Original Message-
>From: David Meikle [mailto:loo...@gmail.com] 
>Sent: Monday, May 2, 2016 11:51 AM
>To: dev@tika.apache.org
>Subject: Re: pre-release 1.13 regression testing
>
>Hi Tim,
>
>> On 2 May 2016, at 12:29, Allison, Timothy B.  wrote:
>> 
>> Dave,
>> Find any showstoppers?
>> 
>> All,
>> Anyone have time to cut the release?
>> 
>>   Cheers,
>> 
>>Tim
>
>Everything is looking good here - been trying it out in production.
>
>I can cut the release today / tomorrow, unless any objections.
>
>Cheers,
>Dave


Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-27 Thread Mattmann, Chris A (3980)
thanks Anthony

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/26/16, 9:10 AM, "Anthony Beylerian"  wrote:

>Please check this approach [1] it could be useful to combine
>a labeled seed set with unlabeled Fisher CallHome.
>Since it maybe a long read there's a shorter ppt as well [2]
>
>[1] link.springer.com/article/10.1023%2FA%3A1007692713085
>[2] cseweb.ucsd.edu/~atsmith/presentation_final.ppt
>
>
>On Tue, Apr 26, 2016 at 11:36 PM, Joern Kottmann  wrote:
>
>> The Large Movie Review Dataset might be interesting for this as well:
>> http://ai.stanford.edu/~amaas/data/sentiment/
>>
>> Jörn
>>
>> On Tue, Apr 26, 2016 at 4:26 PM, Anthony Beylerian <
>> anthony.beyler...@gmail.com> wrote:
>>
>> > sentiment analysis discussion doc :
>> >
>> >
>> >
>> https://docs.google.com/document/d/1Gi59YqtisY4NLaVY3B7CNLMTgCRZm9JEk17kmBmWXqQ/edit?usp=sharing
>> >
>> > On Tue, Apr 26, 2016 at 10:56 PM, Mattmann, Chris A (3980) <
>> > chris.a.mattm...@jpl.nasa.gov> wrote:
>> >
>> > > Hi,
>> > >
>> > > Sure here is the link:
>> > >
>> > > https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee
>> > >
>> > > Sorry for the delay.
>> > >
>> > > Cheers,
>> > > Chris
>> > >
>> > > ++
>> > > Chris Mattmann, Ph.D.
>> > > Chief Architect
>> > > Instrument Software and Science Data Systems Section (398)
>> > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > > Office: 168-519, Mailstop: 168-527
>> > > Email: chris.a.mattm...@nasa.gov
>> > > WWW:  http://sunset.usc.edu/~mattmann/
>> > > ++
>> > > Director, Information Retrieval and Data Science Group (IRDS)
>> > > Adjunct Associate Professor, Computer Science Department
>> > > University of Southern California, Los Angeles, CA 90089 USA
>> > > WWW: http://irds.usc.edu/
>> > > ++
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On 4/26/16, 6:48 AM, "Anastasija Mensikova" <
>> > > mensikova.anastas...@gmail.com> wrote:
>> > >
>> > > >Hi everyone,
>> > > >
>> > > >
>> > > >Is the 9:40 ET hangout still happening? I just have to leave soon to
>> go
>> > > to class.
>> > > >
>> > > >
>> > > >Thank you,
>> > > >Anastasija
>> > > >
>> > > >
>> > > >On 25 April 2016 at 23:39, Anastasija Mensikova
>> > > > wrote:
>> > > >
>> > > >Hi Chris,
>> > > >
>> > > >
>> > > >Yes, that's perfect. I'll be ready by 9:40am.
>> > > >
>> > > >
>> > > >Thank you,
>> > > >Anastasija
>> > > >
>> > > >
>> > > >On 25 April 2016 at 23:28, Mattmann, Chris A (3980)
>> > > > wrote:
>> > > >
>> > > >Hey Anastasija,
>> > > >
>> > > >To be honest 9am EST is a little aggressive, I will likely be able
>> > > >to do 6:40 am PT (am traveling back from DC as I type this) which
>> > > >is 9:40am ET.
>> > > >
>> > > >My GChat handle is chris.mattm...@gmail.com. I will create a hangout
>> > > >and send to the list please contact me at 6:40am PT.
>> > > >
>> > > >Cheers,
>> > > >Chris
>> > > >
>> > > >+++

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-26 Thread Mattmann, Chris A (3980)
Hi,

Sure here is the link:

https://hangouts.google.com/call/a2w5cgdtirf6jgfb4ww5l2l64ee

Sorry for the delay.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/26/16, 6:48 AM, "Anastasija Mensikova"  
wrote:

>Hi everyone,
>
>
>Is the 9:40 ET hangout still happening? I just have to leave soon to go to 
>class.
>
>
>Thank you,
>Anastasija
>
>
>On 25 April 2016 at 23:39, Anastasija Mensikova 
> wrote:
>
>Hi Chris,
>
>
>Yes, that's perfect. I'll be ready by 9:40am. 
>
>
>Thank you,
>Anastasija
>
>
>On 25 April 2016 at 23:28, Mattmann, Chris A (3980) 
> wrote:
>
>Hey Anastasija,
>
>To be honest 9am EST is a little aggressive, I will likely be able
>to do 6:40 am PT (am traveling back from DC as I type this) which
>is 9:40am ET.
>
>My GChat handle is chris.mattm...@gmail.com. I will create a hangout
>and send to the list please contact me at 6:40am PT.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>On 4/25/16, 11:07 PM, "Anastasija Mensikova"  
>wrote:
>
>>Hi everyone,
>>
>>
>>So is the hangout session tomorrow (Tuesday) at 6:30pm IST (9am EST) 
>>confirmed or not?
>>
>>
>>Thank you,
>>Anastasija
>>
>>
>>On 25 April 2016 at 15:23, Madhawa Kasun Gunasekara
>> wrote:
>>
>>Hi all,
>>
>>
>>Shall we have the hangout session tomorrow (Tuesday) about 18:30 IST ?
>>
>>
>>Thanks,
>>
>>Madhawa
>>
>>
>>
>>
>>Madhawa
>>
>>
>>
>>
>>On Sun, Apr 24, 2016 at 10:33 PM, Mondher Bouazizi
>> wrote:
>>
>>Hi,
>>
>>I am sorry for my late reply.
>>
>>Given the time difference between Japan and USA, I think I won't be
>>available on weekdays. I will be available only on Friday/Saturday morning
>>(9-10am EST).
>>
>>I am not sure if Chris is OK with that, we had our previous meetings on
>>Saturday mornings.
>>
>>Otherwise, please go ahead. I will join as soon as I can.
>>
>>Thanks.
>>
>>@Chris: my github ID is mondher-bouazizi
>>
>>Best regards,
>>
>>Mondher
>>
>>On Mon, Apr 25, 2016 at 1:44 AM, Anastasija Mensikova <
>>mensikova.anastas...@gmail.com> wrote:
>>
>>> Hi Anthony,
>>>
>>> I can make it by Madhawa's proposal too, after 6pm IST on Tuesday (after
>>> 8:30am EST). Let me know when exactly!
>>>
>>> Thank you,
>>> Anastasija
>>>
>>> On 24 April 2016 at 03:02, Anthony Beylerian 
>>> wrote:
>>>
>>>> Hi Anastasija,
>>>>
>>>> I'm not available by those times (00-07 JST).  I could make it by
>>>> Madhawa's proposal, but otherwise please go ahead, we may discuss some
>>>> other time.
>>>>
>>>> @Chris: github ID : beylerian
>>>>
>>>> Best,
>>>>
>>>> Anthony
>>>>
>>>>
>>>> Please find my github profile
>
>
>>https://github.com/madhawa-gunasekara <https://github.com/madhawa-gunasekara>
>>>>
>>>> Madhawa
>>>>
>>>> On Sun, Apr 24, 2016 at 12:13 AM, Madhawa Kasun Gunasekara <

Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-25 Thread Mattmann, Chris A (3980)
Hey Anastasija,

To be honest 9am EST is a little aggressive, I will likely be able
to do 6:40 am PT (am traveling back from DC as I type this) which
is 9:40am ET.

My GChat handle is chris.mattm...@gmail.com. I will create a hangout
and send to the list please contact me at 6:40am PT.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/25/16, 11:07 PM, "Anastasija Mensikova"  
wrote:

>Hi everyone,
>
>
>So is the hangout session tomorrow (Tuesday) at 6:30pm IST (9am EST) confirmed 
>or not?
>
>
>Thank you,
>Anastasija
>
>
>On 25 April 2016 at 15:23, Madhawa Kasun Gunasekara 
> wrote:
>
>Hi all, 
>
>
>Shall we have the hangout session tomorrow (Tuesday) about 18:30 IST ?
>
>
>Thanks,
>
>Madhawa
>
>
>
>
>Madhawa
>
>
>
>
>On Sun, Apr 24, 2016 at 10:33 PM, Mondher Bouazizi 
> wrote:
>
>Hi,
>
>I am sorry for my late reply.
>
>Given the time difference between Japan and USA, I think I won't be
>available on weekdays. I will be available only on Friday/Saturday morning
>(9-10am EST).
>
>I am not sure if Chris is OK with that, we had our previous meetings on
>Saturday mornings.
>
>Otherwise, please go ahead. I will join as soon as I can.
>
>Thanks.
>
>@Chris: my github ID is mondher-bouazizi
>
>Best regards,
>
>Mondher
>
>On Mon, Apr 25, 2016 at 1:44 AM, Anastasija Mensikova <
>mensikova.anastas...@gmail.com> wrote:
>
>> Hi Anthony,
>>
>> I can make it by Madhawa's proposal too, after 6pm IST on Tuesday (after
>> 8:30am EST). Let me know when exactly!
>>
>> Thank you,
>> Anastasija
>>
>> On 24 April 2016 at 03:02, Anthony Beylerian 
>> wrote:
>>
>>> Hi Anastasija,
>>>
>>> I'm not available by those times (00-07 JST).  I could make it by
>>> Madhawa's proposal, but otherwise please go ahead, we may discuss some
>>> other time.
>>>
>>> @Chris: github ID : beylerian
>>>
>>> Best,
>>>
>>> Anthony
>>>
>>>
>>> Please find my github profile 
>https://github.com/madhawa-gunasekara <https://github.com/madhawa-gunasekara>
>>>
>>> Madhawa
>>>
>>> On Sun, Apr 24, 2016 at 12:13 AM, Madhawa Kasun Gunasekara <
>>> madhaw...@gmail.com> wrote:
>>>
>>> > Hi Chris,
>>> >
>>> > I'm available on Tuesday & Wednesday after 6.00 pm IST.
>>> >
>>> > Thanks,
>>> > Madhawa
>>> >
>>> > Madhawa
>>> >
>>> > On Sat, Apr 23, 2016 at 11:38 PM, Anastasija Mensikova <
>>> > mensikova.anastas...@gmail.com> wrote:
>>> >
>>> >> Hi Chris,
>>> >>
>>> >> Thank you very much for your email. I'm so excited to work with you!
>>> >>
>>> >> My Github name is amensiko.
>>> >>
>>> >> And yes, next week sounds good! I'm available on: Tuesday at 4:20pm
>>> EST,
>>> >> Thursday 11am - 2:30pm and 4:20 - 6pm EST, Friday 11am - 3pm EST.
>>> >>
>>> >> Thank you,
>>> >> Anastasija
>>> >>
>>> >> On 23 April 2016 at 10:21, Mattmann, Chris A (3980) <
>>> >> chris.a.mattm...@jpl.nasa.gov> wrote:
>>> >>
>>> >>> Hi Anastasija,
>>> >>>
>>> >>> Hope you are well. It’s now time to get started on the project.
>>> >>> Monder, Anthony, Madhawa and I have been discussing ideas about
>>> >>> how to proceed with the project and even developing a task list.
>>> >>> Let’s get your tasks input into that list, and also coordinate.
>>> >>>
>>> >>> I also have an action to share some Spanish/English data to try
>>> >>> and do cross lingual sentiment analysis.
>>> >>>
>>> >>> Are you available t

Re: [DISCUSS] Backward compatibility

2016-04-25 Thread Mattmann, Chris A (3980)
+1 go ahead and re-enable clirr-maven-plugin.

Thanks Konstantin!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/25/16, 2:14 PM, "Konstantin Gribov"  wrote:

>OK, I'll bring them back into o.a.tika.language tommorow if no objections
>will follow until tommorow.
>
>I don't see any ways to break something by doing this but I will recheck
>it.
>
>Should I also enable clirr-maven-plugin on these classes?
>
>пн, 25 апр. 2016 г. в 20:39, Mattmann, Chris A (3980) <
>chris.a.mattm...@jpl.nasa.gov>:
>
>> +1 I am fine with:
>>
>> 1. putting the old classes back in. Fine by me.
>> 2. keeping the new tika-langdetect and improvements.
>>
>> I think that this is the easiest. Sorry for breaking
>> the trunk, apologies. I was just eager to backport Ken’s
>> stuff and also to get Text.jl support.
>>
>> Let’s just add back LanguageIdentifier and I think that would
>> do it, right?
>>
>> Any objections?
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 4/25/16, 12:56 PM, "Allison, Timothy B."  wrote:
>>
>> >+1
>> >
>> >Thank you, Konstantin, for catching this.  I agree about breaking changes
>> in trunk.
>> >
>> >Should we plop the old classes back where they were, add deprecation and
>> live with a bit of messiness for a few versions?
>> >
>> >
>> >
>> >-Original Message-
>> >From: Konstantin Gribov [mailto:gros...@gmail.com]
>> >Sent: Monday, April 25, 2016 10:50 AM
>> >To: dev@tika.apache.org
>> >Subject: [DISCUSS] Backward compatibility
>> >
>> >Hi, folks.
>> >
>> >I want to bring our attention to maintaining backward compatibility on
>> master/1.x branch. I've recently found that we dropped
>> o.a.tika.language.LanguageIdentifier in 3a7a94c[1] (merged on 2016-03-07,
>> see [2] also). It will brake downstream dependants of `tika-core` which use
>> `LanguageIdentifier`.
>> >
>> >It looks OK in 2.x branch, but I'm against sudden API changes (especially
>> dropping public classes/interfaces) in 1.x branch. At least we should mark
>> it `@Deprecated` for version or couple before dropping.
>> >
>> >I'd like to bring this and related classes back before 1.13 release if
>> nobody objects to. I will haven't time to refactor it to use new APIs till
>> middle of May.
>> >
>> >[1]:
>> >
>> https://github.com/apache/tika/commit/3a7a94ca5040eabd90f6060effc517126def3fc1
>> >[2]: https://issues.apache.org/jira/browse/TIKA-1723
>> >--
>> >Best regards,
>> >Konstantin Gribov
>>
>-- 
>Best regards,
>Konstantin Gribov


Re: [DISCUSS] Backward compatibility

2016-04-25 Thread Mattmann, Chris A (3980)
+1 I am fine with:

1. putting the old classes back in. Fine by me.
2. keeping the new tika-langdetect and improvements.

I think that this is the easiest. Sorry for breaking
the trunk, apologies. I was just eager to backport Ken’s
stuff and also to get Text.jl support.

Let’s just add back LanguageIdentifier and I think that would
do it, right? 

Any objections?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/25/16, 12:56 PM, "Allison, Timothy B."  wrote:

>+1
>
>Thank you, Konstantin, for catching this.  I agree about breaking changes in 
>trunk.
>
>Should we plop the old classes back where they were, add deprecation and live 
>with a bit of messiness for a few versions?
>
>
>
>-Original Message-
>From: Konstantin Gribov [mailto:gros...@gmail.com] 
>Sent: Monday, April 25, 2016 10:50 AM
>To: dev@tika.apache.org
>Subject: [DISCUSS] Backward compatibility
>
>Hi, folks.
>
>I want to bring our attention to maintaining backward compatibility on 
>master/1.x branch. I've recently found that we dropped 
>o.a.tika.language.LanguageIdentifier in 3a7a94c[1] (merged on 2016-03-07, see 
>[2] also). It will brake downstream dependants of `tika-core` which use 
>`LanguageIdentifier`.
>
>It looks OK in 2.x branch, but I'm against sudden API changes (especially 
>dropping public classes/interfaces) in 1.x branch. At least we should mark it 
>`@Deprecated` for version or couple before dropping.
>
>I'd like to bring this and related classes back before 1.13 release if nobody 
>objects to. I will haven't time to refactor it to use new APIs till middle of 
>May.
>
>[1]:
>https://github.com/apache/tika/commit/3a7a94ca5040eabd90f6060effc517126def3fc1
>[2]: https://issues.apache.org/jira/browse/TIKA-1723
>--
>Best regards,
>Konstantin Gribov


Re: pre-release 1.13 regression testing

2016-04-25 Thread Mattmann, Chris A (3980)
Thanks Tim I appreciate it

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/25/16, 9:15 AM, "Allison, Timothy B."  wrote:

>All,
>  Given a number of recent changes, I kicked off the regression tests again.  
> I should have results by tomorrow.
>
> Best,
>
>   Tim
>
>


Re: GSoC 2016: OpenNLP Sentiment Analysis

2016-04-23 Thread Mattmann, Chris A (3980)
Monder, Anthony, Madhawa, Anastasija,

Sorry I’m just getting back to some of the actions from our last
conversation. It would seem I owe:

1. Access to USC IRDS Github - please provide me your Github usernames
2. Access to the Fisher Callhome corpus
3. Anastasija - please let me know your Github username. We are going to
leverage my IRDS @ USC Github org for e.g., models and temp storage while
we prepare the actual code for the project.

To proceed on the project here is my suggested plan:

1. Anastasija get with Monder, Anthony and Madhawa and myself over
Google Hangout and go over the task list. The idea will be to do a
SentimentAnalysisParser in Apache Tika, akin to the GeoTopicParser in
Tika, described here: http://wiki.apache.org/tika/GeoTopicParser
This would combine Apache OpenNLP, Apache Tika, and sentiment models.

2. We should train the models based on the Fisher Callhome Corpus.
This would give us a cross lingual sentiment training dataset.

If that is acceptable to everyone I propose early next week (Tuesday
or later) that we Google Hangout then. Sound good?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/23/16, 7:21 AM, "Mattmann, Chris A (3980)"  
wrote:

>Hi Anastasija,
>
>Hope you are well. It’s now time to get started on the project. 
>Monder, Anthony, Madhawa and I have been discussing ideas about
>how to proceed with the project and even developing a task list.
>Let’s get your tasks input into that list, and also coordinate.
>
>I also have an action to share some Spanish/English data to try
>and do cross lingual sentiment analysis.
>
>Are you available to chat this week?
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>On 4/23/16, 4:49 AM, "Anthony Beylerian"  wrote:
>
>>Hello,
>>
>>Congratulations for being accepted for this year's GSoC.
>>Although Mondher and myself will not participate this year as students, we
>>will do our best to help.
>>We are currently busy with academic research, but will join the efforts
>>when possible.
>>Otherwise, for any discussion concerning the proposed approaches, please
>>let us know.
>>
>>Best,
>>
>>On Sat, Apr 23, 2016 at 6:02 PM, Madhawa Kasun Gunasekara <
>>madhaw...@gmail.com> wrote:
>>
>>> Sure we will start working on this.
>>>
>>> Thanks,
>>> Madhawa
>>>
>>> Madhawa
>>>
>>> On Sat, Apr 23, 2016 at 1:38 AM, Chris Mattmann 
>>> wrote:
>>>
>>>> Congrats!
>>>>
>>>> time to get started team.
>>>>


Pivotal, Greenplum and Apache TIka

2016-04-23 Thread Mattmann, Chris A (3980)
Hey All,

Cool article here on Apache Tika’s use at Pivotal:
https://t.co/fPzszrKHtR

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++







GSoC 2016: OpenNLP Sentiment Analysis

2016-04-23 Thread Mattmann, Chris A (3980)
Hi Anastasija,

Hope you are well. It’s now time to get started on the project. 
Monder, Anthony, Madhawa and I have been discussing ideas about
how to proceed with the project and even developing a task list.
Let’s get your tasks input into that list, and also coordinate.

I also have an action to share some Spanish/English data to try
and do cross lingual sentiment analysis.

Are you available to chat this week?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/23/16, 4:49 AM, "Anthony Beylerian"  wrote:

>Hello,
>
>Congratulations for being accepted for this year's GSoC.
>Although Mondher and myself will not participate this year as students, we
>will do our best to help.
>We are currently busy with academic research, but will join the efforts
>when possible.
>Otherwise, for any discussion concerning the proposed approaches, please
>let us know.
>
>Best,
>
>On Sat, Apr 23, 2016 at 6:02 PM, Madhawa Kasun Gunasekara <
>madhaw...@gmail.com> wrote:
>
>> Sure we will start working on this.
>>
>> Thanks,
>> Madhawa
>>
>> Madhawa
>>
>> On Sat, Apr 23, 2016 at 1:38 AM, Chris Mattmann 
>> wrote:
>>
>>> Congrats!
>>>
>>> time to get started team.
>>>


Re: last commits before pre-1.13 regression tests?

2016-04-21 Thread Mattmann, Chris A (3980)
Yeah I have time, but honestly I’m not done. I have a few items
left in the MIME type stuff.

One more day please, one more day.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/21/16, 9:43 AM, "Allison, Timothy B."  wrote:

>I think we're good to start cutting rc1.
>
>Any objections?
>
>Chris, do you have the time to do it?
>
>>Hi All,
>>  I'm about to kick off our regression tests to see if there are major issues 
>> before we release 1.13.  Any blockers/last commits outstanding?  Still need 
>> to upgrade POI to 3.15-beta1...  What else?
>>
>>   Cheers,
>>
>>  Tim
>>
>


Re: last commits before pre-1.13 regression tests?

2016-04-20 Thread Mattmann, Chris A (3980)
Just finished meeting will inspect today 

Sent from my iPhone

> On Apr 20, 2016, at 10:31 AM, Allison, Timothy B.  wrote:
> 
> Chris,
>  Any over-recall/bad precision on your new mimes?
> 
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org] 
> Sent: Wednesday, April 20, 2016 11:20 AM
> To: dev@tika.apache.org
> Subject: RE: last commits before pre-1.13 regression tests?
> 
> Results are available here:
> http://162.242.228.174/reports/tika_1_12_v_tika_1_13-SNAPSHOTv2.tar.bz2 
> 
> I've only looked briefly.  Overall, I think things look ok.
> 
> This isn't quite trunk:
> * I applied Nick C's first dbf regex
> * I added a temporary fix for the pooled time series parser
> 
> There are quite a few changes in mime-detection, and clearly some rare 
> problems with pdfs (and other formats?) now being identified as 
> multipart/apple-double.  I think there are some rare problems with 
> "text/html; charset=UTF-8 -> text/plain; charset=UTF-8" 
> 
> What do others see?  Are we good to go for 1.13 after I commit the 2 * above?
> 
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 18, 2016 12:26 PM
> To: dev@tika.apache.org
> Subject: RE: last commits before pre-1.13 regression tests?
> 
> Sounds good to me.  Given the amount of changes since the last pre-pre-run, I 
> suspect I'll need to redo the tests anyways. ;)
> 
> -Original Message-
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Monday, April 18, 2016 12:16 PM
> To: dev@tika.apache.org
> Subject: Re: last commits before pre-1.13 regression tests?
> 
> Tim I would like to get in and close out all the scientific MIME updates for 
> TREC-DD-Polar and get that in at least.
> 
> In 1.14, my team from USC and I will deliver an automatic Deep Learning way 
> to do MIME detection based on these updates and also the ContentMIMEDetection 
> mechanism described on the wiki. We are also working on a paper to describe 
> that too.
> 
> But for 1.13 I’ve created a JIRA ticket and will link the relevant JIRAs and 
> PRs and I’d like to plow through those. Can we run 1 more tika-batch after I 
> do that to check any regressions?
> 
> https://issues.apache.org/jira/browse/TIKA-1955
> 
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398) NASA Jet 
> Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS) Adjunct 
> Associate Professor, Computer Science Department University of Southern 
> California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 4/18/16, 9:11 AM, "Allison, Timothy B."  wrote:
>> 
>> Hi All,
>> I'm about to kick off our regression tests to see if there are major issues 
>> before we release 1.13.  Any blockers/last commits outstanding?  Still need 
>> to upgrade POI to 3.15-beta1...  What else?
>> 
>>  Cheers,
>> 
>> Tim
>> 
>> Timothy B. Allison, Ph.D.
>> Principal Artificial Intelligence Engineer Group Lead K83E/Human 
>> Language Technology The MITRE Corporation
>> 7515 Colshire Drive, McLean, VA  22102
>> 703-983-2473 (phone); 703-983-1379 (fax)
>> 


Fwd: Getting Files Tags

2016-04-19 Thread Mattmann, Chris A (3980)


Sent from my iPhone

Begin forwarded message:

From: raj kumar mailto:myidrajku...@gmail.com>>
Date: April 19, 2016 at 4:28:08 AM PDT
To: mailto:dev-ow...@tika.apache.org>>
Subject: Fwd: Getting Files Tags


Hi All,

   In Windows, Images & Videos are having tags.  We can add tags to these files 
like 'Favouite' & 'Romantic'.How to retrieve these tag values in TIKA?


Regards,
Rajkumar.S



Re: last commits before pre-1.13 regression tests?

2016-04-18 Thread Mattmann, Chris A (3980)
Tim I would like to get in and close out all the scientific MIME
updates for TREC-DD-Polar and get that in at least.

In 1.14, my team from USC and I will deliver an automatic Deep
Learning way to do MIME detection based on these updates and also
the ContentMIMEDetection mechanism described on the wiki. We are
also working on a paper to describe that too.

But for 1.13 I’ve created a JIRA ticket and will link the relevant
JIRAs and PRs and I’d like to plow through those. Can we run 1
more tika-batch after I do that to check any regressions?

https://issues.apache.org/jira/browse/TIKA-1955


Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/18/16, 9:11 AM, "Allison, Timothy B."  wrote:

>Hi All,
>  I'm about to kick off our regression tests to see if there are major issues 
> before we release 1.13.  Any blockers/last commits outstanding?  Still need 
> to upgrade POI to 3.15-beta1...  What else?
>
>   Cheers,
>
>  Tim
>
>Timothy B. Allison, Ph.D.
>Principal Artificial Intelligence Engineer
>Group Lead
>K83E/Human Language Technology
>The MITRE Corporation
>7515 Colshire Drive, McLean, VA  22102
>703-983-2473 (phone); 703-983-1379 (fax)
>


Fwd: Need Help

2016-04-18 Thread Mattmann, Chris A (3980)


Sent from my iPhone

Begin forwarded message:

From: harsh kumar mailto:kumarhars...@gmail.com>>
Date: April 18, 2016 at 2:02:23 AM PDT
To: mailto:dev-ow...@tika.apache.org>>
Subject: Fwd: Need Help

Hi,

I am using tika for detecting the encoding of a file. But I found that the 
results are not uniform If I use charsetdetector and universalEncodingdetector 
for the same file.

Can you please brief me with the major differences between them and their 
best-fit use cases.

Looking forward to your early reply.

--
Warm Regards…..•
Harsh Kumar



Apache Tika wikipedia page

2016-04-15 Thread Mattmann, Chris A (3980)
Hi All,

I made a Wikipedia page for Apache Tika:

https://en.wikipedia.org/wiki/Apache_Tika


Please update and edit. Thank you.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++







Re: file id comparison

2016-04-13 Thread Mattmann, Chris A (3980)
I don't think there are licensing issues and would love to contribute!

Sent from my iPhone

> On Apr 13, 2016, at 9:33 AM, Allison, Timothy B.  wrote:
> 
> All,
>  Can anyone think of licensing issues/ip issues/other concerns with running a 
> comparison of 'file', Droid and Tika on our TIKA-1302 corpus?  Other file id 
> tools to consider?
>  Anyone want to contribute to analysis?
> 
>  Cheers,
> 
>  Tim


Re: @ApacheTika , and release related tweets question

2016-04-06 Thread Mattmann, Chris A (3980)
FYI I updated the front page with a news item link to the Panama
papers.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/6/16, 10:10 AM, "Mattmann, Chris A (3980)"  
wrote:

>++1 on all the feedback from you two below :)
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>
>On 4/6/16, 9:08 AM, "Bob Paulin"  wrote:
>
>>Hi Nick,
>>
>>This is awesome and I think should be great for the community!  I looked 
>>to commons as an example https://twitter.com/ApacheCommons . Looks like 
>>they tweet out the releases with a link to the mailing list comments.  
>>Might be a good precedent to follow to bring attention to the fact that 
>>there is a mailing list.  Other things to consider: CVE we'd need to 
>>report publicly, committer/PMC updates, and perhaps PSAs of changes like 
>>the SVN - > GIT change.  Thanks again!
>>
>>- Bob
>>
>>On 4/6/2016 7:41 AM, Nick Burch wrote:
>>> Hi All
>>>
>>> Firstly, in case you haven't heard, we've setup a twitter account for 
>>> the project! It's @ApacheTika - https://twitter.com/ApacheTika
>>>
>>>
>>> One thing we'll want to use it for is project publicity, linking to 
>>> interesting things going on around the project, such as today's post 
>>> on how the panama papers investigation used Apache Tika and SOLR :)
>>>
>>> Another thing we can use it for is release announcements. That leads 
>>> to a question though - which parts? Should we just tweet when there's 
>>> a new release out, linking to the download and the changelog?
>>>
>>> Or would people prefer it if we tweeted when we start the countdown to 
>>> a release (to give a chance to test / get last patches ready), again 
>>> when the vote starts (to get a wider group testing and voting), and 
>>> finally when the release is out?
>>>
>>> Thoughts?
>>>
>>> Nick
>>>
>>


Re: @ApacheTika , and release related tweets question

2016-04-06 Thread Mattmann, Chris A (3980)
++1 on all the feedback from you two below :)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/6/16, 9:08 AM, "Bob Paulin"  wrote:

>Hi Nick,
>
>This is awesome and I think should be great for the community!  I looked 
>to commons as an example https://twitter.com/ApacheCommons . Looks like 
>they tweet out the releases with a link to the mailing list comments.  
>Might be a good precedent to follow to bring attention to the fact that 
>there is a mailing list.  Other things to consider: CVE we'd need to 
>report publicly, committer/PMC updates, and perhaps PSAs of changes like 
>the SVN - > GIT change.  Thanks again!
>
>- Bob
>
>On 4/6/2016 7:41 AM, Nick Burch wrote:
>> Hi All
>>
>> Firstly, in case you haven't heard, we've setup a twitter account for 
>> the project! It's @ApacheTika - https://twitter.com/ApacheTika
>>
>>
>> One thing we'll want to use it for is project publicity, linking to 
>> interesting things going on around the project, such as today's post 
>> on how the panama papers investigation used Apache Tika and SOLR :)
>>
>> Another thing we can use it for is release announcements. That leads 
>> to a question though - which parts? Should we just tweet when there's 
>> a new release out, linking to the download and the changelog?
>>
>> Or would people prefer it if we tweeted when we start the countdown to 
>> a release (to give a chance to test / get last patches ready), again 
>> when the vote starts (to get a wider group testing and voting), and 
>> finally when the release is out?
>>
>> Thoughts?
>>
>> Nick
>>
>


Apache Tika used to parse the Panama papers!

2016-04-05 Thread Mattmann, Chris A (3980)
FYI:
http://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak/?utm_campaign=ForbesTech&utm_source=TWITTER&utm_medium=social&utm_channel=Technology&linkId=23087770#709893771df5


BTW I know Thomas and am in touch..he wrote an article about MEMEX
last year.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++







Re: dependency upgrades, release 1.13?

2016-04-01 Thread Mattmann, Chris A (3980)
+1 happy to RM it :)

I’ll cut 1.13 this week or early next week.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Friday, April 1, 2016 at 5:55 AM
To: "dev@tika.apache.org" 
Subject: dependency upgrades, release 1.13?

>All,
>  I upgraded the dependencies I was comfortable upgrading.  Do we want to
>upgrade others?
>
>Opennlp-tools -> 1.6.0
>Commons-csv -> 1.2
>Sis -> 0.6
>Jhighlight -> 1.0.3
>Vorbis-java* -> 0.8
>
>Org.json -> 20160212
>Asm -> 5.1
>
>
>Junit-> 4.12
>slf log4j -> 1.7.20
>
>I ask because I'd like to do a pre-pre-release of 1.13 large scale
>regression test some time mid-late next week, and it'd be great to have
>all "upgrades" in.
>
>Should we aim to release 1.13 by the end of April?  I'd like to wait for
>the next upgrade to POI, which should be out in a week or two.
>
>Cheers,
>
> Tim
>
>



Re: Who's going to Apache: Big Data in May?

2016-03-30 Thread Mattmann, Chris A (3980)
I may be attending briefly :) Just need to get my ducks in a row :)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Sergey Beryozkin 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, March 30, 2016 at 5:26 AM
To: "dev@tika.apache.org" 
Subject: Re: Who's going to Apache: Big Data in May?

>I'll be arriving with my colleague Tue evening and leaving Fri evening,
>look forward to talking to all of you who will be there :-)
>
>Sergey
>On 30/03/16 13:24, Bob Paulin wrote:
>> I'll be flying in Wednesday afternoon and staying through Friday. My
>> talks are both on Thursday.  Looking forward to seeing some folks there!
>>
>> - Bob
>>
>> On 3/30/2016 6:14 AM, Nick Burch wrote:
>>> On Tue, 29 Mar 2016, Ken Krugler wrote:
 I'll be giving a talk (Cascading+Flink) at the conference on Monday,
 May 9th.
>>>
>>> Great! I think there's quite a few Tika and Tika-related talks
>>> happening, at least based on the schedules:
>>> http://apachebigdata2016.sched.org/?s=tika&iframe=no
>>> http://apachecon2016.sched.org/?s=tika&iframe=no
>>>
 I'm planning to stay through Wednesday noon-ish.
>>>
>>> I'm there Sunday-Saturday, as I'm going to both halves of the event!
>>>
>>> Nick
>>>
>>
>



Re: GSOC2016 Sentiment Analysis

2016-03-29 Thread Mattmann, Chris A (3980)
for the classification of texts into a set of
>> >classes defined by the user, whether they are sentiment classes or
>>other.
>> >
>> >However it doesn't perform well for this purpose.
>> >
>> >Furthermore, the sentiment analysis component would not just perform
>>the
>> >naive classification but also additional tasks (e.g., quantification)
>>and
>> >implement more specific and sophisticated approaches.
>> >
>> >
>> >Please share your thoughts.
>> >
>> >
>> >Mondher
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >On Tue, Mar 29, 2016 at 1:51 PM, Madhawa Kasun Gunasekara
>> > wrote:
>> >
>> >Hi Chris / Antony
>> >
>> >
>> >yes I would like to work on this, This proposal address most of the
>> >things in Sentiment analysis,
>> >
>> >AFAIK most of the people use OpenNLP Document Categorizer for Sentiment
>> >Analysis, since there isn't a proper functionality to do sentiment
>> >analysis in OpenNLP, This would be great if we can add this feature on
>> >OpenNLP project, and also I would like to suggest
>> > that we should able to detect the target object of the opinions from
>> >this feature as well.
>> >
>> >
>> >WDYT ??
>> >
>> >
>> >
>> >Thanks,
>> >
>> >Madhawa
>> >
>> >
>> >Madhawa
>> >
>> >
>> >
>> >
>> >On Tue, Mar 29, 2016 at 2:11 AM, Mattmann, Chris A (3980)
>> > wrote:
>> >
>> >Dear Anthony,
>> >
>> >Great! These both sound like fantastic proposals and I’m happy
>> >to be a mentor. Madhawa, would you like to join in on these
>> >efforts?
>> >
>> >Cheers,
>> >Chris
>> >
>> >++
>> >Chris Mattmann, Ph.D.
>> >Chief Architect
>> >Instrument Software and Science Data Systems Section (398)
>> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >Office: 168-519, Mailstop: 168-527
>> >Email: chris.a.mattm...@nasa.gov
>> >WWW:  
>> >http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>> >++
>> >Director, Information Retrieval and Data Science Group (IRDS)
>> >Adjunct Associate Professor, Computer Science Department
>> >University of Southern California, Los Angeles, CA 90089 USA
>> >WWW: http://irds.usc.edu/
>> >++
>> >
>> >
>> >
>> >
>> >
>> >-Original Message-
>> >From: Anthony Beylerian 
>> >Date: Monday, March 28, 2016 at 11:48 AM
>> >To: "d...@opennlp.apache.org" ,
>> >"mondher.bouaz...@gmail.com" 
>> >Cc: Madhawa Kasun Gunasekara , jpluser
>> >
>> >Subject: RE: GSOC2016 Sentiment Analysis
>> >
>> >>Dear Chris,
>> >>
>> >>Thank you for starting the discussion.
>> >>We are glad there is an interest in a sentiment analysis component.
>> >>
>> >>My colleague Mondher posted the two JIRA issues related to Sentiment
>> >>Analysis [1][2] as references for our proposals [3][4] for GSoC.
>> >>In fact, we have been researching this topic at our university.
>> >>We are hoping to participate this year and work on integrating both a
>> >>sentiment classifier and a quantifier for the library.
>> >>
>> >>It would be nice to also have an interface with Tika, maybe we can
>> >>collaborate ?
>> >>We are also looking for mentors, in case someone is willing to support
>> >>our proposals.
>> >>
>> >>Best,
>> >>
>> >>Anthony
>> >>
>> >>[1] 
>> >https://issues.apache.org/jira/browse/OPENNLP-842
>> ><https://issues.apache.org/jira/browse/OPENNLP-842>
>> >>[2] 
>> >https://issues.apache.org/jira/browse/OPENNLP-840
>> ><https://issues.apache.org/jira/browse/OPENNLP-840>
>> >>[3]
>> 
>>>>https://docs.google.com/document/d/1nVnwpmGaOnwHERXr55IClE4V87jUX2sva-m
>>>>kg
>> >>W
>> >>nR8n0/edit?usp=sharing
>> >>[4]
>> 
>>>>https://docs.google.com/document/d/1x02II9W3rirtuSbx_sY8kOQZSgOp0SIKeIW
>>>>TC
>> >>X
>> >>EOJvo/edit?usp=sharing
>> >>
>> >>> From: chris.a.mattm...@jpl.nasa.gov
>> >>> To: nishant@gmail.com
>> >>> CC: d...@opennlp.apache.org;
>> >madhaw...@gmail.com;
>> >hmanj...@usc.edu <mailto:hmanj...@usc.edu>;
>> >>>kamal...@usc.edu
>> >>> Subject: Re: GSOC2016 Sentiment Analysis
>> >>> Date: Sun, 27 Mar 2016 19:34:24 +
>> >>>
>> >>> No problem - I just wanted to encourage discussion thank you for
>> >>> your prompt and courteous replies.
>> >>>
>> >>> ++
>> >>> Chris Mattmann, Ph.D.
>> >>> Chief Architect
>> >>> Instrument Software and Science Data Systems Section (398)
>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>> Office: 168-519, Mailstop: 168-527
>> >>> Email: chris.a.mattm...@nasa.gov
>> >>> WWW: 
>> >http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>> >>> ++
>> >>> Director, Information Retrieval and Data Science Group (IRDS)
>> >>> Adjunct Associate Professor, Computer Science Department
>> >>> University of Southern California, Los Angeles, CA 90089 USA
>> >>> WWW: http://irds.usc.edu/
>> >>> ++
>> >>
>> >>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> 
> 



Re: GSOC2016 Sentiment Analysis

2016-03-29 Thread Mattmann, Chris A (3980)
I like both of your comments Mondher and Madhawa. My team at USC
has been investigating the use of particular corpuses including
Fisher Callhome so as to support sentiment analysis. We have been
writing Java code outside of both OpenNLP and Tika but with the
goal of integrating them into both. We have a mix of Java and
Python code that we’d like to bring into both projects.

I’m reviewing the proposals you wrote now, but would it make sense
to have a Google hangout this Friday, ~10am PT Los Angeles/time?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Mondher Bouazizi 
Date: Monday, March 28, 2016 at 11:46 PM
To: Madhawa Kasun Gunasekara , jpluser

Cc: Anthony Beylerian ,
"d...@opennlp.apache.org" , "dev@tika.apache.org"
, Information and Data Science Group USC List

Subject: Re: GSOC2016 Sentiment Analysis

>Dear Madhawa,
>
>
>Thank you for your interest in the proposals.
>The current tasks we proposed refer to the classification and
>quantification regardless of the topic.
>This can be used in a larger context where the topic is not specified, or
>not unique, in which case we will need to identify the topic(s).
>Therefore, a topic detector would be a good idea to implement, in order
>to complement this.
>
>
>As for the Document Categorizer, it is a general purpose component with
>basic features (n-gram, bag of words, etc.).
>
>It is basically used for the classification of texts into a set of
>classes defined by the user, whether they are sentiment classes or other.
>
>However it doesn't perform well for this purpose.
>
>Furthermore, the sentiment analysis component would not just perform the
>naive classification but also additional tasks (e.g., quantification) and
>implement more specific and sophisticated approaches.
>
>
>Please share your thoughts.
>
>
>Mondher
>
>
>
>
>
>
>
>
>
>On Tue, Mar 29, 2016 at 1:51 PM, Madhawa Kasun Gunasekara
> wrote:
>
>Hi Chris / Antony
>
>
>yes I would like to work on this, This proposal address most of the
>things in Sentiment analysis,
>
>AFAIK most of the people use OpenNLP Document Categorizer for Sentiment
>Analysis, since there isn't a proper functionality to do sentiment
>analysis in OpenNLP, This would be great if we can add this feature on
>OpenNLP project, and also I would like to suggest
> that we should able to detect the target object of the opinions from
>this feature as well.
>
>
>WDYT ??
>
>
>
>Thanks,
>
>Madhawa
>
>
>Madhawa
>
>
>
>
>On Tue, Mar 29, 2016 at 2:11 AM, Mattmann, Chris A (3980)
> wrote:
>
>Dear Anthony,
>
>Great! These both sound like fantastic proposals and I’m happy
>to be a mentor. Madhawa, would you like to join in on these
>efforts?
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>-Original Message-
>From: Anthony Beylerian 
>Date: Monday, March 28, 2016 at 11:48 AM
>To: "d...@opennlp.apache.org" ,
>"mondher.bouaz...@gmail.com" 
>Cc: Madhawa Kasun Gunasekara , jpluser
>
>Subject: RE: GSOC2016 Sentiment Analysis
>
>>Dear Chris,
>>
>>Thank you for starting the discussion.
>>We are glad there is an interest in a sentiment analysis component.
>>
>>My colleague Mondher posted the two JIRA issues related to Sentiment
>>Analysis [1][2] as references for our proposals [3

Re: GSOC2016 Sentiment Analysis

2016-03-28 Thread Mattmann, Chris A (3980)
Dear Anthony,

Great! These both sound like fantastic proposals and I’m happy
to be a mentor. Madhawa, would you like to join in on these
efforts?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Anthony Beylerian 
Date: Monday, March 28, 2016 at 11:48 AM
To: "d...@opennlp.apache.org" ,
"mondher.bouaz...@gmail.com" 
Cc: Madhawa Kasun Gunasekara , jpluser

Subject: RE: GSOC2016 Sentiment Analysis

>Dear Chris,
>
>Thank you for starting the discussion.
>We are glad there is an interest in a sentiment analysis component.
>
>My colleague Mondher posted the two JIRA issues related to Sentiment
>Analysis [1][2] as references for our proposals [3][4] for GSoC.
>In fact, we have been researching this topic at our university.
>We are hoping to participate this year and work on integrating both a
>sentiment classifier and a quantifier for the library.
>
>It would be nice to also have an interface with Tika, maybe we can
>collaborate ?
>We are also looking for mentors, in case someone is willing to support
>our proposals.
>
>Best,
>
>Anthony
>
>[1] https://issues.apache.org/jira/browse/OPENNLP-842
>[2] https://issues.apache.org/jira/browse/OPENNLP-840
>[3] 
>https://docs.google.com/document/d/1nVnwpmGaOnwHERXr55IClE4V87jUX2sva-mkgW
>nR8n0/edit?usp=sharing
>[4] 
>https://docs.google.com/document/d/1x02II9W3rirtuSbx_sY8kOQZSgOp0SIKeIWTCX
>EOJvo/edit?usp=sharing
>
>> From: chris.a.mattm...@jpl.nasa.gov
>> To: nishant@gmail.com
>> CC: d...@opennlp.apache.org; madhaw...@gmail.com; hmanj...@usc.edu;
>>kamal...@usc.edu
>> Subject: Re: GSOC2016 Sentiment Analysis
>> Date: Sun, 27 Mar 2016 19:34:24 +
>> 
>> No problem - I just wanted to encourage discussion thank you for
>> your prompt and courteous replies.
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>
>



Re: GSOC2016 Sentiment Analysis

2016-03-27 Thread Mattmann, Chris A (3980)
Nishant,

I’m not sure what you are talking about, at all. It’s part of the
engagement process in GSoC to *engage the community*. At Apache
this is done on list.

I’ve been on this list for months and there is about 0..traffic.
Which is not good. Traffic, like this, *is good*. It shows there
is a healthy community that actually discusses things.

Madhawa, you don’t need to take this conversation off list, and
precisely the opposite. The conversation must be kept on list.

Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Nishant Kelkar 
Date: Sunday, March 27, 2016 at 11:44 AM
To: 
Cc: Madhawa Kasun Gunasekara , Harshavardhan
Manjunatha , Information and Data Science Group USC List
, "kamal...@usc.edu" ,
"dev@tika.apache.org" 
Subject: Re: GSOC2016 Sentiment Analysis

>Hi Madhawa,
>Could you take this discussion off the dev openNLP list for other
>problems concerning logging in, participation, etc. now that you have a
>positive response? In my humble opinion, that would prevent others not
>involved in your discussion from getting email about the topic.
>
>Good luck!
>
>Best Regards,
>Nishant
>
>
>On Sun, Mar 27, 2016 at 6:37 AM, Mattmann, Chris A (3980)
> wrote:
>
>Thanks please can you create a username with no spaces?
>
>Sent from my iPhone
>
>On Mar 27, 2016, at 2:20 AM, Madhawa Kasun Gunasekara
>mailto:madhaw...@gmail.com>> wrote:
>
>Hi Chris,
>
>Thanks for the reply, I tried to logging to [1], but I couldn't able to
>login into that my username is "Madhawa Gunasekara"
>[1] https://wiki.apache.org/tika/GSoC2016
>
>I have created a jira issue on
>https://issues.apache.org/jira/browse/TIKA-1911
>
>Thanks,
>Madhawa
>
>Madhawa
>
>On Sat, Mar 26, 2016 at 3:21 AM, Mattmann, Chris A (3980)
>mailto:chris.a.mattm...@jpl.nasa.gov>>
>wrote:
>Thanks Harsha. Yes, I know about the Fisher Callhome Corpus. There
>is data related in there that can be used for sentiment analysis :)
>It can be adapted and is being used for that.
>
>Anyways, yes looking forward to the task. Please send in your proposal
>Madhawa.
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov<mailto:chris.a.mattm...@nasa.gov>
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>-Original Message-
>From: Harshavardhan Manjunatha mailto:hmanj...@usc.edu>>
>Date: Friday, March 25, 2016 at 2:45 PM
>To: jpluser 
>mailto:chris.a.mattm...@jpl.nasa.gov>>
>Cc: "d...@opennlp.apache.org<mailto:d...@opennlp.apache.org>"
>mailto:d...@opennlp.apache.org>>, Information and
>Data Science Group USC List
>mailto:ird...@mymaillists.usc.edu>>,
>"kamal...@usc.edu<mailto:kamal...@usc.edu>"
>mailto:kamal...@usc.edu>>,
>"dev@tika.apache.org<mailto:dev@tika.apache.org>"
>mailto:dev@tika.apache.org>>
>Subject: Re: GSOC2016 Sentiment Analysis
>
>>Dear Prof. Mattmann,
>>
>>
>>Thanks. But the Fisher Callhome Corpus is a training Corpus for Machine
>>Translation b/w Spanish & Englosh.
>>
>>
>>I dont think it can be adapted to Sentiment Analysis.
>>
>>
>>Developing a generic training model/corpus for Sentiment Analysis that
>>encapsulates social media, movie reviews, etc, etc will be a Challenging
>>& Exciting Task !!
>>
>>
>>Regards,
>>Harsha
>>
>>
>>On Fri, Mar 25, 2016 at 2:42 PM, Mattmann, Chris A (3980)
>>mailto:chris.a.matt

Re: GSOC2016 Sentiment Analysis

2016-03-27 Thread Mattmann, Chris A (3980)
Thanks please can you create a username with no spaces?

Sent from my iPhone

On Mar 27, 2016, at 2:20 AM, Madhawa Kasun Gunasekara 
mailto:madhaw...@gmail.com>> wrote:

Hi Chris,

Thanks for the reply, I tried to logging to [1], but I couldn't able to login 
into that my username is "Madhawa Gunasekara"
[1] https://wiki.apache.org/tika/GSoC2016

I have created a jira issue on https://issues.apache.org/jira/browse/TIKA-1911

Thanks,
Madhawa

Madhawa

On Sat, Mar 26, 2016 at 3:21 AM, Mattmann, Chris A (3980) 
mailto:chris.a.mattm...@jpl.nasa.gov>> wrote:
Thanks Harsha. Yes, I know about the Fisher Callhome Corpus. There
is data related in there that can be used for sentiment analysis :)
It can be adapted and is being used for that.

Anyways, yes looking forward to the task. Please send in your proposal
Madhawa.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov<mailto:chris.a.mattm...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Harshavardhan Manjunatha mailto:hmanj...@usc.edu>>
Date: Friday, March 25, 2016 at 2:45 PM
To: jpluser 
mailto:chris.a.mattm...@jpl.nasa.gov>>
Cc: "d...@opennlp.apache.org<mailto:d...@opennlp.apache.org>" 
mailto:d...@opennlp.apache.org>>, Information and
Data Science Group USC List 
mailto:ird...@mymaillists.usc.edu>>,
"kamal...@usc.edu<mailto:kamal...@usc.edu>" 
mailto:kamal...@usc.edu>>, 
"dev@tika.apache.org<mailto:dev@tika.apache.org>"
mailto:dev@tika.apache.org>>
Subject: Re: GSOC2016 Sentiment Analysis

>Dear Prof. Mattmann,
>
>
>Thanks. But the Fisher Callhome Corpus is a training Corpus for Machine
>Translation b/w Spanish & Englosh.
>
>
>I dont think it can be adapted to Sentiment Analysis.
>
>
>Developing a generic training model/corpus for Sentiment Analysis that
>encapsulates social media, movie reviews, etc, etc will be a Challenging
>& Exciting Task !!
>
>
>Regards,
>Harsha
>
>
>On Fri, Mar 25, 2016 at 2:42 PM, Mattmann, Chris A (3980)
>mailto:chris.a.mattm...@jpl.nasa.gov>> wrote:
>
>Sounds great Harsha. This is for Google Summer of Code, so collaborating
>would be great, and in this case, we would be working with Madhawa, should
>he choose to accept.
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov<mailto:chris.a.mattm...@nasa.gov>
>WWW:
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>-Original Message-
>From: Harshavardhan Manjunatha mailto:hmanj...@usc.edu>>
>Date: Friday, March 25, 2016 at 2:38 PM
>To: jpluser 
>mailto:chris.a.mattm...@jpl.nasa.gov>>
>Cc: "d...@opennlp.apache.org<mailto:d...@opennlp.apache.org>" 
>mailto:d...@opennlp.apache.org>>, Information and
>Data Science Group USC List 
>mailto:ird...@mymaillists.usc.edu>>,
>"kamal...@usc.edu<mailto:kamal...@usc.edu>" 
>mailto:kamal...@usc.edu>>, 
>"dev@tika.apache.org<mailto:dev@tika.apache.org>"
>mailto:dev@tika.apache.org>>
>Subject: Re: GSOC2016 Sentiment Analysis
>
>>Dear Prof. Mattmann,
>>
>>
>>I would love to collaborate on this & am interested in developing
>>Sentiment Analysis Tika Parsers leveraging Apache OpenNLP.
>>
>>
>>I have completed an Applied NLP course @ USC.
>>
>>
>>I have done a Literature Review of Papers & Open Source Tools on the same
>>recently.
>>
>>
>>Regards,
>>Harsha
>>
>>
>>On Fri, Mar 25, 2016 at 2:07 PM, Mattmann, Chris A (39

Re: GSOC2016 Sentiment Analysis

2016-03-25 Thread Mattmann, Chris A (3980)
Thanks Harsha. Yes, I know about the Fisher Callhome Corpus. There
is data related in there that can be used for sentiment analysis :)
It can be adapted and is being used for that.

Anyways, yes looking forward to the task. Please send in your proposal
Madhawa.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Harshavardhan Manjunatha 
Date: Friday, March 25, 2016 at 2:45 PM
To: jpluser 
Cc: "d...@opennlp.apache.org" , Information and
Data Science Group USC List ,
"kamal...@usc.edu" , "dev@tika.apache.org"

Subject: Re: GSOC2016 Sentiment Analysis

>Dear Prof. Mattmann,
>
>
>Thanks. But the Fisher Callhome Corpus is a training Corpus for Machine
>Translation b/w Spanish & Englosh.
>
>
>I dont think it can be adapted to Sentiment Analysis.
>
>
>Developing a generic training model/corpus for Sentiment Analysis that
>encapsulates social media, movie reviews, etc, etc will be a Challenging
>& Exciting Task !!
>
>
>Regards,
>Harsha
>
>
>On Fri, Mar 25, 2016 at 2:42 PM, Mattmann, Chris A (3980)
> wrote:
>
>Sounds great Harsha. This is for Google Summer of Code, so collaborating
>would be great, and in this case, we would be working with Madhawa, should
>he choose to accept.
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>-Original Message-
>From: Harshavardhan Manjunatha 
>Date: Friday, March 25, 2016 at 2:38 PM
>To: jpluser 
>Cc: "d...@opennlp.apache.org" , Information and
>Data Science Group USC List ,
>"kamal...@usc.edu" , "dev@tika.apache.org"
>
>Subject: Re: GSOC2016 Sentiment Analysis
>
>>Dear Prof. Mattmann,
>>
>>
>>I would love to collaborate on this & am interested in developing
>>Sentiment Analysis Tika Parsers leveraging Apache OpenNLP.
>>
>>
>>I have completed an Applied NLP course @ USC.
>>
>>
>>I have done a Literature Review of Papers & Open Source Tools on the same
>>recently.
>>
>>
>>Regards,
>>Harsha
>>
>>
>>On Fri, Mar 25, 2016 at 2:07 PM, Mattmann, Chris A (3980)
>> wrote:
>>
>>Hi Madhawa,
>>
>>
>>
>>So, how about a project that develops and contributes an Apache
>>
>>Tika and OpenNLP based SentimentAnalysisParser?
>>
>>
>>
>>I have some students currently doing work using the Fisher Callhome
>>
>>Corpus and you can build off that. I am CC’ing my USC IRDS team
>>
>>and my student Indhu who is working on this.
>>
>>
>>
>>Can you start working on your proposal by:
>>
>>
>>
>>1. Creating a JIRA issue here:
>>
>>https://urldefense.proofpoint.com/v2/url?u=http-3A__issues.apache.org_jir
>>a
>>_browse_TIKA&d=CwIGaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=8l5
>>6
>>W6EU8xpHKOeTqpG03w&m=FEfICxmcDheHndXqky_rLNiYMcQE9yeOn7RoOwpR8t0&s=BPBK1m
>>s
>>1hzt9Tb5RdkU5B7FqRxuyMu3BoROpgd8Tvdw&e=
>>
>> tag it with ‘gsoc2016’, ‘memex’, and ‘irds’ please
>>
>>
>>
>>2. Develop a proposal on the Tika wiki here:
>>
>>https://urldefense.proofpoint.com/v2/url?u=http-3A__wiki.apache.org_tika_
>>G
>>SoC2016&d=CwIGaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=8l56W6EU
>>8
>>xpHKOeTqpG03w&m=FEfICxmcDheHndXqky_rLNiYMcQE9yeOn7RoOwp

Re: GSOC2016 Sentiment Analysis

2016-03-25 Thread Mattmann, Chris A (3980)
Sounds great Harsha. This is for Google Summer of Code, so collaborating
would be great, and in this case, we would be working with Madhawa, should
he choose to accept.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Harshavardhan Manjunatha 
Date: Friday, March 25, 2016 at 2:38 PM
To: jpluser 
Cc: "d...@opennlp.apache.org" , Information and
Data Science Group USC List ,
"kamal...@usc.edu" , "dev@tika.apache.org"

Subject: Re: GSOC2016 Sentiment Analysis

>Dear Prof. Mattmann,
>
>
>I would love to collaborate on this & am interested in developing
>Sentiment Analysis Tika Parsers leveraging Apache OpenNLP.
>
>
>I have completed an Applied NLP course @ USC.
>
>
>I have done a Literature Review of Papers & Open Source Tools on the same
>recently.
>
>
>Regards,
>Harsha
>
>
>On Fri, Mar 25, 2016 at 2:07 PM, Mattmann, Chris A (3980)
> wrote:
>
>Hi Madhawa,
>
>
>
>So, how about a project that develops and contributes an Apache
>
>Tika and OpenNLP based SentimentAnalysisParser?
>
>
>
>I have some students currently doing work using the Fisher Callhome
>
>Corpus and you can build off that. I am CC’ing my USC IRDS team
>
>and my student Indhu who is working on this.
>
>
>
>Can you start working on your proposal by:
>
>
>
>1. Creating a JIRA issue here:
>
>https://urldefense.proofpoint.com/v2/url?u=http-3A__issues.apache.org_jira
>_browse_TIKA&d=CwIGaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=8l56
>W6EU8xpHKOeTqpG03w&m=FEfICxmcDheHndXqky_rLNiYMcQE9yeOn7RoOwpR8t0&s=BPBK1ms
>1hzt9Tb5RdkU5B7FqRxuyMu3BoROpgd8Tvdw&e=
>
> tag it with ‘gsoc2016’, ‘memex’, and ‘irds’ please
>
>
>
>2. Develop a proposal on the Tika wiki here:
>
>https://urldefense.proofpoint.com/v2/url?u=http-3A__wiki.apache.org_tika_G
>SoC2016&d=CwIGaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=8l56W6EU8
>xpHKOeTqpG03w&m=FEfICxmcDheHndXqky_rLNiYMcQE9yeOn7RoOwpR8t0&s=GGQdxogPSoNh
>rlr5mALyeK4Jkn7og7u5K0Mr6qGuQ1s&e=
> (you will need permission, first
>
>sign up for your account on the wiki then tell me your username so I
>
>can add permissions for you)
>
>
>
>3. Apply through the Google Summer of Code 2016 program.
>
>
>
>4. Get in touch with me, and Indhu, and keep dev@tika.a.o and
>
>dev@openlp.a.o and ird...@usc.edu in the loop so we can discuss together
>
>as a community.
>
>
>
>Cool?
>
>
>
>Cheers,
>
>Chris
>
>
>
>++
>
>Chris Mattmann, Ph.D.
>
>Chief Architect
>
>Instrument Software and Science Data Systems Section (398)
>
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
>Office: 168-519, Mailstop: 168-527
>
>Email: chris.a.mattm...@nasa.gov
>
>WWW:  
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>
>++
>
>Director, Information Retrieval and Data Science Group (IRDS)
>
>Adjunct Associate Professor, Computer Science Department
>
>University of Southern California, Los Angeles, CA 90089 USA
>
>WWW: http://irds.usc.edu/
>
>++
>
>
>
>
>
>
>
>
>
>
>
>-Original Message-
>
>From: Madhawa Kasun Gunasekara 
>
>Reply-To: "d...@opennlp.apache.org" 
>
>Date: Wednesday, March 16, 2016 at 10:51 PM
>
>To: "d...@opennlp.apache.org" 
>
>Subject: GSOC2016 Sentiment Analysis
>
>
>
>>Hi
>
>>
>
>>I am interesting on contribute to OPENNLP-840: "Sentiment Analysis" for
>
>>GSOC2016 this time. Since i have been engaging with some similar projects
>
>>i
>
>>think it will be a great experience for me.
>
>>
>
>>I am a final year student in IESL College of Engineering, Sri lanka. I
>
>>have
>
>>learned machine learning and natural language processing stuff when I'm
&

Re: Change to NER ParserTest re https://builds.apache.org/job/tika-2.x/57

2016-03-25 Thread Mattmann, Chris A (3980)
Hey Tim,

I’ll take a look. Would be good to add the @AfterClass for sure though.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, March 22, 2016 at 6:55 PM
To: "dev@tika.apache.org" 
Subject: Change to NER ParserTest re
https://builds.apache.org/job/tika-2.x/57

>Chris et al,
>
>  I wasn't able to replicate this test failure on Windows or rhel.  My
>_guess_ is that the OpenNLPNERecogniser is not actually being called,
>only the Regex extractor.
>
> Is it possible that a different ordering of tests could land you in
>NamedEntityParserTest's testParse with a
>NamedEntityParser.SYS_PROP_NER_IMPL set to the Regex parser alone?
>
>  I added an explicit setting of the OpenNLPNERecogniser in
>testParse()...we'll see if that fixes it.  If that does work, am I
>masking something that you wanted to test for?
>
>  Finally, should we add an @AfterClass clean up of the system property
>to these tests so that we don't leave a residue of the tests set in the
>system properties?
>
>Cheers,
>
>  Tim
>
>
>   
>
>-Original Message-
>From: Hudson (JIRA) [mailto:j...@apache.org]
>Sent: Tuesday, March 22, 2016 3:03 PM
>To: dev@tika.apache.org
>Subject: [jira] [Commented] (TIKA-1855) TIka 2.0 - Move shared test-code
>back to tika-core and distribute test files to parser modules
>
>
>[ 
>https://issues.apache.org/jira/browse/TIKA-1855?page=com.atlassian.jira.pl
>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207057#comm
>ent-15207057 ] 
>
>Hudson commented on TIKA-1855:
>--
>
>UNSTABLE: Integrated in tika-2.x #57 (See
>[https://builds.apache.org/job/tika-2.x/57/])
>TIKA-1855 -- test file ending in .bin was git-ignored and fix (tallison:
>rev 4390fba1317d5ee72af6a76f227311cc338043f5)
>* 
>tika-test-resources/src/test/resources/test-documents/test-malformed-heade
>r.html.bin
>* 
>tika-app/src/test/java/org/apache/tika/embedder/ExternalEmbedderTest.java
>* .gitignore
>
>
>> TIka 2.0 - Move shared test-code back to tika-core and distribute test
>>files to parser modules
>> 
>>-
>>-
>>
>> Key: TIKA-1855
>> URL: https://issues.apache.org/jira/browse/TIKA-1855
>> Project: Tika
>>  Issue Type: Sub-task
>>Reporter: Tim Allison
>>Assignee: Tim Allison
>>
>> Undo TIKA-1851, and divide test docs to appropriate parser modules.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)



Re: GSOC2016 Sentiment Analysis

2016-03-25 Thread Mattmann, Chris A (3980)
Hi Madhawa,

So, how about a project that develops and contributes an Apache
Tika and OpenNLP based SentimentAnalysisParser?

I have some students currently doing work using the Fisher Callhome
Corpus and you can build off that. I am CC’ing my USC IRDS team
and my student Indhu who is working on this.

Can you start working on your proposal by:

1. Creating a JIRA issue here:
http://issues.apache.org/jira/browse/TIKA
 tag it with ‘gsoc2016’, ‘memex’, and ‘irds’ please

2. Develop a proposal on the Tika wiki here:
http://wiki.apache.org/tika/GSoC2016 (you will need permission, first
sign up for your account on the wiki then tell me your username so I
can add permissions for you)

3. Apply through the Google Summer of Code 2016 program.

4. Get in touch with me, and Indhu, and keep dev@tika.a.o and
dev@openlp.a.o and ird...@usc.edu in the loop so we can discuss together
as a community.

Cool?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Madhawa Kasun Gunasekara 
Reply-To: "d...@opennlp.apache.org" 
Date: Wednesday, March 16, 2016 at 10:51 PM
To: "d...@opennlp.apache.org" 
Subject: GSOC2016 Sentiment Analysis

>Hi
>
>I am interesting on contribute to OPENNLP-840: "Sentiment Analysis" for
>GSOC2016 this time. Since i have been engaging with some similar projects
>i
>think it will be a great experience for me.
>
>I am a final year student in IESL College of Engineering, Sri lanka. I
>have
>learned machine learning and natural language processing stuff when I'm
>doing my first degree (Computer Science) in University of Sri
>Jayewardhenapura.
>
>In my internship period, I have actively contributed to a Twitter based
>NLP
>project. and We have published an article on IEEE Conference, "Real-time
>Natural Language Processing for Crowdsourced Road Traffic Alerts" [2] .
>
>Please let me know what you think and what you suggest.
>
>Please kindly give me further information on how I could proceed. I
>couldn't able to find the mentioned paper "Multi-Class Sentiment Analysis
>in Twitter: a Pattern-Based Approach"
>[1] https://issues.apache.org/jira/browse/OPENNLP-840
>[2] http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7377667
>
>Thanks
>Madhawa Gunasekara



Re: Need suggestion on file type .HFA to be added Tika.

2016-03-02 Thread Mattmann, Chris A (3980)
I agree with Nick’s replies here

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nandan Padar Chandrashekar 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, March 2, 2016 at 12:19 AM
To: "dev@tika.apache.org" 
Subject: Need suggestion on file type .HFA to be added Tika.

>Hi All,
>
>Identified (Hierarchical File Architecture) HFA file format which is not
>presently being identified through Tika.
>
>file format details :
>
>extension : *.hfa
>Header tag contains string  EHFA_HEADER_TAG
>
>Links :
>
>1.
>ftp://ftp.ecn.purdue.edu/jshan/86/help/html/appendices/hfa_object_director
>y.htm
>
>2. 
>ftp://ftp.ecn.purdue.edu/jshan/86/help/html/appendices/Ehfa_HeaderTag.htm
>
>Should this be considered as custom mime type or standard mime type. ?
>
>Need suggestion for content type(mime-type type) of this file format.
>
>.
>Regards
>Nandan Padar Chandrashekar



Re: trunk build failing in bundle --, cxf class not found for GrobidRESTParser?

2016-03-02 Thread Mattmann, Chris A (3980)
yeah maybe you’re right thanks for fixing it guys

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, March 2, 2016 at 6:30 AM
To: "dev@tika.apache.org" 
Subject: RE: trunk build failing in bundle --, cxf class not found for
GrobidRESTParser?

>There's a chance you hadn't merged my breaking commit?
>
>-----Original Message-
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>Sent: Wednesday, March 02, 2016 9:27 AM
>To: dev@tika.apache.org
>Subject: Re: trunk build failing in bundle --, cxf class not found for
>GrobidRESTParser?
>
>wow this is super odd. Last thing I committed was NLTK .. and it built
>fine locally I Tested before committing.
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department University of
>Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>-Original Message-
>From: "Allison, Timothy B." 
>Reply-To: "dev@tika.apache.org" 
>Date: Wednesday, March 2, 2016 at 4:26 AM
>To: "dev@tika.apache.org" 
>Subject: trunk build failing in bundle --, cxf class not found for
>GrobidRESTParser?
>
>>Anyone have an idea why trunk is now failing?  I couldn't find any
>>changes between the last successful build and last night's failures
>>that would explain this.
>>
>>
>>Test set: org.apache.tika.bundle.BundleIT
>>---
>>---
>>-
>>Tests run: 9, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 21.997
>>sec <<< FAILURE!
>>testTikaBundle(org.apache.tika.bundle.BundleIT)  Time elapsed: 2.374
>>sec <<< ERROR!
>>java.lang.ClassNotFoundException:
>>org.apache.cxf.jaxrs.ext.multipart.ContentDisposition not found by
>>org.apache.tika.bundle [17]
>>  at
>>org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelega
>>tio
>>n(BundleWiringImpl.java:1558)
>>  at
>>org.apache.felix.framework.BundleWiringImpl.access$400(BundleWiringImpl
>>.ja
>>va:79)
>>  at
>>org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass
>>(Bu
>>ndleWiringImpl.java:1998)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>  at
>>org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.
>>jav
>>a:69)
>>  at
>>org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
>>  at 
>>org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>>
>>
>>-Original Message-
>>From: Hudson (JIRA) [mailto:j...@apache.org]
>>Sent: Tuesday, March 01, 2016 9:59 PM
>>To: dev@tika.apache.org
>>Subject: [jira] [Commented] (TIKA-1857) Enhance PDFParser to extract
>>text from XFA forms
>>
>>
>>[
>>https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira
>>.pl 
>>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174937#c
>>omm
>>ent-15174937 ]
>>
>>Hudson commented on TIKA-1857:
>>--
>>
>>UNSTABLE: Integrated in tika-trunk-jdk1.7 #916 (See
>>[https://builds.apache.org/job/tika-trunk-jdk1.7/916/])
>>TIKA-1857: add basic XFA extraction support via Pascal Essiembre.
>>(tallison: rev dbefe9830b26d05f9ce53503565a069bcc63d7c1)
>>*
>>tika-parsers/src/test/resources/test-documents/testPDF_XFA_govdocs1_258
>>578
>>.pdf
>>* 
>>tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParser

Re: trunk build failing in bundle --, cxf class not found for GrobidRESTParser?

2016-03-02 Thread Mattmann, Chris A (3980)
wow this is super odd. Last thing I committed was NLTK .. and it
built fine locally I Tested before committing.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, March 2, 2016 at 4:26 AM
To: "dev@tika.apache.org" 
Subject: trunk build failing in bundle --, cxf class not found for
GrobidRESTParser?

>Anyone have an idea why trunk is now failing?  I couldn't find any
>changes between the last successful build and last night's failures that
>would explain this.
>
>
>Test set: org.apache.tika.bundle.BundleIT
>--
>-
>Tests run: 9, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 21.997
>sec <<< FAILURE!
>testTikaBundle(org.apache.tika.bundle.BundleIT)  Time elapsed: 2.374 sec
><<< ERROR!
>java.lang.ClassNotFoundException:
>org.apache.cxf.jaxrs.ext.multipart.ContentDisposition not found by
>org.apache.tika.bundle [17]
>   at 
>org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegatio
>n(BundleWiringImpl.java:1558)
>   at 
>org.apache.felix.framework.BundleWiringImpl.access$400(BundleWiringImpl.ja
>va:79)
>   at 
>org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass(Bu
>ndleWiringImpl.java:1998)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
>org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.jav
>a:69)
>   at 
>org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>
>
>-Original Message-
>From: Hudson (JIRA) [mailto:j...@apache.org]
>Sent: Tuesday, March 01, 2016 9:59 PM
>To: dev@tika.apache.org
>Subject: [jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text
>from XFA forms
>
>
>[ 
>https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.pl
>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174937#comm
>ent-15174937 ] 
>
>Hudson commented on TIKA-1857:
>--
>
>UNSTABLE: Integrated in tika-trunk-jdk1.7 #916 (See
>[https://builds.apache.org/job/tika-trunk-jdk1.7/916/])
>TIKA-1857: add basic XFA extraction support via Pascal Essiembre.
>(tallison: rev dbefe9830b26d05f9ce53503565a069bcc63d7c1)
>* 
>tika-parsers/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578
>.pdf
>* tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
>* 
>tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
>* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
>* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
>* 
>tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.prope
>rties
>* tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java
>TIKA-1857: add basic XFA extraction support via Pascal Essiembre.
>(tallison: rev 7c245fa87507cf0887838001c54c65b79b7e7cbc)
>* CHANGES.txt
>
>
>> Enhance PDFParser to extract text from XFA forms
>> 
>>
>> Key: TIKA-1857
>> URL: https://issues.apache.org/jira/browse/TIKA-1857
>> Project: Tika
>>  Issue Type: Improvement
>>  Components: parser
>>Reporter: Pascal Essiembre
>>  Labels: patch
>> Fix For: 1.13
>>
>> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip,
>>xfa_in_govdocs1.txt
>>
>>
>> Extract text from PDF Forms (XFA).  Information about XFA:
>>https://en.wikipedia.org/wiki/XFA
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)



Re: parallel dev on trunk and 2.x?

2016-02-25 Thread Mattmann, Chris A (3980)
+1 I haven’t fully moved over to 2.x yet b/c I haven’t honestly
had time to catch up. I suppose after my class in May I will have
time to catch up then and I can focus more on 2.x then. But for me
I am doing all my work in 1.x now so keeping up to date would be
great.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, February 25, 2016 at 12:50 PM
To: "dev@tika.apache.org" 
Subject: parallel dev on trunk and 2.x?

>All,
>  Do I understand correctly that we should be committing most changes to
>both trunk and 2.x?  Obviously, the 2.x commits are for 2.x. :)
>  Or will merge really, actually, truly work at some point in the future
>to merge changes in trunk to 2.x?
>
>Best,
>
>   Tim
>
>-Original Message-
>From: Hudson (JIRA) [mailto:j...@apache.org]
>Sent: Thursday, February 25, 2016 1:41 PM
>To: dev@tika.apache.org
>Subject: [jira] [Commented] (TIKA-1874) Fix rare npe in XWPFListManager
>
>
>[ 
>https://issues.apache.org/jira/browse/TIKA-1874?page=com.atlassian.jira.pl
>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167620#comm
>ent-15167620 ] 
>
>Hudson commented on TIKA-1874:
>--
>
>SUCCESS: Integrated in tika-2.x #31 (See
>[https://builds.apache.org/job/tika-2.x/31/])
>TIKA-1874 fix small npe (tallison: rev
>5083cc11c6230218ecef7d0161fa92bbf8d317e6)
>* 
>tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tik
>a/parser/microsoft/ooxml/XWPFListManager.java
>
>
>> Fix rare npe in XWPFListManager
>> ---
>>
>> Key: TIKA-1874
>> URL: https://issues.apache.org/jira/browse/TIKA-1874
>> Project: Tika
>>  Issue Type: Bug
>>Reporter: Tim Allison
>>Priority: Trivial
>>
>> Many thanks to [~centic]'s
>>[CommonCrawlDocumentDownload|https://github.com/centic9/CommonCrawlDocume
>>ntDownload], I recently grabbed .docx files from the initial index that
>>comes with that code.  I'll be adding these docs to our regular
>>regression testing for TIKA-1302.
>> While running Tika on these ~166k docs, ~30 of those files had an NPE
>>in XWPFListManager.  We need to add a null check.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)



Re: Integrating Tika with MITLL Text.jl library for language detection

2016-02-23 Thread Mattmann, Chris A (3980)
Thanks Ken. 

We are working on bringing in Text.jl and prefer at this point
to work on 1.x branch aka master. I’ve asked Trevor to take a look
at the 1.x branch and pulling your code from 2.x for tika-detect
module into 1.x. Then to look at adding text.jl from MIT-LL as a
corresponding implementation there. It’s a REST-based server that
he set up in Julia that accepts PUT requests. We should be able
to start out with Text.jl and then generalize to any REST service
that will perform language identification later.

You can see the issue from before here:

https://issues.apache.org/jira/browse/TIKA-1696


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Ken Krugler 
Date: Tuesday, February 23, 2016 at 11:14 AM
To: "dev@tika.apache.org" 
Cc: jpluser , "Ramirez, Paul M (398M)"

Subject: RE: Integrating Tika with MITLL Text.jl library for language
detection

>
>
>
>Hi Trevor,
>
>
>1. I assume the benchmark was using a pre-2.0 version of Tika, yes?
>
>
>It would be great to try out the current support in the 2.0 branch, as a
>comparison with what we had previously.
>
>
>Also, details on the test corpus used would be useful.
>
>
>2. I started using the ServiceLoader pattern to support dynamic loading
>of language detectors
>
>
>There's a bit more work to move the common support classes
>(LanguageWriter, etc) from the specific implementation sub-project into
>core
>
>
>Once that's done you should be able to try out directly adding your
>integration with Text.jl
>
>
>-- Ken
>
>
>________
>From: Trevor Claude Lewis
>Sent: February 23, 2016 10:55:46am PST
>To:dev@tika.apache.org
>Cc: Mattmann, Chris A (3980); Ramirez, Paul M (398M);
>kkrugler_li...@transpac.com
>Subject: Integrating Tika with MITLL Text.jl library for language
>detection
>
>
>Hi all,
>
>I am Trevor and I am a grad student at USC currently working with Prof.
>Chris Mattmann and Paul Ramirez, on integrating Tika with MIT Lincoln
>Lab’s
>Text.jl library for language detection.
>https://issues.apache.org/jira/browse/TIKA-1696
>
>Since, Text.jl is written in Julia I have created a Julia HTTP Server
>which
>accepts PUT request data and returns the language of the data as a JSON
>string.
>https://github.com/trevorlewis/csci572dr.git
>
>I have also benchmarked the results of the Julia HTTP Server to identify
>language with Tika 1.11 language detector.
>https://docs.google.com/spreadsheets/d/1cW6S2WpiN08pZ3UMVGMyQkO-fotUiUyGRe
>mCrbC1miY/edit?usp=sharing
>
>I was also looking at the work done by Ken Krugler on Tika's 2.x branch
>language detection and I was planning to fork that project and add the
>Text.jl implementation.
>https://issues.apache.org/jira/browse/TIKA-1723
>
>I wanted to gather any input and feedback on this project.
>
>
>Thanks,
>
>Trevor Lewis
>lewis...@usc.edu
>
>
>
>
>
>--
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>



Miredot built for 1.10, 1.12 and linked in main nag

2016-02-19 Thread Mattmann, Chris A (3980)
...thanks to Lewis for getting Miredot into the build and release
process. I had forgot to build it for 1.10 and 1.12, so it’s done
and published now. I also updated the tree nav to link to mire dot
too.

Now to start filling out the REST docs there from the wiki..

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Website

2016-02-19 Thread Mattmann, Chris A (3980)
Hey Nick,

Sorry it took me so long. I spent a bunch of time writing a script
on Github to make the release process easier by automatically
extracting and building the /index.apt file during the
release process.

https://git.io/v2Ubm


Anyways the site is updated with 1.12. I’m also building MireDot
for 1.11 and 1.12, and will update the links to link to the REST
API docs.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





[RESULT] [VOTE] Apache Tika 1.12 Release Candidate #1

2016-02-15 Thread Mattmann, Chris A (3980)
Team,

Sorry for the long delay. This VOTE has PASSED with the following
tallies:

+1
Chris Mattmann*
Markus Jelsma
Oleg Tikhonov*
Ken Krugler*
Tim Allison*
Konstantin Gribov*
David Meikle*
Lewis John McGibbney*
Tyler Palsulich*

* - Tika PMC

I’ll go update the website and update the mirrors and complete
the rest of the tasks.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: jpluser 
Date: Monday, January 25, 2016 at 11:57 AM
To: "u...@tika.apache.org" , "dev@tika.apache.org"

Subject: [VOTE] Apache Tika 1.12 Release Candidate #1

>Hi Folks,
>
>A first candidate for the Tika 1.12 release is available at:
>
>  https://dist.apache.org/repos/dist/dev/tika/
>
>The release candidate is a zip archive of the sources in:
>https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db2
>4
>27f9e84bc4ff31e569ae661c
>
>
>The SHA1 checksum of the archive is:
>30e64645af643959841ac3bb3c41f7e64eba7e5f
>
>In addition, a staged maven repository is available here:
>
>https://repository.apache.org/content/repositories/orgapachetika-1015/
>
>
>Please vote on releasing this package as Apache Tika 1.12.
>The vote is open for the next 72 hours and passes if a majority of at
>least three +1 Tika PMC votes are cast.
>
>[ ] +1 Release this package as Apache Tika 1.12
>[ ] -1 Do not release this package because…
>
>Cheers,
>Chris
>
>P.S. Of course here is my +1.
>
>
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>



Re: scm info in pom.xml

2016-02-11 Thread Mattmann, Chris A (3980)
I’ve already fixed this in trunk / master :-)

Needs fixing in 2.x but you can borrow from what
I did there..

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, February 11, 2016 at 5:16 AM
To: "dev@tika.apache.org" 
Subject: Re: scm info in pom.xml

>On Sat, 6 Feb 2016, Ken Krugler wrote:
>> I'm revisiting the creation of a new tika-langdetect module in the 2.x
>>branch, and have created a pom.xml
>>
>> But in looking at what I started with (from tika-translate), I see this:
>>
>>  
>>http://svn.apache.org/viewvc/tika/trunk/tika-langdetect
>>
>>scm:svn:http://svn.apache.org/repos/asf/tika/trunk/tika-langd
>>etect
>>
>>scm:svn:https://svn.apache.org/repos/asf/tika/trunk/
>>tika-langdetect
>>  
>>
>> What's the plan (if any) for switching to git details in poms?
>
>I think it needs fixing in both trunk and the 2.x branch, since we're on
>Git for both
>
>Nick



Re: Use of interface vs. abstract class

2016-02-09 Thread Mattmann, Chris A (3980)
Hi Ken,

-Original Message-

From: Ken Krugler 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, February 9, 2016 at 8:54 AM
To: "dev@tika.apache.org" 
Subject: RE: Use of interface vs. abstract class

>Hi Chris,
>
>> From: Mattmann, Chris A (3980)
>> Sent: February 9, 2016 8:40:06am PST
>> To: dev@tika.apache.org
>> Cc: Trevor Claude Lewis; Ramirez, Paul M (398M)
>> Subject: Re: Use of interface vs. abstract class
>> 
>> Hey Ken,
>> 
>> My general preference in these situations is to use an interface;
>> then to provide an abstract base class, and encourage some limited
>> method implementation and basic functionality to bubble up into
>> the abstract base class that implements that interface. Then we
>> can encourage non-API changes to the abstract base class; API (possibly
>> but not guaranteed to be breaking) changes to the interface, and
>> so forth.
>
>What's an example of a non-breaking change to the API that's defined by
>an interface?

Guess it depends on what you and I consider breaking. I’ll illustrate.
Ex1: To me, this is “non breaking”, since C doesn’t have to do anything
with the API
change.

public interface A{
  
   public long X();

   public String Y();

}

public abstract class B implements A{
   
  public long X(){
 return 0L;
  }
 
}

public class C extends B{
  public String Y(){
 return “Y”;
  }
}

//UPDATE A and B
public interface A{
   public long X();

   public String Y();

   public int Z();
}

public abstract class B implements A{
  public long X(){
return 0L;
  }
  
  public int Z(){return 1;}


}

// C remains same. 


>
>> So, TL;DR I would prefer interface + abstract base class, but maybe
>> that’s just me.
>
>I'm curious why you prefer an interface (with the breakage issue)

Again, depends on what you consider breaking. Based on my example above,
I don’t consider that breaking (even if you had a class F that directly
implemented A, you could always sub-interface M, and put “changes” there,
and then suggest that classes that want to take advantage of the new
interface methods explicitly use M instead of A.

> plus the need for an abstract base class over just a base class.

I guess just flexibility, indirection, and abstraction for me.

>
>What I see is people using interfaces when they're pretty certain that
>(a) implementations will need to support multiple APIs, and (b) the
>interface won't change.
>
>Neither seems the be the case for most of the APIs in Tika.

Meh, to each their own, frankly. I am fine with either way to be
honest, but my preference is what I suggested above. Just my pref,
though.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: Use of interface vs. abstract class

2016-02-09 Thread Mattmann, Chris A (3980)
Hey Ken,

My general preference in these situations is to use an interface;
then to provide an abstract base class, and encourage some limited
method implementation and basic functionality to bubble up into
the abstract base class that implements that interface. Then we
can encourage non-API changes to the abstract base class; API (possibly
but not guaranteed to be breaking) changes to the interface, and
so forth.

So, TL;DR I would prefer interface + abstract base class, but maybe
that’s just me.

Cheers,
Chris

CC/Trevor - FYI Trevor, please talk to Ken since he is working on
what I had assigned you to do as well - aka the LanguageIdentifier
refactor/interface.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Ken Krugler 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, February 9, 2016 at 8:34 AM
To: "tika-...@lucene.apache.org" 
Subject: Use of interface vs. abstract class

>Hi all,
>
>In general I see open source projects using abstract classes for
>extension points, as that provides for a migration path in the event of
>an API change, versus breaking any code that has implemented the
>interface.
>
>I see some interfaces being used in Tika, e.g. Translator.
>
>Does the ServiceLoader require that these be interfaces? I assume not, as
>isAssignableFrom() should work with either interfaces or abstract
>classes, right?
>
>Asking because I'm looking at the language detector API for 2.x.
>
>Thanks,
>
>-- Ken
>
>
>
>--
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>



Re: Tika 2.0 and language detection

2016-02-04 Thread Mattmann, Chris A (3980)
Hey Ken,

This is fine. I wanted to get going with our Julia/MIT-LL Text.jl based
detector and turning LanguageIdentifier into an interface. Me and
Trevor (CC’ed) are working on it, but not sure where we’re at and
shouldn’t be a blocker to moving forward.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Ken Krugler 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, February 4, 2016 at 12:23 PM
To: "tika-...@lucene.apache.org" 
Subject: Tika 2.0 and language detection

>Hi all,
>
>Over at https://issues.apache.org/jira/browse/TIKA-1723, Tim & I have
>been discussing whether to focus these pending changes on the 2.0 branch,
>and leave 1.x as-is.
>
>As part of that, we could do a cut-and-run in 2.0, and not spend the time
>to port the current (Tika 1.x) language detector code.
>
>I'm in favor of that approach, as I think leveraging the new detector
>project(s) gives us faster & more accurate results over more languages.
>
>But we're posting to the more general audience here, to gather input on
>things that we might not be considering.
>
>Thanks,
>
>-- Ken
>
>
>
>--
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>



Re: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-29 Thread Mattmann, Chris A (3980)
Thank you Tim for catching this. If you remember, please file a
ticket for the below and I’ll fix it in 1.13 (or someone else will :) )

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Friday, January 29, 2016 at 10:07 AM
To: "dev@tika.apache.org" 
Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1

>+1
>
>With the one caveat that the PooledTimeSeriesParser is now taking
>precedence over the MP4Parser.  So, for those mp4 video files for which
>we used to extract some metadata (length, and a handful of other items),
>we're now getting nothing if the external pooled-time-series application
>is not installed.  This could be a big problem for some people...
>
>Thank you, Chris!
>
>With any luck, I'll be fully dug out by next week and onto our new git
>repo. :) Onward to Tika 1.13 (after TIKA-1830) soon.
>
>
>-Original Message-
>From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
>Sent: Thursday, January 28, 2016 2:44 PM
>To: dev@tika.apache.org
>Subject: RE: [VOTE] Apache Tika 1.12 Release Candidate #1
>
>Built & installed on Mac OS X 10.8.
>
>Switched Bixo to use 1.12, all tests pass.
>
>+1.
>
>-- Ken
>
>> From: Mattmann, Chris A (3980)
>> Sent: January 25, 2016 11:58:04am PST
>> To: u...@tika.apache.org; dev@tika.apache.org
>> Subject: [VOTE] Apache Tika 1.12 Release Candidate #1
>> 
>> Hi Folks,
>> 
>> A first candidate for the Tika 1.12 release is available at:
>> 
>>  https://dist.apache.org/repos/dist/dev/tika/
>> 
>> The release candidate is a zip archive of the sources in:
>> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e6
>> 5db24
>> 27f9e84bc4ff31e569ae661c
>> 
>> 
>> The SHA1 checksum of the archive is:
>> 30e64645af643959841ac3bb3c41f7e64eba7e5f
>> 
>> In addition, a staged maven repository is available here:
>> 
>> https://repository.apache.org/content/repositories/orgapachetika-1015/
>> 
>> 
>> Please vote on releasing this package as Apache Tika 1.12.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache Tika 1.12 [ ] -1 Do not release
>> this package because...
>> 
>> Cheers,
>> Chris
>> 
>> P.S. Of course here is my +1.
>
>--
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>



[VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-25 Thread Mattmann, Chris A (3980)
Hi Folks,

A first candidate for the Tika 1.12 release is available at:

  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db24
27f9e84bc4ff31e569ae661c


The SHA1 checksum of the archive is:
30e64645af643959841ac3bb3c41f7e64eba7e5f

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachetika-1015/


Please vote on releasing this package as Apache Tika 1.12.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.12
[ ] -1 Do not release this package because…

Cheers,
Chris

P.S. Of course here is my +1.



++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Sorry 1.12-rc1 not done yet

2016-01-25 Thread Mattmann, Chris A (3980)
...ran into: http://goo.gl/ggfF50

Just fixed it in 2eb671574 -> 809370ecc and moving
release:prepare forward again.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)

2016-01-25 Thread Mattmann, Chris A (3980)
Hey Ken,

It seemed to not like:

‘scm:git:https://git-wip-us.apache.org/repos/asf/tika.git'


as the SCM url, but it likes:

‘scm:git:https://git-wip-us.apache.org/repos/asf'


build/release is going fine.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Ken Krugler 
Reply-To: "dev@tika.apache.org" 
Date: Monday, January 25, 2016 at 9:55 AM
To: "dev@tika.apache.org" 
Subject: RE: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)

>Hi Chris,
>
>Why is it trying to push to tika.git/tika?
>
>> 'https://git-wip-us.apache.org/repos/asf/tika.git/tika/' not found
>
>
>Isn't the URL https://git-wip-us.apache.org/repos/asf/tika.git
>
>-- Ken
>
>> From: Mattmann, Chris A (3980)
>> Sent: January 25, 2016 7:07:03am PST
>> To: u...@tika.apache.org
>> Subject: Re: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)
>> 
>> Hey Tim,
>> 
>> I started trying to release last night, but ran into having to
>> upgrade our POM xml tags (well remove them except for tika-parent).
>> I followed the model I saw in Apache drill but ran into this on
>> mvn release:prepare
>> 
>> [INFO] Checking in modified POMs...
>> [INFO] Executing: /bin/sh -c cd /Users/mattmann/tmp/tika1.12 && git add
>>--
>> tika-parent/pom.xml tika-core/pom.xml tika-parsers/pom.xml
>> tika-xmp/pom.xml tika-serialization/pom.xml tika-batch/pom.xml
>> tika-app/pom.xml tika-bundle/pom.xml tika-translate/pom.xml
>> tika-server/pom.xml tika-example/pom.xml tika-java7/pom.xml pom.xml
>> [INFO] Working directory: /Users/mattmann/tmp/tika1.12
>> [INFO] Executing: /bin/sh -c cd /Users/mattmann/tmp/tika1.12 && git
>>status
>> [INFO] Working directory: /Users/mattmann/tmp/tika1.12
>> [INFO] Waiting for 10 seconds before tagging the release.
>> [INFO] Tagging release with the label 1.12-rc1...
>> [INFO] Executing: /bin/sh -c cd /Users/mattmann/tmp/tika1.12 && git tag
>>-F
>> 
>>/var/folders/05/5qw82z2d77q16fhxxhwt22trgq/T/maven-scm-465741365.comm
>>it
>> 1.12-rc1
>> [INFO] Working directory: /Users/mattmann/tmp/tika1.12
>> [INFO] Executing: /bin/sh -c cd /Users/mattmann/tmp/tika1.12 && git push
>> https://git-wip-us.apache.org/repos/asf/tika.git/tika 1.12-rc1
>> [INFO] Working directory: /Users/mattmann/tmp/tika1.12
>> [INFO] 
>> 
>> [INFO] Reactor Summary:
>> [INFO] 
>> [INFO] Apache Tika parent . SKIPPED
>> [INFO] Apache Tika core ... SKIPPED
>> [INFO] Apache Tika parsers  SKIPPED
>> [INFO] Apache Tika XMP  SKIPPED
>> [INFO] Apache Tika serialization .. SKIPPED
>> [INFO] Apache Tika batch .. SKIPPED
>> [INFO] Apache Tika application  SKIPPED
>> [INFO] Apache Tika OSGi bundle  SKIPPED
>> [INFO] Apache Tika translate .. SKIPPED
>> [INFO] Apache Tika server . SKIPPED
>> [INFO] Apache Tika examples ... SKIPPED
>> [INFO] Apache Tika Java-7 Components .. SKIPPED
>> [INFO] Apache Tika  FAILURE
>>[09:58
>> min]
>> [INFO] 
>> 
>> [INFO] BUILD FAILURE
>> [INFO] 
>> 
>> [INFO] Total time: 10:00 min
>> [INFO] Finished at: 2016-01-24T23:34:04-08:00
>> [INFO] Final Memory: 12M/245M
>> [INFO] 
>> 
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-release-plugin:2.3.2:prepare
>>(default-cli)
>> on project tika: Unable to tag SCM
>> [ERROR] Provider message:
>> [ERROR] The git-push command failed.
>>

Re: Are we on git?

2016-01-22 Thread Mattmann, Chris A (3980)
Awesome thanks Nick!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Friday, January 22, 2016 at 1:37 AM
To: "dev@tika.apache.org" 
Subject: Re: Are we on git?

>On Fri, 22 Jan 2016, Mattmann, Chris A (3980) wrote:
>> Our new ASF git repo is:
>> https://git-wip-us.apache.org/repos/asf/tika.git
>>
>> Here’s an email I sent to the OODT-dev list about how
>> to convert from your existing SVN checkout to Git.
>> http://s.apache.org/UNr
>
>Steps I followed on my trunk checkout:
>  * svn status
>  * (ensured no local changes)
>  * mv .svn .svn.old
>  * git init
>  * git remote add origin https://git-wip-us.apache.org/repos/asf/tika.git
>  * git checkout -b merge-branch
>  * git fetch --all
>  * git reset --hard origin/master
>  * git checkout master
>
>And on my Tika 2.x checkout the last two steps were changed to:
>  * git reset --hard origin/2.x
>  * git checkout 2.x
>
>All seems to be working well now, thanks for the pointers!
>
>
>> Can we file a ticket to update the contribute page?
>
>I've done that page, and the parser guide links
>
>
>The thing that remains to be done is to sort out the site to import the
>examples from Git rather than SVN. I'll raise a ticket for that
>
>Nick



Re: Are we on git?

2016-01-21 Thread Mattmann, Chris A (3980)
Hi Nick,

We are officially on Git. SVN remains, but it’s R/O.

Our new ASF git repo is:

https://git-wip-us.apache.org/repos/asf/tika.git

Here’s an email I sent to the OODT-dev list about how
to convert from your existing SVN checkout to Git.

http://s.apache.org/UNr

Can we file a ticket to update the contribute page?

Enjoy!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, January 21, 2016 at 9:43 PM
To: "dev@tika.apache.org" 
Subject: Are we on git?

>Hi All
>
>I've seen a commit message to git, but no "stop using SVN", and
>http://tika.apache.org/contribute.html still talks about SVN being our
>master.
>
>What's the status? Have we switched? Still in progress? Where should we
>commit to? Is it time to delete our SVN checkouts and re-checkout from
>git?
>
>Cheers
>Nick



FW: [jira] [Commented] (INFRA-11092) Move Tika from SVN to Writeable Git repos

2016-01-20 Thread Mattmann, Chris A (3980)
Team, Git repos look good to me I am going to comment as such.
If you want to comment please do so.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Geoffrey Corey (JIRA)" 
Date: Tuesday, January 19, 2016 at 2:24 PM
To: jpluser 
Subject: [jira] [Commented] (INFRA-11092) Move Tika from SVN to Writeable
Git repos

>
>[ 
>https://issues.apache.org/jira/browse/INFRA-11092?page=com.atlassian.jira.
>plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107576#co
>mment-15107576 ] 
>
>Geoffrey Corey commented on INFRA-11092:
>
>
>The migration script has finished.
>
>Can you verify that https://git-wip-us.apache.org/repos/asf/tika.git
>looks as expected? The tika git repo is currently set to read-obly until
>you verify it.
>
>> Move Tika from SVN to Writeable Git repos
>> -
>>
>> Key: INFRA-11092
>> URL: https://issues.apache.org/jira/browse/INFRA-11092
>> Project: Infrastructure
>>  Issue Type: SVN->GIT Migration
>>  Components: Git
>>Reporter: Chris A. Mattmann
>>Assignee: Geoffrey Corey
>>
>
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)



Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit

2016-01-18 Thread Mattmann, Chris A (3980)
Great Hen, we’d love to have you on board as a mentor! Please
add yourself to the proposal on the wiki.

Anyone else have interest in Machine Translation? Any OpenNLP folks,
Hadoop folks, Tika, or Lucene folks? CC’ing the dev lists for visibility
please feel free to reply to general@i.a.o.

I’ll leave the DISCUSS thread open for a few more days.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Henri Yandell 
Reply-To: "gene...@incubator.apache.org" 
Date: Monday, January 18, 2016 at 7:57 PM
To: jpluser ,
"gene...@incubator.apache.org" 
Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
Translation Toolkit

>Non-binding +1 to Joshua joining the Incubator. I'd be interested in
>mentoring.
>
>
>> -Original Message-
>> From: jpluser 
>> Reply-To: "gene...@incubator.apache.org" 
>> Date: Tuesday, January 12, 2016 at 10:56 PM
>> To: "gene...@incubator.apache.org" 
>> Cc: "p...@cs.jhu.edu" 
>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>>Translation
>> Toolkit
>>
>> >Hi Everyone,
>> >
>> >Please find attached for your viewing pleasure a proposed new project,
>> >Apache Joshua, a statistical machine translation toolkit. The proposal
>> >is in wiki draft form at:
>> https://wiki.apache.org/incubator/JoshuaProposal
>> >
>> >Proposal text is copied below. I’ll leave the discussion open for a
>>week
>> >and we are interested in folks who would like to be initial committers
>> >and mentors. Please discuss here on the thread.
>> >
>> >Thanks!
>> >
>> >Cheers,
>> >Chris (Champion)
>> >
>> >———
>> >
>> >= Joshua Proposal =
>> >
>> >== Abstract ==
>> >[[joshua-decoder.org|Joshua]] is an open-source statistical machine
>> >translation toolkit. It includes a Java-based decoder for translating
>>with
>> >phrase-based, hierarchical, and syntax-based translation models, a
>> >Hadoop-based grammar extractor (Thrax), and an extensive set of tools
>>and
>> >scripts for training and evaluating new models from parallel text.
>> >
>> >== Proposal ==
>> >Joshua is a state of the art statistical machine translation system
>>that
>> >provides a number of features:
>> >
>> > * Support for the two main paradigms in statistical machine
>>translation:
>> >phrase-based and hierarchical / syntactic.
>> > * A sparse feature API that makes it easy to add new feature templates
>> >supporting millions of features
>> > * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad)
>> > * Support for lattice decoding, allowing upstream NLP tools to expose
>> >their hypothesis space to the MT system
>> > * An efficient representation for models, allowing for quick loading
>>of
>> >multi-gigabyte model files
>> > * Fast decoding speed (on par with Moses and mtplz)
>> > * Language packs — precompiled models that allow the decoder to be
>>run as
>> >a black box
>> > * Thrax, a Hadoop-based tool for learning translation models from
>> >parallel text
>> > * A suite of tools for constructing new models for any language pair
>>for
>> >which sufficient training data exists
>> >
>> >== Background and Rationale ==
>> >A number of factors make this a good time for an Apache project
>>focused on
>> >machine translation (MT): the quality of MT output (for many language
>> >pairs); the average computing resources available on computers,
>>relative
>> >to the needs of MT systems; and the availability of a number of
>> >high-quality toolkits, together with a large base of researchers
>>working
>> >on them.
>> >
>> >Over the past decade, machine translation (MT; the automatic
>>translation
>> >of one human language to another) has become a reality. The research
>>into
>> >statistical approaches to translation that began in the early nineties,
>> >together with the availability of large amounts of training data, and
>> >better computing infrastructure, have all come together to produce
>> >translations results that are “good enough” for a large set of language
>> >pairs and use cases. Free services like
>> >[[https://www.bing.com/translator|Bing Translator]] and
>> >[[https://translate.google.com|Google Translate]] have made these
>> services
>> >available to the average person through direct interfaces and through
>> >tools like browser plugins, and sites across the world with higher
>> >translation needs use them to translate their pages through
>>automatically.
>> >
>> >MT does not require the infrastructure of large corporations in order
>>to

Writeable Git repo migration is underway

2016-01-18 Thread Mattmann, Chris A (3980)
You can track progress here:
https://issues.apache.org/jira/browse/INFRA-11092

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





[RESUT] [VOTE] Moving SCM to Git

2016-01-18 Thread Mattmann, Chris A (3980)
This VOTE has passed with the following tallies:

+1

Chris Mattmann*
Tyler Palsulich*
Bob Paulin*
Hong-Thai Nguyen*
Oleg Tikhonov*
David Meikle*
Ken Krugler*
Lewis John McGibbney*
Nick Burch*
Konstantin Gribov*
Julien Nioche*

Tim Allison*

I’ll file an INFRA ticket to begin the process. They will
leave SVN up, but make it read-only and then create our Git
repo from whatever tip/head of SVN they start from.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: jpluser 
Reply-To: "dev@tika.apache.org" 
Date: Friday, January 1, 2016 at 8:30 PM
To: "dev@tika.apache.org" 
Subject: [VOTE] Moving SCM to Git

>Hi Everyone,
>
>DISCUSS thread here: http://s.apache.org/wVE
>
>Time to officially VOTE on moving Tika to Git. I’ve made a wiki
>page for our SCM explaining how to use Git at Apache, and how to
>use it with Github, and how to use it even in a traditional SVN
>sense. The page is here:
>
>https://wiki.apache.org/tika/UsingGit
>
>
>I’ve also linked it from the main wiki page. I took the liberty
>of updating the only other 2 pages on the wiki that referenced
>SCM with (pending) Git instructions as well:
>
>https://wiki.apache.org/tika/DeveloperResources
>https://wiki.apache.org/tika/ReleaseProcess
>
>From the DISCUSS thread it would seem the following members of
>the community support this move:
>
>Chris Mattmann
>Tyler Palsulich
>Bob Paulin
>Hong-Thai Nguyen
>
>Oleg Tikhonov
>David Meikle
>
>
>Given the above I’m going to count the above people as +1 in
>this VOTE if I don’t hear otherwise.
>
>Nick Burch said he would be more supportive if there was a guide,
>so I made one and updated the other wiki docs as above so hopefully
>that garners his VOTE.
>
>If you’d like to revise your VOTE or to VOTE for the first time,
>please use the ballot below:
>
>[ ] +1 Move the Apache Tika source control to Writeable Git repos
>at the ASF
>[ ] +0 Indifferent.
>[ ] -1 Don’t move the Apache Tika source control to Writeable Git
>repos at the ASF because..
>
>Of course, given the conversation I am +1 for this.
>
>Thanks for VOTE’ing I’ll leave the VOTE open through next Friday.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>



Re: New moderators needed

2016-01-16 Thread Mattmann, Chris A (3980)
Hey Jukka,

Am I that single moderator? :)

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Jukka Zitting 
Reply-To: "dev@tika.apache.org" 
Date: Friday, January 15, 2016 at 7:43 AM
To: Tika Development 
Subject: New moderators needed

>Hi,
>
>I haven't been very active at ASF or Tika lately, so I'm stepping down as
>a
>moderator of many mailing lists (INFRA-11076
>
>).
>
>As a result, infra tells that this mailing list is now down to a single
>moderator, so it would be good to have one or two new volunteers. See
>http://apache.org/dev/committers.html#mailing-list-moderators for more
>details.
>
>Best,
>
>Jukka Zitting



Re: Tika questions on StackOverflow

2016-01-13 Thread Mattmann, Chris A (3980)
Great post Nick

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, January 13, 2016 at 3:22 AM
To: "dev@tika.apache.org" 
Subject: Tika questions on StackOverflow

>Hi All
>
>This may be old news for some of you, in which case you can skip the
>email, but for others... StackOverflow is a programming-focused question
>and answer site, with excellent google-foo, quite wide use, and growing
>use. At the moment I'd say there's something like a new Tika question a
>day on it, and that number seems to be climbing. (It's quite bursty
>though, 2 one day, nothing for the next few)
>
>Increasingly, new users seem to be turning to StackOverflow to get help
>with projects, learn how to use them etc, in place of joining a mailing
>list and asking a question. There's also a lot of people out there who
>know about Tika, aren't on our lists, but are posting helpful replies
>(answers) to questions on how to use Tika.
>
>(There's also a fair number of useless people asking very basic
>questions, 
>without full information, and without having done any research / checked
>existing questions / checked out site / etc. They tend to get moderated
>down pretty quickly though, or they learn and edit the question)
>
>Because StackOverflow gets a lot of newbie traffic, they have some rules,
>and can be quite strict on enforcing them. A lot stricted than many of
>the 
>other StackExchange network sites, largely because of that traffic. That
>means you will find some restrictions at the start, but they go away
>soon. 
>You do need to be careful to actually answer questions with an answer,
>asking for clarifications or saying "can't help, ask on the list" as an
>answer won't go down well.
>
>
>If you're interested to see what sort of questions there are, see
>http://stackoverflow.com/questions/tagged/apache-tika?sort=newest&pageSize
>=50
>for what has been asked recently, and
>http://stackoverflow.com/questions/tagged/apache-tika?sort=votes&pageSize=
>50
>for the most "popular"
>
>
>There are a few of us on StackOverflow already, but you might want to
>join 
>in too. You certainly don't have to! But you might want to, not only to
>help, but also to get bug reports, find out what docs we need to update,
>and maybe even spot people answering who we can ask to join the project.
>
>If you sign up for an account, you can get emails when people ask Tika
>related questions, so you can know to go look if it interests you. To do
>that, go to
> http://stackexchange.com/filters/212512/apache-tika-questions
>On the right it should have an "Email Updates" box, where you can
>subscribe to get emailed for new questions on a timing of your choice
>
>
>If you have questions on using StackOverflow, I'm happy to do my best to
>explain. They have pretty good help/documentation, and they have the
>"meta" site to check policies / why reasons / etc.
>
>You will suffer some restrictions as a new user, but they go away when
>your answers get a few up-votes. Let us know your username if you sign up
>and answer something, then the few of us who already use StackOverflow
>can 
>up vote you to get you to the minimum rep score to escape them!
>
>Nick



Re: Tika 2.0 Modules first pass.

2016-01-05 Thread Mattmann, Chris A (3980)
Thanks Bob took care of 6 for ya:

https://wiki.apache.org/tika/ContributorsGroup

I should be able to review this, but not going to be complete review
for a few weeks.. thanks for your great work

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Bob Paulin 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, January 5, 2016 at 7:54 PM
To: "dev@tika.apache.org" 
Subject: Tika 2.0 Modules first pass.

>All,
>
>I took a stab at the initial module structure based on Tim and my email
>[1].  If a package didn't seem to fit with anything else I created an
>individual project for it.  If any of the groupings don't make sense or
>folks think there are better ways to organize I'm happy to move stuff
>around.  Patches are welcome :).  I have a JIRA created [2].  Commited
>with rev 1723223.
>
>There's still a good amount of outstanding work:
>1) All this could use more testing.  Especially with the external parsers.
>2) As Tim has already raised there is the issue of dual maintaining
>branches.  There are likely some fixes in trunk that are not currently
>applied to the 2.0 branch.
>3) The tika-parser project is currently using the maven shade plugin and
>that is causing issues creating the OSGi Manifest.MF file.  I should be
>able to find a way around this.
>4) Still need to recreate the OSGi uber jar with all dependencies
>packaged with the tika code.
>5) There are still some classes in the tika-parser project.  Should
>these all be moved to core? A common project?...
>6) Documentation.  I could use some Wiki access.  Username: BobPaulin.
>7) There are some dependencies in the tika-parser project that were not
>needed to compile any of the individual modules or run tests. Are they
>still needed?
>8) Where does the 
>org.apache.tika.parser.external.CompositeExternalParser ServiceLoader
>(META-INF/services/org.apache.tika.parser.Parser) config belong.  I
>moved it to tika-core since that is where the class lives.
>9) Subcomponent licenses.  I moved them to the modules they belong in
>but I need to figure out a way to make them bubble up to the uber jars.
>Or perhaps they need to be dual maintained.
>10) Anything I may be forgetting;)
>
>For the most part all the changes just to organize the existing
>packages.  There are a handful of changes to the test suite in order to
>break some cyclical dependencies.  Here's an overview of how the
>projects interrelate at the moment:
>
>tika-parser-modules
>  - /tika-advanced-module
>  - /tika-cad-module
>-> tika-text-module [test]
>  - /tika-code-module
>-> tika-text-module [test]
>  - /tika-database-module
>-> tika-office-module [test]
>  - /tika-ebook-module
>-> tika-text-module
>  - /tika-journal-module
>-> tika-pdf-module
>  - /tika-multimedia-module
>-> tika-web-module [test]
>-> tika-office-module [test]
>-> tika-pdf-module [test]
>  - /tika-office-module
>-> tika-web-module [test]
>-> tika-package-module [test]
>-> tika-text-module [test]
>  - /tika-package-module
>  - /tika-pdf-module
>   -> tika-text-module [test]
>   -> tika-package-module [test]
>   -> tika-office-module [test]
>  - /tika-scientific-module
>   -> tika-text-module [test]
>  - /tika-text-module
>  -/tika-web-module
>   -> tika-text-module [test]
>   -> tika-package-module [test]
>
>Very interested in feedback since we have been talking about this for a
>bit but I'm sure actually seeing it will create more discussion. Looking
>at how much simpler the individual pom files does seem to demonstrate
>that this will be a good thing for the project.
>
>Cheers,
>
>- Bob
>
>[1] 
>http://mail-archives.apache.org/mod_mbox/tika-dev/201508.mbox/%3C55CF4C19.
>6050503%40bobpaulin.com%3E
>[2] https://issues.apache.org/jira/browse/TIKA-1824



Re: [VOTE] Moving SCM to Git

2016-01-02 Thread Mattmann, Chris A (3980)
One final note - this isn't a vote to make GitHub the canonical repo. In the 
future if Whimsy goes well I'd like to explore that but here I am simply 
proposing to use the ASF writeable Git repos (which happen to be mirrored to 
GH).

Cheers,
Chris 

Sent from my iPhone

> On Jan 2, 2016, at 4:31 PM, Mattmann, Chris A (3980) 
>  wrote:
> 
> Hey Ken,
> 
> Projects have been using writeable git repos at the ASF since 2009-2010. The 
> recent conversation at the foundation level was - should we allow GitHub as a 
> canonical external repo and more broadly - is this possible in general? The 
> Whimsy project is currently undergoing that experiment and it's going well 
> but nothing official to report yet.
> 
> Beyond that - projects can release from and use writeable Git repos. Some 
> projects were getting around history by squashing commits ahead of the repo 
> and getting around infra's checks on master (aka trunk) by using different 
> main branch names but we're not in that boat.
> 
> Cheers,
> Chris 
> 
> 
> Sent from my iPhone
> 
>> On Jan 2, 2016, at 3:47 PM, Ken Krugler  wrote:
>> 
>> Hi Chris,
>> 
>> I'd be +1, but I don't have the essence of the "Re: git (Was: ASF/GitHub 
>> Findings of Fact / Statements of Principles)" thread on the Apache members 
>> list clearly in my mind.
>> 
>> Specifically, while that thread was spinning merrily away, there were 
>> concerns about immutability when using git.
>> 
>> E.g. one comment was...
>> 
>>> releases must correspond to an immutable tag in a repository on ASF 
>>> hardware.
>>> 
>>> "Canonical" is needed for releases, and for IP provenance, so I'd augment 
>>> the above with a second requirement: for each release tag, we must be able 
>>> to establish the provenance of all files referenced by that tag.
>>> 
>>> I believe that is the essence of the Foundation's requirements for version 
>>> control. Both can be satisfied via svn or git. Git may require external 
>>> sources to satisfy one or both of those requirements. svn inherently has 
>>> the first nailed, and is much easier for provenance (there may be edge 
>>> cases I'm missing offhand, but we know the ICLA/grant associated with each 
>>> change leading up to the tagged release).
>> 
>> Did it wind up as "projects can experiment with using git for official 
>> releases"?
>> 
>> Thanks,
>> 
>> -- Ken
>> 
>>> From: Mattmann, Chris A (3980)
>>> Sent: January 1, 2016 8:30:16pm PST
>>> To: dev@tika.apache.org
>>> Subject: [VOTE] Moving SCM to Git
>>> 
>>> Hi Everyone,
>>> 
>>> DISCUSS thread here: http://s.apache.org/wVE
>>> 
>>> Time to officially VOTE on moving Tika to Git. I’ve made a wiki
>>> page for our SCM explaining how to use Git at Apache, and how to
>>> use it with Github, and how to use it even in a traditional SVN
>>> sense. The page is here:
>>> 
>>> https://wiki.apache.org/tika/UsingGit
>>> 
>>> 
>>> I’ve also linked it from the main wiki page. I took the liberty
>>> of updating the only other 2 pages on the wiki that referenced
>>> SCM with (pending) Git instructions as well:
>>> 
>>> https://wiki.apache.org/tika/DeveloperResources
>>> https://wiki.apache.org/tika/ReleaseProcess
>>> 
>>> From the DISCUSS thread it would seem the following members of
>>> the community support this move:
>>> 
>>> Chris Mattmann
>>> Tyler Palsulich
>>> Bob Paulin
>>> Hong-Thai Nguyen
>>> 
>>> Oleg Tikhonov
>>> David Meikle
>>> 
>>> 
>>> Given the above I’m going to count the above people as +1 in
>>> this VOTE if I don’t hear otherwise.
>>> 
>>> Nick Burch said he would be more supportive if there was a guide,
>>> so I made one and updated the other wiki docs as above so hopefully
>>> that garners his VOTE.
>>> 
>>> If you’d like to revise your VOTE or to VOTE for the first time,
>>> please use the ballot below:
>>> 
>>> [ ] +1 Move the Apache Tika source control to Writeable Git repos
>>> at the ASF
>>> [ ] +0 Indifferent.
>>> [ ] -1 Don’t move the Apache Tika source control to Writeable Git
>>> repos at the ASF because..
>>> 
>>> Of course, given the conversation I am +1 for this.
>>> 
>>> Thanks for VOTE’in

Re: [VOTE] Moving SCM to Git

2016-01-02 Thread Mattmann, Chris A (3980)
Hey Ken,

Projects have been using writeable git repos at the ASF since 2009-2010. The 
recent conversation at the foundation level was - should we allow GitHub as a 
canonical external repo and more broadly - is this possible in general? The 
Whimsy project is currently undergoing that experiment and it's going well but 
nothing official to report yet.

Beyond that - projects can release from and use writeable Git repos. Some 
projects were getting around history by squashing commits ahead of the repo and 
getting around infra's checks on master (aka trunk) by using different main 
branch names but we're not in that boat.

Cheers,
Chris 


Sent from my iPhone

> On Jan 2, 2016, at 3:47 PM, Ken Krugler  wrote:
> 
> Hi Chris,
> 
> I'd be +1, but I don't have the essence of the "Re: git (Was: ASF/GitHub 
> Findings of Fact / Statements of Principles)" thread on the Apache members 
> list clearly in my mind.
> 
> Specifically, while that thread was spinning merrily away, there were 
> concerns about immutability when using git.
> 
> E.g. one comment was...
> 
>> releases must correspond to an immutable tag in a repository on ASF hardware.
>> 
>> "Canonical" is needed for releases, and for IP provenance, so I'd augment 
>> the above with a second requirement: for each release tag, we must be able 
>> to establish the provenance of all files referenced by that tag.
>> 
>> I believe that is the essence of the Foundation's requirements for version 
>> control. Both can be satisfied via svn or git. Git may require external 
>> sources to satisfy one or both of those requirements. svn inherently has the 
>> first nailed, and is much easier for provenance (there may be edge cases I'm 
>> missing offhand, but we know the ICLA/grant associated with each change 
>> leading up to the tagged release).
> 
> Did it wind up as "projects can experiment with using git for official 
> releases"?
> 
> Thanks,
> 
> -- Ken
> 
>> From: Mattmann, Chris A (3980)
>> Sent: January 1, 2016 8:30:16pm PST
>> To: dev@tika.apache.org
>> Subject: [VOTE] Moving SCM to Git
>> 
>> Hi Everyone,
>> 
>> DISCUSS thread here: http://s.apache.org/wVE
>> 
>> Time to officially VOTE on moving Tika to Git. I’ve made a wiki
>> page for our SCM explaining how to use Git at Apache, and how to
>> use it with Github, and how to use it even in a traditional SVN
>> sense. The page is here:
>> 
>> https://wiki.apache.org/tika/UsingGit
>> 
>> 
>> I’ve also linked it from the main wiki page. I took the liberty
>> of updating the only other 2 pages on the wiki that referenced
>> SCM with (pending) Git instructions as well:
>> 
>> https://wiki.apache.org/tika/DeveloperResources
>> https://wiki.apache.org/tika/ReleaseProcess
>> 
>> From the DISCUSS thread it would seem the following members of
>> the community support this move:
>> 
>> Chris Mattmann
>> Tyler Palsulich
>> Bob Paulin
>> Hong-Thai Nguyen
>> 
>> Oleg Tikhonov
>> David Meikle
>> 
>> 
>> Given the above I’m going to count the above people as +1 in
>> this VOTE if I don’t hear otherwise.
>> 
>> Nick Burch said he would be more supportive if there was a guide,
>> so I made one and updated the other wiki docs as above so hopefully
>> that garners his VOTE.
>> 
>> If you’d like to revise your VOTE or to VOTE for the first time,
>> please use the ballot below:
>> 
>> [ ] +1 Move the Apache Tika source control to Writeable Git repos
>> at the ASF
>> [ ] +0 Indifferent.
>> [ ] -1 Don’t move the Apache Tika source control to Writeable Git
>> repos at the ASF because..
>> 
>> Of course, given the conversation I am +1 for this.
>> 
>> Thanks for VOTE’ing I’ll leave the VOTE open through next Friday.
>> 
>> Cheers,
>> Chris
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
> 
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
> 


[VOTE] Moving SCM to Git

2016-01-01 Thread Mattmann, Chris A (3980)
Hi Everyone,

DISCUSS thread here: http://s.apache.org/wVE

Time to officially VOTE on moving Tika to Git. I’ve made a wiki
page for our SCM explaining how to use Git at Apache, and how to
use it with Github, and how to use it even in a traditional SVN
sense. The page is here:

https://wiki.apache.org/tika/UsingGit


I’ve also linked it from the main wiki page. I took the liberty
of updating the only other 2 pages on the wiki that referenced
SCM with (pending) Git instructions as well:

https://wiki.apache.org/tika/DeveloperResources
https://wiki.apache.org/tika/ReleaseProcess

From the DISCUSS thread it would seem the following members of
the community support this move:

Chris Mattmann
Tyler Palsulich
Bob Paulin
Hong-Thai Nguyen

Oleg Tikhonov
David Meikle


Given the above I’m going to count the above people as +1 in
this VOTE if I don’t hear otherwise.

Nick Burch said he would be more supportive if there was a guide,
so I made one and updated the other wiki docs as above so hopefully
that garners his VOTE.

If you’d like to revise your VOTE or to VOTE for the first time,
please use the ballot below:

[ ] +1 Move the Apache Tika source control to Writeable Git repos
at the ASF
[ ] +0 Indifferent.
[ ] -1 Don’t move the Apache Tika source control to Writeable Git
repos at the ASF because..

Of course, given the conversation I am +1 for this.

Thanks for VOTE’ing I’ll leave the VOTE open through next Friday.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: Looking to contribute

2015-12-20 Thread Mattmann, Chris A (3980)
Pavan awesome glad to have your interest and to have you in the
community!

Check out our JIRA:

https://issues.apache.org/jira/browse/TIKA

My own personal recent interests in Tika are related to Named
Entity Recognition (Stanford NER, CoreNLP and OpenNLP), and in
Automated IR-based Geo-Gazetteers; in Audio/Video extraction,
and so forth. Also in language identification (N-grams; MIT-LL’s
Text.jl) and automated machine translation (Joshua, Moses).

If you are interested in that type of stuff, look for stuff
I reported or assigned to me, or with the label “memex”. In
addition in general if you are more interested in the types
of work that I’m contributing to Tika, see http://memex.jpl.nasa.gov/

Cheers, and happy holidays!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Pavan Sudheendra 
Reply-To: "dev@tika.apache.org" 
Date: Sunday, December 20, 2015 at 9:52 AM
To: "dev@tika.apache.org" 
Subject: Looking to contribute

>Hi all,
>
>My name is Pavan and I'm a software engineer working at Cisco on big data
>projects from the past 2 years.
>
>I'm looking to contribute to the Tika project and i'm wondering if I
>should
>start looking at the Github issues page or somewhere else?
>
>I've started reading the documentation and getting familiar with the build
>process.
>
>Also, any guidance on this subject would be great.
>
>Thanks all.
>
>-- 
>Regards-
>Pavan



FW: [opensource] Open Source workshop at GSAW March 2

2015-12-18 Thread Mattmann, Chris A (3980)
FYI..

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From:  on behalf of "Burke, Wayne M
(398M-Affiliate)" 
Date: Friday, December 18, 2015 at 7:16 AM
To: "opensou...@lists.nasa.gov" 
Subject: [opensource] Open Source workshop at GSAW March 2

>GSAW 
> is now open for registration and if you're into open source software,
>you'll be happy to learn that Chris Mattmann from JPL and Vale Sather
>from Aerospace are co-chairing a workshop this year. The goal will be to
>build community around the use and development
> of open source software within aerospace, and shape organizational
>policy within Federal, State, and Local agencies, FFRDCs, and other
>organizations. You will find this particularly helpful if:
>
>* you manage or develop an open source project
>
>* are struggling to build a community around your project
>
>* are unsure how to work in an open source way within
> the current bureaucracy
>
>
>The workshop will be
>unconference 
> style, with the agenda formed based on the participants’ interests. This
>means that if you have something relevant to share with your peers, this
>is an excellent place to do it. At the same time, the best sessions will
>start with an open question that you
> need an answer to. For a lot of conferences, the best part is the coffee
>break, where you can have directly relevant and in-depth discussions that
>help you assimilate information and apply it to your own needs. This
>entire workshop will be like that - with
> the best minds working in open source and aerospace.
>GSAW is Feb 29 - March 3 and the open source workshop
> will be the afternoon of March 2.
>Register
> here 
>nt-summary-d1571e3f22664277a307cb4a8d52fc08.aspx>
>-- 
>Wayne Moses Burke, MS
>Cognizant Engineer
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Mailstop: 168-527
>E-mail: wayne.m.bu...@jpl.nasa.gov



Re: looking to contribute

2015-12-17 Thread Mattmann, Chris A (3980)
What Tim and Nick said. :) Joey is at Caltech and interested in
working with me, so I said jump on the Tika lists and let’s see
if there is something we can pin down.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, December 17, 2015 at 5:32 AM
To: "dev@tika.apache.org" 
Subject: RE: looking to contribute

>Speaking of the docs/examples, TIKA-1329 is still open because I haven't
>gotten around to documenting it.
>
>Y, if you'd like a report of exceptions, let me know.  IIRC, it would be
>great if we could improve on XML detection (we're currently over
>detecting), and there's plenty of work to do on html parsing TIKA-1599.
>
>I also have probably a full grad student semester worth of curation
>project ideas on the test corpus.  Not glamorous, but very useful for the
>community.
>
>Then there's the eval code itself...that still needs to make it into
>shape to be added.
>
>I agree with Nick though, start small on documentation/examples.
>
>Cheers,
>
>   Tim
>
>-Original Message-
>From: Nick Burch [mailto:apa...@gagravarr.org]
>Sent: Wednesday, December 16, 2015 4:23 PM
>To: dev@tika.apache.org
>Subject: Re: looking to contribute
>
>On Wed, 16 Dec 2015, Joey Hong wrote:
>> My name is Joey. I am a college freshmen with programming experience
>> looking to get into the world of open-source. I was hoping to
>> contribute to the Tika project, and was wondering if there were any
>> tasks that a beginner like me could tackle. I am willing to do
>> anything, whether it be fixing a minor bug, or adding test suites or
>>documentation.
>
>On the docs / examples side, we have a few examples on the website, but
>probably not enough! One thing might be to look through those, identify
>gaps with your fresh eyes, and work on those. We also have instructions
>for some more complicated integrations on the wiki, maybe try some of
>those and feed back on which ones aren't clear enough?
>
>If you want to try more coding, Tim quite often runs Tika against some
>large filesets, and has a nifty tool to report on what breaks. He can
>hopefully point you at the most recent report! Maybe have a look through
>that, identify a few common failures from unidentified or common
>exceptions, and try to fix one or two of those?
>
>Nick



Re: more modular parser bundles

2015-11-30 Thread Mattmann, Chris A (3980)
Sure that’s fine Bob - we don’t need it to be gated on Git.
Create a 2.x branch and go to town, +1 from me :)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Bob Paulin 
Reply-To: "dev@tika.apache.org" 
Date: Monday, November 30, 2015 at 7:08 AM
To: "dev@tika.apache.org" 
Subject: Re: more modular parser bundles

>Hi,
>
>I think Chris actually mentioned that this could be something targeted for
>a 2.0 release.  The first step towards that would be to create the 2.0
>branch since I think this might be a big enough effort to not want to
>block
>the trunk ( or master if we move to git).  Would the list agree that now
>would be a good time to branch?
>
>- Bob
>
>On Mon, Nov 30, 2015 at 6:24 AM, Allison, Timothy B. 
>wrote:
>
>> All,
>>
>>   I'm extremely grateful for all of the new nlp +image processing
>>parsers
>> that we're adding.  Might it be time to start down the implementation
>>path
>> to more modular parser bundles?
>>
>>   Perhaps we could start with a tika-advanced-bundle to gather all of
>>the
>> nlp/advanced parsers?  Or would this have to wait for Tika 2.0?
>>
>>   Bob got us off to a great start.  There hasn't been much discussion
>> since August.  I think my email from 24 Aug [1] was the last?
>>
>>   Cheers,
>>
>> Tim
>>
>> [1]
>> 
>>https://mail-archives.apache.org/mod_mbox/tika-dev/201508.mbox/%3cDM2PR09
>>mb071305dfd203e21bfbe7a63ac7...@dm2pr09mb0713.namprd09.prod.outlook.com%3
>>e
>>
>> -Original Message-
>> From: Madhav Sharan (JIRA) [mailto:j...@apache.org]
>> Sent: Wednesday, November 25, 2015 6:16 PM
>> To: dev@tika.apache.org
>> Subject: [jira] [Created] (TIKA-1803) Use lucene-geo-gazetteer REST API
>>in
>> GeoTopicParser
>>
>> Madhav Sharan created TIKA-1803:
>> ---
>>
>>  Summary: Use lucene-geo-gazetteer REST API in
>>GeoTopicParser
>>  Key: TIKA-1803
>>  URL: https://issues.apache.org/jira/browse/TIKA-1803
>>  Project: Tika
>>   Issue Type: Sub-task
>>   Components: parser
>> Reporter: Madhav Sharan
>>
>>
>> As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of
>>a
>> location. CLI requires jvm and lucene to instantiate for every request.
>> With all new REST api it will be possible to gain improvement in this
>>space.
>>
>> Idea is to create a client of lucene-geo-gazetteer in tika and use it in
>> GeoTopicParser
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)
>>



Re: more modular parser bundles

2015-11-30 Thread Mattmann, Chris A (3980)
Tim,

Fully agreed. One solution that presents itself to me is to finish
up the Git discuss (which was overwhelmingly positive, and I need
to write a wiki page for Nick), get that VOTE out of the way, move
to Git, then basically have two main branches of development. I’d
like 1.x to continue as-is, and then to create a 2.x branch and
let Bob go to town on it (and others that are interested). Then,
once you guys are ready you can release on it out of 2.x, while
1.x maintenance and existing architecture keep pressing forward.

What do you think?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Monday, November 30, 2015 at 4:24 AM
To: "dev@tika.apache.org" 
Subject: more modular parser bundles

>All,
>
>  I'm extremely grateful for all of the new nlp +image processing parsers
>that we're adding.  Might it be time to start down the implementation
>path to more modular parser bundles?
>
>  Perhaps we could start with a tika-advanced-bundle to gather all of the
>nlp/advanced parsers?  Or would this have to wait for Tika 2.0?
>
>  Bob got us off to a great start.  There hasn't been much discussion
>since August.  I think my email from 24 Aug [1] was the last?
>
>  Cheers,
>
>Tim
>
>[1] 
>https://mail-archives.apache.org/mod_mbox/tika-dev/201508.mbox/%3cDM2PR09M
>b071305dfd203e21bfbe7a63ac7...@dm2pr09mb0713.namprd09.prod.outlook.com%3e
>
>-Original Message-
>From: Madhav Sharan (JIRA) [mailto:j...@apache.org]
>Sent: Wednesday, November 25, 2015 6:16 PM
>To: dev@tika.apache.org
>Subject: [jira] [Created] (TIKA-1803) Use lucene-geo-gazetteer REST API
>in GeoTopicParser
>
>Madhav Sharan created TIKA-1803:
>---
>
> Summary: Use lucene-geo-gazetteer REST API in GeoTopicParser
> Key: TIKA-1803
> URL: https://issues.apache.org/jira/browse/TIKA-1803
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Madhav Sharan
>
>
>As of now tika uses lucene-geo-gazetteer CLI to extract co-ordinates of a
>location. CLI requires jvm and lucene to instantiate for every request.
>With all new REST api it will be possible to gain improvement in this
>space.
>
>Idea is to create a client of lucene-geo-gazetteer in tika and use it in
>GeoTopicParser
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)



Re: NER Parser tests behind proxy?

2015-11-24 Thread Mattmann, Chris A (3980)
Gotcha Tim, OK that helps.

Thamme, can you try and test this behind a proxy so that we can
try and replicate what Tim is seeing?

As for packaging the models, Stanford NER may be difficult to do
that, not only b/c of the license (GPLv3 [1], which is why we did it
as a runtime dependency, and optional, since we also did Apache
OpenNLP), but b/c of the size of the models. Apache OpenNLP models
are there and freely available, but no Maven packaging exists
for them.

We’ll get this figured out Tim.

Cheers,
Chris

[1] http://nlp.stanford.edu/software/corenlp.shtml

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, November 24, 2015 at 6:07 AM
To: "dev@tika.apache.org" 
Cc: ThammeGowda Narayanaswamy 
Subject: RE: NER Parser tests behind proxy?

>Y, you do, but you (or I) can set the proxy for Maven correctly and
>(without the NER requirement) the build works fine.
>
>***WARNING, what I'm running into might very well just be user error in
>not telling Maven to pass the proxy info to Groovy...this is why I didn't
>open an issue :) I've done some googling, but haven't found an answer to
>this.***
>
>In response to Thamme's questions:
>>> Which is better?
>>> 1. List 'access to opennlp.sourceforge.net' as a requirement
>I have access without a problem via regular means, the problem is that
>Maven isn't passing proxy information into Groovy when it tries to make
>the call to get the document (I confirmed this by dumping system props
>within ModelGetter).  Perhaps we just document that you need to download
>the four model files manually and stick them in the right subdirectory if
>you are behind a proxy (ugly solution, but would probably work)?
>
>
>>>2. Package and deploy models as a maven artifact
>Are there licensing issues for the current models?  Are the current
>models ASLv2.0?  Would we need all four full models?  And, y, my
>suggestion was to build a very small model and push it to source control
>in the resources directory.
>
>All this said, 1) again, this could be user error and 2) the addition of
>Stanford NER is fantastic...Thank you for this addition!
>
>
>-Original Message-
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>Sent: Monday, November 23, 2015 11:12 AM
>To: dev@tika.apache.org
>Cc: ThammeGowda Narayanaswamy 
>Subject: Re: NER Parser tests behind proxy?
>
>Hey Tim,
>
>Why shouldn’t we have to worry
>about connectivity outside of the Maven stuff? I mean clearly, if I
>install Tika on a new system today without a Maven repo, I must be
>connected to the internet, right?
>
>Cheers,
>Chris
>
>
>
>-Original Message-
>From: "Allison, Timothy B." 
>Reply-To: "dev@tika.apache.org" 
>Date: Monday, November 23, 2015 at 8:03 AM
>To: "dev@tika.apache.org" 
>Cc: ThammeGowda Narayanaswamy 
>Subject: RE: NER Parser tests behind proxy?
>
>>The problem comes down to: ModelGetter.groovy which is trying to grab:
>>${basedir}/src/test/resources/org/apache/tika/parser/ner/opennlp/ner-pe
>>rso
>>n.bin
>>
>>If we could build a small model (and I mean really small) and package
>>it with Tika, we wouldn't have to worry about http connectivity outside
>>of the usual maven stuff.
>>
>>-Original Message-
>>From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>>Sent: Monday, November 23, 2015 10:52 AM
>>To: dev@tika.apache.org
>>Cc: ThammeGowda Narayanaswamy 
>>Subject: Re: NER Parser tests behind proxy?
>>
>>Hey Tim,
>>
>>I’m not seeing these of course b/c I’m not behind a proxy. Thamme, any
>>ideas?
>>
>>Cheers,
>>Chris
>>
>>++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398) NASA Jet
>>Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattm...@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattman

Re: NER Parser tests behind proxy?

2015-11-23 Thread Mattmann, Chris A (3980)
Hey Tim,

Why shouldn’t we have to worry
about connectivity outside of the Maven stuff? I mean clearly, if
I install Tika on a new system today without a Maven repo, I must
be connected to the internet, right?

Cheers,
Chris



-Original Message-
From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Monday, November 23, 2015 at 8:03 AM
To: "dev@tika.apache.org" 
Cc: ThammeGowda Narayanaswamy 
Subject: RE: NER Parser tests behind proxy?

>The problem comes down to: ModelGetter.groovy which is trying to grab:
>${basedir}/src/test/resources/org/apache/tika/parser/ner/opennlp/ner-perso
>n.bin
>
>If we could build a small model (and I mean really small) and package it
>with Tika, we wouldn't have to worry about http connectivity outside of
>the usual maven stuff.
>
>-Original Message-
>From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>Sent: Monday, November 23, 2015 10:52 AM
>To: dev@tika.apache.org
>Cc: ThammeGowda Narayanaswamy 
>Subject: Re: NER Parser tests behind proxy?
>
>Hey Tim,
>
>I’m not seeing these of course b/c I’m not behind a proxy. Thamme, any
>ideas?
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department University of
>Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>-Original Message-
>From: "Allison, Timothy B." 
>Reply-To: "dev@tika.apache.org" 
>Date: Thursday, November 19, 2015 at 5:36 PM
>To: "dev@tika.apache.org" 
>Subject: NER Parser tests behind proxy?
>
>>My proxy is configured for git/maven/etc, but how do I configure it
>>within the test so that I don't get this?
>>
>>GET : http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin ->
>>tika-parsers\src\test\resources\org\apache\tika\parser\ner\opennlp\ner-
>>per
>>son.bin
>>[INFO]
>>---
>>-
>>[INFO] Reactor Summary:
>>[INFO]
>>[INFO] Apache Tika parent  SUCCESS
>>[3.264s] [INFO] Apache Tika core ..
>>SUCCESS [44.470s] [INFO] Apache Tika parsers
>>... FAILURE [1:56.462s] [INFO] Apache Tika
>>XMP ... SKIPPED [INFO] Apache Tika
>>serialization . SKIPPED [INFO] Apache Tika
>>batch . SKIPPED [INFO] Apache Tika
>>application ... SKIPPED [INFO] Apache Tika OSGi
>>bundle ... SKIPPED [INFO] Apache Tika translate
>>. SKIPPED [INFO] Apache Tika server
>> SKIPPED [INFO] Apache Tika examples
>>.. SKIPPED [INFO] Apache Tika Java-7
>>Components . SKIPPED [INFO] Apache Tika
>>... SKIPPED [INFO]
>>---
>>-
>>[INFO] BUILD FAILURE
>>[INFO]
>>---
>>-
>>[INFO] Total time: 2:45.245s
>>[INFO] Finished at: Thu Nov 19 20:29:34 EST 2015 [INFO] Final Memory:
>>52M/482M [INFO]
>>---
>>-
>>[ERROR] Failed to execute goal
>>org.codehaus.groovy.maven:gmaven-plugin:1.0:execute (testSetup) on
>>project tika-parsers: java.net.ConnectException: Connection refused:
>>connect -> [Help 1]
>>org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
>>execute goal org.codehaus.groovy.maven:gmaven-plugin:1.0:execute
>>(testSetup) on project tika-parsers: java.net.ConnectException:
>>Connection refused:
>>connect
>>  at
>>org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j
>>ava
>>:217)
>>  at
>>org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j
>>ava
>>:153)
>>  at
>>org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j
>>ava
>&

Re: NER Parser tests behind proxy?

2015-11-23 Thread Mattmann, Chris A (3980)
Hey Tyler it’s only part of the tests..

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Tyler Palsulich 
Reply-To: "dev@tika.apache.org" 
Date: Monday, November 23, 2015 at 8:08 AM
To: "dev@tika.apache.org" 
Cc: ThammeGowda Narayanaswamy 
Subject: RE: NER Parser tests behind proxy?

>Apologies if i missed a discussion about this earlier, but should we be
>downloading a model by default?
>
>Tyler
>On Nov 23, 2015 8:03 AM, "Allison, Timothy B."  wrote:
>
>> The problem comes down to: ModelGetter.groovy which is trying to grab:
>> 
>>${basedir}/src/test/resources/org/apache/tika/parser/ner/opennlp/ner-pers
>>on.bin
>>
>> If we could build a small model (and I mean really small) and package it
>> with Tika, we wouldn't have to worry about http connectivity outside of
>>the
>> usual maven stuff.
>>
>> -Original Message-
>> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>> Sent: Monday, November 23, 2015 10:52 AM
>> To: dev@tika.apache.org
>> Cc: ThammeGowda Narayanaswamy 
>> Subject: Re: NER Parser tests behind proxy?
>>
>> Hey Tim,
>>
>> I’m not seeing these of course b/c I’m not behind a proxy. Thamme, any
>> ideas?
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398) NASA Jet
>> Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department University of
>> Southern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
>>
>>
>> -Original Message-
>> From: "Allison, Timothy B." 
>> Reply-To: "dev@tika.apache.org" 
>> Date: Thursday, November 19, 2015 at 5:36 PM
>> To: "dev@tika.apache.org" 
>> Subject: NER Parser tests behind proxy?
>>
>> >My proxy is configured for git/maven/etc, but how do I configure it
>> >within the test so that I don't get this?
>> >
>> >GET : http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin ->
>> >tika-parsers\src\test\resources\org\apache\tika\parser\ner\opennlp\ner-
>> >per
>> >son.bin
>> >[INFO]
>> >---
>> >-
>> >[INFO] Reactor Summary:
>> >[INFO]
>> >[INFO] Apache Tika parent  SUCCESS
>> >[3.264s] [INFO] Apache Tika core ..
>> >SUCCESS [44.470s] [INFO] Apache Tika parsers
>> >... FAILURE [1:56.462s] [INFO] Apache Tika
>> >XMP ... SKIPPED [INFO] Apache Tika
>> >serialization . SKIPPED [INFO] Apache Tika
>> >batch . SKIPPED [INFO] Apache Tika
>> >application ... SKIPPED [INFO] Apache Tika OSGi
>> >bundle ... SKIPPED [INFO] Apache Tika translate
>> >. SKIPPED [INFO] Apache Tika server
>> > SKIPPED [INFO] Apache Tika examples
>> >.. SKIPPED [INFO] Apache Tika Java-7
>> >Components . SKIPPED [INFO] Apache Tika
>> >... SKIPPED [INFO]
>> >---
>> >-
>> >[INFO] BUILD FAILURE
>> >[INFO]
>> >---
>> >-
>> >[INFO] Total time: 2:45.245s
>> >[INFO] Finished at: Thu Nov 19 20:29:34 EST 201

Re: NER Parser tests behind proxy?

2015-11-23 Thread Mattmann, Chris A (3980)
;Caused by: java.net.ConnectException: Connection refused: connect
>   at java.net.DualStackPlainSocketImpl.connect0(Native Method)
>   at 
>java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.j
>ava:79)
>   at 
>java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:35
>0)
>   at 
>java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.
>java:206)
>   at 
>java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
>sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnec
>tion.java:1169)
>   at 
>sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnectio
>n.java:1105)
>   at 
>sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection
>.java:999)
>   at 
>sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java
>:933)
>   at 
>sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnect
>ion.java:1513)
>   at 
>sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnecti
>on.java:1441)
>   at 
>sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnecti
>on.java:2943)
>   at java.net.URLConnection.getHeaderFieldLong(URLConnection.java:629)
>   at java.net.URLConnection.getContentLengthLong(URLConnection.java:501)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
>62)
>   at 
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
>pl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
>org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSi
>teNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:229)
>   at 
>org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMetho
>dSite.java:52)
>   at 
>org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArr
>ay.java:43)
>   at 
>org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSit
>e.java:116)
>   at 
>org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSit
>e.java:120)
>   at ModelGetter.downloadFile(ModelGetter.groovy:61)
>   ... 42 more
>
>-Original Message-
>From: Nick Burch [mailto:apa...@gagravarr.org]
>Sent: Thursday, November 19, 2015 7:41 PM
>To: dev@tika.apache.org
>Subject: Re: [DISCUSS] Moving to Git
>
>On Thu, 19 Nov 2015, Mattmann, Chris A (3980) wrote:
>> I’ll be happy to update our docs and to write a wiki page on using
>> Tika & Git that we can refer folks to. I think I’ve demonstrated
>> documenting things on the Tika wiki :)
>
>Great stuff! Scribble something sensible down, and I can vote +1 to the
>move, plus learn more about Git at the same time :)
>
>Nick



NER wiki page up

2015-11-20 Thread Mattmann, Chris A (3980)
Hey Team,

Thamme and I added a wiki page for Tika/Stanford NER and Apache
OpenNLP integration:

http://wiki.apache.org/tika/TikaAndNER


Cheers,
Chris

P.S. Nick - Git instructions coming next :)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: [DISCUSS] Moving to Git

2015-11-19 Thread Mattmann, Chris A (3980)
Hey Nick,

I’ll be happy to update our docs and to write a wiki page
on using Tika & Git that we can refer folks to. I think I’ve
demonstrated documenting things on the Tika wiki :)

Fair enough?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, November 19, 2015 at 4:33 AM
To: "dev@tika.apache.org" 
Subject: Re: [DISCUSS] Moving to Git

>On Wed, 18 Nov 2015, Mattmann, Chris A (3980) wrote:
>> Git has something similar to svn:externals:
>>
>> http://stackoverflow.com/questions/571232/svnexternals-equivalent-in-git
>
>Good to know
>
>> I’ve seen both used in the same way. Also the examples site code
>> is something we could always gin up a script solution to and isn’t
>> a blocker by any means
>
>Guess it depends on if we move the website over as well to git, or leave
>it as svn?
>
>
>> As to the discussions of what’s going on with Git/Github/version
>> control, etc., the use of writeable Git repositories at the ASF
>> has been sanctioned and used pervasively for years. That Git/Github
>> /version control *policy* discussion is pretty independent of using
>> the ASF’s own sanctioned writeable git repos on ASF hardware, which
>> is all I’m proposing to do.
>
>I know it's allowed! I've just also seen lots of things about how it can
>be done wrong, either deliberately or accidently, and I don't want Tika
>having that issue too. I haven't used Git at the ASF enough to be sure
>what we should or shouldn't be doing, so I think having that written down
>by our git experts first would be good for everyone like me!
>
>> Infra has put policies (temporarily) in place to deal with any of
>> the branching issues that have shown up etc. So there is already
>> enforcement and so on.
>
>Once that's relaxed, we'll want our own rules about when, where and
>if-ever that's allowed, so everyone knows!
>
>Additionally, on the github side, quite a few people currently have their
>own github mirrors of Tika with branches in that which aren't held in
>SVN. 
>I'm not sure what the right answer is, but I think we need to get a
>policy 
>written down on when those need to be pushed into the ASF git master,
>what 
>happens when they are etc
>
>
>> Finally it seems like there is good support so far for this, so
>> I’ll keep collecting feedback before calling an official vote maybe
>> in the next few days. I’m really hoping there is really no big
>> difference other than replacing svn co with git clone and replacing
>> svn commit with git commit && git push in most places.
>
>I agree, for simple stuff it should be a small change. It's the less
>simple stuff I'd rather we got right first, rather than doing wrong and
>having to unpick later! Especially as we bring in new committers, it's a
>lot easier if they can refer to somewhere to see our rules. (Even if it
>is 
>a short wiki page that just says "don't" against a long list of things!)
>
>Nick



Re: [DISCUSS] Moving to Git

2015-11-18 Thread Mattmann, Chris A (3980)
Hey Nick,

Git has something similar to svn:externals:

http://stackoverflow.com/questions/571232/svnexternals-equivalent-in-git


I’ve seen both used in the same way. Also the examples site code
is something we could always gin up a script solution to and isn’t
a blocker by any means - it’s a smallish portion of the overall
process and even if it had to be done by hand it’s something we don’t
do often enough for it to be a real burden. I can speak from experience
having done most or all of Tika’s releases.

As to the discussions of what’s going on with Git/Github/version
control, etc., the use of writeable Git repositories at the ASF
has been sanctioned and used pervasively for years. That Git/Github
/version control *policy* discussion is pretty independent of using
the ASF’s own sanctioned writeable git repos on ASF hardware, which
is all I’m proposing to do. AKA I’m proposing we move Tika’s
canonical repo from:

http://svn.apache.org/repos/asf/tika/

TO:

https://git-wip-us.apache.org/repos/asf/tika.git

Infra has put policies (temporarily) in place to deal with any of
the branching issues that have shown up etc. So there is already
enforcement and so on. And like I said, the ASF has allowed writeable
Git repos for many years now.

Finally it seems like there is good support so far for this, so
I’ll keep collecting feedback before calling an official vote maybe
in the next few days. I’m really hoping there is really no big
difference other than replacing svn co with git clone and replacing
svn commit with git commit && git push in most places. One last note:
many of the “issues” brought up on other projects or being discussed
at a Foundation policy level are issues e.g., with the Incubator,
some with newer (ish) TLPs that have arisen over the past few years
and that are pushing the boundaries on how to use Git in ways that
are forcing the foundation to ask questions at its core policy
levels. That discussion is ongoing. Tika has been around since 2007,
includes a strong set of ASF members, has seen the version control
debates over the years and long since survived them, etc. I see no
evidence and an extremely low probability that we will use writeable
ASF git repos in any such way that drives the policy at the foundation
level in the same way.

Instead, I see pretty boring use of Git writeable repos to become
more consistent with the way it seems like more and more of us are
doing development (even today with Tika).

HTH.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, November 18, 2015 at 7:44 AM
To: "dev@tika.apache.org" 
Subject: Re: [DISCUSS] Moving to Git

>On Wed, 18 Nov 2015, Mattmann, Chris A (3980) wrote:
>> I propose we move to writeable git repos for Tika for our repository. I
>> mostly interact with Git & Github nowadays even with Tika using the
>> mirroring and PR interaction support.
>
>I'm -0 on this at the moment
>
>Having followed other Apache lists, it seems that there's quite a few
>ways 
>to use Git, not all of them compatible with the Apache way, and some of
>them easy to do wrong.
>
>Were we to have some proposed guidelines/information/rules on using Git
>for Tika, such as about what branches squashing might be permitted on,
>rules for that, information/rules on remote branches, how to handle /
>when 
>to use / not-use private branches and github branches, and the like, then
>I'd be minded to change my vote
>
>I'm also wondering how it would work with the website pulling in bits of
>the Tika Examples module from SVN for the examples page? That currently
>uses a svn:externals, so we can keep the code in a normal module + unit
>test it, then pulls in snippets, how would that work if the code moved to
>git?
>
>Nick



  1   2   3   4   >