Re: support of Java 5

2011-06-08 Thread Benson Margulies
Virtually no one is paying attention Oracle's pronouncements on this subject. If 1.7 ever shows up, that might change. On Wed, Jun 8, 2011 at 1:05 PM, Nick Burch wrote: > On 08/06/11 16:21, Oleg Tikhonov wrote: >> >> As you may know, Oracle announced Java 5 SE EOL (End Of Life) since 2009 . >> H

[jira] [Created] (TIKA-657) Email parser gets into trouble on malformed html in enron corpus

2011-05-07 Thread Benson Margulies (JIRA)
Components: parser Affects Versions: 0.9 Reporter: Benson Margulies There is a very large corpus of email addresses available: http://www.cs.cmu.edu/~enron/. In processing even a subset of this corpus, I see numerous 'unexpected RuntimeException' errors resulting fr

Two oddities

2011-05-06 Thread Benson Margulies
1: org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, Metadata, ParseContext) calls org.apache.tika.parser.CompositeParser.getParser(Metadata), not passing in its parser context, and the later makes a new one. 2: I have a somewhat odd classpath environment: I have embedded t

Re: [jira] [Commented] (TIKA-213) JSON output from Tika CLI

2011-04-10 Thread Benson Margulies
Don't stop on my account. On Sun, Apr 10, 2011 at 2:20 PM, Chris A. Mattmann (JIRA) wrote: > >    [ > https://issues.apache.org/jira/browse/TIKA-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018126#comment-13018126 > ] > > Chris A. Mattmann comment

[jira] [Commented] (TIKA-213) JSON output from Tika CLI

2011-04-10 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018119#comment-13018119 ] Benson Margulies commented on TIKA-213: --- I use jackson all the time. It's

Re: [jira] [Commented] (TIKA-213) JSON output from Tika CLI

2011-04-10 Thread Benson Margulies
I use jackson all the time. It's not that big and you know Tatu does a good job on it. On Sun, Apr 10, 2011 at 1:52 PM, Chris A. Mattmann (JIRA) wrote: > >    [ > https://issues.apache.org/jira/browse/TIKA-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId

Readability

2011-03-28 Thread Benson Margulies
I've pushed some code : git://github.com/basis-technology-corp/Java-readability.git

[jira] Commented: (TIKA-610) Invent TikaRuntimeException

2011-03-09 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004886#comment-13004886 ] Benson Margulies commented on TIKA-610: --- I didn't spot that MimeExcepti

Re: Really no binary releases?

2011-03-07 Thread Benson Margulies
It Would Be Nice. I might make you a patch. On Mon, Mar 7, 2011 at 4:19 AM, Jukka Zitting wrote: > Hi, > > On 03/06/2011 06:44 PM, Benson Margulies wrote: >> >> It seems that tika doesn't push any sort of binary package to the >> mirrors. Is this on purpose? &

Really no binary releases?

2011-03-06 Thread Benson Margulies
It seems that tika doesn't push any sort of binary package to the mirrors. Is this on purpose?

[jira] Created: (TIKA-610) Invent TikaRuntimeException

2011-03-02 Thread Benson Margulies (JIRA)
Invent TikaRuntimeException --- Key: TIKA-610 URL: https://issues.apache.org/jira/browse/TIKA-610 Project: Tika Issue Type: Bug Reporter: Benson Margulies As per TIKA-597, there are cases where Tika

[jira] Commented: (TIKA-469) The Parser is not correctly outputting Arabic text documents

2011-02-16 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995473#comment-12995473 ] Benson Margulies commented on TIKA-469: --- oops. I agree with you about that,

[jira] Commented: (TIKA-469) The Parser is not correctly outputting Arabic text documents

2011-02-16 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995385#comment-12995385 ] Benson Margulies commented on TIKA-469: --- ken, HTML is the one file type *not* on

[jira] Commented: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-02-14 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994455#comment-12994455 ] Benson Margulies commented on TIKA-597: --- I made you a patch. > Bogus ex

[jira] Updated: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-02-14 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated TIKA-597: -- Attachment: TIKA-597.patch > Bogus exception handler

[jira] Created: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-02-14 Thread Benson Margulies (JIRA)
/jira/browse/TIKA-597 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.8 Reporter: Benson Margulies org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream) contains an exception handler that calls

[jira] Created: (TIKA-572) Update plugin versions in the POM structure

2010-12-14 Thread Benson Margulies (JIRA)
Reporter: Benson Margulies The tika build uses an ancient version of the gpg plugin, and probably some other fossils as well. It would be good to update before the next release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to

[jira] Updated: (TIKA-570) If this is a BMP, my name is horatio alger

2010-12-11 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated TIKA-570: -- Description: I am attaching a file which Tika is identifying as a bmp. It contains ordinary

[jira] Updated: (TIKA-570) If this is a BMP, my name is horatio alger

2010-12-11 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated TIKA-570: -- Attachment: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt C80A5295-EFC7-44DD-9A39

[jira] Created: (TIKA-570) If this is a BMP, my name is horatio alger

2010-12-11 Thread Benson Margulies (JIRA)
Reporter: Benson Margulies I am attaching a file which Tika is identifying as a bmp. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.

Re: Charset SPI

2010-11-06 Thread Benson Margulies
at 3:19 PM, Ken Krugler wrote: > > On Nov 4, 2010, at 7:08am, Benson Margulies wrote: > >> Have you all ever considered wiring the CharsetDetector to the >> java.nio.Charset SPI mechanism as an autodetecting charset? > > No, I don't remember this coming up. > > C

[jira] Commented: (TIKA-539) Encoding detection is too biased by encoding in meta tag

2010-11-06 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928978#action_12928978 ] Benson Margulies commented on TIKA-539: --- Have you checked the insides of nutch

Charset SPI

2010-11-04 Thread Benson Margulies
Have you all ever considered wiring the CharsetDetector to the java.nio.Charset SPI mechanism as an autodetecting charset? I could knock one off. Would you want it to be a separate JAR or just in the parsers with the detector?

Re: Build problem with trunk?

2010-11-04 Thread Benson Margulies
t; want to release it to Maven central. > > I've also added back in the java.net repository, so the build works again. > > -- Ken > > On Nov 4, 2010, at 5:21am, Benson Margulies wrote: > >> Apologies, but google didn't yield an answer to this. Building with m

Boilerpipe is nice, but what about readability?

2010-11-04 Thread Benson Margulies
I just coded a Java port of the arclabs 'readability' javascript code, which has a very strong reputation as a device for grabbing the useful content from newsy web pages. I could contribute it to Tika, if (a) you wanted it, and (b) there was some reasonable way to decide or configure which one to

[jira] Created: (TIKA-542) Publish Javadoc on tika.apache.org

2010-11-04 Thread Benson Margulies (JIRA)
Reporter: Benson Margulies The front page of the site doesn't seem to offer a path to the javadoc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.

Build problem with trunk?

2010-11-04 Thread Benson Margulies
Apologies, but google didn't yield an answer to this. Building with maven 3 ... [ERROR] Failed to execute goal on project tika-parsers: Could not resolve dependencies for project org.apache.tika:tika-parsers:bundle:0.8-SNAPSHOT: Could not find artifact rome:rome:jar:1.0 in maven2-repository.google