Re: [VOTE] Release Apache Tika 2.9.0 Candidate #1

2023-08-28 Thread Julien Nioche
Thanks Tim!

Compiled StormCrawler with Tika  2.9.0 and ran a crawl without noticing any
issues.

+1 (non binding) to release

Julien

On Wed, 23 Aug 2023 at 15:50, Tim Allison  wrote:

> A candidate for the Tika 2.9.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/2.9.0
>
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/2.9.0-rc1/
>
> The SHA-512 checksum of the archive is
>
> 4b54172163a2e86b805e7077b11d21902dc2137a849eb0d58ca06a904a91007ed14ac78ee8266531ff62cd666059409d728b679c571304c7b672c6446d9c5a15.
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1095/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 2.9.0.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 2.9.0
> [ ] -1 Do not release this package because...
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Release Apache Tika 2.8.0 Candidate #2

2023-05-12 Thread Julien Nioche
Thanks Tim,

I have tried with the RC2 and it is now working fine.

+1 from me

J

On Thu, 11 May 2023 at 21:08, Tim Allison  wrote:

> A candidate for the Tika 2.8.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/2.8.0
>
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/2.8.0-rc2/
>
> The SHA-512 checksum of the archive is
>
> b39d485c8046019fb9319d7d76c68d14b8494dea25619209058244cb567d0c51e6c243ca2a478d611e079ed47d64294c82cf9475889f23cd73cbba13ee4e6cd9.
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1094/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 2.8.0.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 2.8.0
> [ ] -1 Do not release this package because...
>
> Here's my +1.
>
> Thank you!
>
> Best,
>
> Tim
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Apache Tika 2.8.0 Release Candidate 1

2023-05-11 Thread Julien Nioche
Thanks Tim,

I am testing 2.8.0 with StormCrawler

Apart from a lot of warning about missing classes like
*Caused by: java.lang.ClassNotFoundException:
org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream *
I am also getting a failed test when trying to extract text from an
embedded document.

I can't see anything related in the release notes apart maybe from

   * Improve extraction of embedded file names in .docx (TIKA-3968).

I've created a branch for it in SC ->
https://github.com/DigitalPebble/storm-crawler/tree/tika2.8
in case anyone has the time and inclination to try to reproduce the issue.

I'll see if I can find the source of the problem

Julien


On Tue, 9 May 2023 at 17:40, Tim Allison  wrote:

> A candidate for the Tika 2.8.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/2.8.0
>
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/2.8.0-rc1/
>
> The SHA-512 checksum of the archive is
>
> 6b514a45b87013c566e57af2b6a526bce0b3bf02a1dabefe998068aa49672ec4a7ec2ecfa538a84aca719607f339a44341caeaab1ca313fc1c161154ec095bbb.
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1093/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 2.8.0.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 2.8.0
> [ ] -1 Do not release this package because...
>
> Here's my +1.
>
> Best,
>
> Tim
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Release Apache Tika 2.7.0 Candidate #1

2023-02-03 Thread Julien Nioche
Hi Tim,

Thanks for the release. I ran Tika 2.7.0 with StormCrawler and did not
notice any problems.

Cheers

Julien

On Tue, 31 Jan 2023 at 19:13, Tim Allison  wrote:

> A candidate for the Tika 2.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/2.7.0
>
> The release candidate is a zip archive of the sources in:
> https://github.com/apache/tika/tree/2.7.0-rc1/
>
> The SHA-512 checksum of the archive is
>
> 7f3505f6a86b617a37f25f31f4c6b3e4028d2baab700a5fe4070d38d6f625dba3c18db4010da84acb71af14ffdb1259cc64ea10d8ec2a22fc56667bfe1b52ad7.
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1092/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 2.7.0.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 2.7.0
> [ ] -1 Do not release this package because...
>
>
> Here's my +1.
>
> Thank you!
>
> Best,
>
>   Tim
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Closed] (TIKA-2269) NPE with FeedParser

2017-02-21 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed TIKA-2269.
---

thanks for committing [~talli...@mitre.org]

> NPE with FeedParser
> ---
>
> Key: TIKA-2269
> URL: https://issues.apache.org/jira/browse/TIKA-2269
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>    Reporter: Julien Nioche
> Fix For: 2.0, 1.15
>
>
> Getting the NPE below when parsing 
> [https://chm.tbe.taleo.net/chm02/ats/servlet/Rss?org=TSA&cws=43]
> bq. Caused by: java.lang.NullPointerException
>   at org.apache.tika.parser.feed.FeedParser.stripTags(FeedParser.java:119)
>   at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:74)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 43 more



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TIKA-2269) NPE with FeedParser

2017-02-20 Thread Julien Nioche (JIRA)
Julien Nioche created TIKA-2269:
---

 Summary: NPE with FeedParser
 Key: TIKA-2269
 URL: https://issues.apache.org/jira/browse/TIKA-2269
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.14
Reporter: Julien Nioche


Getting the NPE below when parsing 
[https://chm.tbe.taleo.net/chm02/ats/servlet/Rss?org=TSA&cws=43]

bq. Caused by: java.lang.NullPointerException
at org.apache.tika.parser.feed.FeedParser.stripTags(FeedParser.java:119)
at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:74)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 43 more




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-20 Thread Julien Nioche
Hi Tim

I had exiftool installed indeed, so that might explain it. All tests now
pass. Will have a closer look at it all later.

Thanks

Julien

On 20 October 2016 at 13:45, Allison, Timothy B.  wrote:

> https://issues.apache.org/jira/browse/TIKA-2056
>
> Perhaps?
>
> -Original Message-
> From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
> Sent: Thursday, October 20, 2016 8:34 AM
> To: dev@tika.apache.org
> Subject: Re: [VOTE] Apache Tika 1.14 Release Candidate #1
>
> Hi
>
> Am getting the following when running 'mvn clean package', have I
> forgotten something obvious?
>
> Julien
>
> *Failed tests: *
> *  ForkParserIntegrationTest.testParserHandlingOfNonSerializable:210
> expected: but
> was:* *Tests
> in error: *
> *
> ForkParserIntegrationTest.testAttachingADebuggerOnTheFor
> kedParserShouldWork:234
> » Tika*
> *  ForkParserIntegrationTest.testForkedPDFParsing:257 » Tika Unable to
> serialize ...*
> *  ForkParserIntegrationTest.testForkedTextParsing:66 » Tika Unable to
> serialize ...*
>
> *Tests run: 755, Failures: 1, Errors: 3, Skipped: 17*
>
> *[INFO]
> *
> *[INFO] Reactor Summary:*
> *[INFO] *
> *[INFO] Apache Tika parent  SUCCESS
> [4.368s]*
> *[INFO] Apache Tika core .. SUCCESS
> [16.487s]*
> *[INFO] Apache Tika parsers ... FAILURE
> [4:54.631s]*
>
>
>
> On 19 October 2016 at 19:48, Chris Mattmann  wrote:
>
> > Hi Folks,
> >
> > A first candidate for the Tika 1.14 release is available at:
> >
> >   https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >
> > https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tree;hb=
> > 687d7706c9778e4f49f2834a07e5a9d99b23042b
> >
> > The SHA1 checksum of the archive is:
> > ad9152392ffe6b620c8102ab538df0579b36c520
> >
> > In addition, a staged maven repository is available here:
> >
> > https://repository.apache.org/content/repositories/orgapachetika-1020/
> >
> > Please vote on releasing this package as Apache Tika 1.14.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.14 [ ] -1 Do not release
> > this package because..
> >
> > Cheers,
> > Chris
> >
> > P.S. Of course here is my +1.
> >
> >
> >
> >
> >
> >
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>


Re: [VOTE] Apache Tika 1.14 Release Candidate #1

2016-10-20 Thread Julien Nioche
Hi

Am getting the following when running 'mvn clean package', have I forgotten
something obvious?

Julien

*Failed tests: *
*  ForkParserIntegrationTest.testParserHandlingOfNonSerializable:210
expected: but
was:*
*Tests in error: *
*
ForkParserIntegrationTest.testAttachingADebuggerOnTheForkedParserShouldWork:234
» Tika*
*  ForkParserIntegrationTest.testForkedPDFParsing:257 » Tika Unable to
serialize ...*
*  ForkParserIntegrationTest.testForkedTextParsing:66 » Tika Unable to
serialize ...*

*Tests run: 755, Failures: 1, Errors: 3, Skipped: 17*

*[INFO]
*
*[INFO] Reactor Summary:*
*[INFO] *
*[INFO] Apache Tika parent  SUCCESS
[4.368s]*
*[INFO] Apache Tika core .. SUCCESS
[16.487s]*
*[INFO] Apache Tika parsers ... FAILURE
[4:54.631s]*



On 19 October 2016 at 19:48, Chris Mattmann  wrote:

> Hi Folks,
>
> A first candidate for the Tika 1.14 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>
> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tree;hb=
> 687d7706c9778e4f49f2834a07e5a9d99b23042b
>
> The SHA1 checksum of the archive is:
> ad9152392ffe6b620c8102ab538df0579b36c520
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1020/
>
> Please vote on releasing this package as Apache Tika 1.14.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.14
> [ ] -1 Do not release this package because..
>
> Cheers,
> Chris
>
> P.S. Of course here is my +1.
>
>
>
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Moving SCM to Git

2016-01-13 Thread Julien Nioche
+1

On 2 January 2016 at 04:30, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Everyone,
>
> DISCUSS thread here: http://s.apache.org/wVE
>
> Time to officially VOTE on moving Tika to Git. I’ve made a wiki
> page for our SCM explaining how to use Git at Apache, and how to
> use it with Github, and how to use it even in a traditional SVN
> sense. The page is here:
>
> https://wiki.apache.org/tika/UsingGit
>
>
> I’ve also linked it from the main wiki page. I took the liberty
> of updating the only other 2 pages on the wiki that referenced
> SCM with (pending) Git instructions as well:
>
> https://wiki.apache.org/tika/DeveloperResources
> https://wiki.apache.org/tika/ReleaseProcess
>
> From the DISCUSS thread it would seem the following members of
> the community support this move:
>
> Chris Mattmann
> Tyler Palsulich
> Bob Paulin
> Hong-Thai Nguyen
>
> Oleg Tikhonov
> David Meikle
>
>
> Given the above I’m going to count the above people as +1 in
> this VOTE if I don’t hear otherwise.
>
> Nick Burch said he would be more supportive if there was a guide,
> so I made one and updated the other wiki docs as above so hopefully
> that garners his VOTE.
>
> If you’d like to revise your VOTE or to VOTE for the first time,
> please use the ballot below:
>
> [ ] +1 Move the Apache Tika source control to Writeable Git repos
> at the ASF
> [ ] +0 Indifferent.
> [ ] -1 Don’t move the Apache Tika source control to Writeable Git
> repos at the ASF because..
>
> Of course, given the conversation I am +1 for this.
>
> Thanks for VOTE’ing I’ll leave the VOTE open through next Friday.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049248#comment-15049248
 ] 

Julien Nioche commented on TIKA-1599:
-

Don't think that this is the version they use now. 
[https://github.com/commoncrawl/nutch] maybe?

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049239#comment-15049239
 ] 

Julien Nioche commented on TIKA-1599:
-

Hi [~talli...@mitre.org]

Haven't kept a log of specific examples but that was frequent enough to 
discourage any attempt at using Xpath on them. 


> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Julien Nioche
Both Nutch and Behemoth declare Hadoop 1.2.1 as a dependency and since it
does not use Guava they won't have the same issue. However, did is just the
default version and some people use them on Hadoop 2.x, in which case
they'll might need to find a workaround

On 20 April 2015 at 15:56, Julien Nioche 
wrote:

> and I haven't tested it with Nutch either...
>
> On 20 April 2015 at 15:46, Julien Nioche 
> wrote:
>
>> I haven't tested the RC with Behemoth, it will probably have the same
>> issue but I'll do like you and defer the update if that's the case.
>>
>> On 20 April 2015 at 15:23, Ken Krugler 
>> wrote:
>>
>>>
>>> > From: Allison, Timothy B.
>>> > Sent: April 20, 2015 5:11:04am PDT
>>> > To: dev@tika.apache.org
>>> > Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2
>>> >
>>> > If I understand correctly, if we release rc2, Tika 1.8 will break in
>>> Hadoop clusters across the land?!
>>> > Or, Hadoop folks will have to apply a classloading workaround or
>>> rebuild 1.8/trunk with small version mod in TIKA-1606 to get Tika to work.
>>> >
>>> > For most Hadoopites, this will be a straightforward fix, and I'm
>>> assuming that's why Ken is not more outspoken against releasing rc2 as is
>>> (Ken, let me know if I'm wrong!).
>>>
>>> Usually it's straightforward. Though whenever you start manipulating the
>>> classloader logic, you can get odd results.
>>>
>>> E.g. by forcing your job jar's dependencies to show up first, now you
>>> can have an issue where one of your jars masks an older/newer version that
>>> Hadoop needs, so the job fails for some other reason.
>>>
>>> But yes, I don't feel strongly enough about this to vote -1, as I don't
>>> think there are that many people using Tika with Hadoop.
>>>
>>> For Bixo, I'd defer updating the Tika dependency until another version
>>> is released.
>>>
>>> Don't know about Behemoth - Julien?
>>>
>>> -- Ken
>>>
>>>
>>> > For other users, though, say, in healthcare, where code security
>>> review is stringent, this could be a real pain, no?
>>> >
>>> > Am I understanding correctly what will happen?  If so, do we really
>>> want to do this?
>>> >
>>> >
>>> > -Original Message-
>>> > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>>> > Sent: Saturday, April 18, 2015 11:48 PM
>>> > To: dev@tika.apache.org
>>> > Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2
>>> >
>>> > +1 to pushing on Monday - if we have to roll a 1.9 quickly
>>> > after, we can :)
>>> >
>>> > ++
>>> > Chris Mattmann, Ph.D.
>>> > Chief Architect
>>> > Instrument Software and Science Data Systems Section (398)
>>> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> > Office: 168-519, Mailstop: 168-527
>>> > Email: chris.a.mattm...@nasa.gov
>>> > WWW:  http://sunset.usc.edu/~mattmann/
>>> > ++
>>> > Adjunct Associate Professor, Computer Science Department
>>> > University of Southern California, Los Angeles, CA 90089 USA
>>> > ++
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > -Original Message-
>>> > From: Tyler Palsulich 
>>> > Reply-To: "dev@tika.apache.org" 
>>> > Date: Saturday, April 18, 2015 at 11:29 PM
>>> > To: "dev@tika.apache.org" 
>>> > Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2
>>> >
>>> >> Hi Folks,
>>> >>
>>> >> If there are no blocking complaints (OSGi?) by Monday (a little longer
>>> >> than
>>> >> 3 days, I realize), I'll mark this as passed and finish the release
>>> >> process.
>>> >>
>>> >> Of course, it's no problem for me to cut another RC, if it's needed.
>>> >>
>>> >> Have a great weekend!
>>> >> Tyler
>>> >> I've run into one problem while testing Tika 1.8 with Bixo
>>> &

Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Julien Nioche
and I haven't tested it with Nutch either...

On 20 April 2015 at 15:46, Julien Nioche 
wrote:

> I haven't tested the RC with Behemoth, it will probably have the same
> issue but I'll do like you and defer the update if that's the case.
>
> On 20 April 2015 at 15:23, Ken Krugler 
> wrote:
>
>>
>> > From: Allison, Timothy B.
>> > Sent: April 20, 2015 5:11:04am PDT
>> > To: dev@tika.apache.org
>> > Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2
>> >
>> > If I understand correctly, if we release rc2, Tika 1.8 will break in
>> Hadoop clusters across the land?!
>> > Or, Hadoop folks will have to apply a classloading workaround or
>> rebuild 1.8/trunk with small version mod in TIKA-1606 to get Tika to work.
>> >
>> > For most Hadoopites, this will be a straightforward fix, and I'm
>> assuming that's why Ken is not more outspoken against releasing rc2 as is
>> (Ken, let me know if I'm wrong!).
>>
>> Usually it's straightforward. Though whenever you start manipulating the
>> classloader logic, you can get odd results.
>>
>> E.g. by forcing your job jar's dependencies to show up first, now you can
>> have an issue where one of your jars masks an older/newer version that
>> Hadoop needs, so the job fails for some other reason.
>>
>> But yes, I don't feel strongly enough about this to vote -1, as I don't
>> think there are that many people using Tika with Hadoop.
>>
>> For Bixo, I'd defer updating the Tika dependency until another version is
>> released.
>>
>> Don't know about Behemoth - Julien?
>>
>> -- Ken
>>
>>
>> > For other users, though, say, in healthcare, where code security review
>> is stringent, this could be a real pain, no?
>> >
>> > Am I understanding correctly what will happen?  If so, do we really
>> want to do this?
>> >
>> >
>> > -Original Message-
>> > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
>> > Sent: Saturday, April 18, 2015 11:48 PM
>> > To: dev@tika.apache.org
>> > Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2
>> >
>> > +1 to pushing on Monday - if we have to roll a 1.9 quickly
>> > after, we can :)
>> >
>> > ++
>> > Chris Mattmann, Ph.D.
>> > Chief Architect
>> > Instrument Software and Science Data Systems Section (398)
>> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > Office: 168-519, Mailstop: 168-527
>> > Email: chris.a.mattm...@nasa.gov
>> > WWW:  http://sunset.usc.edu/~mattmann/
>> > ++
>> > Adjunct Associate Professor, Computer Science Department
>> > University of Southern California, Los Angeles, CA 90089 USA
>> > ++
>> >
>> >
>> >
>> >
>> >
>> >
>> > -Original Message-
>> > From: Tyler Palsulich 
>> > Reply-To: "dev@tika.apache.org" 
>> > Date: Saturday, April 18, 2015 at 11:29 PM
>> > To: "dev@tika.apache.org" 
>> > Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2
>> >
>> >> Hi Folks,
>> >>
>> >> If there are no blocking complaints (OSGi?) by Monday (a little longer
>> >> than
>> >> 3 days, I realize), I'll mark this as passed and finish the release
>> >> process.
>> >>
>> >> Of course, it's no problem for me to cut another RC, if it's needed.
>> >>
>> >> Have a great weekend!
>> >> Tyler
>> >> I've run into one problem while testing Tika 1.8 with Bixo
>> >>
>> >> It involves a dependency issue involving (of course) Guava, since that
>> >> project loves to break their API :(
>> >>
>> >> The bixo-core jar has these transitive dependencies on various
>> versions of
>> >> Guava:
>> >>
>> >> Hadoop - 11.0.2
>> >> Cascading - 14.0.1
>> >> Tika-parsers - 10.0.1
>> >>   cdm - 17.0
>> >>
>> >> Everyone winds up using version 10.0.1 (note that Tika has a
>> dependency on
>> >> cdm, which wants to use 17.0)
>> >>
>> >> The problem is that Hadoop (for any recent version) 

Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Julien Nioche
I haven't tested the RC with Behemoth, it will probably have the same issue
but I'll do like you and defer the update if that's the case.

On 20 April 2015 at 15:23, Ken Krugler  wrote:

>
> > From: Allison, Timothy B.
> > Sent: April 20, 2015 5:11:04am PDT
> > To: dev@tika.apache.org
> > Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2
> >
> > If I understand correctly, if we release rc2, Tika 1.8 will break in
> Hadoop clusters across the land?!
> > Or, Hadoop folks will have to apply a classloading workaround or rebuild
> 1.8/trunk with small version mod in TIKA-1606 to get Tika to work.
> >
> > For most Hadoopites, this will be a straightforward fix, and I'm
> assuming that's why Ken is not more outspoken against releasing rc2 as is
> (Ken, let me know if I'm wrong!).
>
> Usually it's straightforward. Though whenever you start manipulating the
> classloader logic, you can get odd results.
>
> E.g. by forcing your job jar's dependencies to show up first, now you can
> have an issue where one of your jars masks an older/newer version that
> Hadoop needs, so the job fails for some other reason.
>
> But yes, I don't feel strongly enough about this to vote -1, as I don't
> think there are that many people using Tika with Hadoop.
>
> For Bixo, I'd defer updating the Tika dependency until another version is
> released.
>
> Don't know about Behemoth - Julien?
>
> -- Ken
>
>
> > For other users, though, say, in healthcare, where code security review
> is stringent, this could be a real pain, no?
> >
> > Am I understanding correctly what will happen?  If so, do we really want
> to do this?
> >
> >
> > -Original Message-
> > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> > Sent: Saturday, April 18, 2015 11:48 PM
> > To: dev@tika.apache.org
> > Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2
> >
> > +1 to pushing on Monday - if we have to roll a 1.9 quickly
> > after, we can :)
> >
> > ++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++
> >
> >
> >
> >
> >
> >
> > -Original Message-
> > From: Tyler Palsulich 
> > Reply-To: "dev@tika.apache.org" 
> > Date: Saturday, April 18, 2015 at 11:29 PM
> > To: "dev@tika.apache.org" 
> > Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2
> >
> >> Hi Folks,
> >>
> >> If there are no blocking complaints (OSGi?) by Monday (a little longer
> >> than
> >> 3 days, I realize), I'll mark this as passed and finish the release
> >> process.
> >>
> >> Of course, it's no problem for me to cut another RC, if it's needed.
> >>
> >> Have a great weekend!
> >> Tyler
> >> I've run into one problem while testing Tika 1.8 with Bixo
> >>
> >> It involves a dependency issue involving (of course) Guava, since that
> >> project loves to break their API :(
> >>
> >> The bixo-core jar has these transitive dependencies on various versions
> of
> >> Guava:
> >>
> >> Hadoop - 11.0.2
> >> Cascading - 14.0.1
> >> Tika-parsers - 10.0.1
> >>   cdm - 17.0
> >>
> >> Everyone winds up using version 10.0.1 (note that Tika has a dependency
> on
> >> cdm, which wants to use 17.0)
> >>
> >> The problem is that Hadoop (for any recent version) uses an API from
> >> Guava's cache implementation that no longer exists:
> >>
> >>
> com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheL
> >> oader;)Lcom/google/common/cache/LoadingCache;
> >> java.lang.NoSuchMethodError:
> >>
> com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheL
> >> oader;)Lcom/google/common/cache/LoadingCache;
> >>   at
> >> org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
> >>   at
> >> org.apache.hadoop.io.compress.CodecPool.(CodecPool.java:74)
> >>   at
> >> org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1272)
> >>   at
> >>
> org.apache.hadoop.mapred.SequenceFileOutputFormat$1.close(SequenceFileOutp
> >> utFormat.java:79)
> >>
> >> So what this means is that anyone trying to use Tika with Hadoop will
> need
> >> to play games with the class loader to get the older version of Guava -
> >> though that can cause other issues if Hadoop (or Cascading, etc) rely on
> >> anything that's only in the newer Guava API.
> >>
> >> Guava 1.0.01 was released about 3.5 years ago; 11.0.2 was from about 3
> >> years ago. So it seems like we should upgrade to at least 11.0.2
> >>
> >> But I don't know if this is enough of an issue to require anot

Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-14 Thread Julien Nioche
Hi Tim

Great to hear that you managed to use the dataset from CommonCrawl. Thanks!

Julien

On 14 April 2015 at 14:15, Allison, Timothy B.  wrote:

> +1
>
> Thank you, Tyler!
>
> Apologies to Hong-Thai and community for not recognizing the severity of
> TIKA-1600 when I voted in favor of rc1!
>
> Details...
>
> I reran against govdocs1, and there aren't any major surprises.
>
> On our Rackspace vm, I  _finally_ unzipped the Common Crawl slice that
> Julien Nioche created for us, and I ran against that as well.  That turned
> up TIKA-1605 and another exceedingly rare NPE in the PDFParser.  I don't
> think either of these are blockers, and they're now fixed in trunk.
>
> There are slightly fewer metadata values for some jpegs.  For the one file
> that I manually reviewed, 1.8-rc was missing these values (that were
> available in 1.7):
>
> JPEG quality
> IPTC-NAA record
> Plug-in 1 Data
>
> Comparison reports are available here (much more work remains to be done
> on tika-eval):
>
> https://github.com/tballison/share/tree/master/tika_comparisons
>
> 
> From: Tyler Palsulich 
> Sent: Monday, April 13, 2015 1:56 PM
> To: dev@tika.apache.org; u...@tika.apache.org
> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
>
> Hi Folks,
>
> A candidate for the Tika 1.8 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
>
> The SHA1 checksum of the archive is
>   5e22fee9079370398472e59082d171ae2d7fdd31.
>
> In addition, a staged maven repository is available here:
>   https://repository.apache.org/content/repositories/orgapachetika-1009
>
> Please vote on releasing this package as Apache Tika 1.8. The vote is open
> for the next 72 hours and passes if a majority of at least three +1 Tika
> PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.8
> [ ] ±0 I don't object to this release, but I haven't checked it
> [ ] -1 Do not release this package because...
>
> Thanks,
> Tyler
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-04-09 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487012#comment-14487012
 ] 

Julien Nioche commented on TIKA-1599:
-

FWIW we've just added a JSoup based parser to 
storm-crawler[https://github.com/DigitalPebble/storm-crawler] as the HTML 
parsing in Tika normalizes/filters the original content far too much.


> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228305#comment-14228305
 ] 

Julien Nioche commented on TIKA-1302:
-

FYI have extracted data from the CommonCrawl dataset using Behemoth and put 
that on the server. See 
[http://digitalpebble.blogspot.co.uk/2014/11/generating-test-corpus-for-apache-tika.html]
 for a description of the steps. Roughly 220GB of compressed data, 2M documents 
of all mime-types, mostly non HTML. 
[~talli...@apache.org] please let me know if you have any problems with the data

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: wayback_exception_summaries.xlsx
>
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226397#comment-14226397
 ] 

Julien Nioche commented on TIKA-1302:
-

Sure, will get back to you re-details of scp when I have the data ready

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: wayback_exception_summaries.xlsx
>
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226336#comment-14226336
 ] 

Julien Nioche commented on TIKA-1302:
-

Hi [~talli...@apache.org]
It would be easy to do that with Behemoth, not sure CC contains many multimedia 
files but it certainly will have the other types you mentioned. We could either 
dump the content of the URLs to an archive to process with Tika later or do the 
Tika parsing with Behemoth as well. 

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: wayback_exception_summaries.xlsx
>
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217749#comment-14217749
 ] 

Julien Nioche commented on TIKA-595:


Thanks Dave!

> HtmlHandler does not support multivalue metadata
> 
>
> Key: TIKA-595
> URL: https://issues.apache.org/jira/browse/TIKA-595
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8
>Reporter: Lutz Pumpenmeier
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-595.patch
>
>
> The HtmlHandler uses metadata.set(...). So META tags that occure more than 
> once are not handled correctly (DublinCore metadata can be set more than 
> once).
> The handler should use  metadata.add(..) instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-595:
---
Fix Version/s: 1.7

> HtmlHandler does not support multivalue metadata
> 
>
> Key: TIKA-595
> URL: https://issues.apache.org/jira/browse/TIKA-595
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8
>Reporter: Lutz Pumpenmeier
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-595.patch
>
>
> The HtmlHandler uses metadata.set(...). So META tags that occure more than 
> once are not handled correctly (DublinCore metadata can be set more than 
> once).
> The handler should use  metadata.add(..) instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-595:
---
Attachment: TIKA-595.patch

Any reason why we wouldn't want to have multiple values in the metadata if they 
are present in the HTML doc?

> HtmlHandler does not support multivalue metadata
> 
>
> Key: TIKA-595
> URL: https://issues.apache.org/jira/browse/TIKA-595
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8
>Reporter: Lutz Pumpenmeier
>Priority: Minor
> Attachments: TIKA-595.patch
>
>
> The HtmlHandler uses metadata.set(...). So META tags that occure more than 
> once are not handled correctly (DublinCore metadata can be set more than 
> once).
> The handler should use  metadata.add(..) instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Parse Html with Tika

2014-11-03 Thread Julien Nioche
Hi Linh

You can specify a mapper to control what the html parser will filter or not.

see
https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639
for an example

Julien

On Monday, 3 November 2014, Linh Tang  wrote:

> Dear All,
>
> I am Phuong Linh,
> I am using Tika to extract content form Html file to search. But HtmlParser
> cannot parse all tag of Html.  ( I get Html page by Nutch, then use Tika to
> extract the important information, after then use Solr to search.)
> Can you tell me what i can do to parse all tag of html.
>
> Thanks advance!
>
> Regards,
> Tang Thi Phuong Linh.
> --
> P.Linh
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-05-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001612#comment-14001612
 ] 

Julien Nioche commented on TIKA-1302:
-

How large do you want that batch to be? If we are talking millions of pages, 
one option would be to use the Tika module of  Behemoth on the CommonCrawl 
dataset. See 
[http://digitalpebble.blogspot.co.uk/2011/05/processing-enron-dataset-using-behemoth.html]
 for a comparable  work we did some time ago on the Enron dataset. Behemoth 
already has a module for ingesting data from CommonCrawl. This means of course 
having Hadoop up and running.

Alternatively it would be simple to extract the documents from the CC dataset 
into the server's filesystem and use the TikaServer without Hadoop. Not sure 
what the legal implications of using these documents would be though.

The beauty of using the CommonCrawl dataset is that apart from volume, it is a 
good sample of the web with all the weird and beautiful things it contains 
(broken documents, large ones, etc...)





> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [VOTE] Apache Tika 1.5 RC2

2014-02-10 Thread Julien Nioche
Hi Dave,

+1 from me. Compiled fine on Linux Mint + tested Maven artefacts with
Behemoth and ran a parse without problems.

Thanks for doing this.

Julien


On 9 February 2014 22:53, Dave Meikle  wrote:

> Hi Guys,
>
> A new release candidate for the Tika 1.5 release is now available at:
> http://people.apache.org/~dmeikle/tika-1.5-rc2/
>
> This fixes the issues with the POM version numbers for tika-dotnet and
> tika-java7 in Tika 1.5 RC1.
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.5-rc2/
>
> The SHA1 checksum of the archive is:
> f9a3c04dc3d1ce27742d0db7b8c171bbd89063b6
>
> A staged M2 repository can also be found on repository.apache.org here:
> https://repository.apache.org/content/repositories/orgapachetika-1002
>
> Please vote on releasing this package as Apache Tika 1.5.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
>[ ] +1 Release this package as Apache Tika 1.5
>[ ] -1 Do not release this package because...
>
> Here is my +1 for the release.
>
> Cheers,
> Dave
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [VOTE] Apache Tika 1.5 RC1

2014-02-05 Thread Julien Nioche
Hi Dave

Am trying to compile from src and am getting

[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]
[ERROR]   The project org.apache.tika:tika-java7:1.5-SNAPSHOT
(/data/tika-1.5/tika-java7/pom.xml) has 1 error
[ERROR] Non-resolvable parent POM: Could not find artifact
org.apache.tika:tika-parent:pom:1.5-SNAPSHOT and 'parent.relativePath'
points at wrong local POM @ line 25, column 11 -> [Help 2]
[ERROR]

*mvn -version*
Apache Maven 3.0.4
Maven home: /usr/share/maven
Java version: 1.7.0_21, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-7-openjdk-amd64/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "3.5.0-17-generic", arch: "amd64", family: "unix"

Am I missing something?

Julien



On 5 February 2014 01:59, David Meikle  wrote:

> Hi Guys,
>
> A candidate for the Tika 1.5 release is now available at:
> http://people.apache.org/~dmeikle/tika-1.5-rc1/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.5-rc1/
>
> The SHA1 checksum of the archive is:
> 66adb7e73058da73a055a823bd61af48129c1179
>
> A staged M2 repository can also be found on repository.apache.org here:
> https://repository.apache.org/content/repositories/orgapachetika-1000
>
> Please vote on releasing this package as Apache Tika 1.5.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
>[ ] +1 Release this package as Apache Tika 1.5
>[ ] -1 Do not release this package because...
>
> Here is my +1 for the release.
>
> Cheers,
> Dave




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Integrate Apache Any23 into Apache Tika

2013-10-18 Thread Julien Nioche
Hi,

I had a look at Any23 some time ago and found that it overlapped with quite
a few other projects indeed but could (should?) have either relied on those
projects (e.g. parsing and mimetype stuff to Tika) or delegated the
functionality altogether (e.g. crawling to Nutch) instead of reinventing
the wheel and spread itself thin.

I am not familiar with the history of the project, where the code comes
from and who was behind it but I am a bit surprised that the project was
allowed to graduate from incubation without these points being addressed.

Migrating the code to Tika as a whole would not be a good idea I think.
However from a Tika point of view, it could be interesting to have the meta
parsers to convert the semantic information into a neutral representation
as a ContentHandler as in TIKA-980. Most people would probably be
interested in that more than the generation side of Any23 (what is referred
to as output format) which I think is not so relevant for Tika. From an
Any23 perspective, the project could then focus on the generation side and
just rely on Tika for pretty much everything else.

I haven't looked into Any23 in great detail and there could be other
interesting things to take from it.

Julien



On 18 October 2013 15:46, Ken Krugler  wrote:

> Hi Lewis,
>
> I haven't have much time to look into Any23, which includes reviewing
> Markus's patch for integrating some portions of that into Tika (see
> https://issues.apache.org/jira/browse/TIKA-980)
>
> The main challenge I see is that Tika seems to do best as a wrapper for
> other parsers, versus outright ownership of parsers.
>
> Which isn't to say that rolling Any23 into Tika wouldn't work, but without
> at least one active developer it would seem likely that it would languish,
> without active development.
>
> But maybe that's OK…
>
> -- Ken
>
> On Oct 18, 2013, at 7:30am, Lewis John Mcgibbney wrote:
>
> > Hi Tika Dev's/PMC,
> >
> > This thread is aimed at recognizing common ground shared by Any23 and
> Tika
> > in an attempt to possibly integrate Any23 into Tika.
> > First however it will serve a purpose for me to put this into context and
> > also provide some rationale behind this initiative.
> >
> > It is my understanding that the Tika PMC sponsored Any23 through the
> Apache
> > Incubator until we (the Any23 PMC) were ready to graduate having made an
> > incubating release and having grown the community somewhat. Post
> > graduation, we made a 0.8.0 release in July 2013.
> >
> > It is also my understanding that the logical justification for the Tika
> PMC
> > sponsoring us, was that it was envisaged (by numerous dev's) that there
> was
> > already some common ground between the aim and objectives of both
> projects
> > e.g. mime type detection, parsing, extraction of metadata, serialization,
> > etc. therefore with a little positive thinking and understanding of both
> > projects, one can clearly see the shared interests.
> >
> > I am speaking on behalf of the Any23 community here when I say that we
> have
> > however come to a realization that the community is not as vibrant as we
> > would like. This is combined with the fact that initial/original project
> > dev's are not around right now to keep the project moving in a forward
> > direction.
> >
> > It is therefore of interest to us, to approach the Tika community with
> the
> > intention of discussing a proposal to integrate Any23 code into Apache
> Tika.
> >
> > For those interested, the Any23 project URL is http://any23.apache.org,
> we
> > also have a live service which you can use to get a feel for what Any23
> > actually does. It can be found at http://any23.org.
> >
> > Any feedback from this community would be really appreciated, as it looks
> > like the alternative would be for us to take the code into the Apache
> > Attic... which is always a last resort.
> >
> > Thanks in advance.
> >
> > Lewis
> >
> > --
> > *Lewis*
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[ANNOUNCEMENT] 0.3 release of crawler-commons

2013-10-11 Thread Julien Nioche
Hi,

Just to let you know that we have just release the version 0.3 of
crawler-commons. Crawler-commons is a set of reusable Java components that
implement functionality common to any web crawler. These components benefit
from collaboration among various existing web crawler projects, and reduce
duplication of effort. The main components are parsers for robots.txt,
sitemap files, domain utilities and fetchers.

Crawler-commons is used in Bixo and Apache Nutch for parsing robots.txt
files.

 *Project* -> https://code.google.com/p/crawler-commons/

 *Release notes* ->
http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.3/CHANGES.txt

 *Info about artifacts* ->
http://search.maven.org/#artifactdetails|com.google.code.crawler-commons|crawler-commons|0.3|jar

Thanks!

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Pluggable language detection

2012-03-22 Thread Julien Nioche
If you mean integrating a better third-party detector - that's exactly my
point. We don't develop and maintain our own parsers, why should we follow
a different logic when it comes to language identification? There are other
resource around why don't we just use them? I assume that by default our
existing detector (improved or not) could still be used, all we need is
just a mechanism to be able to select an alternative implementation and a
common interface. That's probably not a big deal to implement. Any thoughts
on how to do it? Are there any things we should reuse from the way we deal
with the parsers?

Thanks for your comments

Julien


On 21 March 2012 16:55, Ken Krugler  wrote:

>
> On Mar 21, 2012, at 8:51am, Julien Nioche wrote:
>
> > Hi guys,
> >
> > Just wondering about the best way to make the language detection
> pluggable
> > instead of having it hard-wired as it is now. We now that the resources
> > that are currently in Tika are both slow and inaccurate [1] and there are
> > other libraries that we could leverage. Why not having the option to
> select
> > a different implementation just like we do for parsers? Obviously we'd
> need
> > a common interface for the parsers etc...
> >
> > What do you think?
>
> I'd be more in favor of using that time to integrate a better language
> detector into Tika, so that everybody wins from the work :)
>
> -- Ken
>
>
> > [1]
> >
> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: % of different content types out there on the web

2012-01-29 Thread Julien Nioche
That could be an interesting experiment to do with the commoncrawl dataset
and Tika on Behemoth. Assuming of course that the detection is done
correctly by Tika.  Does anyone have a spare cluster on EC2 ;-) ?

Julien

On 28 January 2012 02:01, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> (sorry for the cross post)
>
> Hey Guys,
>
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates
> the breakout (by % or some other metric) of content types out there out
> the web
> (with a whole web crawl or a meaningful representative dataset) that are
> non HTML.
>
> Anyone have any ideas about this?
>
> Thanks!
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: [VOTE] Add Any23 to the Apache Incubator

2011-09-27 Thread Julien Nioche
+1 from me

On 27 September 2011 06:18, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> OK, the proposal period had died now and I'm now calling a formal VOTE on
> the Any23 proposal located here:
>
> http://wiki.apache.org/incubator/Any23Proposal
>
> Proposal text copied at the bottom of this email. I'll leave the VOTE open
> through the
> rest of the week, and close it around Saturday, October 1, early AM PDT.
>
> Please VOTE:
>
> [ ] +1 Accept Any23 into the Apache Incubator
> [ ] +0 Don't care
> [ ] -1  Don't Accept Any23 into the Apache Incubator because...
>
> Thanks!
>
> Cheers,
> Chris
>
> P.S. Here's my +1
>
> Proposal Text:
>
> = Any23 =
> == Abstract ==
> The following proposal is about ''Anything To Triples'' (shortly Any23)
> defined as a Java library,  a Web service and a set of command line tools to
> extract and validate structured data  in [[http://www.w3.org/RDF/|RDF]]
> format from a variety of Web documents and markup formats.  Any23 is what it
> is informally named an ''RDF Distiller''.
>
> == Proposal ==
> Any23 "Anything to Triples" is a library written in Java 6 and released
> under the Apache 2.0 License. It provides a set of extractors for scraping
> semantic markup (such as [[http://microformats.org/|Microformats]], [[
> http://www.w3.org/TR/rdfa-syntax/|RDFa]] and [[
> http://www.w3.org/TR/microdata/|Microdata]])  from several sources (HTML4,
> XHTML5, CSV), a set of data validations, a set of parsers and writers to
> handle the main RDF transport formats (RDFXML, Ntriples, NQuads, Turtle).
>  The library provides a command line tool for dealing with data extraction,
> conversion and validation, and a REST service implementation. The library is
> plugin based, allowing the hot loading of new extractors and validators.
> Any23 enables third-parties developers to access structured data from Web
> pages without the need of implementing ad-hoc scraping techniques. In this
> sense, Any23 will relieve developers from build complex solutions when
> developing data acquisition pipelines and processes targeted to semantically
> marked-up Web data.
>
> == Background ==
> Any23 has been initially developed at [[http://www.deri.ie/|DERI (Digital
> Enterprise Research Institute)]],  as main component of the RDF extraction
> pipeline used in [[http://sindice.com/|Sindice (the Semantic Web Index)]],
> now is evolved in joint effort with [[http://www.fbk.eu/|FBK (Fondazione
> Bruno Kessler)]]. At present time the Any23 official [[
> http://developers.any23.org|developers page]] contains all the
> documentation, while the code is maintained on [[
> http://code.google.com/p/any23/|Google Code]]. An official up-to-date
> showcase [[http://any23.org|demo]] is also available.
>
> == Rationale ==
> Provide and maintain a robust, standard and updated library for extracting
> and validating semantic markup from heterogeneous sources would provide
> large benefits to the entire Open Source Community. Researchers and academic
> projects are adopting RDF related technologies from years  while the
> industry is actually moving toward Semantic Web technologies with more
> concreteness. Several industry initiatives related to the [[
> http://en.wikipedia.org/wiki/Semantic_Web|Web of Data]]  are taking place
> in the these months. [[http://schema.org|Schema.org]], for example, is an
> initiative sponsored by  [[
> http://www.google.com/about/corporate/company/|Google Inc]], [[
> http://info.yahoo.com/center/us/yahoo/|Yahoo Inc]]  and [[
> http://www.microsoft.com/about/companyinformation/en/us/default.aspx|MicrosoftCorporation]]
>   to structure the data in a harmonized way on [[
> http://dev.w3.org/html5/spec/Overview.html|HTML5]] pages. [[
> http://schema.org|Schema.org]] leverages on the [[
> http://dev.w3.org/html5/md/|HTML5 Microdata]] native specification. [[
> http://ogp.me/|OpenGraphProtocol]] is the open standard sponsored by  [[
> https://www.facebook.com/pages/Facebooking/114721225206500|Facebook Inc]]
> to include metadata in HTML page headers.  [[
> http://ogp.me/|OpenGraphProtocol]], initially based on [[
> http://www.w3.org/TR/xhtml-rdfa-primer/|RDFa]], allows to describe the
> content of a Web page and its underlying vocabulary could be directly
> represented using RDF.
>
> = Current Status =
> == Meritocracy ==
> The historical Any23 team believes in meritocracy and always acted as a
> community. Mailing list, open issue tracker and other communication channels
> have always been adopted since its first release. The adoption in a larger
> community, such as Apache,  is the natural evolution for Any23. Moreover,
> the Apache standards will enforce the existing Any23 community practices and
> will be a foundation for future committers involvement.
>
> == Core Developers ==
> In alphabetical order:
>
>  * Davide Palmisano 
>  * Giovanni Tummarello 
>  * Michele Mostarda 
>  * Richard Cyganiak 
>  * Reto Bachmann-Gmuer 
>  * Simone Tripodi 
>  * Szymon Danielczyk

Re: index video and image format with nutch 1.3?

2011-09-10 Thread Julien Nioche
This is not a Tika issue. Ask this on the Nutch user list instead.

On 9 September 2011 22:34, hadi  wrote:

> when i want to index video file with nutch 1.3 i get the following error :
>
> *Error parsing: file:///D:/film.avi: failed(2,0): Can't retrieve Tika
> parser
> for
>   mime-type video/x-msvideo*
> (also it is the same error for images file)
>
> and in hadoop log the detail error is:
>
> *parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.feed.FeedParser mapped to contentType
> video/x-msvideo
> via parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: video/x-msvideo*
>
> i metioned that i add the following config in parse-plugins.xml:
>
> *
>
>
> *
>
> also add the folowing config in nutch-site.xml
>
> *
>  plugin.includes
>
>
> nutch-extensionpoints|protocol-file|protocol-http|urlfilter-regex|parse-(html|tika|pdf|zip|avi)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
> *
>
> but the it doesn't work and get the same tika error,please help me
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/index-video-and-image-format-with-nutch-1-3-tp3324172p3324172.html
> Sent from the Apache Tika - Development mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Updated] (TIKA-612) Specify PDFBox options via ParseContext

2011-08-24 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-612:
---

Attachment: Tika-612.patch

Patch which allows to specify the options via the Context object. WDYT?

> Specify PDFBox options via ParseContext 
> 
>
> Key: TIKA-612
> URL: https://issues.apache.org/jira/browse/TIKA-612
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
>Priority: Minor
> Attachments: Tika-612.patch
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by 
> PDFBox are currently hardwritten in the PDFParser code, we will allow them to 
> be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089498#comment-13089498
 ] 

Julien Nioche commented on TIKA-696:


The text of the watermark can be found towards the end of word/header1.xml from 
the .docx

{code}

{code}

> Extract watermarks from Word documents
> --
>
> Key: TIKA-696
> URL: https://issues.apache.org/jira/browse/TIKA-696
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
> Attachments: Demo with watermark.doc, Demo+with+watermark.docx
>
>
> It would be nice to store the text of a watermark as metadata.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-696:
---

Attachment: Demo+with+watermark.docx

.docx version generated with MS Office

Can't see the watermark with OO but a reliable informer has told me that it is 
visible when loading with MS Office. 

> Extract watermarks from Word documents
> --
>
> Key: TIKA-696
> URL: https://issues.apache.org/jira/browse/TIKA-696
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.9
>Reporter: Julien Nioche
> Attachments: Demo with watermark.doc, Demo+with+watermark.docx
>
>
> It would be nice to store the text of a watermark as metadata.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089480#comment-13089480
 ] 

Julien Nioche commented on TIKA-696:


Can't see the watermark when saving and reopening the doc at the .docx format. 
Have used OpenOffice for generating it.

> Extract watermarks from Word documents
> --
>
> Key: TIKA-696
> URL: https://issues.apache.org/jira/browse/TIKA-696
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.9
>Reporter: Julien Nioche
> Attachments: Demo with watermark.doc
>
>
> It would be nice to store the text of a watermark as metadata.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-696:
---

Attachment: Demo with watermark.doc

Attached doc file containing a watermark

> Extract watermarks from Word documents
> --
>
> Key: TIKA-696
> URL: https://issues.apache.org/jira/browse/TIKA-696
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
> Attachments: Demo with watermark.doc
>
>
> It would be nice to store the text of a watermark as metadata.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (TIKA-696) Extract watermarks from Word documents

2011-08-23 Thread Julien Nioche (JIRA)
Extract watermarks from Word documents
--

 Key: TIKA-696
 URL: https://issues.apache.org/jira/browse/TIKA-696
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 0.9
Reporter: Julien Nioche
 Attachments: Demo with watermark.doc

It would be nice to store the text of a watermark as metadata.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Towards 1.0

2011-05-21 Thread Julien Nioche
Hi

It's a few months since 0.9 and our Tika in Action book is soon ready
> for print, so I think it's good time to start planning for the 1.0
> release.
>
> There are a few odds and ends that I'd still like to sort out in the
> trunk, but overall I think we're in a pretty much ready for the switch
> from 0.x to 1.x.
>

+1


>
> One major issue to be decided is whether we want to follow up with the
> earlier intention of dropping deprecated functionality (like the
> three-argument parse() method) before the 1.0 release. I think we
> should do that and also make some other backwards-incompatible
> cleanups while we're at it. That way we'll have less old baggage to
> carry as we evolve through the 1.x release cycle


+1 this is the perfect time to do these changes

We'll spend some time next week on Tika-657 i.e process the Enron corpus
with Tika + Behemoth; we'll probably find things to improve on the email
parser as a result. Would be good to do 1.0 after that maybe?

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Assigned] (TIKA-657) Email parser gets into trouble on malformed html in enron corpus

2011-05-21 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned TIKA-657:
--

Assignee: Julien Nioche

> Email parser gets into trouble on malformed html in enron corpus
> 
>
> Key: TIKA-657
> URL: https://issues.apache.org/jira/browse/TIKA-657
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>Reporter: Benson Margulies
>    Assignee: Julien Nioche
>
> There is a very large corpus of email addresses available: 
> http://www.cs.cmu.edu/~enron/.
> In processing even a subset of this corpus, I see numerous 'unexpected 
> RuntimeException' errors resulting from tagsoup throwing on truly awful html. 
> It seems to me that being able to do something with this entire stack would 
> make a good '1.0' criteria for tika's email parser.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-657) Email parser gets into trouble on malformed html in enron corpus

2011-05-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030467#comment-13030467
 ] 

Julien Nioche commented on TIKA-657:


Good idea. We need more tutorials and example for Behemoth 
[https://github.com/jnioche/behemoth] and processing the Enron corpus with Tika 
would be an interesting one. We get the stacktraces in the Hadoop logs and 
could then look into the details of each problem 

> Email parser gets into trouble on malformed html in enron corpus
> 
>
> Key: TIKA-657
> URL: https://issues.apache.org/jira/browse/TIKA-657
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>Reporter: Benson Margulies
>
> There is a very large corpus of email addresses available: 
> http://www.cs.cmu.edu/~enron/.
> In processing even a subset of this corpus, I see numerous 'unexpected 
> RuntimeException' errors resulting from tagsoup throwing on truly awful html. 
> It seems to me that being able to do something with this entire stack would 
> make a good '1.0' criteria for tika's email parser.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-649) NPE while parsing a .docx

2011-04-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026266#comment-13026266
 ] 

Julien Nioche commented on TIKA-649:


Sorry, should have tested on the trunk as well. Thanks Nick anyway

> NPE while parsing a .docx  
> ---
>
> Key: TIKA-649
> URL: https://issues.apache.org/jira/browse/TIKA-649
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
> Fix For: 1.0
>
> Attachments: Popcorn.docx
>
>
> The method extractHeaders in 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator throws a 
> NPE on line 234 as XWPFHeaderFooterPolicy hfPolicy is null.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-649) NPE while parsing a .docx

2011-04-27 Thread Julien Nioche (JIRA)
NPE while parsing a .docx  
---

 Key: TIKA-649
 URL: https://issues.apache.org/jira/browse/TIKA-649
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9
Reporter: Julien Nioche
 Attachments: Popcorn.docx

The method extractHeaders in 
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator throws a NPE 
on line 234 as XWPFHeaderFooterPolicy hfPolicy is null.

 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-649) NPE while parsing a .docx

2011-04-27 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-649:
---

Attachment: Popcorn.docx

Wikipedia content on popcorn within a docx page

> NPE while parsing a .docx  
> ---
>
> Key: TIKA-649
> URL: https://issues.apache.org/jira/browse/TIKA-649
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
> Attachments: Popcorn.docx
>
>
> The method extractHeaders in 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator throws a 
> NPE on line 234 as XWPFHeaderFooterPolicy hfPolicy is null.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Invisible text displayed for headings in doc files

2011-04-06 Thread Julien Nioche
Hi guys,

We are currently getting duplicated text for the heading from .doc files
e.g.

*29. No Partnership or Agency XE "29. No
Partnership or Agency" *

XE seems to be a flag in MS Word
http://taxonomist.tripod.com/indexing/wordflags.html but I don't think it
should be displayed.

Have I missed a parameter somewhere that could be used to hide these things
or shall I open a JIRA?

BTW is the class name vary from one user to another (depending on the
stylesheet) or is it consistent?

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] Closed: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-09 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed TIKA-611.
--


> PDFParser mixes the text from separate columns
> --
>
> Key: TIKA-611
> URL: https://issues.apache.org/jira/browse/TIKA-611
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.0
>
>
> As reported on the dev list by  Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with 
> columns/beads) correctly.  The text is not in the write order as it 
> intermixes text from different beads.  Try it on an academic paper. 
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which 
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been 
> added as part of the commit rev 1029510, see 
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the 
> Context object, but for the time being wouldn't it make sense to set 
> setSortByPosition to the default value of false? I think that this would be 
> the best option for most cases where docs have columns.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Resolved: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-09 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved TIKA-611.


Resolution: Fixed

Committed revision 1079705.

Opened TIKA-612 for the params via ParseContext 

> PDFParser mixes the text from separate columns
> --
>
> Key: TIKA-611
> URL: https://issues.apache.org/jira/browse/TIKA-611
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.0
>
>
> As reported on the dev list by  Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with 
> columns/beads) correctly.  The text is not in the write order as it 
> intermixes text from different beads.  Try it on an academic paper. 
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which 
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been 
> added as part of the commit rev 1029510, see 
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the 
> Context object, but for the time being wouldn't it make sense to set 
> setSortByPosition to the default value of false? I think that this would be 
> the best option for most cases where docs have columns.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (TIKA-612) Specify PDFBox options via ParseContext

2011-03-09 Thread Julien Nioche (JIRA)
Specify PDFBox options via ParseContext 


 Key: TIKA-612
 URL: https://issues.apache.org/jira/browse/TIKA-612
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 0.9
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor


See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox 
are currently hardwritten in the PDFParser code, we will allow them to be 
specified via the ParseContext objects

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004035#comment-13004035
 ] 

Julien Nioche commented on TIKA-611:


The current behaviour is incorrect not only for academic research papers but 
for any document using columns (e.g. contracts, etc...) and was how things were 
done prior to the modif in Tika-446
Can we fix the boolean value in this issue then open a new issue and implement 
the mechanism with ParseContext for this and the other params?

> PDFParser mixes the text from separate columns
> --
>
> Key: TIKA-611
> URL: https://issues.apache.org/jira/browse/TIKA-611
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.0
>
>
> As reported on the dev list by  Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with 
> columns/beads) correctly.  The text is not in the write order as it 
> intermixes text from different beads.  Try it on an academic paper. 
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which 
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been 
> added as part of the commit rev 1029510, see 
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the 
> Context object, but for the time being wouldn't it make sense to set 
> setSortByPosition to the default value of false? I think that this would be 
> the best option for most cases where docs have columns.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003884#comment-13003884
 ] 

Julien Nioche commented on TIKA-611:


No objections? Shall I commit this?

> PDFParser mixes the text from separate columns
> --
>
> Key: TIKA-611
> URL: https://issues.apache.org/jira/browse/TIKA-611
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.0
>
>
> As reported on the dev list by  Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with 
> columns/beads) correctly.  The text is not in the write order as it 
> intermixes text from different beads.  Try it on an academic paper. 
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which 
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been 
> added as part of the commit rev 1029510, see 
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the 
> Context object, but for the time being wouldn't it make sense to set 
> setSortByPosition to the default value of false? I think that this would be 
> the best option for most cases where docs have columns.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Created: (TIKA-611) PDFParser mixes the text from separate columns

2011-03-07 Thread Julien Nioche (JIRA)
PDFParser mixes the text from separate columns
--

 Key: TIKA-611
 URL: https://issues.apache.org/jira/browse/TIKA-611
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.0


As reported on the dev list by  Michael Schmitz :

bq. I don't think the current snapshot is parsing articles (pdfs with 
columns/beads) correctly.  The text is not in the write order as it intermixes 
text from different beads.  Try it on an academic paper. 
http://turing.cs.washington.edu/papers/acl08.pdf

This can be fixed by changing the value of setSortByPosition to false, which is 
the default value in PDFTextStripper. This line (PDF2XHTML:82) had been added 
as part of the commit rev 1029510, see 
https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787

Ideally we could specify what value to set for these parameters via the Context 
object, but for the time being wouldn't it make sense to set setSortByPosition 
to the default value of false? I think that this would be the best option for 
most cases where docs have columns.

 




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Resolved: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-03-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved TIKA-597.


   Resolution: Fixed
Fix Version/s: 1.0

Committed revision 1076300

Thanks Benson

> Bogus exception handler in 
> org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, 
> InputStream)
> ---
>
> Key: TIKA-597
> URL: https://issues.apache.org/jira/browse/TIKA-597
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.8
>Reporter: Benson Margulies
>Assignee: Julien Nioche
> Fix For: 1.0
>
> Attachments: TIKA-597.patch
>
>
> org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, 
> InputStream) 
> contains an exception handler that calls printStackTrace instead of rethrowing
> as a RuntimeException. Should it be 'throws TikaException' in any case?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-03-02 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001497#comment-13001497
 ] 

Julien Nioche commented on TIKA-597:


Benson, 

I can't see any TikaRuntimeException in the current SVN (Revision: 1076294). 
This could be a nice addition, would you mind opening a separate issue for it 
so that people can discuss the pros/cons?

Will embed the TikaException in a RuntimeException for now.

Thanks

Julien

> Bogus exception handler in 
> org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, 
> InputStream)
> ---
>
> Key: TIKA-597
> URL: https://issues.apache.org/jira/browse/TIKA-597
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.8
>Reporter: Benson Margulies
>Assignee: Julien Nioche
> Attachments: TIKA-597.patch
>
>
> org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, 
> InputStream) 
> contains an exception handler that calls printStackTrace instead of rethrowing
> as a RuntimeException. Should it be 'throws TikaException' in any case?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Assigned: (TIKA-597) Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)

2011-03-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned TIKA-597:
--

Assignee: Julien Nioche  (was: Chris A. Mattmann)

> Bogus exception handler in 
> org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, 
> InputStream)
> ---
>
> Key: TIKA-597
> URL: https://issues.apache.org/jira/browse/TIKA-597
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.8
>Reporter: Benson Margulies
>Assignee: Julien Nioche
> Attachments: TIKA-597.patch
>
>
> org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, 
> InputStream) 
> contains an exception handler that calls printStackTrace instead of rethrowing
> as a RuntimeException. Should it be 'throws TikaException' in any case?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [VOTE] Apache Tika 0.9 Release Candidate #1

2011-02-15 Thread Julien Nioche
>
>
> Please vote on releasing these packages as Apache Tika 0.9. The vote is
> open
> for the next 72 hours. Only votes from Tika PMC are binding, but everyone
> is welcome to check the release candidate and voice their approval or
> disapproval. The vote passes if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache Tika 0.9.
>
> [ ] -1 Do not release the packages because...
>
> +1 : I tried 0.9 with Behemoth and it worked fine. As for Nutch 1.3 it
causes an issue with the zip parser plugin but I don't think Tika is to
blame. I'll fix it when 0.9 is released.

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-11-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965286#action_12965286
 ] 

Julien Nioche commented on TIKA-461:


patch -p1 failed 

peb...@lucid-vostro:/data/tika$ patch -p1 < TIKA-461-tests-1.patch 
patching file 
tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
Hunk #1 FAILED at 16.
Hunk #2 FAILED at 26.
Hunk #3 FAILED at 40.
etc...

can you please do an aggregate patch with svn diff? Thanks 

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>Assignee: Julien Nioche
> Attachments: testRFC822-multipart, TIKA-461-tests-1.patch, 
> TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-11-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965271#action_12965271
 ] 

Julien Nioche commented on TIKA-461:


Benjamin, thanks for your patch. Could you generate it with 'svn diff' ? I am 
not able to apply it successfully to my code.

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>Assignee: Julien Nioche
> Attachments: testRFC822-multipart, TIKA-461-tests-1.patch, 
> TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-461) RFC822 messages not parsed

2010-11-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-461:
---

Attachment: testRFC822-multipart

Test document for mail parsing with multiparts, text + html representations of 
the text and picture attached. 

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>    Assignee: Julien Nioche
> Attachments: testRFC822-multipart, TIKA-461-tests-1.patch, 
> TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Furthering Along TIKA-461

2010-11-25 Thread Julien Nioche
Hi Ben,

Great! I still haven't found the time to work on Nick's suggestions but you
can definitely work on the tests if you want to and add some of the emails
you mentioned. Having some cases of multipart with HTML and txt content +
images and attachments would be good.

Thanks

Julien

On 25 November 2010 18:01, Benjamin Douglas  wrote:

> Hello,
>
> I am working on a project with rfc-822 email messages and ran into the
> problem discussed in TIKA-461. I'd be interested in helping this story
> along, if there is anything more to be done. In particular, I have a pile of
> public domain emails that might be useful for testing.
>
> Thanks,
> Ben Douglas
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-11-09 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930180#action_12930180
 ] 

Julien Nioche commented on TIKA-461:


Nope. I was planning to refactor the parser first along the lines of what you 
described (org.apache.tika.extractor) and haven't found the time to do it so 
far.

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>Assignee: Julien Nioche
> Attachments: TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-09-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915708#action_12915708
 ] 

Julien Nioche commented on TIKA-461:


Nick, 

Thanks for taking the time to review my patch. 

bq. It'd probably be good to see some more tests with it. For now, just 
checking your basic message should be fine, but I'd suggest we also try to get 
an email with plain text, html, images and similar in to check the more complex 
bits.

Agreed

bq. In terms of the nested parser, I'm tempted to say we do something so that 
plain text comes out without any extra work needed. Anything else gets handled 
via a Parser fetched from the ParseContext if required, much as we're doing for 
container formats like zip, .docx etc. That way, you can throw a simple email 
at it and get the text, but the rest of the parts are available if you want them

I hadn't noticed that you've added org.apache.tika.extractor, seems an elegant 
way of doing. Will have a closer look and see how I can leverage it in  
RFC822Parser

bq.  Also, the james jars need to be listed in the tika bundle pom so they get 
properly included 

Ok, did not know about that. Thanks

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.7
>    Reporter: Joshua Turner
>Assignee: Julien Nioche
> Attachments: TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-09-27 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915269#action_12915269
 ] 

Julien Nioche commented on TIKA-461:


Hi guys, 

Could anyone have a look at the patch and review it please? I will assume no 
one has seen any major issues with it and commit at the end of this week 
otherwise
Thanks

Julien  

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>Assignee: Julien Nioche
> Attachments: TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-461) RFC822 messages not parsed

2010-09-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-461:
---

Issue Type: New Feature  (was: Bug)

changed from bug to new feature

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>    Assignee: Julien Nioche
> Attachments: TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-461) RFC822 messages not parsed

2010-09-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-461:
---

Attachment: TIKA-461.patch

This patch contains an initial version of the RFC822Parser which uses 
apache-mime4j

The metadata are currently created only for the main message and not for the 
parts. Note that the parts are put inside  elements and parsed with the 
right parser for their declared mime-type e.g. txt, html etc... The multiparts 
seem to be properly handled as well.

There is definitely space for improvements, as usual comments are more than 
welcome

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>    Assignee: Julien Nioche
> Attachments: TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-461) RFC822 messages not parsed

2010-09-06 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906468#action_12906468
 ] 

Julien Nioche commented on TIKA-461:


I'll have a look at mime4j and try to use it in Tika

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>Assignee: Julien Nioche
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (TIKA-461) RFC822 messages not parsed

2010-09-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned TIKA-461:
--

Assignee: Julien Nioche

> RFC822 messages not parsed
> --
>
> Key: TIKA-461
> URL: https://issues.apache.org/jira/browse/TIKA-461
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>Reporter: Joshua Turner
>    Assignee: Julien Nioche
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

2010-08-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899465#action_12899465
 ] 

Julien Nioche commented on TIKA-463:


Look good. I must be missing something obvious but I can't work out where an 
element like META is sent to the XHTML output. Wasn't the case before as far as 
I can remember and this can't be in HtmlHandler as it imposes the constraints I 
described earlier i.e. it used to simply put the info in the metadata. Ken, 
would you mind giving me a hint?


> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, 
> link
> -
>
> Key: TIKA-463
> URL: https://issues.apache.org/jira/browse/TIKA-463
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Ken Krugler
>Assignee: Ken Krugler
> Fix For: 0.8
>
> Attachments: TIKA-463-1.patch, TIKA-463-2.patch, TIKA-463-3.patch, 
> TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd 
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges 
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, 
> then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (TIKA-460) HTMLHandler misses treatment of A elements

2010-08-14 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved TIKA-460.


Resolution: Fixed

Committed revision 985444

The A elements are now processed correctly when using the IdentityMapper. I 
have added  to the list of safe elements in the DefaultHTMLMapper.

Ken - the element A still have a special treatment so the safe attributes you 
added in   

{code}
put("a", attrSet("rel", "name"));
{code}

are still not used. Since A was not in the list of safe elements these 
attributes were not used anyway

I still think that we should delegate the logic to the mappers as suggested in 
TIKA-463 but in the meantime this fix allows us to get to the A's using the 
IdentityMapper and simplifies the code a bit. 

> HTMLHandler misses treatment of A elements 
> ---
>
> Key: TIKA-460
> URL: https://issues.apache.org/jira/browse/TIKA-460
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 0.8
>
> Attachments: TIKA-460.patch
>
>
> The A elements should be processed before any other safe element, otherwise 
> it never happens

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-460) HTMLHandler misses treatment of A elements

2010-08-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898335#action_12898335
 ] 

Julien Nioche commented on TIKA-460:


Hi Ken, correct. The A's get bypassed otherwise. Tika-463 would be a cleaner 
way of dealing with situations like these but in the meantime the patch should 
be OK 

> HTMLHandler misses treatment of A elements 
> ---
>
> Key: TIKA-460
> URL: https://issues.apache.org/jira/browse/TIKA-460
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 0.8
>
> Attachments: TIKA-460.patch
>
>
> The A elements should be processed before any other safe element, otherwise 
> it never happens

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Post link to Tika in Action book on Tika website?

2010-08-02 Thread Julien Nioche
+1 from me

On 2 August 2010 18:33, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Tika community,
>
> Jukka Zitting and I are working on the Tika in Action book [1]. How would
> everyone feel about us posting a link to it on the Tika website [2]?
>
> If so, I'll prepare a patch and update the website shortly.
>
> Cheers,
> Chris
>
> [1] http://manning.com/mattmann/
> [2] http://tika.apache.org/
>
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

2010-07-27 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892958#action_12892958
 ] 

Julien Nioche commented on TIKA-463:


Am very tempted to push things one step further and delegate the startElement() 
and endElement() to the mappers so that users can do whatever they fancy in 
their custom mapper implementations. In that case we'd probably not need 
mapSafeElement and mapSafeAttribute any longer. The patch above gives the 
mappers access to the metadata.

For example,  have a special treatment in the HTMLHandler and we currently 
can't get the rel attribute in from http://www.nutch.org"; 
rel="nofollow">, which for a crawler is quite an embarrassment. Instead, by 
delegating the logic to the mappers we get total control on what can be done 
while at the same time remain able to keep the existing behaviour by default. 

Any reason not to delegate start/endElement to the mappers? It would be good to 
get some feedback on this, as I really need to improve the  handling of HTML 
for Nutch :-)

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, 
> link
> -
>
> Key: TIKA-463
> URL: https://issues.apache.org/jira/browse/TIKA-463
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Ken Krugler
>Assignee: Ken Krugler
> Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd 
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges 
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, 
> then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (TIKA-466) Feed Parser

2010-07-20 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed TIKA-466.
--


> Feed Parser
> ---
>
> Key: TIKA-466
> URL: https://issues.apache.org/jira/browse/TIKA-466
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.8
>
> Attachments: TIKA-466.patch
>
>
> We currently have no parsers for feeds in Tika and since we are progressively 
> getting rid of our legacy parsers in Nutch I thought it could make sense to 
> have one.
> The patch attached is based on the ROME feed parser 
> (https://rome.dev.java.net/) which is under Apache License. Rome provides a 
> unified API for different feed formats and seems well maintained.
> The implementation of the FeedParser is by no means complete but should serve 
> as a basis for further improvements. It currently stores the title and 
> description from the feed and stores them in the metadata and uses the 
> following XHTML representation for the entries : 
> ENTRY_TITLE
> 
> ENTRY_DESCRIPTION
>  
> This is pretty basic but should at least allow us to retrieve the outlinks in 
> Nutch as well as some text. 
> J. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-466) Feed Parser

2010-07-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890380#action_12890380
 ] 

Julien Nioche commented on TIKA-466:


Thanks Chris for reviewing and committing it

> Feed Parser
> ---
>
> Key: TIKA-466
> URL: https://issues.apache.org/jira/browse/TIKA-466
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.8
>
> Attachments: TIKA-466.patch
>
>
> We currently have no parsers for feeds in Tika and since we are progressively 
> getting rid of our legacy parsers in Nutch I thought it could make sense to 
> have one.
> The patch attached is based on the ROME feed parser 
> (https://rome.dev.java.net/) which is under Apache License. Rome provides a 
> unified API for different feed formats and seems well maintained.
> The implementation of the FeedParser is by no means complete but should serve 
> as a basis for further improvements. It currently stores the title and 
> description from the feed and stores them in the metadata and uses the 
> following XHTML representation for the entries : 
> ENTRY_TITLE
> 
> ENTRY_DESCRIPTION
>  
> This is pretty basic but should at least allow us to retrieve the outlinks in 
> Nutch as well as some text. 
> J. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-147) Add Flash parser

2010-07-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889883#action_12889883
 ] 

Julien Nioche commented on TIKA-147:


There is http://www.jswiff.com/licensing/ which seems maintained but is under 
GPL. 

Can't we push the JavaSWF jar to a public repository ourselves?  

> Add Flash parser
> 
>
> Key: TIKA-147
> URL: https://issues.apache.org/jira/browse/TIKA-147
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Dave Meikle
>Priority: Minor
>
> Adobe has published the Flash SWF file format specification at 
> http://www.adobe.com/devnet/swf/.
> Once there's a parser library available for Flash files we should use it to 
> make especially downstream web crawlers like Nutch happy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

2010-07-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-463:
---

Attachment: TIKA-463.patch

Patch which implements some of the ideas described in this issue. 

- HTMLMapper is an abstract class with a constructor HtmlMapper(Metadata 
metadata, ParseContext context)
- all extensions of HtmlMapper can access the metadata and context
- HTMLMapper implements the method resolve(String url)
- Created a LinksHtmlMapper which extends DefaultHtmlMapper
- HtmlHandler.bodyLevel is used to restrict the propagation of characters() but 
not the elements
- HtmlHandler has a variable inHead to separate the treatment of elements in 
the header from the rest (don't know if this is really needed but that's how it 
is done now)

Note that : 
- HTMLMapper.resolve() is currently called from the HtmlHandler
- the signatures of the mapper methods have not been changed
- custom processing of some elements (A, BASE, LINK, ...) is still done in the 
HtmlHandler and not in the mapper

This patch passes the tests. 



> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, 
> link
> -
>
> Key: TIKA-463
> URL: https://issues.apache.org/jira/browse/TIKA-463
> Project: Tika
>  Issue Type: Bug
>Reporter: Ken Krugler
>Assignee: Ken Krugler
> Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd 
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges 
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, 
> then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-466) Feed Parser

2010-07-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-466:
---

Attachment: TIKA-466.patch

> Feed Parser
> ---
>
> Key: TIKA-466
> URL: https://issues.apache.org/jira/browse/TIKA-466
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Julien Nioche
>Priority: Minor
> Attachments: TIKA-466.patch
>
>
> We currently have no parsers for feeds in Tika and since we are progressively 
> getting rid of our legacy parsers in Nutch I thought it could make sense to 
> have one.
> The patch attached is based on the ROME feed parser 
> (https://rome.dev.java.net/) which is under Apache License. Rome provides a 
> unified API for different feed formats and seems well maintained.
> The implementation of the FeedParser is by no means complete but should serve 
> as a basis for further improvements. It currently stores the title and 
> description from the feed and stores them in the metadata and uses the 
> following XHTML representation for the entries : 
> ENTRY_TITLE
> 
> ENTRY_DESCRIPTION
>  
> This is pretty basic but should at least allow us to retrieve the outlinks in 
> Nutch as well as some text. 
> J. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (TIKA-466) Feed Parser

2010-07-16 Thread Julien Nioche (JIRA)
Feed Parser
---

 Key: TIKA-466
 URL: https://issues.apache.org/jira/browse/TIKA-466
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Priority: Minor
 Attachments: TIKA-466.patch

We currently have no parsers for feeds in Tika and since we are progressively 
getting rid of our legacy parsers in Nutch I thought it could make sense to 
have one.

The patch attached is based on the ROME feed parser 
(https://rome.dev.java.net/) which is under Apache License. Rome provides a 
unified API for different feed formats and seems well maintained.

The implementation of the FeedParser is by no means complete but should serve 
as a basis for further improvements. It currently stores the title and 
description from the feed and stores them in the metadata and uses the 
following XHTML representation for the entries : 

ENTRY_TITLE

ENTRY_DESCRIPTION
 

This is pretty basic but should at least allow us to retrieve the outlinks in 
Nutch as well as some text. 

J. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-460) HTMLHandler misses treatment of A elements

2010-07-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887718#action_12887718
 ] 

Julien Nioche commented on TIKA-460:


this would work if we had  in the list of safe elements in the 
DefaultHTMLMapper, which is not the case. Will wait for the outcome of the 
discussions on TIKA-463 which will affect the way link elements are handled.  

> HTMLHandler misses treatment of A elements 
> ---
>
> Key: TIKA-460
> URL: https://issues.apache.org/jira/browse/TIKA-460
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 0.8
>
> Attachments: TIKA-460.patch
>
>
> The A elements should be processed before any other safe element, otherwise 
> it never happens

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

2010-07-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887716#action_12887716
 ] 

Julien Nioche commented on TIKA-463:


creating a LinksHtmlMapper : +1, that would be a nice intermediate between the 
default mapper and the identity mapper 

handling of links in mapper : mapSafeAttribute() returns a normalised 
representation of the attribute names that are allowed but does not affect the 
value of the attributes. Maybe we should change the method so that it returns 
BOTH the normalised name (or null of the attribute must be skipped) and the 
corresponding normalised value (e.g. the resolved URL) given a name/value 
couple. The mapper implementation could then manage the resolution of the URLs 
internally. This would also be useful for normalising the names and values of 
elements in the header such as http-equiv.

HtmlParser as an abstract class : what about following Jukka's suggestion for 
Handlers in https://issues.apache.org/jira/browse/TIKA-458 and have a Factory?

As for frames, it raises another issue (see 
https://issues.apache.org/jira/browse/TIKA-457) which is that anything outside 
 and  is currently discarded by the HTMLMapper. This is why I 
considered doing TIKA-458 but maybe we could make the HTMLHandler more generic 
and delegate the decisions to the Mappers e.g. by adding a method isBody(). 

The body level is currently used to : 
a) distinguish the elements in the header
b) determine where characters should be added to the text of the document

Do we really need (a)? Are elements such as LINK, BASE or META found anywhere 
outside the HEAD? Should mapSafeElement() take into account the path of an 
element as well e.g. to allow a  only if it has  for parent?




> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, 
> link
> -
>
> Key: TIKA-463
> URL: https://issues.apache.org/jira/browse/TIKA-463
> Project: Tika
>  Issue Type: Bug
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd 
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges 
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, 
> then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-460) HTMLHandler misses treatment of A elements

2010-07-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-460:
---

Attachment: TIKA-460.patch

> HTMLHandler misses treatment of A elements 
> ---
>
> Key: TIKA-460
> URL: https://issues.apache.org/jira/browse/TIKA-460
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 0.8
>
> Attachments: TIKA-460.patch
>
>
> The A elements should be processed before any other safe element, otherwise 
> it never happens

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (TIKA-460) HTMLHandler misses treatment of A elements

2010-07-08 Thread Julien Nioche (JIRA)
HTMLHandler misses treatment of A elements 
---

 Key: TIKA-460
 URL: https://issues.apache.org/jira/browse/TIKA-460
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.7
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 0.8


The A elements should be processed before any other safe element, otherwise it 
never happens

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (TIKA-458) Specify HTMLHandler via Context

2010-07-07 Thread Julien Nioche (JIRA)
Specify HTMLHandler via Context
---

 Key: TIKA-458
 URL: https://issues.apache.org/jira/browse/TIKA-458
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 0.7
Reporter: Julien Nioche
 Attachments: TIKA-458.patch

One of the recent changes on Tika is the possibility to specify a custom 
HTMLMapper via the Context - which I think is an elegant mechanism. I was 
wondering whether there would be a reason NOT to be able to do the same for the 
HTMLHandler and if nothing is passed via the Context, rely on the current 
implementation. This would give more control to the user on what to do with the 
SAX events while at the same time preserving the functionality by default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (TIKA-457) HTMLParser gets an early event

2010-07-07 Thread Julien Nioche (JIRA)
HTMLParser gets an early  event
--

 Key: TIKA-457
 URL: https://issues.apache.org/jira/browse/TIKA-457
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche


I am using the IdentityMapper in the HTMLparser with this simple document:

{code}
 my title 


 












{code}

Strangely the HTMLHandler is getting a call to endElement on the body *BEFORE*  
we reach frameset. As a result the variable bodylevel is decremented back to 0 
and the remaining entities are ignored due to the logic implemented in 
HTMLHandler.

Any idea?




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-458) Specify HTMLHandler via Context

2010-07-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-458:
---

Attachment: TIKA-458.patch

> Specify HTMLHandler via Context
> ---
>
> Key: TIKA-458
> URL: https://issues.apache.org/jira/browse/TIKA-458
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.7
>    Reporter: Julien Nioche
> Attachments: TIKA-458.patch
>
>
> One of the recent changes on Tika is the possibility to specify a custom 
> HTMLMapper via the Context - which I think is an elegant mechanism. I was 
> wondering whether there would be a reason NOT to be able to do the same for 
> the HTMLHandler and if nothing is passed via the Context, rely on the current 
> implementation. This would give more control to the user on what to do with 
> the SAX events while at the same time preserving the functionality by default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Specify HTMLHandler via Context

2010-07-07 Thread Julien Nioche
Hi guys,

One of the recent changes on Tika is the possibility to specify a custom
HTMLMapper via the Context - which I think is an elegant mechanism. I was
wondering whether there would be a reason NOT to be able to do the same for
the HTMLHandler and if nothing is passed via the Context, rely on the
current implementation. This would give more control to the user on what to
do with the SAX events while at the same time preserving the functionality
by default.

Any thoughts on this?

Julien

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


[jira] Closed: (TIKA-454) Illegal Charset Name crashes HTMLParser

2010-07-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed TIKA-454.
--

Resolution: Fixed

Committed revision 960487

> Illegal Charset Name crashes HTMLParser
> ---
>
> Key: TIKA-454
> URL: https://issues.apache.org/jira/browse/TIKA-454
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 0.8
>
> Attachments: TIKA-454.patch
>
>
> As reported by Andrzej [1], the HTMLParser crashes when the charset found in 
> meta is illegal e.g. 
> 
> [1] 
> http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3c4c2a102d.7090...@getopt.org%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (TIKA-454) Illegal Charset Name crashes HTMLParser

2010-07-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned TIKA-454:
--

Assignee: Julien Nioche

> Illegal Charset Name crashes HTMLParser
> ---
>
> Key: TIKA-454
> URL: https://issues.apache.org/jira/browse/TIKA-454
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 0.8
>
> Attachments: TIKA-454.patch
>
>
> As reported by Andrzej [1], the HTMLParser crashes when the charset found in 
> meta is illegal e.g. 
> 
> [1] 
> http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3c4c2a102d.7090...@getopt.org%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (TIKA-454) Illegal Charset Name crashes HTMLParser

2010-07-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated TIKA-454:
---

Attachment: TIKA-454.patch

Trivial fix - simply catch the exception and let the guesswork begin.
The only drawback of this is that the content type is not fixed accordingly

{code}
Content-Encoding: ISO-8859-1
Content-Length: 110
Content-Type: text/html; charset=ISO 8859-1
{code}

but at least it is not crashing the parsing anymore

> Illegal Charset Name crashes HTMLParser
> ---
>
> Key: TIKA-454
> URL: https://issues.apache.org/jira/browse/TIKA-454
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
>    Reporter: Julien Nioche
> Fix For: 0.8
>
> Attachments: TIKA-454.patch
>
>
> As reported by Andrzej [1], the HTMLParser crashes when the charset found in 
> meta is illegal e.g. 
> 
> [1] 
> http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3c4c2a102d.7090...@getopt.org%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (TIKA-454) Illegal Charset Name crashes HTMLParser

2010-07-02 Thread Julien Nioche (JIRA)
Illegal Charset Name crashes HTMLParser
---

 Key: TIKA-454
 URL: https://issues.apache.org/jira/browse/TIKA-454
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.7
Reporter: Julien Nioche
 Fix For: 0.8


As reported by Andrzej [1], the HTMLParser crashes when the charset found in 
meta is illegal e.g. 



[1] 
http://mail-archives.apache.org/mod_mbox/tika-user/201006.mbox/%3c4c2a102d.7090...@getopt.org%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-448) Tika FLVParser hangs

2010-06-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883660#action_12883660
 ] 

Julien Nioche commented on TIKA-448:


I have seen similar cases with FLV when the content fetched by Nutch had been 
trimmed. Setting the log level to debug should give you more information about 
which URL is problematic.
One simple workaround for cases like these (apart from filtering on *.flv of 
course) is to use the skip record options in Hadoop 

{code}
  skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D 
mapred.skip.map.max.skip.records=1"
  ./nutch parse $commonOptions $skipRecordsOptions $hdfspath/segments/$SEGMENT
{code}

this will skip the problematic entries after a couple of retries.

Of course preventing the flv parser to loop would be even better. I'll see if I 
can reproduce the problem later

> Tika FLVParser hangs
> 
>
> Key: TIKA-448
> URL: https://issues.apache.org/jira/browse/TIKA-448
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.7
> Environment: Linux JDK 1.6u13, Nutch 1.1
>Reporter: Jeroen van Vianen
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A 
> thread dump shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a5]
>java.lang.Thread.State: RUNNABLE
> at java.io.FilterInputStream.skip(FilterInputStream.java:125)
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen 
> - skiplen) returns 0 for whatever reason in 
> org.apache.tika.parser.video.FLVParser.parse around line 246:
> // Tag was not metadata, skip over data we cannot handle
> for (int skiplen = 0; skiplen < datalen;) {
> long currentSkipLen = datainput.skip(datalen - skiplen);
> skiplen += currentSkipLen;
> }
> As I don't know which FLV was downloaded that caused the problem I cannot 
> easily create a testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svnpubsub for the Tika web site

2010-06-21 Thread Julien Nioche
Same here

+1



On 21 June 2010 18:00, Ken Krugler  wrote:

> Hi Jukka,
>
> I can't think of any cons, so +1
>
> -- Ken
>
>
> On Jun 21, 2010, at 3:02am, Jukka Zitting wrote:
>
>  Hi,
>>
>> The PDFBox web site [1] is now managed using the new svnpubsub
>> mechanism set up by the infra team. Basically, the generated web site
>> is committed to svn along with the site sources, and the svnpubsub
>> magic will automatically publish the latest changes as soon as they've
>> been committed. No more hours waiting for the rsync delay or wondering
>> if the CI build setup works correctly. :-) See PDFBOX-623 [2] for the
>> basic site update process now used by PDFBox.
>>
>> I'd like to set up a similar system also for Tika. We already have a
>> Maven generated site, so it'll be easy to duplicate the setup from
>> PDFBox.
>>
>> WDYT?
>>
>> [1] http://pdfbox.apache.org/
>> [2] https://issues.apache.org/jira/browse/PDFBOX-623
>>
>> BR,
>>
>> Jukka Zitting
>>
>
> 
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


Re: Welcome Julien Nioche, new Tika PMC member and committer

2010-06-06 Thread Julien Nioche
Hi,

Thank you for the warm welcome, I feel very honoured to have been made a
Tika committer.

A few lines about myself : I am the director of a DigitalPebble, a small
consultancy based in Bristol, UK. I started using Lucene back in 2001, made
a few small contributions to it and started LIMO - an open source web
application used for monitoring Lucene indices. Over the last 3 years I have
used and contributed to quite a few Apache projects such as SOLR, UIMA,
Nutch and Tika. I am also a Nutch committer since last Devember and am
actively working on NutchBase, which will (probably) become Nutch 2.0 at
some point.

I have used Tika quite a lot in the last couple of years, notably by writing
a Tika component for Apache UIMA and a parse plugin for Nutch. Most of the
parsing duties have already been delegated to Tika in the forthcoming Nutch
1.1 and this trend will continue in Nutch 2.0. I will probably spend some
time porting the missing parsers from Nutch to Tika (e.g. feeds) and dig a
bit more on any bugs that my Nutch crawls could reveal (e.g. errors in
mime-type magic detection).

My activities at DigitalPebble also cover Natural Language Processing (which
is my initial background), text analysis and I recently started an open
source project named Behemoth which allows to scale text analysis
applications using Hadoop.

Best,

Julien Nioche

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Julien


On 5 June 2010 23:42, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> In recognition of his contributions to the Tika project, the Tika PMC has
> voted to make Julien Nioche a Tika PMC member and committer, and Julien has
> accepted!
>
> Julien, please feel free to say a few words about yourself, and most
> importantly, welcome aboard!
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>


[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871623#action_12871623
 ] 

Julien Nioche commented on TIKA-433:


Could do. I can't see a place in Tika's code for non-core contributions / 
sandbox though and am not sure that we want to burden Tika with Hadoop 
dependencies just for the sake of implementing this. My comment was actually 
more about the fact that functionalities such as the one you described *are* 
what Behemoth is all about i.e. processing documents in various ways using 
mapreduce, storing the data in a neutral, stand-off based implementation and 
using that in conjunction with projects such as SOLR or Mahout.
I suppose it also depends on whether Tika's focus should be on its API or 
provide a sandbox as well. WDYT?

> Tika + Hadoop
> -
>
> Key: TIKA-433
> URL: https://issues.apache.org/jira/browse/TIKA-433
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Reporter: Grant Ingersoll
>Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with 
> "rich" documents on it and an output format (or output processor) and 
> converted the docs to XHTML or Solr or whatever.  Seems like it should be 
> pretty straightforward to do on the Hadoop side of things.  Only tricky part, 
> I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-430) Automatically let all valid XHTML 1.0 attributes through from HTML documents

2010-05-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871585#action_12871585
 ] 

Julien Nioche commented on TIKA-430:


The method mapSafeAttribute(String elementName, String attributeName) 
introduced in Tika-379 should allow to implement that 

> Automatically let all valid XHTML 1.0 attributes through from HTML documents
> 
>
> Key: TIKA-430
> URL: https://issues.apache.org/jira/browse/TIKA-430
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>
> Many consumers of parse output wouldn't want to process the raw 
> (unnormalized) elements they'd get with the IdentityHtmlMapper, but they 
> would want to get any standard attributes. For example, with  elements 
> they would get any rel attribues.
> I believe this would require changing the DefaultHtmlMapper to "know" about 
> valid attributes for different elements.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871544#action_12871544
 ] 

Julien Nioche commented on TIKA-433:


You can do that with [Behemoth|http://code.google.com/p/behemoth-pebble/] as it 
uses Tika on rich documents stored in a SequenceFile. There is an application 
in the Behemoth Sandbox which sends the annotated documents to SOLR and I am 
planning to write one to generate vectors for Mahout. The output format is a 
very straightforward standoff annotation model and that should fit for most 
applications.

> Tika + Hadoop
> -
>
> Key: TIKA-433
> URL: https://issues.apache.org/jira/browse/TIKA-433
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Reporter: Grant Ingersoll
>Priority: Minor
>
> Would be great to have a Tika contrib that took in an HDFS location with 
> "rich" documents on it and an output format (or output processor) and 
> converted the docs to XHTML or Solr or whatever.  Seems like it should be 
> pretty straightforward to do on the Hadoop side of things.  Only tricky part, 
> I suppose, is the output format and how to make that pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.