Re: Per Page Document Content

2015-07-15 Thread Nick Burch
On Wed, 15 Jul 2015, Nazar Hussain wrote: The problem I am facing is with pages. I can extract total pages from document metadata. But I can't find any way to extract content per page from the document. What file formats is this for? And how are you calling Tika? If the file format is page-ba

Re: Licensing of Tika

2015-07-15 Thread Nick Burch
On Tue, 14 Jul 2015, Chris Harshman wrote: Personally, I'd conduct a review of each component if license compliance is important to you (e.g., if you're going to release a commercial product incorporating the code). While Apache tries to ensure the software it produces is "commercially friend

Re: Licensing of Tika

2015-07-15 Thread lgilardon...@gmail.com
On that page there is an explciit reference to the unrar license mentioned in first post. It say it may be used provided a notice is given On 7/15/2015 11:07 AM, Nick Burch wrote: On Tue, 14 Jul 2015, Chris Harshman wrote: Personally, I'd conduct a review of each component if license complianc

Re: Per Page Document Content

2015-07-15 Thread Nazar Hussain
Yes in first phase I am targeting PDF and DOC files. Later will use PPT and other but all would be page based documents. I had read on different references on web that it returns div per page. Can any one help out for exact code that works with Tika 1.9. I have this code written in JRuby class M

Re: Per Page Document Content

2015-07-15 Thread Nick Burch
On Wed, 15 Jul 2015, Nazar Hussain wrote: Yes in first phase I am targeting PDF and DOC files. Later will use PPT and other but all would be page based documents. .doc is not a page based format, it's a run-based format. There is no page information in the file format, it's calculated on the f

robust Tika and Hadoop

2015-07-15 Thread Allison, Timothy B.
All, I'd like to fill out our Wiki a bit more on using Tika robustly within Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't looked carefully into these packages yet. Does anyone have any recommendations for specific configurations/design patterns that will def

Re: Per Page Document Content

2015-07-15 Thread Mattmann, Chris A (3980)
Also, Nazar, are you talking about e.g., Scrapy style extractions? If so, Tika has the Content Handler interface. From Java, this is relatively easy to call, but we don’t really provide a mechanism from the command line and/or REST server to call arbitrary extractions. Maybe we should think about d

Re: robust Tika and Hadoop

2015-07-15 Thread Chris Mattmann
I would add Nutch to the list too, Tim :-) +1 from me. — Chris Mattmann chris.mattm...@gmail.com -Original Message- From: "Allison, Timothy B." Reply-To: Date: Wednesday, July 15, 2015 at 4:38 AM To: "user@tika.apache.org" Subject: robust Tika and Hadoop >All, > > I’d like to

Re: Per Page Document Content

2015-07-15 Thread Nazar Hussain
@Matt. I am looking for plain text extraction, no css or xpath. I just want to extract text per page. So I would have array of plain text content on which each index have content of a single page. @Nick. I had progressed with the links you shared. Now my working handler class is: class PageConten

Re: Per Page Document Content

2015-07-15 Thread Nick Burch
On Wed, 15 Jul 2015, Nazar Hussain wrote: @Matt. I am looking for plain text extraction, no css or xpath. I just want to extract text per page. So I would have array of plain text content on which each index have content of a single page. You won't be able to do it in the plain-text space. You

Re: Licensing of Tika

2015-07-15 Thread Ingo Wiarda
Hi, generating a list of all licenses is a good idea. The last thing you want for your product is to discover that the most recent version of a dependency is AGPL'ed, if you plan to publish under another license. I have done this some time ago for the Cinnamon CMS: http://cinnamon-cms.com/de

Re: Licensing of Tika

2015-07-15 Thread Mattmann, Chris A (3980)
Hi Ingo, The other thing to realize is that Tika is a “collective” work, and that it’s collective work is licensed under the Apache software foundation license, and that the collective work and its dependencies are compatible either with ALv2, or with category-A, category-B licenses from the Apach