Hi Ingo,
The other thing to realize is that Tika is a “collective” work, and
that it’s collective work is licensed under the Apache software foundation
license, and that the collective work and its dependencies are compatible
either with ALv2, or with category-A, category-B licenses from the Apach
Hi,
generating a list of all licenses is a good idea. The last thing you
want for your product is to discover that the most recent version of a
dependency is AGPL'ed, if you plan to publish under another license.
I have done this some time ago for the Cinnamon CMS:
http://cinnamon-cms.com/de
On Wed, 15 Jul 2015, Nazar Hussain wrote:
@Matt. I am looking for plain text extraction, no css or xpath. I just
want to extract text per page. So I would have array of plain text
content on which each index have content of a single page.
You won't be able to do it in the plain-text space. You
@Matt. I am looking for plain text extraction, no css or xpath. I just want
to extract text per page. So I would have array of plain text content on
which each index have content of a single page.
@Nick. I had progressed with the links you shared. Now my working handler
class is:
class PageConten
I would add Nutch to the list too, Tim :-)
+1 from me.
—
Chris Mattmann
chris.mattm...@gmail.com
-Original Message-
From: "Allison, Timothy B."
Reply-To:
Date: Wednesday, July 15, 2015 at 4:38 AM
To: "user@tika.apache.org"
Subject: robust Tika and Hadoop
>All,
>
> I’d like to
Also, Nazar, are you talking about e.g., Scrapy style extractions?
If so, Tika has the Content Handler interface. From Java, this is
relatively easy to call, but we don’t really provide a mechanism
from the command line and/or REST server to call arbitrary extractions.
Maybe we should think about d
All,
I'd like to fill out our Wiki a bit more on using Tika robustly within
Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't
looked carefully into these packages yet.
Does anyone have any recommendations for specific configurations/design
patterns that will def
On Wed, 15 Jul 2015, Nazar Hussain wrote:
Yes in first phase I am targeting PDF and DOC files. Later will use PPT
and other but all would be page based documents.
.doc is not a page based format, it's a run-based format. There is no page
information in the file format, it's calculated on the f
Yes in first phase I am targeting PDF and DOC files. Later will use PPT and
other but all would be page based documents.
I had read on different references on web that it returns div per page. Can
any one help out for exact code that works with Tika 1.9.
I have this code written in JRuby
class M
On that page there is an explciit reference to the unrar license
mentioned in first post. It say it may be used provided a notice is given
On 7/15/2015 11:07 AM, Nick Burch wrote:
On Tue, 14 Jul 2015, Chris Harshman wrote:
Personally, I'd conduct a review of each component if license
complianc
On Tue, 14 Jul 2015, Chris Harshman wrote:
Personally, I'd conduct a review of each component if license compliance
is important to you (e.g., if you're going to release a commercial
product incorporating the code).
While Apache tries to ensure the software it produces is "commercially
friend
On Wed, 15 Jul 2015, Nazar Hussain wrote:
The problem I am facing is with pages. I can extract total pages from
document metadata. But I can't find any way to extract content per page
from the document.
What file formats is this for? And how are you calling Tika?
If the file format is page-ba
12 matches
Mail list logo