Re: Parser removes file content and treats it as Metadata

2024-01-24 Thread Ken Krugler
linesare also included in the handler contents.Regards,GerardoFrom: Ken Krugler <kkrugler_li...@transpac.com>Sent: Saturday, January 20, 2024 11:54 AMTo: user@tika.apache.org <user@tika.apache.org>Cc: Mikhail Gushinets <mikhail.gushin...@aparavi.com>Subject: Re: Parser removes file

Re: Parser removes file content and treats it as Metadata

2024-01-20 Thread Ken Krugler
(Till the end of the file). > > and the initial text of the file (FROM, TO, DATE, LOCATION) is not included > but registered as metadata: > > > > I would like to know if there is any way to prevent this from happening using > AutoDectectParser so that all the text is inc

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-13 Thread Ken Krugler
) Keep Java 11 in "main"/3.x now and set the EOL for Tika 2.x/Java 8 in say > 6 months or fewer? > > Thank you, all, for your feedback! > > Best, > > Tim > > -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-12 Thread Ken Krugler
e, Sep 12, 2023 at 10:49 AM Tim Allison <mailto:talli...@apache.org>> wrote: >> >If Tika users will be happy to move on and drop Java 8 and/or javax. Please >> >drop them :))) >> >> Fellow devs and broader Tika community, are we ok with EOL'ing Tika

Re: Dependencies error in Tika

2022-09-04 Thread Ken Krugler
deaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35) > at > com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235) > at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54) > > Mark Kerzner, SHMsoft <http://shmsoft.com/>, > Book a call with m

Slack channel report of failing Docker build

2021-09-23 Thread Ken Krugler
idea what should be fixed where, just passing this along :) — Ken ------ Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Detecting multiple languages in a long text

2021-02-02 Thread Ken Krugler
; In the end I only actually care about the languages, the probabilities I’d > only use to see if it’s even worth mentioning a specific one if it should > return more than one for longer text samples. > > > Von: Ken Krugler <mailto:kkrugler_li...@transpac.com>> >

Re: Detecting multiple languages in a long text

2021-02-01 Thread Ken Krugler
a single language with the full text and my first > French-Greek text sample. > > How do I get the other languages (in my case: French & Greek) as a result too? -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [VOTE] Release Apache Tika 1.25 Candidate #2

2020-11-25 Thread Ken Krugler
a> > > Please vote on releasing this package as Apache Tika 1.25. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.25 > [ ] -1 Do not release this package bec

Re: Why does Tika offer a client-server option?

2020-11-25 Thread Ken Krugler
. >> >> Thanks so much, >> Robert >> >> >> > -- > Imixs Software Solutions GmbH > Web: www.imixs.com <http://www.imixs.com/> Phone: +49 (0)89-452136 16 > Office: Agnes-Pockels-Bogen 1, 80992 München > Registergericht: Amtsgericht Muenchen, HRB 136045

Re: [ANNOUNCE] Apache Tika 1.22 released

2019-08-02 Thread Ken Krugler
ika.apache.org/ > > -- Tim Allison, on behalf of the Apache Tika community -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-21 Thread Ken Krugler
for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.20 > [ ] -1 Do not release this package because... > > Here's my +1. > > Cheers, > > Tim -

Re: Does Tika parse QuickBooks files?

2018-07-01 Thread Ken Krugler
to work, would it be a useful addition? That would be helpful, thanks! — Ken ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Extract HTML objects using TIKA

2018-05-24 Thread Ken Krugler
ed in getting structured data (column names, data types, etc). And > that’s something Tika doesn’t automagically do for you. > > It would be interesting to create such a thing (similar to what we did for > Boilerpipe) for use with Tika. E.g. see https://github.com/s

Re: Extract HTML objects using TIKA

2018-05-24 Thread Ken Krugler
t would be interesting to create such a thing (similar to what we did for Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde <https://github.com/seagatesoft/sde> — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Tika detects short Japanese sentences as Chinese

2018-04-05 Thread Ken Krugler
Hi Artur, Is the detector that you get back from getDefaultLanguageDetector the OptimaizeLangDetector? — Ken > On Apr 3, 2018, at 2:55 AM, Artur Rashitov wrote: > > Given the following code: > > val japanese = "私はガラスを食べられます。それは私を傷つけません。" > LanguageDetector.getDefaultLanguageDetector.loadMod

Re: HTML parsing, script tags,

2017-06-28 Thread Ken Krugler
ken HTML, with varying degrees of success, depending on the way that HTML is broken. — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Extracting macros in 1.15

2017-06-05 Thread Ken Krugler
> On Jun 5, 2017, at 10:43am, Allison, Timothy B. wrote: > > Jim, > Thank you, again, for reaching out to us. Now that we have a user who > actually cares about macros, I have some follow up questions, we aren’t > treating js in html as a macro…should we try to do that? Are there other >

Re: French Language Detection with Tika

2017-05-12 Thread Ken Krugler
element with the “lang” = attribute before emitting the text. If that’s important, though, it wouldn’t be hard to create your own version of the BodyHandler that does this. — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: French Language Detection with Tika

2017-05-10 Thread Ken Krugler
ne or a > reference place in that respect, I'll have a better degree of confidence im > my recommandation > > Best Regards > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: How to keep all HTML link when doing file content extraction?

2017-02-14 Thread Ken Krugler
apache.tika.sax.TeeContentHandler that sends SAX events to the regular content extraction handler, and (typically) the SimpleLinkExtractor class (in the same package). — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions &

Re: script tags in LinkContentHandler

2016-04-06 Thread Ken Krugler
Hi Joe, In that case, I’d file a Jira issue with two test docs attached, one with a regular

Re: script tags in LinkContentHandler

2016-04-06 Thread Ken Krugler
> On Apr 6, 2016, at 1:33pm, Allison, Timothy B. wrote: > > On #2, I'd prefer not skipping elements. I definitely understand the use > case to extract what a human can see, but I suspect if your email address > ends in 'forensics.com', you'd probably like to see everything as well. I’m not s

Re: script tags in LinkContentHandler

2016-04-05 Thread Ken Krugler
Hi Joe, I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake. I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have be

Re: script tags in LinkContentHandler

2016-04-05 Thread Ken Krugler
Hi Joe, > On Apr 5, 2016, at 12:27pm, Joseph Naegele > wrote: > > Hi all, > > I'm using Nutch for crawling the web, and one of its built-in HTML parsers > uses Tika and its LinkContentHandler. I'm interested in collecting *all* > links on a web page, but I'm surprised the LinkContentHandler

RE: Logging

2016-03-02 Thread Ken Krugler
rsers > tika-server > > SLF4J is used by; > tika-batch > tika-core > tika-parsers > tika-translate > > If I do a patch which way should I refactor? My personal preference is to use > SLF4J. > > John -- Ken Krugler +1 530-210-6378 http://

RE: Unable to extract content from chunked portion of large file

2016-02-26 Thread Ken Krugler
(filePath); > > > get the chunk of 1 MB out of srcBytes > > > when i pass this 1 MB chunk to Tika it is giving me the error. > > > As the WIKI Tika needs the entire file to extract content. > > this is where i struck. i don't wan't to pass entire file to

RE: Unable to extract content from chunked portion of large file

2016-02-24 Thread Ken Krugler
t; @POST >@Consumes("multipart/form-data") >@Produces("text/plain") >@Path("form") >public StreamingOutput getTextFromMultipart(Attachment att, @Context final > UriInfo info) { >return produceText(att.getObject(InputStream.class

RE: Unable to extract content from chunked portion of large file

2016-02-24 Thread Ken Krugler
g timeout issues. > > i tried getting chunk of file and pass to Tika. Tika given me invalid data > exception. > > I Think for Tika we need to pass entire file at once to extract content. > > Raghu. > > From: Ken Krugler > Sent: Friday, February 19, 2016 8:22 P

RE: Jackson & Fat tika-server jar question

2016-02-23 Thread Ken Krugler
er with just org.apache.tika.server and tika-server-all which > is the bloating version with dependencies? > > Cheers, > John > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Removing cryptographic JARs from Tika

2016-02-19 Thread Ken Krugler
luding > cryptographic JARs, I won't be able to use. > > Thank you!! > > Steve > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Removing cryptographic JARs from Tika

2016-02-19 Thread Ken Krugler
its full capacity less encrypted files? > > I have no need to extract text off encrypted files, but due to Tika including > cryptographic JARs, I won't be able to use. > > Thank you!! > > Steve -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Unable to extract content from chunked portion of large file

2016-02-19 Thread Ken Krugler
> we are using Tika Server(REST api) in our .net application. > > please suggest us better approach for this scenario. > > Regards, > Raghu. > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop,

RE: Extraction table from HTML document in Tika

2015-11-12 Thread Ken Krugler
t; Sent: November 12, 2015 6:49:23am PST > To: user@tika.apache.org > Subject: Extraction table from HTML document in Tika > > Hi > > Is there a way to extract tables from a HTML document using Tika? > thanks! > > Benjamin -- Ken Krugler +1 53

RE: LanguageIdentifier.isReasonablyCertain is always false

2015-09-03 Thread Ken Krugler
> > Tika tika = new Tika(); > > InputStream is = new FileInputStream( fileName ); > String content = tika.parseToString( is ); > > LanguageIdentifier identifier = new LanguageIdentifier( content ); > > System.out.println( identifier.getLanguage() ); > System.out.println( identifier.isReasonablyCertain() ); > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Extracting the structure of an HTML Document

2015-08-17 Thread Ken Krugler
get the structure of the documents : what are the > different sections, what are the titles of these sections etc... > > Is there a way to do that with Tika? > > Thanks! > > Benjamin -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: robust Tika and Hadoop

2015-07-21 Thread Ken Krugler
ing archive files. But that's a good point, with current versions of Tika we could now more easily handle those. It gets a bit tricky, though, as the UID for content is the URL, but now we'd have multiple sub-docs that we'd want to index separately. > From: Ken Krugler [mailto

RE: robust Tika and Hadoop

2015-07-20 Thread Ken Krugler
Tim > > > [0] https://github.com/DigitalPebble/behemoth > [1] > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ > [2] > http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-f

RE: [VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-09 Thread Ken Krugler
++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: XPath support for attributes

2015-06-03 Thread Ken Krugler
ML for the content handler, with for example > SECTION CONTENT. Which is the proper content > handler for having all the sections? Will the XPath implementation for Tika > support expressions with @ATTRIBUTE and [@ATTRIBUTE='value'] ? > > Thank you

RE: getLanguage returns "lt" if pdf-file contains only images

2014-12-18 Thread Ken Krugler
n‘t think this is supposed to work > that way. > > Is there any way to get a value that indicates the probability of the > detected language or another way to get a proper (in this case no) language? > Regards Sven > -- Ken Krugler +1 530-210-6378

RE: HTML parsing error with tag inside tag

2014-09-09 Thread Ken Krugler
; > > > > > >-Original Message- > >From: Devaraja Swami > >Reply-To: > >Date: Monday, September 8, 2014 7:12 PM > >To: > >Subject: HTML parsing error with tag inside tag > > > >>In the following HTML document, the is inside the

RE: Tika versions compatibility

2014-09-02 Thread Ken Krugler
> From: Baldwin, David > Sent: September 2, 2014 5:39:17pm PDT > To: user@tika.apache.org > Subject: RE: Tika versions compatibility > > Did you mean a different version number of Lucene other than 4.1? i.e. the > lucene.apache.org says 4.9 came out on 25 June 2014 Uwe said 4.10, not 4.1 -- Ke

Re: How to identify binary content ?

2014-08-08 Thread Ken Krugler
Hi Avi, Just to clarify, are you asking for some way to determine whether a given file (format) will never return any text (other than metadata)? Thanks, -- Ken On Aug 7, 2014, at 11:28pm, Avi Hayun wrote: > Hi, > > I am crawling my site and am using Tika for binary content parsing. > > Bu

Re: How to index the parsed content effectively

2014-07-02 Thread Ken Krugler
; Perhaps a custom ContentHandler can index content fragments in a unique > Lucene field every time its characters(...) method is called, something > I've been planning to experiment with. > > The feedback will be appreciated > Cheers, Sergey -- Ke

Re: Detecting html file which is urf-16 encoded

2014-06-17 Thread Ken Krugler
r the utf-16 file is still not recognised as html, despite > the tect having multiple matches. It seems that the detect method does not > realise what encoding is being used for the file. Is there a way to tell a > detector what encoding a file is in to aid detection? > > Thanks &g

Re: What exception does CharsetDetector.detect() throw?

2014-05-29 Thread Ken Krugler
t text has been provided > * > > But it seems to me that the method returns null but does not raise an > exception. What exception does the method throw? > > Thanks in advance. > > Best Regards, > EungJun Yi > -- Ken Krugler +1 53

Re: Parse out Name:Value pairs

2014-02-11 Thread Ken Krugler
> parsing ability and SAX event firing to make life easier. > > Sounds like you'll want to define / identify a suitable mimetype for these, > add some mime magic so they get detected, then write your own parser that > spots these name/value pairs and emmits suitable sax events for you to consume > > See http://tika.apache.org/1.4/parser_guide.html for a guide as to how to do > all of that > > Nick > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Parse out Name:Value pairs

2014-02-10 Thread Ken Krugler
xt, then why do you want to deal with SAX events? Is it that the file is too big? In any case, I imagine you could get the desired behavior by implementing your own ContentHandler. -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [VOTE] Apache Tika 1.5 RC2

2014-02-09 Thread Ken Krugler
Do not release this package because... > > Here is my +1 for the release. > > Cheers, > Dave -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: language detection in tika ...

2013-12-15 Thread Ken Krugler
o put a release stake in the ground, it would help motivate me to at least close out that issue :) Regards, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Script element not reported in custom handler

2013-10-10 Thread Ken Krugler
; The confusing thing is that i am able to get it in my handler when adding >> the script element to TagSoup inside HtmlParser's constructor: >>HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0); >> >> Without this, script and it's charact

Part 2 blog post on extracting text features using Tika

2013-07-21 Thread Ken Krugler
on on alternative approaches to text parsing (NLP vs. Solr tokenization). Thanks, -- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Blog post on extracting text features using Tika

2013-07-11 Thread Ken Krugler
share some of what I'd learned over the years in processing text for classification, clustering and other related ML tasks. It undoubtedly has some things that are unclear or even incorrect, so please comment :) Thanks, -- Ken ------ Ken Krugler +1 530-210-6378