Re: Parser removes file content and treats it as Metadata

2024-01-24 Thread Ken Krugler
linesare also included in the handler contents.Regards,GerardoFrom: Ken Krugler <kkrugler_li...@transpac.com>Sent: Saturday, January 20, 2024 11:54 AMTo: user@tika.apache.org <user@tika.apache.org>Cc: Mikhail Gushinets <mikhail.gushin...@aparavi.com>Subject: Re: Parser removes file

Re: Parser removes file content and treats it as Metadata

2024-01-20 Thread Ken Krugler
.. (Till the end of the file). > > and the initial text of the file (FROM, TO, DATE, LOCATION) is not included > but registered as metadata: > > > > I would like to know if there is any way to prevent this from happening using > AutoDectectParser so that all the text is included

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-13 Thread Ken Krugler
p Java 11 in "main"/3.x now and set the EOL for Tika 2.x/Java 8 in say > 6 months or fewer? > > Thank you, all, for your feedback! > > Best, > > Tim > > -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: [DISCUSS] Release planning for 3.x and 2.x's EOL

2023-09-12 Thread Ken Krugler
2023 at 10:49 AM Tim Allison <mailto:talli...@apache.org>> wrote: >> >If Tika users will be happy to move on and drop Java 8 and/or javax. Please >> >drop them :))) >> >> Fellow devs and broader Tika community, are we ok with EOL'ing Tika 2.x and >&g

Re: Dependencies error in Tika

2022-09-04 Thread Ken Krugler
estRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35) > at > com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235) > at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54) > > Mark Kerzner, SHMsoft <http://shmsoft.com/>, > Book a call with me here <

Slack channel report of failing Docker build

2021-09-23 Thread Ken Krugler
idea what should be fixed where, just passing this along :) — Ken ------ Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Detecting multiple languages in a long text

2021-02-02 Thread Ken Krugler
; In the end I only actually care about the languages, the probabilities I’d > only use to see if it’s even worth mentioning a specific one if it should > return more than one for longer text samples. > > > Von: Ken Krugler <mailto:kkrugler_li...@transpac.com>> >

Re: Detecting multiple languages in a long text

2021-02-01 Thread Ken Krugler
ngle language with the full text and my first > French-Greek text sample. > > How do I get the other languages (in my case: French & Greek) as a result too? -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: [VOTE] Release Apache Tika 1.25 Candidate #2

2020-11-25 Thread Ken Krugler
che/tika> > > Please vote on releasing this package as Apache Tika 1.25. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.25 > [ ] -1 Do not release this package bec

Re: Why does Tika offer a client-server option?

2020-11-25 Thread Ken Krugler
ks so much, >> Robert >> >> >> > -- > Imixs Software Solutions GmbH > Web: www.imixs.com <http://www.imixs.com/> Phone: +49 (0)89-452136 16 > Office: Agnes-Pockels-Bogen 1, 80992 München > Registergericht: Amtsgericht Muenchen, HRB 136045 > Geschaeftsführer: G

Re: [ANNOUNCE] Apache Tika 1.22 released

2019-08-02 Thread Ken Krugler
ika.apache.org/ > > -- Tim Allison, on behalf of the Apache Tika community -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-21 Thread Ken Krugler
for the next 72 hours and passes if a majority of at > least three +1 Tika PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.20 > [ ] -1 Do not release this package because... > > Here's my +1. > > Cheers, > > Tim -- Ken

Re: Does Tika parse QuickBooks files?

2018-07-01 Thread Ken Krugler
get it to work, would it be a useful addition? That would be helpful, thanks! — Ken ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Extract HTML objects using TIKA

2018-05-24 Thread Ken Krugler
Handler is org.xml.sax.ContentHandler. — Ken > > Thnaks. > From: Ken Krugler [mailto:kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>] > Sent: Thursday, May 24, 2018 4:09 PM > To: user@tika.apache.org <mailto:user@tika.apache.org> > Subject: Re: Ext

Re: Extract HTML objects using TIKA

2018-05-24 Thread Ken Krugler
omagically do for you. It would be interesting to create such a thing (similar to what we did for Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde <https://github.com/seagatesoft/sde> — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Tika detects short Japanese sentences as Chinese

2018-04-05 Thread Ken Krugler
Hi Artur, Is the detector that you get back from getDefaultLanguageDetector the OptimaizeLangDetector? — Ken > On Apr 3, 2018, at 2:55 AM, Artur Rashitov wrote: > > Given the following code: > > val japanese = "私はガラスを食べられます。それは私を傷つけません。" >

Re: HTML parsing, script tags,

2017-06-28 Thread Ken Krugler
braries that try to fix up broken HTML, with varying degrees of success, depending on the way that HTML is broken. — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Extracting macros in 1.15

2017-06-05 Thread Ken Krugler
> On Jun 5, 2017, at 10:43am, Allison, Timothy B. wrote: > > Jim, > Thank you, again, for reaching out to us. Now that we have a user who > actually cares about macros, I have some follow up questions, we aren’t > treating js in html as a macro…should we try to do that?

Re: French Language Detection with Tika

2017-05-12 Thread Ken Krugler
so that we could send out the element with the “lang” = attribute before emitting the text. If that’s important, though, it wouldn’t be hard to create your own version of the BodyHandler that does this. — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunl

Re: French Language Detection with Tika

2017-05-10 Thread Ken Krugler
n Tika, If you are able to refer me to someone or a > reference place in that respect, I'll have a better degree of confidence im > my recommandation > > Best Regards > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: How to keep all HTML link when doing file content extraction?

2017-02-14 Thread Ken Krugler
job). It calls the Tika parse() method with a org.apache.tika.sax.TeeContentHandler that sends SAX events to the regular content extraction handler, and (typically) the SimpleLinkExtractor class (in the same package). — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.c

Re: script tags in LinkContentHandler

2016-04-05 Thread Ken Krugler
Hi Joe, I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake. I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have

Re: script tags in LinkContentHandler

2016-04-05 Thread Ken Krugler
Hi Joe, > On Apr 5, 2016, at 12:27pm, Joseph Naegele > wrote: > > Hi all, > > I'm using Nutch for crawling the web, and one of its built-in HTML parsers > uses Tika and its LinkContentHandler. I'm interested in collecting *all* > links on a web page, but I'm

RE: Logging

2016-03-02 Thread Ken Krugler
ers > tika-server > > SLF4J is used by; > tika-batch > tika-core > tika-parsers > tika-translate > > If I do a patch which way should I refactor? My personal preference is to use > SLF4J. > > John -- Ken Krugler +1 530-210-6378 http://www.s

RE: Unable to extract content from chunked portion of large file

2016-02-26 Thread Ken Krugler
t; get the chunk of 1 MB out of srcBytes > > > when i pass this 1 MB chunk to Tika it is giving me the error. > > > As the WIKI Tika needs the entire file to extract content. > > this is where i struck. i don't wan't to pass entire file to Tika. > > correct me if i am

RE: Unable to extract content from chunked portion of large file

2016-02-24 Thread Ken Krugler
eout issues. > > i tried getting chunk of file and pass to Tika. Tika given me invalid data > exception. > > I Think for Tika we need to pass entire file at once to extract content. > > Raghu. > > From: Ken Krugler <kkrugler_li...@transpac.com> > Sent:

RE: Jackson & Fat tika-server jar question

2016-02-23 Thread Ken Krugler
che.tika.server and tika-server-all which > is the bloating version with dependencies? > > Cheers, > John > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Removing cryptographic JARs from Tika

2016-02-19 Thread Ken Krugler
including > cryptographic JARs, I won't be able to use. > > Thank you!! > > Steve > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Removing cryptographic JARs from Tika

2016-02-19 Thread Ken Krugler
ted files? > > I have no need to extract text off encrypted files, but due to Tika including > cryptographic JARs, I won't be able to use. > > Thank you!! > > Steve -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Extraction table from HTML document in Tika

2015-11-12 Thread Ken Krugler
ovember 12, 2015 6:49:23am PST > To: user@tika.apache.org > Subject: Extraction table from HTML document in Tika > > Hi > > Is there a way to extract tables from a HTML document using Tika? > thanks! > > Benjamin -- Ken Krugler +1 530-210-6378

RE: LanguageIdentifier.isReasonablyCertain is always false

2015-09-03 Thread Ken Krugler
; Tika tika = new Tika(); > > InputStream is = new FileInputStream( fileName ); > String content = tika.parseToString( is ); > > LanguageIdentifier identifier = new LanguageIdentifier( content ); > > System.out.println( identifier.getLanguage() ); > System.out.println( identifier.isReasonablyCertain() ); > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Extracting the structure of an HTML Document

2015-08-17 Thread Ken Krugler
sections, what are the titles of these sections etc... Is there a way to do that with Tika? Thanks! Benjamin -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr

RE: robust Tika and Hadoop

2015-07-21 Thread Ken Krugler
those. It gets a bit tricky, though, as the UID for content is the URL, but now we'd have multiple sub-docs that we'd want to index separately. From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Monday, July 20, 2015 7:21 PM To: user@tika.apache.org Subject: RE: robust Tika

RE: robust Tika and Hadoop

2015-07-20 Thread Ken Krugler
://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big

RE: XPath support for attributes

2015-06-03 Thread Ken Krugler
section id = myIDSECTION CONTENT/section. Which is the proper content handler for having all the sections? Will the XPath implementation for Tika support expressions with @ATTRIBUTE and [@ATTRIBUTE='value'] ? Thank you Andrea -- Ken Krugler +1 530-210-6378 http

RE: HTML parsing error with a tag inside h1 tag

2014-09-09 Thread Ken Krugler
-- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop

Re: How to identify binary content ?

2014-08-08 Thread Ken Krugler
Hi Avi, Just to clarify, are you asking for some way to determine whether a given file (format) will never return any text (other than metadata)? Thanks, -- Ken On Aug 7, 2014, at 11:28pm, Avi Hayun avrah...@gmail.com wrote: Hi, I am crawling my site and am using Tika for binary content

Re: How to index the parsed content effectively

2014-07-02 Thread Ken Krugler
content fragments in a unique Lucene field every time its characters(...) method is called, something I've been planning to experiment with. The feedback will be appreciated Cheers, Sergey -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data

Re: Detecting html file which is urf-16 encoded

2014-06-17 Thread Ken Krugler
is being used for the file. Is there a way to tell a detector what encoding a file is in to aid detection? Thanks George -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr

Re: What exception does CharsetDetector.detect() throw?

2014-05-29 Thread Ken Krugler
* /ul But it seems to me that the method returns null but does not raise an exception. What exception does the method throw? Thanks in advance. Best Regards, EungJun Yi -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions

Re: Parse out Name:Value pairs

2014-02-11 Thread Ken Krugler
://tika.apache.org/1.4/parser_guide.html for a guide as to how to do all of that Nick -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr

Re: Parse out Name:Value pairs

2014-02-10 Thread Ken Krugler
with SAX events? Is it that the file is too big? In any case, I imagine you could get the desired behavior by implementing your own ContentHandler. -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading

Re: [VOTE] Apache Tika 1.5 RC2

2014-02-09 Thread Ken Krugler
, Dave -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr

Re: language detection in tika ...

2013-12-15 Thread Ken Krugler
help motivate me to at least close out that issue :) Regards, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr

Part 2 blog post on extracting text features using Tika

2013-07-21 Thread Ken Krugler
on alternative approaches to text parsing (NLP vs. Solr tokenization). Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr

Blog post on extracting text features using Tika

2013-07-11 Thread Ken Krugler
to share some of what I'd learned over the years in processing text for classification, clustering and other related ML tasks. It undoubtedly has some things that are unclear or even incorrect, so please comment :) Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http