Fwd: Error when processing doap file http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf:

2021-12-25 Thread Dave Fisher
FYI-

Sent from my iPhone

Begin forwarded message:

> From: Projects 
> Date: December 25, 2021 at 6:01:14 PM PST
> To: Site Development 
> Subject: Error when processing doap file 
> http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf:
> Reply-To: site-...@apache.org
> 
> URL: http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf
> mismatched tag: line 364, column 4
> Source: 
> https://svn.apache.org/repos/asf/comdev/projects.apache.org/trunk/data/projects.xml


Re: 2.2.0 JARs not pushed to Maven Central

2021-12-17 Thread Dave Fisher
Get on #afinfra in slack and look at the scroll back.

> On Dec 17, 2021, at 2:59 PM, lewis john mcgibbney  wrote:
> 
> I’ve been waiting on the M2 central Repository being updated with the 2.2.0
> jars…
> I checked repository.Apache.org and they are NOT staged which I assume
> means that the staging repository has been closed which should have
> triggered the release to maven central.
> Anyone know what’s going on?
> lewismc
> -- 
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc



Re: Log4j 2.16.0 a more complete fix to Log4Shell

2021-12-13 Thread Dave Fisher
You’ll need to evaluate that yourself.

Sent from my iPhone

> On Dec 13, 2021, at 4:56 PM, Tim Allison  wrote:
> 
> Do we have to do a respin of the release candidate or is this marginally 
> better?
> 
>> On Mon, Dec 13, 2021 at 7:43 PM Dave Fisher  wrote:
>> 
>> https://lists.apache.org/thread/d6v4r6nosxysyq9rvnr779336yf0woz4



Log4j 2.16.0 a more complete fix to Log4Shell

2021-12-13 Thread Dave Fisher
https://lists.apache.org/thread/d6v4r6nosxysyq9rvnr779336yf0woz4


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Dave Fisher (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412084#comment-17412084
 ] 

Dave Fisher commented on TIKA-3544:
---

The OP's source 
[https://getcreditcardnumbers.com|https://getcreditcardnumbers.com/] produces 
invalid numbers. In JSON and Javascript Numbers are always double precision 
floating point.

See [https://www.w3schools.com/js/js_numbers.asp]

 

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

2021-09-08 Thread Dave Fisher (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412035#comment-17412035
 ] 

Dave Fisher commented on TIKA-3544:
---

See [https://en.wikipedia.org/wiki/Double-precision_floating-point_format]

Double can only keep between 15-17 digits of precision. I think you have to 
leave things at 15 digits or do more precise analysis which would be slower.

There is a reason why there is an error term called epsilon with floating point.

Credit Card Numbers are Strings of Numeric Characters. Use strings. Just like 
you have to use for US Zipcodes due to leading '0'.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -
>
> Key: TIKA-3544
> URL: https://issues.apache.org/jira/browse/TIKA-3544
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20
>Reporter: Jitin Jindal
>Priority: Major
> Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2939) Figure out how to allow OCR'ing of large PDFs via tika-server

2020-10-23 Thread Dave Fisher (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219857#comment-17219857
 ] 

Dave Fisher commented on TIKA-2939:
---

On your client side PDFBox tools might help you with your split.

https://pdfbox.apache.org/2.0/commandline.html#pdfsplit

> Figure out how to allow OCR'ing of large PDFs via tika-server
> -
>
> Key: TIKA-2939
> URL: https://issues.apache.org/jira/browse/TIKA-2939
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Tim Allison
>Priority: Minor
>
> Tesseract can take quite a bit of time on large PDFs, which can lead to 
> timeouts in jax-rs and the connection closing:
> {noformat}
> Caused by: com.ctc.wstx.exc.WstxIOException: Closed
> at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:262)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.handleMessage(JAXRSDefaultFaultOutInterceptor.java:104)
> Caused by: org.eclipse.jetty.io.EofException: Closed
> at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination$JettyOutputStream.write(JettyHTTPDestination.java:322)
> at 
> org.apache.cxf.io.AbstractWrappedOutputStream.write(AbstractWrappedOutputStream.java:51)
> at 
> com.ctc.wstx.sw.EncodingXmlWriter.flushBuffer(EncodingXmlWriter.java:742)
> at com.ctc.wstx.sw.EncodingXmlWriter.flush(EncodingXmlWriter.java:176)
> at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:260)
> {noformat}
> I tried expanding the timeouts on the client side: 
> {noformat}
>  RequestConfig config = RequestConfig.custom()
> .setConnectTimeout(TIMEOUT * 1000)
> .setConnectionRequestTimeout(TIMEOUT * 1000)
> .setSocketTimeout(TIMEOUT * 1000).build();
> {noformat}
> But this doesn't solve the problem.
> How can we/can we increase the timeout on the server side and is there a 
> maximum?
> If we can't fix the problem with timeouts, we should figure out a way to let 
> people select only a few pages for OCR so that clients can iterate through 
> large PDFs.
> This issue is different from TIKA-1871 in that the problem isn't chunking the 
> large document to get the file to tika-server; rather the problem is the 
> amount of time it can take tika-server to run OCR on every page of a large 
> PDF and return the full results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Tika lib is huge.. why?

2020-09-26 Thread Dave Fisher
IIRC - if you know you only want PDF extraction then take a look at Apache 
PDFBox. PDFBox.apache.org

Sent from my iPhone

> On Sep 26, 2020, at 10:00 AM, Laurence Vanhelsuwe  
> wrote:
> 
> Thanks for the explanation.
> 
> I understand the approach.. but in my particular use case, I cannot 
> reasonably justify inflating my application size from 7Mb to 77Mb just to add 
> functionality amounting to less than 1% of all functionality.
> 
> I guess there’s no way to surgically extract just the PDF metadata parsing 
> functionality from Tika ?
> 
> Laurence
> 
>> On 26 Sep 2020, at 18:04, Keith Bennett  wrote:
>> 
>> Tika coordinates the use of many external-to-Tika parser libraries, each
>> with their own dependencies, for parsing many types of files. These parser
>> libraries are bundled into the tika-app jar file for your convenience. I
>> believe it's these libraries that make up the bulk of the download. For
>> example, if you unzip the jar file and inspect the contents, you can see
>> that just one of these parsers, poi, consists of 24 MB:
>> 
>> % cd org/apache/poi
>> 
>> % du -sh .
>> 24M .
>> 
>> - Keith
>> 
>>> On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe 
>>> 
>>> wrote:
>>> 
>>> I found Tika during a quest to extract PDF metadata in Java. Did i screw
>>> up the JAR download, or is Tika really 70MB ?
>>> 
>>> Kind regards,
>>> 
>>> Laurence
>>> 
>>> 
> 



Re: HTML to PDF conversion

2019-10-16 Thread Dave Fisher
Hi -

You may want to take a look at Apache FOP which is part of the Apache XML 
Graphics project. My team had success with that in generating PDF from XML.

Regards,
Dave

> On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin  wrote:
> 
> Ken, thanks for the feedback, I meant to reply to your comments,
> 
> I suppose I really meant Tika offering a uniform API to create some simple
> structured PDF/etc files.
> ContentCreator creator = ContentCreator.get("PDF");
> creator.addTitle("Introduction to Tika");
> creator.addText("");
> creator.addTable("tablename", new LinkedHashMap>());
> creator.addAttachment(someImage);
> creator.complete();
> 
> It would be consistent with the Tika approach on the read side.
> 
> Cheers, Sergey
> On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler  wrote:
> 
>> If you’re suggesting ways to make it easier to use something like
>> YaHPConverter with Tika, definitely yes.
>> 
>> If you’re talking about integrating this functionality…my personal view is
>> no.
>> 
>> I think Tika should focus on extracting content from documents, versus
>> format transformations.
>> 
>> Tika is an attractive location for functionality like this, since it sits
>> in the middle of a lot of data processing pipelines, but I worry about a
>> bloated code base, with corresponding challenges in maintenance and support.
>> 
>> Regards,
>> 
>> — Ken
>> 
>> 
>>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin 
>> wrote:
>>> 
>>> Hi All
>>> 
>>> I've seen a Quarkus user asking how to convert to PDF, and one of my
>>> colleagues pointed to
>>> 
>> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>>> 
>>> Does it make sense for Tika to offer something related to the text to PDF
>>> (for a start, something on top of that transformer), and then may be even
>>> for other formats ?
>>> 
>>> Sergey
>> 
>> --
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 



Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Dave Fisher
Having run a Solr service, you are striving to have quick response on queries 
and want to avoid anything that can pause the JVM. You work hard to make your 
updates quick and NRT. Text Extractions of XML based documents like Office and 
big object files like PDF are memory intensive and should be sandboxed on 
separate VMs.

Regards,
Dave

> On May 29, 2018, at 12:11 PM, Ken Krugler  wrote:
> 
> Thanks for the ref, Tim.
> 
> I’m curious why SolrCell doesn’t fire up threads when parsing docs with Tika 
> (or use the fork parser), to mitigate issues with hangs & crashes?
> 
> — Ken
> 
>> On May 29, 2018, at 11:54 AM, Tim Allison  wrote:
>> 
>> All,
>> 
>> Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
>> challenges of using Solr's ExtractingRequestHandler and the guidance to
>> avoid it in production.
>> 
>>  I completely agree with this point, and I think that Shawn did a very
>> nice job of capturing some of the challenges.  If you have any feedback or
>> would like to make edits, see:
>> 
>> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>> 
>>  Cheers,
>> 
>>Tim
> 
> 
> http://about.me/kkrugler
> +1 530-210-6378
> 



signature.asc
Description: Message signed with OpenPGP