Fwd: Error when processing doap file http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf:
FYI- Sent from my iPhone Begin forwarded message: > From: Projects > Date: December 25, 2021 at 6:01:14 PM PST > To: Site Development > Subject: Error when processing doap file > http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf: > Reply-To: site-...@apache.org > > URL: http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf > mismatched tag: line 364, column 4 > Source: > https://svn.apache.org/repos/asf/comdev/projects.apache.org/trunk/data/projects.xml
Re: 2.2.0 JARs not pushed to Maven Central
Get on #afinfra in slack and look at the scroll back. > On Dec 17, 2021, at 2:59 PM, lewis john mcgibbney wrote: > > I’ve been waiting on the M2 central Repository being updated with the 2.2.0 > jars… > I checked repository.Apache.org and they are NOT staged which I assume > means that the staging repository has been closed which should have > triggered the release to maven central. > Anyone know what’s going on? > lewismc > -- > http://home.apache.org/~lewismc/ > http://people.apache.org/keys/committer/lewismc
Re: Log4j 2.16.0 a more complete fix to Log4Shell
You’ll need to evaluate that yourself. Sent from my iPhone > On Dec 13, 2021, at 4:56 PM, Tim Allison wrote: > > Do we have to do a respin of the release candidate or is this marginally > better? > >> On Mon, Dec 13, 2021 at 7:43 PM Dave Fisher wrote: >> >> https://lists.apache.org/thread/d6v4r6nosxysyq9rvnr779336yf0woz4
Log4j 2.16.0 a more complete fix to Log4Shell
https://lists.apache.org/thread/d6v4r6nosxysyq9rvnr779336yf0woz4
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412084#comment-17412084 ] Dave Fisher commented on TIKA-3544: --- The OP's source [https://getcreditcardnumbers.com|https://getcreditcardnumbers.com/] produces invalid numbers. In JSON and Javascript Numbers are always double precision floating point. See [https://www.w3schools.com/js/js_numbers.asp] > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3544) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results
[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412035#comment-17412035 ] Dave Fisher commented on TIKA-3544: --- See [https://en.wikipedia.org/wiki/Double-precision_floating-point_format] Double can only keep between 15-17 digits of precision. I think you have to leave things at 15 digits or do more precise analysis which would be slower. There is a reason why there is an error term called epsilon with floating point. Credit Card Numbers are Strings of Numeric Characters. Use strings. Just like you have to use for US Zipcodes due to leading '0'. > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > - > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20 >Reporter: Jitin Jindal >Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-2939) Figure out how to allow OCR'ing of large PDFs via tika-server
[ https://issues.apache.org/jira/browse/TIKA-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219857#comment-17219857 ] Dave Fisher commented on TIKA-2939: --- On your client side PDFBox tools might help you with your split. https://pdfbox.apache.org/2.0/commandline.html#pdfsplit > Figure out how to allow OCR'ing of large PDFs via tika-server > - > > Key: TIKA-2939 > URL: https://issues.apache.org/jira/browse/TIKA-2939 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Tim Allison >Priority: Minor > > Tesseract can take quite a bit of time on large PDFs, which can lead to > timeouts in jax-rs and the connection closing: > {noformat} > Caused by: com.ctc.wstx.exc.WstxIOException: Closed > at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:262) > at > org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.handleMessage(JAXRSDefaultFaultOutInterceptor.java:104) > Caused by: org.eclipse.jetty.io.EofException: Closed > at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination$JettyOutputStream.write(JettyHTTPDestination.java:322) > at > org.apache.cxf.io.AbstractWrappedOutputStream.write(AbstractWrappedOutputStream.java:51) > at > com.ctc.wstx.sw.EncodingXmlWriter.flushBuffer(EncodingXmlWriter.java:742) > at com.ctc.wstx.sw.EncodingXmlWriter.flush(EncodingXmlWriter.java:176) > at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:260) > {noformat} > I tried expanding the timeouts on the client side: > {noformat} > RequestConfig config = RequestConfig.custom() > .setConnectTimeout(TIMEOUT * 1000) > .setConnectionRequestTimeout(TIMEOUT * 1000) > .setSocketTimeout(TIMEOUT * 1000).build(); > {noformat} > But this doesn't solve the problem. > How can we/can we increase the timeout on the server side and is there a > maximum? > If we can't fix the problem with timeouts, we should figure out a way to let > people select only a few pages for OCR so that clients can iterate through > large PDFs. > This issue is different from TIKA-1871 in that the problem isn't chunking the > large document to get the file to tika-server; rather the problem is the > amount of time it can take tika-server to run OCR on every page of a large > PDF and return the full results. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Tika lib is huge.. why?
IIRC - if you know you only want PDF extraction then take a look at Apache PDFBox. PDFBox.apache.org Sent from my iPhone > On Sep 26, 2020, at 10:00 AM, Laurence Vanhelsuwe > wrote: > > Thanks for the explanation. > > I understand the approach.. but in my particular use case, I cannot > reasonably justify inflating my application size from 7Mb to 77Mb just to add > functionality amounting to less than 1% of all functionality. > > I guess there’s no way to surgically extract just the PDF metadata parsing > functionality from Tika ? > > Laurence > >> On 26 Sep 2020, at 18:04, Keith Bennett wrote: >> >> Tika coordinates the use of many external-to-Tika parser libraries, each >> with their own dependencies, for parsing many types of files. These parser >> libraries are bundled into the tika-app jar file for your convenience. I >> believe it's these libraries that make up the bulk of the download. For >> example, if you unzip the jar file and inspect the contents, you can see >> that just one of these parsers, poi, consists of 24 MB: >> >> % cd org/apache/poi >> >> % du -sh . >> 24M . >> >> - Keith >> >>> On Sat, Sep 26, 2020 at 7:54 AM Laurence Vanhelsuwe >>> >>> wrote: >>> >>> I found Tika during a quest to extract PDF metadata in Java. Did i screw >>> up the JAR download, or is Tika really 70MB ? >>> >>> Kind regards, >>> >>> Laurence >>> >>> >
Re: HTML to PDF conversion
Hi - You may want to take a look at Apache FOP which is part of the Apache XML Graphics project. My team had success with that in generating PDF from XML. Regards, Dave > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin wrote: > > Ken, thanks for the feedback, I meant to reply to your comments, > > I suppose I really meant Tika offering a uniform API to create some simple > structured PDF/etc files. > ContentCreator creator = ContentCreator.get("PDF"); > creator.addTitle("Introduction to Tika"); > creator.addText(""); > creator.addTable("tablename", new LinkedHashMap>()); > creator.addAttachment(someImage); > creator.complete(); > > It would be consistent with the Tika approach on the read side. > > Cheers, Sergey > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler wrote: > >> If you’re suggesting ways to make it easier to use something like >> YaHPConverter with Tika, definitely yes. >> >> If you’re talking about integrating this functionality…my personal view is >> no. >> >> I think Tika should focus on extracting content from documents, versus >> format transformations. >> >> Tika is an attractive location for functionality like this, since it sits >> in the middle of a lot of data processing pipelines, but I worry about a >> bloated code base, with corresponding challenges in maintenance and support. >> >> Regards, >> >> — Ken >> >> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin >> wrote: >>> >>> Hi All >>> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my >>> colleagues pointed to >>> >> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html >>> >>> Does it make sense for Tika to offer something related to the text to PDF >>> (for a start, something on top of that transformer), and then may be even >>> for other formats ? >>> >>> Sergey >> >> -- >> Ken Krugler >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Cassandra & Solr >> >>
Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production
Having run a Solr service, you are striving to have quick response on queries and want to avoid anything that can pause the JVM. You work hard to make your updates quick and NRT. Text Extractions of XML based documents like Office and big object files like PDF are memory intensive and should be sandboxed on separate VMs. Regards, Dave > On May 29, 2018, at 12:11 PM, Ken Krugler wrote: > > Thanks for the ref, Tim. > > I’m curious why SolrCell doesn’t fire up threads when parsing docs with Tika > (or use the fork parser), to mitigate issues with hangs & crashes? > > — Ken > >> On May 29, 2018, at 11:54 AM, Tim Allison wrote: >> >> All, >> >> Over the weekend, Shawn Heisey very kindly drafted a wikipage about the >> challenges of using Solr's ExtractingRequestHandler and the guidance to >> avoid it in production. >> >> I completely agree with this point, and I think that Shawn did a very >> nice job of capturing some of the challenges. If you have any feedback or >> would like to make edits, see: >> >> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika >> >> Cheers, >> >>Tim > > > http://about.me/kkrugler > +1 530-210-6378 > signature.asc Description: Message signed with OpenPGP