Re: Single line in extracted PDF contents

2010-11-11 Thread Staffan
On Thu, Nov 11, 2010 at 10:14 AM, Staffan wrote: > Hi, > > Current trunk/0.8RC seems to concatenate the PDF body from PDFBox into > one line. Last time I tested trunk, about a month ago, it did not. See > the following command line output: > Had the time to make a unit test now and track the regre

[jira] Updated: (TIKA-548) PDF content extracted as single line

2010-11-11 Thread Staffan Olsson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Staffan Olsson updated TIKA-548: Attachment: tika-PDF-content-regression-test.patch > PDF content extracted as single line > -

[jira] Created: (TIKA-548) PDF content extracted as single line

2010-11-11 Thread Staffan Olsson (JIRA)
PDF content extracted as single line Key: TIKA-548 URL: https://issues.apache.org/jira/browse/TIKA-548 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.8 Repo

Re: svn commit: r1033937 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ tika-parsers/src/main/java/org/apache/tika/pa

2010-11-11 Thread Mattmann, Chris A (388J)
BTW: that said, thanks for taking the time to implement this functionality – it looks great and of course I’m +1 for making it easier for you guys to use Tika in your company! Cheers, Chris On 11/11/10 6:38 AM, "Maxim Valyanskiy" wrote: Hello! 11.11.2010 17:05, Jukka Zitting пишет: > Log: >

Re: svn commit: r1033937 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ tika-parsers/src/main/java/org/apache/tika/pa

2010-11-11 Thread Mattmann, Chris A (388J)
Hi Max, > > We have POI-based utility that extracts all embedded files (attachments, > pictures > and etc) from different file formats. This utility takes arbitrary file and > returns ZIP-archive with all attachments. > > This utility duplicates functionality of embedded file processing in Tika.

Re: svn commit: r1033937 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ tika-parsers/src/main/java/org/apache/tika/pa

2010-11-11 Thread Nick Burch
On Thu, 11 Nov 2010, Maxim Valyanskiy wrote: So I need to create JIRA issue before commit? Yup. If it's a major change, or you're not sure about the route to take, post the patch for review on the jira first. If it's a smaller change (eg the scope of this one), create the jira before you star

Re: svn commit: r1033937 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ tika-parsers/src/main/java/org/apache/tika/pa

2010-11-11 Thread Maxim Valyanskiy
Hello! 11.11.2010 17:05, Jukka Zitting пишет: Log: Extract interface for EmbeddedDocumentExtractor We have POI-based utility that extracts all embedded files (attachments, pictures and etc) from different file formats. This utility takes arbitrary file and returns ZIP-archive with all attac

Re: [VOTE] Apache Tika 0.8 Release Candidate #1

2010-11-11 Thread Ken Krugler
Hi Chris, We built/ran Bixo against the released Tika 0.8 jars, and it passed all of our tests. +1 for me -- Ken On Nov 9, 2010, at 1:29pm, Mattmann, Chris A (388J) wrote: Hi Folks, I have posted a candidate for the Apache Tika 0.8 release. The source code is at: http://people.apache

Re: svn commit: r1033937 - in /tika/trunk: tika-core/src/main/java/org/apache/tika/extractor/ tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ tika-parsers/src/main/java/org/apache/tika/pa

2010-11-11 Thread Jukka Zitting
Hi, On Thu, Nov 11, 2010 at 3:31 PM, wrote: > Log: > Extract interface for EmbeddedDocumentExtractor It would be good if all non-trivial commit messages contained a reference to a relevant issue in Jira for better context of why particular changes are being made. Nick correctly noted earlier t

Single line in extracted PDF contents

2010-11-11 Thread Staffan
Hi, Current trunk/0.8RC seems to concatenate the PDF body from PDFBox into one line. Last time I tested trunk, about a month ago, it did not. See the following command line output: $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf 1   ·   untitled 3   ·   2010-02-13 09:52