I forgot to mention earlier that you should probably move the PdfParser
initialization code out of the evaluate method. This will probably cause a
significant overhead both in terms of gc and runtime performance. You'll
want to initialize your parser once and evaluate all your docs against it.

- Pradeep

On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> Java string's are immutable. So "pdfText.concat()" returns a new string
> and the original string is left unmolested. So at the end, all you're doing
> is returning an empty string. Instead, you can do "pdfText =
> pdfText.concat(...)". But the better way to write it is to use a
> StringBuilder.
>
> StringBuilder pdfText = ...;
> pdfText.append(...);
> pdfText.append(...);
> ...
> return pdfText.toString();
>
> On Fri Dec 05 2014 at 9:12:37 AM Ryan <freelanceflashga...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm working on an open source project attempting to convert raw content
>> from a pdf (stored as a databytearray) into plain text using a Pig UDF and
>> Apache Tika. I could use your help. For some reason, the UDF I'm using
>> isn't working. The script succeeds but no output is written. *This is the
>> Pig script I'm following:*
>>
>> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
>> DEFINE ExtractTextFromPDFs
>>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
>> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
>>
>> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
>> date:
>> chararray, mime: chararray, content: bytearray); --load the data
>>
>> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from
>> the
>> arc file
>> b = LIMIT a 2; --limit to 2 pages to speed up testing time
>> c = foreach b generate url, ExtractTextFromPDFs(content);
>> store c into 'output/pdf_test';
>>
>>
>> *This is the UDF I wrote:*
>>
>> public class ExtractTextFromPDFs extends EvalFunc<String> {
>>
>>   @Override
>>   public String exec(Tuple input) throws IOException {
>>       String pdfText = "";
>>
>>       if (input == null || input.size() == 0 || input.get(0) == null) {
>>           return "N/A";
>>       }
>>
>>       DataByteArray dba = (DataByteArray)input.get(0);
>>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
>> debugging. Nothing written
>>
>>       InputStream is = new ByteArrayInputStream(dba.get());
>>
>>       ContentHandler contenthandler = new BodyContentHandler();
>>       Metadata metadata = new Metadata();
>>       DefaultDetector detector = new DefaultDetector();
>>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
>>
>>       try {
>>         pdfparser.parse(is, contenthandler, metadata, new ParseContext());
>>       } catch (SAXException | TikaException e) {
>>         // TODO Auto-generated catch block
>>         e.printStackTrace();
>>       }
>>       pdfText.concat(" : "); //another attempt at debugging. Still nothing
>> written
>>       pdfText.concat(contenthandler.toString());
>>
>>       //close the input stream
>>       if(is != null){
>>         is.close();
>>       }
>>       return pdfText;
>>   }
>>
>> }
>>
>> Thank you for your assistance,
>> Ryan
>>
>

Reply via email to