Java string's are immutable. So "pdfText.concat()" returns a new string and
the original string is left unmolested. So at the end, all you're doing is
returning an empty string. Instead, you can do "pdfText =
pdfText.concat(...)". But the better way to write it is to use a
StringBuilder.

StringBuilder pdfText = ...;
pdfText.append(...);
pdfText.append(...);
...
return pdfText.toString();

On Fri Dec 05 2014 at 9:12:37 AM Ryan <freelanceflashga...@gmail.com> wrote:

> Hi,
>
> I'm working on an open source project attempting to convert raw content
> from a pdf (stored as a databytearray) into plain text using a Pig UDF and
> Apache Tika. I could use your help. For some reason, the UDF I'm using
> isn't working. The script succeeds but no output is written. *This is the
> Pig script I'm following:*
>
> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> DEFINE ExtractTextFromPDFs
>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
>
> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, date:
> chararray, mime: chararray, content: bytearray); --load the data
>
> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from the
> arc file
> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> c = foreach b generate url, ExtractTextFromPDFs(content);
> store c into 'output/pdf_test';
>
>
> *This is the UDF I wrote:*
>
> public class ExtractTextFromPDFs extends EvalFunc<String> {
>
>   @Override
>   public String exec(Tuple input) throws IOException {
>       String pdfText = "";
>
>       if (input == null || input.size() == 0 || input.get(0) == null) {
>           return "N/A";
>       }
>
>       DataByteArray dba = (DataByteArray)input.get(0);
>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
> debugging. Nothing written
>
>       InputStream is = new ByteArrayInputStream(dba.get());
>
>       ContentHandler contenthandler = new BodyContentHandler();
>       Metadata metadata = new Metadata();
>       DefaultDetector detector = new DefaultDetector();
>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
>
>       try {
>         pdfparser.parse(is, contenthandler, metadata, new ParseContext());
>       } catch (SAXException | TikaException e) {
>         // TODO Auto-generated catch block
>         e.printStackTrace();
>       }
>       pdfText.concat(" : "); //another attempt at debugging. Still nothing
> written
>       pdfText.concat(contenthandler.toString());
>
>       //close the input stream
>       if(is != null){
>         is.close();
>       }
>       return pdfText;
>   }
>
> }
>
> Thank you for your assistance,
> Ryan
>

Reply via email to