Java string's are immutable. So "pdfText.concat()" returns a new string and the original string is left unmolested. So at the end, all you're doing is returning an empty string. Instead, you can do "pdfText = pdfText.concat(...)". But the better way to write it is to use a StringBuilder.
StringBuilder pdfText = ...; pdfText.append(...); pdfText.append(...); ... return pdfText.toString(); On Fri Dec 05 2014 at 9:12:37 AM Ryan <freelanceflashga...@gmail.com> wrote: > Hi, > > I'm working on an open source project attempting to convert raw content > from a pdf (stored as a databytearray) into plain text using a Pig UDF and > Apache Tika. I could use your help. For some reason, the UDF I'm using > isn't working. The script succeeds but no output is written. *This is the > Pig script I'm following:* > > register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; > DEFINE ExtractTextFromPDFs > org.warcbase.pig.piggybank.ExtractTextFromPDFs(); > DEFINE ArcLoader org.warcbase.pig.ArcLoader(); > > raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, date: > chararray, mime: chararray, content: bytearray); --load the data > > a = FILTER raw BY (url matches '.*\\.pdf$'); --gets all PDF pages from the > arc file > b = LIMIT a 2; --limit to 2 pages to speed up testing time > c = foreach b generate url, ExtractTextFromPDFs(content); > store c into 'output/pdf_test'; > > > *This is the UDF I wrote:* > > public class ExtractTextFromPDFs extends EvalFunc<String> { > > @Override > public String exec(Tuple input) throws IOException { > String pdfText = ""; > > if (input == null || input.size() == 0 || input.get(0) == null) { > return "N/A"; > } > > DataByteArray dba = (DataByteArray)input.get(0); > pdfText.concat(String.valueOf(dba.size())); //my attempt at > debugging. Nothing written > > InputStream is = new ByteArrayInputStream(dba.get()); > > ContentHandler contenthandler = new BodyContentHandler(); > Metadata metadata = new Metadata(); > DefaultDetector detector = new DefaultDetector(); > AutoDetectParser pdfparser = new AutoDetectParser(detector); > > try { > pdfparser.parse(is, contenthandler, metadata, new ParseContext()); > } catch (SAXException | TikaException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > pdfText.concat(" : "); //another attempt at debugging. Still nothing > written > pdfText.concat(contenthandler.toString()); > > //close the input stream > if(is != null){ > is.close(); > } > return pdfText; > } > > } > > Thank you for your assistance, > Ryan >