I forgot to mention earlier that you should probably move the PdfParser initialization code out of the evaluate method. This will probably cause a significant overhead both in terms of gc and runtime performance. You'll want to initialize your parser once and evaluate all your docs against it.
- Pradeep On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <pradeep...@gmail.com> wrote: > Java string's are immutable. So "pdfText.concat()" returns a new string > and the original string is left unmolested. So at the end, all you're doing > is returning an empty string. Instead, you can do "pdfText = > pdfText.concat(...)". But the better way to write it is to use a > StringBuilder. > > StringBuilder pdfText = ...; > pdfText.append(...); > pdfText.append(...); > ... > return pdfText.toString(); > > On Fri Dec 05 2014 at 9:12:37 AM Ryan <freelanceflashga...@gmail.com> > wrote: > >> Hi, >> >> I'm working on an open source project attempting to convert raw content >> from a pdf (stored as a databytearray) into plain text using a Pig UDF and >> Apache Tika. I could use your help. For some reason, the UDF I'm using >> isn't working. The script succeeds but no output is written. *This is the >> Pig script I'm following:* >> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; >> DEFINE ExtractTextFromPDFs >> org.warcbase.pig.piggybank.ExtractTextFromPDFs(); >> DEFINE ArcLoader org.warcbase.pig.ArcLoader(); >> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, >> date: >> chararray, mime: chararray, content: bytearray); --load the data >> >> a = FILTER raw BY (url matches '.*\\.pdf$'); --gets all PDF pages from >> the >> arc file >> b = LIMIT a 2; --limit to 2 pages to speed up testing time >> c = foreach b generate url, ExtractTextFromPDFs(content); >> store c into 'output/pdf_test'; >> >> >> *This is the UDF I wrote:* >> >> public class ExtractTextFromPDFs extends EvalFunc<String> { >> >> @Override >> public String exec(Tuple input) throws IOException { >> String pdfText = ""; >> >> if (input == null || input.size() == 0 || input.get(0) == null) { >> return "N/A"; >> } >> >> DataByteArray dba = (DataByteArray)input.get(0); >> pdfText.concat(String.valueOf(dba.size())); //my attempt at >> debugging. Nothing written >> >> InputStream is = new ByteArrayInputStream(dba.get()); >> >> ContentHandler contenthandler = new BodyContentHandler(); >> Metadata metadata = new Metadata(); >> DefaultDetector detector = new DefaultDetector(); >> AutoDetectParser pdfparser = new AutoDetectParser(detector); >> >> try { >> pdfparser.parse(is, contenthandler, metadata, new ParseContext()); >> } catch (SAXException | TikaException e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> pdfText.concat(" : "); //another attempt at debugging. Still nothing >> written >> pdfText.concat(contenthandler.toString()); >> >> //close the input stream >> if(is != null){ >> is.close(); >> } >> return pdfText; >> } >> >> } >> >> Thank you for your assistance, >> Ryan >> >