After running it with updated code, it seems like the problem has to do with something related to Tika since my output says that my input is the correct number of bytes (i.e. it's actually being sent in correctly). Going to test further to narrow down the problem.
Pradeep, would you recommend using a static variable inside the ExtractTextFromPDFs function to store the PdfParser once it has been initialized once? I'm still learning how to best do things within the Pig/MapReduce/Hadoop framework Ryan On Fri, Dec 5, 2014 at 1:35 PM, Ryan <[email protected]> wrote: > Thanks Pradeep! I'll give it a try and report back > > Ryan > > On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <[email protected]> > wrote: > >> I forgot to mention earlier that you should probably move the PdfParser >> initialization code out of the evaluate method. This will probably cause a >> significant overhead both in terms of gc and runtime performance. You'll >> want to initialize your parser once and evaluate all your docs against it. >> >> - Pradeep >> >> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <[email protected]> >> wrote: >> >> > Java string's are immutable. So "pdfText.concat()" returns a new string >> > and the original string is left unmolested. So at the end, all you're >> doing >> > is returning an empty string. Instead, you can do "pdfText = >> > pdfText.concat(...)". But the better way to write it is to use a >> > StringBuilder. >> > >> > StringBuilder pdfText = ...; >> > pdfText.append(...); >> > pdfText.append(...); >> > ... >> > return pdfText.toString(); >> > >> > On Fri Dec 05 2014 at 9:12:37 AM Ryan <[email protected]> >> > wrote: >> > >> >> Hi, >> >> >> >> I'm working on an open source project attempting to convert raw content >> >> from a pdf (stored as a databytearray) into plain text using a Pig UDF >> and >> >> Apache Tika. I could use your help. For some reason, the UDF I'm using >> >> isn't working. The script succeeds but no output is written. *This is >> the >> >> Pig script I'm following:* >> >> >> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; >> >> DEFINE ExtractTextFromPDFs >> >> org.warcbase.pig.piggybank.ExtractTextFromPDFs(); >> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader(); >> >> >> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, >> >> date: >> >> chararray, mime: chararray, content: bytearray); --load the data >> >> >> >> a = FILTER raw BY (url matches '.*\\.pdf$'); --gets all PDF pages from >> >> the >> >> arc file >> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time >> >> c = foreach b generate url, ExtractTextFromPDFs(content); >> >> store c into 'output/pdf_test'; >> >> >> >> >> >> *This is the UDF I wrote:* >> >> >> >> public class ExtractTextFromPDFs extends EvalFunc<String> { >> >> >> >> @Override >> >> public String exec(Tuple input) throws IOException { >> >> String pdfText = ""; >> >> >> >> if (input == null || input.size() == 0 || input.get(0) == null) { >> >> return "N/A"; >> >> } >> >> >> >> DataByteArray dba = (DataByteArray)input.get(0); >> >> pdfText.concat(String.valueOf(dba.size())); //my attempt at >> >> debugging. Nothing written >> >> >> >> InputStream is = new ByteArrayInputStream(dba.get()); >> >> >> >> ContentHandler contenthandler = new BodyContentHandler(); >> >> Metadata metadata = new Metadata(); >> >> DefaultDetector detector = new DefaultDetector(); >> >> AutoDetectParser pdfparser = new AutoDetectParser(detector); >> >> >> >> try { >> >> pdfparser.parse(is, contenthandler, metadata, new >> ParseContext()); >> >> } catch (SAXException | TikaException e) { >> >> // TODO Auto-generated catch block >> >> e.printStackTrace(); >> >> } >> >> pdfText.concat(" : "); //another attempt at debugging. Still >> nothing >> >> written >> >> pdfText.concat(contenthandler.toString()); >> >> >> >> //close the input stream >> >> if(is != null){ >> >> is.close(); >> >> } >> >> return pdfText; >> >> } >> >> >> >> } >> >> >> >> Thank you for your assistance, >> >> Ryan >> >> >> > >> > >
