Re: Help with Pig UDF?

Pradeep Gollakota Fri, 05 Dec 2014 15:22:50 -0800

A static variable is not necessary... a simple instance variable is just
fine.


On Fri Dec 05 2014 at 2:27:53 PM Ryan <freelanceflashga...@gmail.com> wrote:

> After running it with updated code, it seems like the problem has to do
> with something related to Tika since my output says that my input is the
> correct number of bytes (i.e. it's actually being sent in correctly). Going
> to test further to narrow down the problem.
>
> Pradeep, would you recommend using a static variable inside the
> ExtractTextFromPDFs function to store the PdfParser once it has been
> initialized once? I'm still learning how to best do things within the
> Pig/MapReduce/Hadoop framework
>
> Ryan
>
> On Fri, Dec 5, 2014 at 1:35 PM, Ryan <freelanceflashga...@gmail.com>
> wrote:
>
> > Thanks Pradeep! I'll give it a try and report back
> >
> > Ryan
> >
> > On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <pradeep...@gmail.com
> >
> > wrote:
> >
> >> I forgot to mention earlier that you should probably move the PdfParser
> >> initialization code out of the evaluate method. This will probably
> cause a
> >> significant overhead both in terms of gc and runtime performance. You'll
> >> want to initialize your parser once and evaluate all your docs against
> it.
> >>
> >> - Pradeep
> >>
> >> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <
> pradeep...@gmail.com>
> >> wrote:
> >>
> >> > Java string's are immutable. So "pdfText.concat()" returns a new
> string
> >> > and the original string is left unmolested. So at the end, all you're
> >> doing
> >> > is returning an empty string. Instead, you can do "pdfText =
> >> > pdfText.concat(...)". But the better way to write it is to use a
> >> > StringBuilder.
> >> >
> >> > StringBuilder pdfText = ...;
> >> > pdfText.append(...);
> >> > pdfText.append(...);
> >> > ...
> >> > return pdfText.toString();
> >> >
> >> > On Fri Dec 05 2014 at 9:12:37 AM Ryan <freelanceflashga...@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm working on an open source project attempting to convert raw
> content
> >> >> from a pdf (stored as a databytearray) into plain text using a Pig
> UDF
> >> and
> >> >> Apache Tika. I could use your help. For some reason, the UDF I'm
> using
> >> >> isn't working. The script succeeds but no output is written. *This is
> >> the
> >> >> Pig script I'm following:*
> >> >>
> >> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> >> >> DEFINE ExtractTextFromPDFs
> >> >>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> >> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
> >> >>
> >> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
> >> >> date:
> >> >> chararray, mime: chararray, content: bytearray); --load the data
> >> >>
> >> >> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages
> from
> >> >> the
> >> >> arc file
> >> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> >> >> c = foreach b generate url, ExtractTextFromPDFs(content);
> >> >> store c into 'output/pdf_test';
> >> >>
> >> >>
> >> >> *This is the UDF I wrote:*
> >> >>
> >> >> public class ExtractTextFromPDFs extends EvalFunc<String> {
> >> >>
> >> >>   @Override
> >> >>   public String exec(Tuple input) throws IOException {
> >> >>       String pdfText = "";
> >> >>
> >> >>       if (input == null || input.size() == 0 || input.get(0) ==
> null) {
> >> >>           return "N/A";
> >> >>       }
> >> >>
> >> >>       DataByteArray dba = (DataByteArray)input.get(0);
> >> >>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
> >> >> debugging. Nothing written
> >> >>
> >> >>       InputStream is = new ByteArrayInputStream(dba.get());
> >> >>
> >> >>       ContentHandler contenthandler = new BodyContentHandler();
> >> >>       Metadata metadata = new Metadata();
> >> >>       DefaultDetector detector = new DefaultDetector();
> >> >>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
> >> >>
> >> >>       try {
> >> >>         pdfparser.parse(is, contenthandler, metadata, new
> >> ParseContext());
> >> >>       } catch (SAXException | TikaException e) {
> >> >>         // TODO Auto-generated catch block
> >> >>         e.printStackTrace();
> >> >>       }
> >> >>       pdfText.concat(" : "); //another attempt at debugging. Still
> >> nothing
> >> >> written
> >> >>       pdfText.concat(contenthandler.toString());
> >> >>
> >> >>       //close the input stream
> >> >>       if(is != null){
> >> >>         is.close();
> >> >>       }
> >> >>       return pdfText;
> >> >>   }
> >> >>
> >> >> }
> >> >>
> >> >> Thank you for your assistance,
> >> >> Ryan
> >> >>
> >> >
> >>
> >
> >
>

Re: Help with Pig UDF?

Reply via email to