Re: Help with Pig UDF?

2014-12-06 Thread Ryan
Got it, thanks! Any idea why Tika might not be working? I've been testing and while no exceptions are being thrown, neither is anything being appended when I call pdfText.append(contenthandler.toString()); On Fri, Dec 5, 2014 at 6:21 PM, Pradeep Gollakota pradeep...@gmail.com wrote: A static

Help with Pig UDF?

2014-12-05 Thread Ryan
Hi, I'm working on an open source project attempting to convert raw content from a pdf (stored as a databytearray) into plain text using a Pig UDF and Apache Tika. I could use your help. For some reason, the UDF I'm using isn't working. The script succeeds but no output is written. *This is the

Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
Java string's are immutable. So pdfText.concat() returns a new string and the original string is left unmolested. So at the end, all you're doing is returning an empty string. Instead, you can do pdfText = pdfText.concat(...). But the better way to write it is to use a StringBuilder.

Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
I forgot to mention earlier that you should probably move the PdfParser initialization code out of the evaluate method. This will probably cause a significant overhead both in terms of gc and runtime performance. You'll want to initialize your parser once and evaluate all your docs against it. -

Re: Help with Pig UDF?

2014-12-05 Thread Ryan
After running it with updated code, it seems like the problem has to do with something related to Tika since my output says that my input is the correct number of bytes (i.e. it's actually being sent in correctly). Going to test further to narrow down the problem. Pradeep, would you recommend

Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
A static variable is not necessary... a simple instance variable is just fine. On Fri Dec 05 2014 at 2:27:53 PM Ryan freelanceflashga...@gmail.com wrote: After running it with updated code, it seems like the problem has to do with something related to Tika since my output says that my input is