Got it, thanks! Any idea why Tika might not be working? I've been testing
and while no exceptions are being thrown, neither is anything being
appended when I call pdfText.append(contenthandler.toString());
On Fri, Dec 5, 2014 at 6:21 PM, Pradeep Gollakota pradeep...@gmail.com
wrote:
A static variable is not necessary... a simple instance variable is just
fine.
On Fri Dec 05 2014 at 2:27:53 PM Ryan freelanceflashga...@gmail.com
wrote:
After running it with updated code, it seems like the problem has to do
with something related to Tika since my output says that my input is the
correct number of bytes (i.e. it's actually being sent in correctly).
Going
to test further to narrow down the problem.
Pradeep, would you recommend using a static variable inside the
ExtractTextFromPDFs function to store the PdfParser once it has been
initialized once? I'm still learning how to best do things within the
Pig/MapReduce/Hadoop framework
Ryan
On Fri, Dec 5, 2014 at 1:35 PM, Ryan freelanceflashga...@gmail.com
wrote:
Thanks Pradeep! I'll give it a try and report back
Ryan
On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota
pradeep...@gmail.com
wrote:
I forgot to mention earlier that you should probably move the
PdfParser
initialization code out of the evaluate method. This will probably
cause a
significant overhead both in terms of gc and runtime performance.
You'll
want to initialize your parser once and evaluate all your docs against
it.
- Pradeep
On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota
pradeep...@gmail.com
wrote:
Java string's are immutable. So pdfText.concat() returns a new
string
and the original string is left unmolested. So at the end, all
you're
doing
is returning an empty string. Instead, you can do pdfText =
pdfText.concat(...). But the better way to write it is to use a
StringBuilder.
StringBuilder pdfText = ...;
pdfText.append(...);
pdfText.append(...);
...
return pdfText.toString();
On Fri Dec 05 2014 at 9:12:37 AM Ryan
freelanceflashga...@gmail.com
wrote:
Hi,
I'm working on an open source project attempting to convert raw
content
from a pdf (stored as a databytearray) into plain text using a Pig
UDF
and
Apache Tika. I could use your help. For some reason, the UDF I'm
using
isn't working. The script succeeds but no output is written. *This
is
the
Pig script I'm following:*
register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
DEFINE ExtractTextFromPDFs
org.warcbase.pig.piggybank.ExtractTextFromPDFs();
DEFINE ArcLoader org.warcbase.pig.ArcLoader();
raw = load '/data/arc/sample.arc' using ArcLoader as (url:
chararray,
date:
chararray, mime: chararray, content: bytearray); --load the data
a = FILTER raw BY (url matches '.*\\.pdf$'); --gets all PDF pages
from
the
arc file
b = LIMIT a 2; --limit to 2 pages to speed up testing time
c = foreach b generate url, ExtractTextFromPDFs(content);
store c into 'output/pdf_test';
*This is the UDF I wrote:*
public class ExtractTextFromPDFs extends EvalFuncString {
@Override
public String exec(Tuple input) throws IOException {
String pdfText = ;
if (input == null || input.size() == 0 || input.get(0) ==
null) {
return N/A;
}
DataByteArray dba = (DataByteArray)input.get(0);
pdfText.concat(String.valueOf(dba.size())); //my attempt at
debugging. Nothing written
InputStream is = new ByteArrayInputStream(dba.get());
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
DefaultDetector detector = new DefaultDetector();
AutoDetectParser pdfparser = new AutoDetectParser(detector);
try {
pdfparser.parse(is, contenthandler, metadata, new
ParseContext());
} catch (SAXException | TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
pdfText.concat( : ); //another attempt at debugging. Still
nothing
written
pdfText.concat(contenthandler.toString());
//close the input stream
if(is != null){
is.close();
}
return pdfText;
}
}
Thank you for your assistance,
Ryan