Pig HBase Scan Performance using HBaseStorage API

2014-12-06 Thread Krishna Kalyan
Hi,
Would there be a performance difference query1 vs query2?

*query1 :*
cc = LOAD '$TBL_CLEARCODE'
USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_description
cf_data:cq_category cf_data:cq_dqtimestamp cf_data:cq_checkarray',
'-loadKey true  ')
as (key,description,category,ActiveStagTmStamp,transformArray);

*query2:*
cc = LOAD '$TBL_CLEARCODE'
USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_description
cf_data:cq_category cf_data:cq_dqtimestamp cf_data:cq_checkarray',
'-loadKey true  *-maxTimestamp $CORR_DATE*')
as (key,description,category,ActiveStagTmStamp,transformArray);

The only difference between between the two querys is the -maxTimestamp
parameter in query2.

Regards,
Krishna


Pig Writing a Pig UDF for checkandPut using HBaseStorage API

2014-12-06 Thread Krishna Kalyan
Hi,

Currently we have all our batch process written in Pig. I need to Store
some data into HBase using Pig. Before storing the data I need to check if
the value is present. If value is dont put.

I plan to write a pig UDF to do this, as all our data pipelines use Pig.

My Sample Code Below:

HTable test_table = new HTable(conf, test);
Put put = new Put(Bytes.toBytes(1));
put.add(Bytes.toBytes(c1), Bytes.toBytes(a),
Bytes.toBytes(1abc));
test_table.checkAndPut(Bytes.toBytes(1),
Bytes.toBytes(c1), Bytes.toBytes(a), Bytes.toBytes(11abc), put);

I dont see any changes in my table.

Would appreciate guidance/references to achieve this.

Regards,

Krishna









[image: Inline image 1]


Re: Help with Pig UDF?

2014-12-06 Thread Ryan
Got it, thanks! Any idea why Tika might not be working? I've been testing
and while no exceptions are being thrown, neither is anything being
appended when I call pdfText.append(contenthandler.toString());

On Fri, Dec 5, 2014 at 6:21 PM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 A static variable is not necessary... a simple instance variable is just
 fine.

 On Fri Dec 05 2014 at 2:27:53 PM Ryan freelanceflashga...@gmail.com
 wrote:

  After running it with updated code, it seems like the problem has to do
  with something related to Tika since my output says that my input is the
  correct number of bytes (i.e. it's actually being sent in correctly).
 Going
  to test further to narrow down the problem.
 
  Pradeep, would you recommend using a static variable inside the
  ExtractTextFromPDFs function to store the PdfParser once it has been
  initialized once? I'm still learning how to best do things within the
  Pig/MapReduce/Hadoop framework
 
  Ryan
 
  On Fri, Dec 5, 2014 at 1:35 PM, Ryan freelanceflashga...@gmail.com
  wrote:
 
   Thanks Pradeep! I'll give it a try and report back
  
   Ryan
  
   On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota 
 pradeep...@gmail.com
  
   wrote:
  
   I forgot to mention earlier that you should probably move the
 PdfParser
   initialization code out of the evaluate method. This will probably
  cause a
   significant overhead both in terms of gc and runtime performance.
 You'll
   want to initialize your parser once and evaluate all your docs against
  it.
  
   - Pradeep
  
   On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota 
  pradeep...@gmail.com
   wrote:
  
Java string's are immutable. So pdfText.concat() returns a new
  string
and the original string is left unmolested. So at the end, all
 you're
   doing
is returning an empty string. Instead, you can do pdfText =
pdfText.concat(...). But the better way to write it is to use a
StringBuilder.
   
StringBuilder pdfText = ...;
pdfText.append(...);
pdfText.append(...);
...
return pdfText.toString();
   
On Fri Dec 05 2014 at 9:12:37 AM Ryan 
 freelanceflashga...@gmail.com
wrote:
   
Hi,
   
I'm working on an open source project attempting to convert raw
  content
from a pdf (stored as a databytearray) into plain text using a Pig
  UDF
   and
Apache Tika. I could use your help. For some reason, the UDF I'm
  using
isn't working. The script succeeds but no output is written. *This
 is
   the
Pig script I'm following:*
   
register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
DEFINE ExtractTextFromPDFs
 org.warcbase.pig.piggybank.ExtractTextFromPDFs();
DEFINE ArcLoader org.warcbase.pig.ArcLoader();
   
raw = load '/data/arc/sample.arc' using ArcLoader as (url:
 chararray,
date:
chararray, mime: chararray, content: bytearray); --load the data
   
a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages
  from
the
arc file
b = LIMIT a 2; --limit to 2 pages to speed up testing time
c = foreach b generate url, ExtractTextFromPDFs(content);
store c into 'output/pdf_test';
   
   
*This is the UDF I wrote:*
   
public class ExtractTextFromPDFs extends EvalFuncString {
   
  @Override
  public String exec(Tuple input) throws IOException {
  String pdfText = ;
   
  if (input == null || input.size() == 0 || input.get(0) ==
  null) {
  return N/A;
  }
   
  DataByteArray dba = (DataByteArray)input.get(0);
  pdfText.concat(String.valueOf(dba.size())); //my attempt at
debugging. Nothing written
   
  InputStream is = new ByteArrayInputStream(dba.get());
   
  ContentHandler contenthandler = new BodyContentHandler();
  Metadata metadata = new Metadata();
  DefaultDetector detector = new DefaultDetector();
  AutoDetectParser pdfparser = new AutoDetectParser(detector);
   
  try {
pdfparser.parse(is, contenthandler, metadata, new
   ParseContext());
  } catch (SAXException | TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
  }
  pdfText.concat( : ); //another attempt at debugging. Still
   nothing
written
  pdfText.concat(contenthandler.toString());
   
  //close the input stream
  if(is != null){
is.close();
  }
  return pdfText;
  }
   
}
   
Thank you for your assistance,
Ryan