Hi Micael, the transaction state is kept in memory by the transaction manager, and its edits are written to a write-ahead log to be able to reconstruct the state after a failure.
You are right that the transaction object does not need to be serialized for each put: I opened two improvement Jiras (TEPHRA-233 ands -234) to address this. Were you able to clean up the transaction state and rerun your benchmark? Cheers -Andreas. On Thu, Jun 8, 2017 at 2:02 AM, Micael Capitão <[email protected]> wrote: > Hi, > > (I have inadvertently deleted the previous reply email so this email is a > response from my previous email) > > Probably I have lots of invalidated transactions because of the first > tests I was performing that were taking more than 30s per transaction. It > is possible that the invalidated transactions have pilled up. > > Bellow is the stats on the Transaction object. And yes, I have lots of > invalid transactions and that explains the absurd size I am getting on the > serialized representation. Where does Tephra store that? Zookeeper? > > 2017-06-07 09:50:08 INFO TransactionAwareHTableFix:109 - startTx Encoded > transaction size: 104203 bytes > 2017-06-07 09:50:08 INFO TransactionAwareHTableFix:110 - inprogress Tx: 0 > 2017-06-07 09:50:08 INFO TransactionAwareHTableFix:111 - invalid Tx: 13015 > 2017-06-07 09:50:08 INFO TransactionAwareHTableFix:112 - checkpoint write > pointers: 0 > > Another question: does the Transaction object may change outside the > startTx and updateTx calls? I was wondering if it is really needed to > serialize it on each single operation. > > > Regards. > > > On 31/05/17 09:49, Micael Capitão wrote: > >> Hi all, >> >> I've been testing Tephra 0.11.0 for a project that may need transactions >> on top of HBase and I find it's performance, for instance, for a bulk load, >> very poor. Let's not discuss why am I doing a bulk load with transactions. >> >> In my use case I am generating batches of ~10000 elements and inserting >> them with the *put(List<Put> puts)* method. There is no concurrent writers >> or readers. >> If I do the put without transactions it takes ~0.5s. If I use the >> *TransactionAwareHTable* it takes ~12s. >> I've tracked down the performance killer to be the >> *addToOperation(OperationWithAttributes op, Transaction tx)*, more >> specifically the *txCodec.encode(tx)*. >> >> I've created a TransactionAwareHTableFix with the *addToOperation(txPut, >> tx)* commented, and used it in my code, and each batch started to take >> ~0.5s. >> >> I've noticed that inside the *TransactionCodec* you were instantiating a >> new TSerializer and TDeserializer on each call to encode/decode. I tried >> instantiating the ser/deser on the constructor but even that way each of my >> batches would take the same ~12s. >> >> Further investigation has shown me that the Transaction instance, after >> being encoded by the TransactionCodec, has 104171 bytes of length. So in my >> 10000 elements batch, ~970MB is metadata. Is that supposed to happen? >> >> >> Regards, >> >> Micael Capitão >> > > -- > > Micael Capitão > *BIG DATA ENGINEER* > > *E-mail: *[email protected] > *Mobile: *(+351) 91 260 94 27 | *Skype*: micaelcapitao > > x > > Xpand IT | Delivering Innovation and Technology > Phone: (+351) 21 896 71 50 > Fax:(+351) 21 896 71 51 > Site:www.xpand-it.com <http://www.xpand-it.com> > > Facebook <http://www.xpand-it.com/facebook> Linkedin < > http://www.xpand-it.com/linkedin> Twitter <http://www.xpand-it.com/twitter> > Youtube <http://www.xpand-it.com/youtube> > >
