Re: Incubations

2006-03-17 Thread lichtner

On Fri, 17 Mar 2006, Jason Dillon wrote:

> Prior to escalation to the ASF, a Podling needs to show that :
>
>  * it is a worthy and healthy project;
>   * it truly fits within the ASF framework;and
>  * it "gets" the Apache Way.
> 
>
> Part of the way is to resolve conflict with in the community.
> lichtner's comment of "should pack up its code and move somewhere
> else" is IMO sensationalism and is not helpful to the Geronimo
> community, or to the incoming podling communities which we are trying
> to resolve how best to integrate with our own.

To someone who is "ASF or nothing" it was not helpful. To people whose
first priority is the code, it could be helpful.


Incubations

2006-03-17 Thread lichtner

I wanted to see what this incubation problem is all about, so I took a
look at the web site http://incubator.apache.org/resolution.html .

It says that the B.o.D. has determined that it's in "the best interests of
the Foundation" to create this incubator PMC charged with "providing
guidance", to help products engender "their own collaborative community",
and "educating" new developers.

So you do not want to get incubated by Apache unless:

- You care deeply about the Apache Foundation.
- You project needs "guidance."
- You need your project to have a "community."

Philosophy eventually rises up to the surface. This resolution may explain
why we are seeing so many emails about ActiveMQ graduating etc.

The Apache Foundation is generous to provide resources to open-source
projects, but this is not an entirely selfless act. If your project
consists of one person, it does not qualify as a good ASF project because
it doesn't have a "community", for example. If your project doesn't
believe in democracy then it's not a good ASF project.

Personally, I would not get a project incubated to help ASF be all it can
be. I don't necessarily care about ASF, I do care about the Apache httpd
and the other projects which are hosted by ASF but which might as well be
hosted somewhere else.

I think that if Geronimo is at odds with ASF's idealistic, abstract
motivations it should pack up its code and move somewhere else where it
can focus on coding. If they are only staying for the free services then
perhaps IBM can donate those.

Not to mention that the project is so big it could have its own
foundation ...


Re: Summary?

2006-03-14 Thread lichtner

On Tue, 14 Mar 2006, David Blevins wrote:

> Provisioning of the actual stateful session bean keys is easy to
> isolate, but as I say inventing a client id that you could use as
> part of a stateful session bean's id is not easy.

Would it be enough to generate a cluster-wide unique id?


New release of EVS4J and performance news

2006-03-12 Thread lichtner

FYI, I put out 1.0b3 which contains a couple of major bug fixes, which
specifically showed up on windows.

Also, a user reported running the local benchmark and getting 25,000
messages per second (= 300Mbps). Also my new turion64 laptop gets
about 240Mbps, also on a local test.


Re: Apache-licensed version of jgroups?

2006-02-20 Thread lichtner

On Mon, 20 Feb 2006, James Strachan wrote:

> Incidentally I'm interested to see if reliable multicast in Java
> actually makes sense from a performance perspective.

Back in 2000 I was curious also. That's why I wrote an implementation of
totem. EVS4J can send about 7000 messages per second. The latency depends
on the size of the ring. You can tune both together using the windowSize
as a tuning knob.

> JGroups for example is generally pretty slow; we tend to use TCP in
> ActiveMQ if folks want high performance reliable communication and find
> its way faster; so I'm not sure if reliable multicast in Java would ever
> perform well enough to be worth it from a pure performance perspective.
> I'd love to be proved wrong of course :).

When you have a free hour, see

http://www.bway.net/~lichtner/evs4j.html

I cannot guarantee that you will decide to use evs4j, but I can guarantee
that you will get a kick out of watching it run.

Don't forget to try different window sizes.

> Certainly reliable multicast can use less bandwidth if multiple nodes
> are all consuming the same data, so even if its slower, there are use
> cases for it.
>
> I'm looking forward to whatever you come up with :)

It doesn't sound like people need it, so it probably won't happen.


Re: Apache-licensed version of jgroups?

2006-02-17 Thread lichtner

James,

I looked at the api, to try to get an idea of what it does. I tried to
look at faqs and other links on activeio.codehaus.org, and I had to create
a Confluence account, and the links were empty (boiler plate links?).

I am sure you can explain to me what the mission of this project is
supposed to be. But in any case, it seems pretty complex. Without spending
a lot more time on it I just don't know if this is a good foundation for
adding reliable multicasting, specifically I cannot gauge what kind of
performance you are going to get - I am not interested in building toys -
we have those already. Perhaps you have a specific design in mind for
adding reliable multicast?

If it were up to me, as a first guess, I could see using activeio to
handle state transfer (to add new nodes or migrate replicas) but not for
multicasting itself. That's just my first impression. Please advise.

Guglielmo

On Fri, 17 Feb 2006, James Strachan wrote:

> Have you taken a look at ActiveIO which we use for various low level
> communication protocols like TCP, NIO, AIO etc...
>
> http://svn.apache.org/repos/asf/incubator/activemq/trunk/activeio/
>
> e.g.
>
> http://svn.apache.org/repos/asf/incubator/activemq/trunk/activeio/
> activeio-core/src/main/java/org/apache/activeio/
>
> it might prove a useful starting place to layer on protocols like
> membership, mnak etc?
>
> We're already using ActiveIO in ActiveMQ and OpenEJB; I'm sure it
> could be used in other places too.
>
> James
>
>
> On 15 Feb 2006, at 19:38, lichtner wrote:
> >
> > Is there any interest in an apache-licensed version of jgroups?
> >
> > I am thinking something along these lines:
> >
> > 1. Well-understood layered architecture, of x-kernel, Ensemble, and
> > JGroups fame.
> >
> > 2. Performance-focused: low thread count per protocol layer (0+),
> > no java
> > serialization.
> >
> > 3. Simple: implement best-of-breed protocols only, and provide
> > pre-assembled protocol stacks.
> >
> > 4. Release 1.0 with basic protocols: membership, mnak, flow
> > control, etc.
> >
> > I would be happiest with an arrangement where somebody junior would
> > code
> > it and I would serve as an advisor (if needed) based on my
> > experience with
> > EVS4J, with the actual coder taking the credit for it.
> >
> > I would like to see Geronimo evolve quickly into an industrial
> > strength
> > server (which I think it will) so I can stop using all the other app
> > servers ..
> >
> > Guglielmo
>
>
> James
> ---
> http://radio.weblogs.com/0112098/
>
>


Re: Apache-licensed version of jgroups?

2006-02-17 Thread lichtner

I will take a look.

On Fri, 17 Feb 2006, James Strachan wrote:

> Have you taken a look at ActiveIO which we use for various low level
> communication protocols like TCP, NIO, AIO etc...
>
> http://svn.apache.org/repos/asf/incubator/activemq/trunk/activeio/
>
> e.g.
>
> http://svn.apache.org/repos/asf/incubator/activemq/trunk/activeio/
> activeio-core/src/main/java/org/apache/activeio/
>
> it might prove a useful starting place to layer on protocols like
> membership, mnak etc?
>
> We're already using ActiveIO in ActiveMQ and OpenEJB; I'm sure it
> could be used in other places too.
>
> James
>
>
> On 15 Feb 2006, at 19:38, lichtner wrote:
> >
> > Is there any interest in an apache-licensed version of jgroups?
> >
> > I am thinking something along these lines:
> >
> > 1. Well-understood layered architecture, of x-kernel, Ensemble, and
> > JGroups fame.
> >
> > 2. Performance-focused: low thread count per protocol layer (0+),
> > no java
> > serialization.
> >
> > 3. Simple: implement best-of-breed protocols only, and provide
> > pre-assembled protocol stacks.
> >
> > 4. Release 1.0 with basic protocols: membership, mnak, flow
> > control, etc.
> >
> > I would be happiest with an arrangement where somebody junior would
> > code
> > it and I would serve as an advisor (if needed) based on my
> > experience with
> > EVS4J, with the actual coder taking the credit for it.
> >
> > I would like to see Geronimo evolve quickly into an industrial
> > strength
> > server (which I think it will) so I can stop using all the other app
> > servers ..
> >
> > Guglielmo
>
>
> James
> ---
> http://radio.weblogs.com/0112098/
>
>


Apache-licensed version of jgroups?

2006-02-15 Thread lichtner

Is there any interest in an apache-licensed version of jgroups?

I am thinking something along these lines:

1. Well-understood layered architecture, of x-kernel, Ensemble, and
JGroups fame.

2. Performance-focused: low thread count per protocol layer (0+), no java
serialization.

3. Simple: implement best-of-breed protocols only, and provide
pre-assembled protocol stacks.

4. Release 1.0 with basic protocols: membership, mnak, flow control, etc.

I would be happiest with an arrangement where somebody junior would code
it and I would serve as an advisor (if needed) based on my experience with
EVS4J, with the actual coder taking the credit for it.

I would like to see Geronimo evolve quickly into an industrial strength
server (which I think it will) so I can stop using all the other app
servers ..

Guglielmo


Re: Oracle XA RAR for G1.0?

2006-02-09 Thread lichtner

On Thu, 9 Feb 2006, Jason Dillon wrote:

> Thanks!  My DBA cleared this for me and now XA is working with 1 Oracle
> DS and 1 ActiveMQ CF. I still can not get the 2 Oracle datasources
> working together with XA though.

Glad it worked out.

> Did anyone have a chance to peek at that URL I mailed describing a
> similar problem?  The one that suggested that some Oracle XA flag needs
> to be set for loosly coupled branches?

I haven't looked at it.


Re: Oracle XA RAR for G1.0?

2006-02-07 Thread lichtner

On Tue, 7 Feb 2006, Jason Dillon wrote:

> I've got a db looking into fixing that for me...
>
> And created https://issues.apache.org/jira/browse/GERONIMO-1599
>
> I'm not sure how to fix this though :-(

It looks like line 219 is setting a null xidFactory. It looks like
xidFactory is a GBean attribute, so perhaps it is null because the
XidFactory GBean does not start properly. I think G has a log that prints
out all the GBeans as they start up. Maybe you can see some earlier error
messages there. Or maybe it's an ordering problem.

I supposed since this is IOC you could break this if you hacked the
deployment plans, or else it's some kind of oversight.

Who wrote this code?


Re: Oracle XA RAR for G1.0?

2006-02-07 Thread lichtner

Since you crashed so many times and then had to delete the log, which
knows how to clean up the in-doubt transactions, you now have some
transactions which are waiting to be committed or rolled back and are
holding locks (as they should.)

If you have a dba I would get him/her involved.

To do it manually you have to do a select on

DBA_2PC_PENDING

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14237/statviews_3002.htm#sthref1821

and then do ROLLBACK FORCE or COMMIT FORCE as shown here:

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14231/ds_txnman.htm#i1007905

If you do not privileges for the select, you can try to log in as sys with
the default Oracle password:

sqlplus sys/CHANGE_ON_INSTALL

Sometimes they don't bother to change it.

If you grant the JDBC user the ability to select on DBA_2PC_PENDING (or
other appropriate view) then Geronimo (once the NPE is fixed) can settle
these automatically for you.

Guglielmo

On Tue, 7 Feb 2006, Jason Dillon wrote:

> It finally dawned on my that my connection to ActiveMQ using:
>
> vm://localhost?asyncSend=true
>
> Was a bad idea.  So I tired using these:
>
>  * vm://localhost
>  * tcp://localhost:61616
>
> Both of which don't hang... but now were back to more Oracle exceptions:
>
> 
> 18:24:47,683 WARN  [JDBCExceptionReporter] SQL Error: 1591, SQLState: 72000
> 18:24:47,683 ERROR [JDBCExceptionReporter] ORA-01591: lock held by
> in-doubt distributed transaction 6.28.6034
>
> 18:24:47,684 WARN  [JDBCExceptionReporter] SQL Error: 1591, SQLState: 72000
> 18:24:47,684 ERROR [JDBCExceptionReporter] ORA-01591: lock held by
> in-doubt distributed transaction 6.28.6034
>
> 18:24:47,686 WARN  [JDBCExceptionReporter] SQL Error: 1591, SQLState: 72000
> 18:24:47,686 ERROR [JDBCExceptionReporter] ORA-01591: lock held by
> in-doubt distributed transaction 6.28.6034
>
> 18:24:47,686 WARN  [JDBCExceptionReporter] SQL Error: 1591, SQLState: 72000
> 18:24:47,686 ERROR [JDBCExceptionReporter] ORA-01591: lock held by
> in-doubt distributed transaction 6.28.6034
>
> 18:24:47,687 ERROR [JDBCExceptionReporter] Could not execute JDBC batch update
> java.sql.BatchUpdateException: ORA-01591: lock held by in-doubt
> distributed transaction 6.28.6034
>
> at 
> oracle.jdbc.driver.DatabaseError.throwBatchUpdateException(DatabaseError.java:498)
> at 
> oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12368)
> at 
> org.tranql.connector.jdbc.StatementHandle.executeBatch(StatementHandle.java:157)
> at 
> net.sf.hibernate.impl.BatchingBatcher.doExecuteBatch(BatchingBatcher.java:54)
> at 
> net.sf.hibernate.impl.BatcherImpl.executeBatch(BatcherImpl.java:126)
> at net.sf.hibernate.impl.SessionImpl.executeAll(SessionImpl.java:2421)
> at net.sf.hibernate.impl.SessionImpl.execute(SessionImpl.java:2371)
> at net.sf.hibernate.impl.SessionImpl.flush(SessionImpl.java:2240)
> at 
> org.springframework.orm.hibernate.HibernateTemplate$22.doInHibernate(HibernateTemplate.java:595)
> at 
> org.springframework.orm.hibernate.HibernateTemplate.execute(HibernateTemplate.java:312)
> at 
> org.springframework.orm.hibernate.HibernateTemplate.flush(HibernateTemplate.java:593)
> at 
> com.solidusnetworks.paycore.util.hibernate.BaseDAOHibernate.save(BaseDAOHibernate.java:176)
> at 
> com.solidusnetworks.ach.oltp.dao.impl.ECheckDAOHibernate.saveECheck(ECheckDAOHibernate.java:208)
> at 
> com.solidusnetworks.paycore.ach.model.check.service.CheckUpdateServiceBean.addECheck(CheckUpdateServiceBean.java:117)
> at 
> com.solidusnetworks.paycore.ach.model.check.service.CheckUpdateServiceBean$$FastClassByCGLIB$$70674c74.invoke()
> at 
> org.openejb.dispatch.AbstractMethodOperation.invoke(AbstractMethodOperation.java:90)
> at org.openejb.slsb.BusinessMethod.execute(BusinessMethod.java:67)
> at 
> org.openejb.dispatch.DispatchInterceptor.invoke(DispatchInterceptor.java:72)
> at 
> org.apache.geronimo.naming.java.ComponentContextInterceptor.invoke(ComponentContextInterceptor.java:56)
> at 
> org.openejb.ConnectionTrackingInterceptor.invoke(ConnectionTrackingInterceptor.java:81)
> at 
> org.openejb.transaction.ContainerPolicy$TxRequired.invoke(ContainerPolicy.java:119)
> at 
> org.openejb.transaction.TransactionContextInterceptor.invoke(TransactionContextInterceptor.java:80)
> at 
> org.openejb.slsb.StatelessInstanceInterceptor.invoke(StatelessInstanceInterceptor.java:98)
> at 
> org.openejb.transaction.ContainerPolicy$TxRequired.invoke(ContainerPolicy.java:119)
> at 
> org.openejb.transaction.TransactionContextInterceptor.invoke(TransactionContextInterceptor.java:80)
> at 
> org.openejb.SystemExceptionInterceptor.invoke(SystemExceptionInterceptor.java:82)
> at 
> org.openejb.GenericEJBContainer.invoke(GenericEJBContainer.java:238)
> at 

Re: Oracle XA RAR for G1.0?

2006-02-07 Thread lichtner

It just sounds like a bug, I guess.


On Tue, 7 Feb 2006, Jason Dillon wrote:

> I'm not saying it won't work... but its defintetly not happy with
> TranQL with its throwing an exception for a metadata query instead of
> returning false.
>
> --jason
>
>
> On 2/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >
> > 9.2.x.x does work with XA.
> >
> > > I'm going to retest everything with the 10.2.0.1.0_g driver... since
> > > 9.2.* was whack for non-xa I'm not sure that anything would work as
> > > expected.
> > >
> > > --jason
> > >
> > >
> > > On 2/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > >> > I have a feeling that something else is wrong, as I mentioned before I
> > >> > see hanging transactions when using the local adapter in local-tx
> > >> > mode.  And when I ctrl-c G it corrupts the txlog each time... which is
> > >> > very bad IMO.
> > >>
> > >> What do you mean by "corrupts"? Do you mean that the transaction manager
> > >> does not perform recovery properly upon boot?
> > >>
> > >> > I'm starting to think this is a god must hate jason problem more than
> > >> > anything else :-(
> > >>
> > >> Since you are getting an XAException.XA_RMERR error while trying to
> > >> enlist
> > >> a resource manager, maybe Oracle is not set up properly to do XA
> > >> transactions for you.
> > >>
> > >> I do remember that to get XAResource.recover() to work for example you
> > >> have to grant the jdbc user certain database catalog privileges -
> > >> because
> > >> it has to do a select on the in-doubt transaction table. It's not
> > >> impossible that you have to do some configuration in the database server
> > >> to be able to enlist properly.
> > >>
> > >> If I were you I would try to run an xa transaction myself by calling new
> > >> OracleXADataSource(), calling setConnectionURL, setPassword,
> > >> setUserName,
> > >> and then getConnection() and getXAResource(), and then
> > >> start/end/prepare/commit. You can do this from the command line. The
> > >> Oracle driver has an example class that does this so you can cut and
> > >> paste.
> > >>
> > >> That could be a big sanity check.
> > >>
> > >> Guglielmo
> > >>
> > >>
> > >>
> > >
> >
> >
> >
>


Re: Hanging transaction

2006-02-07 Thread lichtner

I guess HowlLog.java line 362 should not be throwing an exception.

> And starting up G right after produces this:
>
> 
> Booting Geronimo Kernel (in Java 1.4.2_09)...
> Started configuration  1/23   0s geronimo/rmi-naming/1.0/car
> 16:15:06,779 ERROR [GBeanInstanceState] Error while starting; GBean is
> now in the FAILED state:
> objectName="geronimo.server:J2EEApplication=null,J2EEModule=geronimo/j2ee-server/1.0/car,J2EEServer=geronimo,j2eeType=TransactionLog,name=HOWLTransactionLog"
> java.lang.NullPointerException
> at
> org.apache.geronimo.transaction.log.HOWLLog$GeronimoReplayListener.onRecord(HOWLLog.java:362)
> at
> org.objectweb.howl.log.xa.XALogger.replayActiveTx(XALogger.java:1059)
> at
> org.apache.geronimo.transaction.log.HOWLLog.doStart(HOWLLog.java:220)
> at
> org.apache.geronimo.gbean.runtime.GBeanInstance.createInstance(GBeanInstance.java:936)
> at
> org.apache.geronimo.gbean.runtime.GBeanInstanceState.attemptFullStart(GBeanInstanceState.java:325)
> at
> org.apache.geronimo.gbean.runtime.GBeanInstanceState.start(GBeanInstanceState.java:110)
> at
> org.apache.geronimo.gbean.runtime.GBeanInstanceState.startRecursive(GBeanInstanceState.java:132)
> at
> org.apache.geronimo.gbean.runtime.GBeanInstance.startRecursive(GBeanInstance.java:537)
> at
> org.apache.geronimo.kernel.basic.BasicKernel.startRecursiveGBean(BasicKernel.java:208)
> at
> org.apache.geronimo.kernel.config.Configuration.startRecursiveGBeans(Configuration.java:315)
> at
> org.apache.geronimo.kernel.config.Configuration$$FastClassByCGLIB$$7f4b4a9b.invoke()
> at net.sf.cglib.reflect.FastMethod.invoke(FastMethod.java:53)
> at
> org.apache.geronimo.gbean.runtime.FastMethodInvoker.invoke(FastMethodInvoker.java:38)
> at
> org.apache.geronimo.gbean.runtime.GBeanOperation.invoke(GBeanOperation.java:118)
> at
> org.apache.geronimo.gbean.runtime.GBeanInstance.invoke(GBeanInstance.java:835)
> at
> org.apache.geronimo.kernel.basic.BasicKernel.invoke(BasicKernel.java:178)
> at
> org.apache.geronimo.kernel.basic.BasicKernel.invoke(BasicKernel.java:173)
> at
> org.apache.geronimo.kernel.config.ConfigurationManagerImpl.start(ConfigurationManagerImpl.java:142)
> at
> org.apache.geronimo.kernel.config.ConfigurationManagerImpl$$FastClassByCGLIB$$fbed85d2.invoke()
> at net.sf.cglib.reflect.FastMethod.invoke(FastMethod.java:53)
> at
> org.apache.geronimo.gbean.runtime.FastMethodInvoker.invoke(FastMethodInvoker.java:38)
> at
> org.apache.geronimo.gbean.runtime.GBeanOperation.invoke(GBeanOperation.java:118)
> at
> org.apache.geronimo.gbean.runtime.GBeanInstance.invoke(GBeanInstance.java:800)
> at
> org.apache.geronimo.gbean.runtime.RawInvoker.invoke(RawInvoker.java:57)
> at
> org.apache.geronimo.kernel.basic.RawOperationInvoker.invoke(RawOperationInvoker.java:36)
> at
> org.apache.geronimo.kernel.basic.ProxyMethodInterceptor.intercept(ProxyMethodInterceptor.java:96)
> at
> org.apache.geronimo.kernel.config.ConfigurationManager$$EnhancerByCGLIB$$ac1e62eb.start()
> at
> org.apache.geronimo.system.main.Daemon.doStartup(Daemon.java:323)
> at org.apache.geronimo.system.main.Daemon.(Daemon.java:82)
> at org.apache.geronimo.system.main.Daemon.main(Daemon.java:404)
> Started configuration  2/23   1s geronimo/j2ee-server/1.0/car
> 16:15:07,603 INFO  [RMIConnectorServer] RMIConnectorServer started at:
> service:jmx:rmi://localhost/jndi/rmi:/JMXConnector
> Started configuration  3/23   1s geronimo/j2ee-security/1.0/car
> Started configuration  4/23   2s geronimo/activemq-broker/1.0/car
> Started configuration  5/23   0s geronimo/activemq/1.0/car
> Started configuration  6/23   0s geronimo/system-database/1.0/car
> Started configuration  7/23   2s geronimo/directory/1.0/car
> Started configuration  8/23   0s geronimo/ldap-realm/1.0/car
> Started configuration  9/23   1s geronimo/jetty/1.0/car
> Started configuration 10/23   0s geronimo/geronimo-gbean-deployer/1.0/car
> Started configuration 11/23   1s geronimo/j2ee-deployer/1.0/car
> Started configuration 12/23   0s geronimo/jetty-deployer/1.0/car
> Started configuration 13/23   0s geronimo/welcome-jetty/1.0/car
> Started configuration 14/23   0s geronimo/ldap-demo-jetty/1.0/car
> Started configuration 15/23   0s geronimo/servlets-examples-jetty/1.0/car
> Started configuration 16/23   2s geronimo/jsp-examples-jetty/1.0/car
> Started configuration 17/23   2s geronimo/webconsole-jetty/1.0/car
> Started configuration 18/23   0s geronimo/uddi-jetty/1.0/car
> Started configuration 19/23   0s geronimo/jmxdebug-jetty/1.0/car
> Started configuration 20/23   4s geronimo/daytrader-derby-jetty/1.0/car
> Started configuration 21/23   0s geronimo/remote-deploy-jetty/1.0/car
> java.lang.IllegalStateException: Cannot retrieve the value for
> non-persistent attribute containerName when GBeanIn

Re: Oracle XA RAR for G1.0?

2006-02-07 Thread lichtner

It's not supposed to do that. It should scan the recovery log, then call
XAResource.recover() before the data source is first used. Since you are
getting an NPE there may be a bug in the code.

>> What do you mean by "corrupts"? Do you mean that the transaction manager
>> does not perform recovery properly upon boot?
>
> So, at some point the remote EJB call appears to hang (looking into
> that more now), it never times out, just sits there.  So I kill the
> process (not a -9), so the vm shutdown gracefully (or at least tries
> to).
>
> Then if I try to start it up again, I get NPE when the system boots
> (I've posted the exception previously).
>
> Now it might be possible that recovery failed for some reason... but
> it should not NPE, and it should not cause the application server to
> not load.  Or it should just quickly fail with a reasonable error
> message about failure to recover from txlog and explain how to fix it.
>
> But, so far all I can do is `rm var/txlog/*` and then start up the server
> again.
>
> --jason
>




Re: Oracle XA RAR for G1.0?

2006-02-07 Thread lichtner

9.2.x.x does work with XA.

> I'm going to retest everything with the 10.2.0.1.0_g driver... since
> 9.2.* was whack for non-xa I'm not sure that anything would work as
> expected.
>
> --jason
>
>
> On 2/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> > I have a feeling that something else is wrong, as I mentioned before I
>> > see hanging transactions when using the local adapter in local-tx
>> > mode.  And when I ctrl-c G it corrupts the txlog each time... which is
>> > very bad IMO.
>>
>> What do you mean by "corrupts"? Do you mean that the transaction manager
>> does not perform recovery properly upon boot?
>>
>> > I'm starting to think this is a god must hate jason problem more than
>> > anything else :-(
>>
>> Since you are getting an XAException.XA_RMERR error while trying to
>> enlist
>> a resource manager, maybe Oracle is not set up properly to do XA
>> transactions for you.
>>
>> I do remember that to get XAResource.recover() to work for example you
>> have to grant the jdbc user certain database catalog privileges -
>> because
>> it has to do a select on the in-doubt transaction table. It's not
>> impossible that you have to do some configuration in the database server
>> to be able to enlist properly.
>>
>> If I were you I would try to run an xa transaction myself by calling new
>> OracleXADataSource(), calling setConnectionURL, setPassword,
>> setUserName,
>> and then getConnection() and getXAResource(), and then
>> start/end/prepare/commit. You can do this from the command line. The
>> Oracle driver has an example class that does this so you can cut and
>> paste.
>>
>> That could be a big sanity check.
>>
>> Guglielmo
>>
>>
>>
>




Re: Oracle XA RAR for G1.0?

2006-02-07 Thread lichtner
> I have a feeling that something else is wrong, as I mentioned before I
> see hanging transactions when using the local adapter in local-tx
> mode.  And when I ctrl-c G it corrupts the txlog each time... which is
> very bad IMO.

What do you mean by "corrupts"? Do you mean that the transaction manager
does not perform recovery properly upon boot?

> I'm starting to think this is a god must hate jason problem more than
> anything else :-(

Since you are getting an XAException.XA_RMERR error while trying to enlist
a resource manager, maybe Oracle is not set up properly to do XA
transactions for you.

I do remember that to get XAResource.recover() to work for example you
have to grant the jdbc user certain database catalog privileges - because
it has to do a select on the in-doubt transaction table. It's not
impossible that you have to do some configuration in the database server
to be able to enlist properly.

If I were you I would try to run an xa transaction myself by calling new
OracleXADataSource(), calling setConnectionURL, setPassword, setUserName,
and then getConnection() and getXAResource(), and then
start/end/prepare/commit. You can do this from the command line. The
Oracle driver has an example class that does this so you can cut and
paste.

That could be a big sanity check.

Guglielmo




Re: Oracle XA RAR for G1.0?

2006-02-06 Thread lichtner

-3 should be javax.transaction.xa.XAException.XA_RMERR:

http://java.sun.com/j2ee/1.4/docs/api/constant-values.html#javax.transaction

Anyhow, you are not actually enlisting the oracle resource manager, so
that's a step in the right direction.

I think that Geronimo is not printing the stack trace. If you can, maybe
add a printStackTrace() and recompile.

What version of Oracle server and driver are you using?

On Mon, 6 Feb 2006, Jason Dillon wrote:

> And the plot thickens... this warning is issued before the exceptions
> are dumped:
>
> 
> 21:21:23,050 WARN  [Transaction] Unable to enlist XAResource
> [EMAIL PROTECTED], e
> rrorCode: -3
> oracle.jdbc.xa.OracleXAException
>  at oracle.jdbc.xa.OracleXAResource.checkError
> (OracleXAResource.java:1190)
>  at oracle.jdbc.xa.client.OracleXAResource.start
> (OracleXAResource.java:311)
>  at
> org.apache.geronimo.transaction.manager.WrapperNamedXAResource.start
> (WrapperNamedXAResource.java:86)
>  at
> org.apache.geronimo.transaction.manager.TransactionImpl.enlistResource
> (TransactionImpl.java:166)
>  at
> org.apache.geronimo.transaction.context.InheritableTransactionContext.en
> listResource(InheritableTransactionContext.java:92)
>  at
> org.apache.geronimo.connector.outbound.TransactionEnlistingInterceptor.g
> etConnection(TransactionEnlistingInterceptor.java:53)
>  at
> org.apache.geronimo.connector.outbound.TransactionCachingInterceptor.get
> Connection(TransactionCachingInterceptor.java:81)
>  at
> org.apache.geronimo.connector.outbound.ConnectionHandleInterceptor.getCo
> nnection(ConnectionHandleInterceptor.java:43)
>  at
> org.apache.geronimo.connector.outbound.TCCLInterceptor.getConnection
> (TCCLInterceptor.java:39)
>  at
> org.apache.geronimo.connector.outbound.ConnectionTrackingInterceptor.get
> Connection(ConnectionTrackingInterceptor.java:66)
>  at
> org.apache.geronimo.connector.outbound.AbstractConnectionManager.allocat
> eConnection(AbstractConnectionManager.java:57)
>  at org.tranql.connector.jdbc.DataSource.getConnection
> (DataSource.java:56)
>  at $javax.sql.DataSource$$FastClassByCGLIB$$6525cafd.invoke
> ()
>  at net.sf.cglib.proxy.MethodProxy.invoke(MethodProxy.java:149)
>  at
> org.apache.geronimo.connector.ConnectorMethodInterceptor.intercept
> (ConnectorMethodInterceptor.java:53)
>  at $javax.sql.DataSource$$EnhancerByCGLIB$
> $4e89d0c0.getConnection()
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:324)
>  at
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(
> AopUtils.java:282)
>  at
> org.springframework.aop.framework.JdkDynamicAopProxy.invoke
> (JdkDynamicAopProxy.java:163)
>  at $Proxy1.getConnection(Unknown Source)
>  at
> org.springframework.orm.hibernate.LocalDataSourceConnectionProvider.getC
> onnection(LocalDataSourceConnectionProvider.java:75)
>  at net.sf.hibernate.cfg.SettingsFactory.buildSettings
> (SettingsFactory.java:73)
>  at net.sf.hibernate.cfg.Configuration.buildSettings
> (Configuration.java:1155)
>  at net.sf.hibernate.cfg.Configuration.buildSessionFactory
> (Configuration.java:789)
>  at
> org.springframework.orm.hibernate.LocalSessionFactoryBean.newSessionFact
> ory(LocalSessionFactoryBean.java:535)
>  at
> org.springframework.orm.hibernate.LocalSessionFactoryBean.afterPropertie
> sSet(LocalSessionFactoryBean.java:470)
>  at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFac
> tory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1065)
>  at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFac
> tory.createBean(AbstractAutowireCapableBeanFactory.java:343)
>  at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFac
> tory.createBean(AbstractAutowireCapableBeanFactory.java:260)
>  at
> org.springframework.beans.factory.support.AbstractBeanFactory.getBean
> (AbstractBeanFactory.java:221)
>  at
> org.springframework.beans.factory.support.AbstractBeanFactory.getBean
> (AbstractBeanFactory.java:145)
>  at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFac
> tory.resolveReference(AbstractAutowireCapableBeanFactory.java:973)
>  at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFac
> tory.resolveValueIfNecessary(AbstractAutowireCapableBeanFactory.java:
> 911)
>  at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFac
> tory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:852)
>  at
> org.sprin

Re: Oracle XA RAR for G1.0?

2006-02-06 Thread lichtner

It's my sanskrit-only support contract.

Here on dejanews you can see other people suffering from a similar
problem:

http://groups.google.com/groups?q=COMMIT+is+not+allowed+in+a+subordinate+session&hl=en

On Mon, 6 Feb 2006, Aaron Mulder wrote:

> That's about as helpful as Sanskrit.
>
> Aaron
>
> On 2/6/06, lichtner <[EMAIL PROTECTED]> wrote:
> >
> > ORA-02089: COMMIT is not allowed in a subordinate session
> > Cause: COMMIT was issued in a session that is not the two-phase commit 
> > global coordinator.
> > Action: Issue commit at the global coordinator only.


Re: Oracle XA RAR for G1.0?

2006-02-06 Thread lichtner

What resource managers are in your transaction? Is it just Geronimo and
one instance of Oracle?

Do you happen to be executing a stored procedure, or are you calling
commit explicitly anywhere except through JTA?

On Tue, 7 Feb 2006, Jason Dillon wrote:

> Thanks But I'm still clueless as to why this happens :-(
>
> --jason
>
>
> -Original Message-
> From: lichtner <[EMAIL PROTECTED]>
> Date: Mon, 6 Feb 2006 22:06:48
> To:dev@geronimo.apache.org
> Subject: Re: Oracle XA RAR for G1.0?
>
>
> ORA-02089: COMMIT is not allowed in a subordinate session
> Cause: COMMIT was issued in a session that is not the two-phase commit 
> global coordinator.
> Action: Issue commit at the global coordinator only.
>
> http://oraclesvca2.oracle.com/docs/cd/B19306_01/server.102/b14219/e1500.htm#sthref32
>
> On Mon, 6 Feb 2006, Jason Dillon wrote:
>
> > No love :-(
> >
> > When I configure both of my datasources to use the oracle xa adapter,
> > and use the xa adapter for activemq, I get exceptions like:
> >
> > 
> > [2/6/06 16:35:17:456 PST]  [ERROR] -
> > org.apache.geronimo.kernel.log.GeronimoLog.error(line:104) -
> > ORA-02089: COMMIT is not allowed in a subordinate session
> >
> > [2/6/06 16:35:17:458 PST]  [ERROR] -
> > org.apache.geronimo.kernel.log.GeronimoLog.error(line:108) - Could not
> > execute query
> > java.sql.SQLException: ORA-02089: COMMIT is not allowed in a subordinate 
> > session
> >
> > at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
> > at oracle.jdbc.ttc7.TTIoer.processError(TTIoer.java:289)
> > at oracle.jdbc.ttc7.Oall7.receive(Oall7.java:582)
> > at oracle.jdbc.ttc7.TTC7Protocol.doOall7(TTC7Protocol.java:1986)
> > at 
> > oracle.jdbc.ttc7.TTC7Protocol.parseExecuteDescribe(TTC7Protocol.java:880)
> > at 
> > oracle.jdbc.driver.OracleStatement.doExecuteQuery(OracleStatement.java:2516)
> > at 
> > oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:2850)
> > at 
> > oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:609)
> > at 
> > oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:537)
> > at 
> > org.tranql.connector.jdbc.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:49)
> > at 
> > net.sf.hibernate.impl.BatcherImpl.getResultSet(BatcherImpl.java:87)
> > at net.sf.hibernate.loader.Loader.getResultSet(Loader.java:875)
> > at net.sf.hibernate.loader.Loader.doQuery(Loader.java:269)
> > at 
> > net.sf.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:133)
> > at net.sf.hibernate.loader.Loader.doList(Loader.java:1033)
> > at net.sf.hibernate.loader.Loader.list(Loader.java:1024)
> > at 
> > net.sf.hibernate.hql.QueryTranslator.list(QueryTranslator.java:854)
> > at net.sf.hibernate.impl.SessionImpl.find(SessionImpl.java:1544)
> > at net.sf.hibernate.impl.QueryImpl.list(QueryImpl.java:39)
> > at 
> > org.springframework.orm.hibernate.HibernateTemplate$26.doInHibernate(HibernateTemplate.java:667)
> > at 
> > org.springframework.orm.hibernate.HibernateTemplate.execute(HibernateTemplate.java:312)
> > at 
> > org.springframework.orm.hibernate.HibernateTemplate.find(HibernateTemplate.java:655)
> > at 
> > com.solidusnetworks.paycore.util.hibernate.BaseDAOHibernate.find(BaseDAOHibernate.java:423)
> > at 
> > com.solidusnetworks.ach.oltp.dao.impl.NegativeFileDAOHibernate.findActiveNegativeFiles(NegativeFileDAOHibernate.java:394)
> > at 
> > com.solidusnetworks.paycore.ach.model.negativefile.service.NegativeFileViewServiceBean.retrieveActiveGlobalAndMerchantNegativeFiles(NegativeFileViewServiceBean.java:87)
> > at 
> > com.solidusnetworks.paycore.ach.model.negativefile.service.NegativeFileViewServiceBean$$FastClassByCGLIB$$55b05efa.invoke()
> > at 
> > org.openejb.dispatch.AbstractMethodOperation.invoke(AbstractMethodOperation.java:90)
> > at org.openejb.slsb.BusinessMethod.execute(BusinessMethod.java:67)
> > at 
> > org.openejb.dispatch.DispatchInterceptor.invoke(DispatchInterceptor.java:72)
> > at 
> > org.apache.geronimo.naming.java.ComponentContextInterceptor.invoke(ComponentContextInterceptor.java:56)
> > at 
> > org.openejb.ConnectionTrackingInterceptor.invoke(ConnectionTracki

Re: Oracle XA RAR for G1.0?

2006-02-06 Thread lichtner

ORA-02089: COMMIT is not allowed in a subordinate session
Cause: COMMIT was issued in a session that is not the two-phase commit 
global coordinator.
Action: Issue commit at the global coordinator only.

http://oraclesvca2.oracle.com/docs/cd/B19306_01/server.102/b14219/e1500.htm#sthref32

On Mon, 6 Feb 2006, Jason Dillon wrote:

> No love :-(
>
> When I configure both of my datasources to use the oracle xa adapter,
> and use the xa adapter for activemq, I get exceptions like:
>
> 
> [2/6/06 16:35:17:456 PST]  [ERROR] -
> org.apache.geronimo.kernel.log.GeronimoLog.error(line:104) -
> ORA-02089: COMMIT is not allowed in a subordinate session
>
> [2/6/06 16:35:17:458 PST]  [ERROR] -
> org.apache.geronimo.kernel.log.GeronimoLog.error(line:108) - Could not
> execute query
> java.sql.SQLException: ORA-02089: COMMIT is not allowed in a subordinate 
> session
>
> at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
> at oracle.jdbc.ttc7.TTIoer.processError(TTIoer.java:289)
> at oracle.jdbc.ttc7.Oall7.receive(Oall7.java:582)
> at oracle.jdbc.ttc7.TTC7Protocol.doOall7(TTC7Protocol.java:1986)
> at 
> oracle.jdbc.ttc7.TTC7Protocol.parseExecuteDescribe(TTC7Protocol.java:880)
> at 
> oracle.jdbc.driver.OracleStatement.doExecuteQuery(OracleStatement.java:2516)
> at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:2850)
> at 
> oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:609)
> at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:537)
> at 
> org.tranql.connector.jdbc.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:49)
> at net.sf.hibernate.impl.BatcherImpl.getResultSet(BatcherImpl.java:87)
> at net.sf.hibernate.loader.Loader.getResultSet(Loader.java:875)
> at net.sf.hibernate.loader.Loader.doQuery(Loader.java:269)
> at 
> net.sf.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:133)
> at net.sf.hibernate.loader.Loader.doList(Loader.java:1033)
> at net.sf.hibernate.loader.Loader.list(Loader.java:1024)
> at net.sf.hibernate.hql.QueryTranslator.list(QueryTranslator.java:854)
> at net.sf.hibernate.impl.SessionImpl.find(SessionImpl.java:1544)
> at net.sf.hibernate.impl.QueryImpl.list(QueryImpl.java:39)
> at 
> org.springframework.orm.hibernate.HibernateTemplate$26.doInHibernate(HibernateTemplate.java:667)
> at 
> org.springframework.orm.hibernate.HibernateTemplate.execute(HibernateTemplate.java:312)
> at 
> org.springframework.orm.hibernate.HibernateTemplate.find(HibernateTemplate.java:655)
> at 
> com.solidusnetworks.paycore.util.hibernate.BaseDAOHibernate.find(BaseDAOHibernate.java:423)
> at 
> com.solidusnetworks.ach.oltp.dao.impl.NegativeFileDAOHibernate.findActiveNegativeFiles(NegativeFileDAOHibernate.java:394)
> at 
> com.solidusnetworks.paycore.ach.model.negativefile.service.NegativeFileViewServiceBean.retrieveActiveGlobalAndMerchantNegativeFiles(NegativeFileViewServiceBean.java:87)
> at 
> com.solidusnetworks.paycore.ach.model.negativefile.service.NegativeFileViewServiceBean$$FastClassByCGLIB$$55b05efa.invoke()
> at 
> org.openejb.dispatch.AbstractMethodOperation.invoke(AbstractMethodOperation.java:90)
> at org.openejb.slsb.BusinessMethod.execute(BusinessMethod.java:67)
> at 
> org.openejb.dispatch.DispatchInterceptor.invoke(DispatchInterceptor.java:72)
> at 
> org.apache.geronimo.naming.java.ComponentContextInterceptor.invoke(ComponentContextInterceptor.java:56)
> at 
> org.openejb.ConnectionTrackingInterceptor.invoke(ConnectionTrackingInterceptor.java:81)
> at 
> org.openejb.transaction.ContainerPolicy$TxRequired.invoke(ContainerPolicy.java:119)
> at 
> org.openejb.transaction.TransactionContextInterceptor.invoke(TransactionContextInterceptor.java:80)
> at 
> org.openejb.slsb.StatelessInstanceInterceptor.invoke(StatelessInstanceInterceptor.java:98)
> at 
> org.openejb.transaction.ContainerPolicy$TxRequired.invoke(ContainerPolicy.java:119)
> at 
> org.openejb.transaction.TransactionContextInterceptor.invoke(TransactionContextInterceptor.java:80)
> at 
> org.openejb.SystemExceptionInterceptor.invoke(SystemExceptionInterceptor.java:82)
> at 
> org.openejb.GenericEJBContainer.invoke(GenericEJBContainer.java:238)
> at 
> org.openejb.proxy.EJBMethodInterceptor.intercept(EJBMethodInterceptor.java:129)
> at 
> org.openejb.proxy.SessionEJBLocalObject$$EnhancerByCGLIB$$c4231b05.retrieveActiveGlobalAndMerchantNegativeFiles()
> at 
> com.solidusnetworks.paycore.ach.business.pos.POSCheckServiceBean.retrieveNegativeFileData(POSCheckServiceBean.java:414)
> at 
> com.solidusnetworks.paycore.ach.business.pos.POSCheckServiceBean.

Re: Oracle XA RAR for G1.0?

2006-02-05 Thread lichtner

I think the properties were ConnectionURL, UserName and Password,
but don't spend a lot of time on these because I could be wrong ..

On Sun, 5 Feb 2006, Jason Dillon wrote:

> Any clue on the required config to get the rar deployed?
>
> I'm trying to convert this URL to the params for the RAR:
>
>  jdbc:oracle:thin:@mydbhost:1621:devdb
>
> Unfortunately the Oracle XA RAR does not take a URL, but wants
> granular configuration.  Some obvious stuff I get (like the port
> number), but what to use for protocol and type, etc have me
> scratching my head.
>
> I also looked for the Javadocs for
> oracle.jdbc.xa.client.OracleXADataSource with no luck to see what
> properties it exposed.  The only docs I can find are to expose the
> XAResource, but there must be more since the TranQL RAR is calling
> some of them.
>
> Any ideas?
>
> --jason
>
>
> On Feb 3, 2006, at 5:44 AM, Matt Hogstrom wrote:
>
> > I think David means that it has not been extensively tested and so
> > there are no gurantees that you'll simply be able to drop it in.
> > I'm currently working on a DB2 XA RAR and am still working out some
> > kinks.  It should work well, we're just not sure its been testd
> > enough to know that it does.
> >
> > I looked on CodeHaus and it appears that Jeremy had not previous
> > released a SNAPSHOT.  I compiled the connector this morning against
> > the Oracle 10.1.4.0 classes12.jar.
> >
> > I've published it and it is called tranql/rars/tranql-connector-
> > oracle-xa-1.0-SNAPSHOT.rar
> >
> > If someone can try this out then that would be excellent.  I have
> > only compiled it and not tested it so caveat emptor.
> >
> > lichtner wrote:
> >> On Fri, 3 Feb 2006, David Jencks wrote:
> >>> It is likely to work if you build it.  However I don't know that it
> >>> has been used in the last year or more, so I won't make any
> >>> promises.  Matt might have tried it, I don't know.  We have been a
> >>> bit reluctant to publish it without more evidence that it works
> >>> well.
> >> Why would it not work well? When I was in my last job I remember
> >> getting
> >> that rar to work with mysql xa, so it probably also works with
> >> Oracle xa.
>
>


Re: Oracle XA RAR for G1.0?

2006-02-03 Thread lichtner

On Fri, 3 Feb 2006, David Jencks wrote:

> It is likely to work if you build it.  However I don't know that it
> has been used in the last year or more, so I won't make any
> promises.  Matt might have tried it, I don't know.  We have been a
> bit reluctant to publish it without more evidence that it works well.

Why would it not work well? When I was in my last job I remember getting
that rar to work with mysql xa, so it probably also works with Oracle xa.


Re: Supporting applications that need a database

2006-02-01 Thread lichtner

I don't generally like default things when the default is a completely
arbitrary choice, as is the case here. This is not like a default port
number.

If someone is having trouble configuring a database then he/she or someone
else should write a tool that tries to configure out the settings, not
dumb down the software for everyone else.

On Wed, 1 Feb 2006, David Jencks wrote:

> Dain has been complaining that the default database is no more and
> IIUC suggesting that we reinstate it and by default hook at least ejb
> applications that don't have an explicit database configuration up to
> it.  Since I removed the default database I'd like to somewhat
> preemptively explain my thinking.
>
> Based on my support experiences with another app server that did
> something like this, I think this is a really bad idea.  What
> happened there was that no one knew how to connect their app to a non-
> default database, and we got zillions of problem reports based on the
> app using the default database rather than the one that was
> misconfigured :-)
>
> I also don't think that encouraging all applications to use the same
> database is a very good policy.  It certainly invites collisions
> between applications and reduces portability.
>
> We have the capabilities to build a derby database for a particular
> schema, and package it , and to bundle a datasource configuration
> with a j2ee app plan.  This is used for the daytrader and uddi server
> configurations.  Rather than including a database no one should
> want :-) and encouraging people to use it, I would rather see us
> automate the construction of a configured database for an app, and
> the construction and bundling of a datasource configuration with the
> app's plan.
>
> thanks
> david jencks
>
>


Re: Clustering docs - DB Section

2006-01-30 Thread lichtner

On Mon, 30 Jan 2006, Ryan Thomas wrote:

> > I just took a look at the c-jdbc design and it seems that they have to
> > execute writes one at a time (one insert statement at a time) - because if
> > you start multiple write transaction at the same time then multiple sites
> > could execute the writes in different orders. So it's not a good general
> > solution except for read-mostly applications.
> >
>
> Pardon my ignorance, but is the purpose here to develop a db-clustering
> module for geronimo, or is that beyond the scope of the server?

A db-clustering module like c-jdbc is not beyond the scope, but then it
has very limited value so I would not work on it.


Re: Clustering docs - DB Section

2006-01-27 Thread lichtner


On Fri, 27 Jan 2006, James Strachan wrote:

> > Is anybody working on Derby clustering?
>
> http://sequoia.continuent.org/HomePage
>
> its meant to be mostly ASF licensed now; though given its L/GPL
> heritage of C-JDBC I'd be a little cautious of the licensing

It says that is the "continuation" of C-JDBC. In other words, it's the
same project.

I just took a look at the c-jdbc design and it seems that they have to
execute writes one at a time (one insert statement at a time) - because if
you start multiple write transaction at the same time then multiple sites
could execute the writes in different orders. So it's not a good general
solution except for read-mostly applications.


Clustering docs - DB Section

2006-01-27 Thread lichtner

I see that the DB section says "Any takers?". You want somebody to write
about clustering of databases which have built-in support, or tools like
C-JDBC?

Is anybody working on Derby clustering?


Re: web clustering componentization

2006-01-27 Thread lichtner

I would pick one type of clustering at a time, solve that problem, roll it
out, and then move on to the next one. And I would specifically address
each type of clustering requirement separately (e.g. http and entity
beans) because the best solution is different in each case.

On Fri, 27 Jan 2006, Jules Gosnell wrote:

> Paul McMahan wrote:
>
> > Thanks for initiating this important conversation.  As part of the
> > clustering effort I would like to work on a new admin portlet that can
> > be used to visualize the cluster topology and do things like set
> > properties for the session managers, deploy/undeploy applications,
> > etc.  I think that making this aspect of Geronimo intuitive and easy
> > to use will be a big differentiator in our favor.  I would love to
> > hear everyone's feedback about this idea.  E.g. given our current
> > direction does this idea make sense? and what should it look like?
> >
> > If we are generally in favor of this idea then before proceeding I
> > would need help understanding certain fundamental concepts of our
> > approach, such as:
> >
> > -- Will G have explicit knowledge and control over clustering or will
> > it just be "embedded" within the applications' logic/configuration?
>
> Geronimo should have full knowledge at the container level.
>
> > -- Will G follow a deployment manager paradigm where a certain node is
> > in charge of managing the entire config and pushing it to the worker
> > nodes?  Or will each node be responsible for its own configuration?
>
> I don't know if there has been any discussion of this ? I think that IBM
> donated a control panel of some sort but I haven't seen it and don't
> know if it covers deployment.
>
> There is some stuff of relevance in the clustering overview that I just
> posted.
>
> I would like to see a situation where you can [un]deploy to/from any
> node in the [sub]cluster. This would perform the same operation on all
> nodes in that [sub]cluster. Essentially, applications become 1->all
> replicated data.
>
> This is a very simplistic view - as it is probably not useful to
> synchronise the lifecycle of instances of the same app around the
> cluster, otherwise you can't stop one without stopping them all (but
> maybe this is what you want?) - maybe we need a synchronised and
> non-sychronised mode...?
>
> Of course, it may be that this conversation has already been had and
> decisions made elsewhere ? In which case, I would be interested in
> knowing the outcome so that I can update the clustering overview and add
> pointers to the necessary threads/people etc...
>
> > -- What component(s) of G will provide an interface to the clustering
> > metadata and config?  Would it be possible to provide a single point
> > of control in G that can take as input a clustering "plan" and handle
> > the resulting deployments and node configuration?
>
> I'm not aware that anyone has drilled down this far yet, into how a
> unified approach might work, but I suspect that we will end up with some
> form of ClusterGBean which provides the node's connection to the Cluster
> (an activecluster.Cluster), notification of other nodes leaving and
> joining, messaging primitives etc. A clustering portlet might register
> listeners with this.
>
> > -- Is each member of a cluster assumed to be using an identical
> > configuration?
>
> In a homogeneous cluster yes - but I think that we will have to cater
> for heterogeneous deployments. There is a nascent section in the
> overview doc about this.
>
> > -- Will G be able to participate in heterogeneous clusters with other
> > types of app servers and/or versions of G?
>
> I don't know whether anyone else has considered mixing (a) appservers or
> (b) geronimo versions in the same cluster - I haven't. I think (a)
> unlikely (unless you are talking about e.g. a business tier that is
> geronimo and a web-tier that is standalone Jetty/Tomcat, which I guess
> might be possible, but i suspect that both tiers would just be geronimo
> running different configurations), (b) is more possible, provided that
> the protocols used by  the different clustering components were
> compatible. With the amount of complexity already surrounding
> clustering, though, I think you would just be making your life much more
> difficult than it need be.
>
> Hope that helps,
>
>
> Jules
>
> >
> > Looking forward to your feedback and advice.
> >
> > Best wishes,
> > Paul
> >
> >
> >
> > On 1/26/06, *David Jencks* <[EMAIL PROTECTED]
> > > wrote:
> >
> > I have some ideas about how to set up web clustering in terms of
> > geronimo modules and configurations and connections between gbeans.
> > I've talked with Jules and Jeff about this and think that they agree
> > that this is a good starting point so I'll try to describe it and
> > possibly even help implement parts of it.
> >
> > There are quite a few questions I'm going to ignore, such as EJB SFSB
> > clustering and the new session 

Re: Replication using totem protocol

2006-01-23 Thread lichtner

Still, it doesn't seem like there is much interest in using totem. For
session replication you can use primary-backup, if anything.

On Sun, 22 Jan 2006, Geir Magnusson Jr wrote:

> Catching up :
>
> [EMAIL PROTECTED] wrote:
> >> No.  You license the code to the Apache Software Foundation giving
> >> the foundation the rights to relicense under any license (so the
> >> foundation can upgrade the license as they did with ASL2).  We do ask
> >> that you change the copyrights on the version of the code you give to
> >> the ASF to something like "Copyright 2004 The Apache Software
> >> Foundation or its licensors, as applicable."
> >
> > That _is_ transferring the copyright.
>
> No, it isn't.  You are still the copyright holder of the contributed
> material.  The (c) statement that Dain suggested represents the
> collective copyright of the whole package, which is your original code
> (for which you hold the copyright), and additions from other people (who
> individually hold copyright or share copyright depending on the
> contribution.)
>
> That's why it's "or it's licensors", which you would certainly be.
>
> >
> > As I told Jeff on the phone, I would definitely considering this if it
> > turns that evs4j will really be used, but I would rather not grant someone
> > an unlimited license at the present time. Jeff said we are going to have a
> > discussion, so we'll know more soon enough.
>
> The Apache License is fairly close to an unlimited license, so if it's
> available under the AL, you are already there.
>
> The only thing different is that you are giving the ASF the ability to
> distribute the collective work under other terms other than the current
> version of the Apache License.
>
> I hope that makes you feel a little more comfortable about things.
>
> geir
>
>


Re: heads up: initial contribution of a client API to session state management for OpenEJB, ServiceMix, Lingo and Tuscany

2006-01-19 Thread lichtner

On Thu, 19 Jan 2006, Dain Sundstrom wrote:

> On Jan 18, 2006, at 10:20 PM, lichtner wrote:
>
> > This state is transactional, I take it?
>
> Nope.  For OpenEJB, only stateful session beans (SFSB) would use this api.

I see. I plan to never use them, if I can help it.

> EJB Entity beans in OpenEJB or really any shared persistence data
> would need a different API as the data is supposed to be shared by
> multiple simultaneous clients.

That's what I thought.


Re: WADI clustering

2006-01-19 Thread lichtner

On Thu, 19 Jan 2006, Jules Gosnell wrote:

> > We should avoid making those decesions before hand.

What decisions does the user need to make?

Users need to make a lot of decisions already. Are the decisions you
mention worth the time it will take for users to make them?

> as far as clustering sessions in concerned. I can't understand why
> anyone would want to run without affinity turned up as high as it will go.

Me neither. The session is just not intended for concurrent use.


Re: heads up: initial contribution of a client API to session state management for OpenEJB, ServiceMix, Lingo and Tuscany

2006-01-18 Thread lichtner

This state is transactional, I take it?

On Wed, 18 Jan 2006 [EMAIL PROTECTED] wrote:

>
> On 18 Jan 2006, at 18:10, lichtner wrote:
> > It looks like a map-like interface. When you say this could manage
> > state
> > for OpenEJB, what kind of state do you have in mind?
>
> For a given client in EJB / JBI / Lingo / SCA there tends to be
> chunks of state for each client.  e.g. in EJB a single EJB client can
> create N stateful session beans each of which can be updated
> independently; but the entire state might be managed as a whole. So
> each chunk of state inside the Session represents the state of a
> single session bean for a given EJB client.
>
> James
> ---
> http://radio.weblogs.com/0112098/
>
>


Re: heads up: initial contribution of a client API to session state management for OpenEJB, ServiceMix, Lingo and Tuscany

2006-01-18 Thread lichtner

It looks like a map-like interface. When you say this could manage state
for OpenEJB, what kind of state do you have in mind?

On Wed, 18 Jan 2006, James Strachan wrote:

> I got chance to have a mini-hackathon with some geronimo committers
> over the weekend to hack up a real simple client API to some kind of
> state store, which could be clustered, that the OpenEJB, ServiceMix,
> Lingo and Tuscany guys could use.
>
> Rather than focussing on the possible technical implementations and
> techniques (like group communication, election strategies,
> distributed locking, totem protocol, distributed hash maps etc) we
> tried to put in place a simple client API for the person who has to
> integrate some kind of client session state into the client/server
> side of OpenEJB or ServiceMix etc .
>
> Its a very simple API and should be trivial to implement in a
> gazillion of different ways (a HashMap, totem, WADI, just a database
> or file system, a combination of database for being the controller &
> using point to point non-reliable messaging with the other members to
> group election strategies etc). Without further ado here's where it
> lives...
>
> http://svn.apache.org/repos/asf/geronimo/trunk/modules/session/
>
> There's some javadoc that tries to explain the use cases, the design
> goals behind the client API and a variety of possible implementations
> we could do - the idea is based on your requirements and performance
> targets you may use a real simple implementation or a wacky complex
> one.  We tried to assume a possibly low QoS (e.g. 1 box with a
> HashMap) while allowing any implementation to plug in based on its
> requirements.
>
> Here's more docs in HTML which explains it much better
> http://svn.apache.org/repos/asf/geronimo/trunk/modules/session/src/
> java/org/apache/geronimo/session/package.html
>
> Thoughts?
>
> To the WADI folks - do you think it'd be easy to put WADI underneath
> this API?
>
> James
> ---
> http://radio.weblogs.com/0112098/
>
>
>
>
> James
> ---
> http://radio.weblogs.com/0112098/
>
>


Re: Clustering - initial overview doc... - where should we keep it ?

2006-01-18 Thread lichtner

So where is this document now? I am not very familiar with the web site
there seems to be more than one place.

On Wed, 18 Jan 2006, Hernan Cunico wrote:

> Hi Jules,
> many of the articles (if not all) started the same way and many of them are 
> still a work in progress.
>
> It would be great if you can publish your doc there, I can give you a hand 
> with the confluence
> formatting if you want to.
>
> Cheers!
> Hernan
>
> Jules Gosnell wrote:
> > Hernan Cunico wrote:
> >
> >> Hi Jules,
> >> can you put your docs in confluence!?
> >>
> >> There is already a section for performance in the TOC, it would be
> >> great if you put the clustering documentation there.
> >
> >
> >
> > OK - can do, if everyone is happy with this...
> >
> > It is a pretty rough doc though... the rest of the stuff in there looks
> > pretty polished... - this really is a work in progress... - is that OK ?
> >
> > Jules
> >
> >>
> >> Here is the link to the TOC, to edit confluence you will have to
> >> register first.
> >>
> >> http://opensource2.atlassian.com/confluence/oss/pages/viewpage.action?pageId=1692
> >>
> >>
> >> Let me know if you need any help with confluence.
> >>
> >> Cheers!
> >> Hernan
> >>
> >> Jules Gosnell wrote:
> >>
> >>> Guys,
> >>>
> >>> I have the beginnings of this doc...
> >>>
> >>> Where would be the best place to keep it ? Ideally it should be r/w
> >>> by everyone, with history - SVN, WIKI or where ? What is best practice ?
> >>>
> >>> Also, if in SVN, what format - text, html, ... etc...
> >>>
> >>>
> >>> Jules
> >>>
> >
> >
>


Work

2006-01-18 Thread lichtner

I am actually looking for another job/contract right now (in the San Diego
area, or I can telecommute), so I thought I would mention it in case
anybody knows of any openings.

Guglielmo



Re: Replication using totem protocol

2006-01-18 Thread lichtner

On Wed, 18 Jan 2006, Jules Gosnell wrote:

> I haven't been able to convince myself to take the quorum approach
> because...
>
> shared-something approach:
> - the shared something is a Single Point of Failure (SPoF) - although
> you could use an HA something.

It's not really a spof. You just fail over to a different resource. All
you need is a lock. You could use two java processes anywhere on the
network which listen for a socket, and only one. If one is not listening,
you try the other one.

> - If the node holding the lock 'goes crazy', but does not die, the rest
> of the cluster becomes a fragment - so it becomes an SPoF as well.

If by 'goes crazy' you mean that it's up but it's not doing anything,
totem defends against by detecting processors which fail to make progress
after N token rotations, and when they do it declares them failed.

But if you mean that it just sends corrupt data or starts using broken
algorithms etc. then I would need to research it a bit. But definitely
defending against these byzantine failures will be more expensive. I
believe the solution is that you have to process operations on multiple
nodes and compare the results.

I believe this is how Tandem machines work. Each cpu step is voted on.
Byzantine failures can happen because of cosmic rays, or other
physics-related issues.

Definitely much more fun.

> - used in isolation, it does not take into account that the lock may be
> held by the smallest cluster fragment

Yes, it does. The question is, why do you have a partition? If you have a
partition because a network element failed, then put some redundancy in
your network topology. If the partition is virtual, i.e.
congestion-induced, then wait a few seconds for it to heal.

And if you get too many virtual partitions it means you either need to
tweak your failure detection parameters (token-loss-timeout in totem) or
your load is too high and you need to add some capacity to the cluster.

> shared-nothing approach:
> - I prefer this approach, but, as you have stated, if the two halves are
> equally sized...

I didn't mean to say that. In this approach you _must_ set a minimum
quorum which is _the_ majority of the size of the rack. If you own five
machines, make the quorum three.

> - What if there are two concurrent fractures (does this happen?)

It's no different than any other partition.

> - ActiveCluster notifies you of one membership change at a time - so you
> would have to decide on an algorithm for 'chunking' node loss, so that
> you could decide when a fragmentation had occurred...

The problem exists anyway. Even in totem you can have several
'configurations' installed in quick succession. In order to defend against
this you need to design your state transfer algorithms around it.

> perhaps a hybrid of the two would be able to cover more bases... -
> shared-nothing falling back to shared-something if your fragment is
> sized N/2.

You can definitely make a totally custom algorithm for determining your
majority partition, and it's a fact that hybrid approaches can solve
difficult problems, but for the reasons I said above I believe that you
can just fine if you have a redundant network and you keep some cpu
unused.

> As far as my plans for WADI, I think I am happy to stick with the, 'rely
> on affinity and keep going' approach.
>
> As far as situations where a distributed object may have more than one
> client, I can see that quorum offers the hope of a solution, but,
> without some very careful thought, I would still be hesitant to stake my
> shirt on it :-) for the reasons given above...
>
> I hadn't really considered 'pausing' a cluster fragment, so this is a
> useful idea. I guess that I have been thinking more in terms of
> long-lived fractures, rather than short-lived ones. If the latter are
> that much more common, then this is great input and I need to take it
> into account.
>
> The issue about 'chunking' node loss interests me... I see that the
> EVS4J Listener returns a set of members, so it is possible to express
> the loss of more than one node. How is membership decided and node loss
> aggregated ?

Read the totem protocol article. The membership protocol is in there. But
as I said you can still get a flurry of configurations installed one after
the other. It is only a problem if you plan to do your cluster
re-organization all at once.

Guglielmo


Re: Replication using totem protocol

2006-01-17 Thread lichtner

By reading selected parts of this book you can get a background on various
issues that you have asked about:

http://citeseer.ist.psu.edu/birman96building.html

On Tue, 17 Jan 2006, Rajith Attapattu wrote:

> > Can u guys talk more about locking mechanisms pros and cons wrt in memory
> > replication and storaged backed replication.
>
> >I don't know what you have in mind here by 'storage-backed'.
>
> Sorry if I was not clear on that. what i meant was in memory vs serialized
> form, either stored in a file or database or some other mechanism.
>
> >>you want to guarantee that the user's work is _never_lost, just send all
> session updates to yourself in a totem-protocol 'safe' message
> hmm can we really make a garuntee here even that you assumption
> holds (Assuming 4 nodes and likely to survive node crashes up to 4 - R = 2
> node crashes.)
>
> Also I didn't understand how u arrived at the 4-R value. I guess it's bcos I
> don't have much knowledge about totem.
> If there is a short answer and if it's not beyond the scope of the thread
> can u try one more time to explain the thoery behind your assumption
>
> Regards,
>
> Rajith.
>
> On 1/17/06, lichtner <[EMAIL PROTECTED]> wrote:
> >
> >
> > On Tue, 17 Jan 2006, Rajith Attapattu wrote:
> >
> > > Can u guys talk more about locking mechanisms pros and cons wrt in
> > memory
> > > replication and storaged backed replication.
> >
> > I don't know what you have in mind here by 'storage-backed'.
> >
> > > Also what if a node goes down while the lock is aquirred?? I assume
> > there is
> > > a time out.
> >
> > Which architecture do you have in mind here? I think the question is
> > relevant if you use a standalone lock server, but if you don't then you
> > just put the lock queue with the data item in question.
> >
> > > When it comes to partition (either network/power failure or vistual) or
> > > healing (same new nodes comming up as well??) what are some of the
> > > algorithms and stratergies that are widely used to handle those
> > situations
> > > ?? any pointers will be great.
> >
> > I believe the best strategy depends on what type of state the application
> > has. Clearly if the state took zero time to transfer over you could
> > compare version numbers, transfer the state to the nodes that happen to be
> > out-of-date, and you are back in business. OTOH if the state is 1Gb you
> > will take a different approach. There is not much to look up here. Think
> > about it carefull and you can come up with the best state transfer for
> > your application.
> >
> > Session state is easier than others because it consists of miryads small,
> > independent data items that do not support concurrent access.
> >
> > > so if u are in the middle of filling a 10 page application on the web
> > and
> > > while in the 9th page and the server goes down, if you can restart again
> > > with the 7 or 8th page (a resonable percentage of data was preserved
> > through
> > > merge/split/change) I guess it would be tolarable if not excellent in a
> > very
> > > busy server.
> >
> > Since this is a question about availability consider a cluster, say 4
> > nodes, with a minimum R=2, say, where all the sessions are replicated on
> > _each_ node. If you want to guarantee that the user's work is _never_
> > lost, just send all session updates to yourself in a totem-protocol 'safe'
> > message, which is delivered only after the message has been received (but
> > not delivered) by all the nodes, and wait for your own message to arrive.
> > This takes between 1 and 2 token rotations, which on 4 nodes I guess would
> > be between 10-20 milliseconds, which is not a lot as http request
> > latencies go.
> >
> > As a result of this after an http request returns, the work done is likely
> > to survive node crashes up to 4 - R = 2 node crashes.
> >
> >
>


Re: Replication using totem protocol

2006-01-17 Thread lichtner

On Tue, 17 Jan 2006, Rajith Attapattu wrote:

> Can u guys talk more about locking mechanisms pros and cons wrt in memory
> replication and storaged backed replication.

I don't know what you have in mind here by 'storage-backed'.

> Also what if a node goes down while the lock is aquirred?? I assume there is
> a time out.

Which architecture do you have in mind here? I think the question is
relevant if you use a standalone lock server, but if you don't then you
just put the lock queue with the data item in question.

> When it comes to partition (either network/power failure or vistual) or
> healing (same new nodes comming up as well??) what are some of the
> algorithms and stratergies that are widely used to handle those situations
> ?? any pointers will be great.

I believe the best strategy depends on what type of state the application
has. Clearly if the state took zero time to transfer over you could
compare version numbers, transfer the state to the nodes that happen to be
out-of-date, and you are back in business. OTOH if the state is 1Gb you
will take a different approach. There is not much to look up here. Think
about it carefull and you can come up with the best state transfer for
your application.

Session state is easier than others because it consists of miryads small,
independent data items that do not support concurrent access.

> so if u are in the middle of filling a 10 page application on the web and
> while in the 9th page and the server goes down, if you can restart again
> with the 7 or 8th page (a resonable percentage of data was preserved through
> merge/split/change) I guess it would be tolarable if not excellent in a very
> busy server.

Since this is a question about availability consider a cluster, say 4
nodes, with a minimum R=2, say, where all the sessions are replicated on
_each_ node. If you want to guarantee that the user's work is _never_
lost, just send all session updates to yourself in a totem-protocol 'safe'
message, which is delivered only after the message has been received (but
not delivered) by all the nodes, and wait for your own message to arrive.
This takes between 1 and 2 token rotations, which on 4 nodes I guess would
be between 10-20 milliseconds, which is not a lot as http request
latencies go.

As a result of this after an http request returns, the work done is likely
to survive node crashes up to 4 - R = 2 node crashes.



Re: Replication using totem protocol

2006-01-17 Thread lichtner


On Tue, 17 Jan 2006, Jules Gosnell wrote:

> just when you thought that this thread would die :-)

I think Jeff Genender wanted a discussion to be sparked, and it worked.

> So, I am wondering how might I use e.g. a shared disc or majority voting
> in this situation ? In order to decide which fragment was the original
> cluster and which was the piece that had broken off ? but then what
> would the piece that had broken off do ? shutdown ?

Wait to rejoin the cluster. Since it is not "the" cluster, it waits. It is
not safe to make any updates.

_How_ a groups decides it is "the" cluster can be done in several ways.
Shared-disk cluster can do by a locking operation on a disk (I would have
to research the details on this), a cluster with a database can get a lock
from the database (and keep the connection open). And one way to do this
in a shared-nothing cluster is to use a quorum of N/2 + 1, where is the
maximum number of nodes. Clearly it has to be the majority or else you can
have a split-brain cluster.

> Do you think that we need to worry about situations where a piece of
> state has more than one client, so a network partition may result in two
> copies diverging in different and incompatible directions, rather than
> only one diverging.

If you use a quorum or quorum-resource as above you do not have this
problem. You can turn down the requests or let them block until the
cluster re-discovers the 'failed' nodes.

> I can imagine this happening in an Entity Bean (but
> we should be able to use the DB to resolve this) or an application POJO.
> I haven't considered the latter case and it looks pretty hopeless to me,
> unless you have some alternative route over which the two fragments can
> communicate... but then, if you did, would you not pair it with your
> original network, so that the one failed over to the other or replicated
> its activity, so that you never perceived a split in the first place ?
> Is this a common solution, or do people use other mechanisms here ?

I do believe that membership and quorum is all you need.

Guglielmo


Re: Replication using totem protocol

2006-01-16 Thread lichtner


On Tue, 17 Jan 2006, Jules Gosnell wrote:

> >I believe that if you put some spare capacity in your cluster you will get
> >good availability. For example, if your minimum R is 2 and the normal
> >operating value is 4, when a node fails you will not be frantically doing
> >state transfer.
> >
> >
> OK - so your system is a little more relaxed about the exact number of
> replicants. You specify upper and lower bounds rather  than an absolute
> number, then you move towards the upper bound when you have the capacity ?

That's the idea. It's a bit like having hot spares, but all nodes are
treated on the same footing.

> >I would also just send a redirect. I don't think it's worth relocating a
> >session.
> >
> If you can communicate the session's location to the load-balancer, then
> I agree, but some load-balancers are pretty dumb :-)

I see .. I was hoping somebody was not going to say that. Even so, it
depends on the latency of the request when it actually request. After all,
this only happens after a failure. But no matter, you can also move the
session over.

Guglielmo


Re: Replication using totem protocol

2006-01-16 Thread lichtner


On Mon, 16 Jan 2006, Jules Gosnell wrote:

> >2. When an HTTP request arrives, if the cluster which received does not
> >have R copies then it blocks (it waits until there are.) This should in
> >data centers because partitions are likely to be very short-lived (aka
> >virtual partitions, which are due to congestion, not to any hardware
> >issue.)
> >
> >
> Interesting. I was intending to actively repopulate the cluster
> fragment, as soon as the split was detected. I figure that
> - the longer that sessions spend without their full complement of
> backups, the more likely that a further failure may result in data loss.
> - the split is an exceptional cicumstance at which you would expect to
> pay an exceptional cost (regenerating missing primaries from backups and
> vice-versa)
>
> by waiting for a request to arrive for a session before ensuring it has
> its correct complement of backups, you extend the time during which it
> is 'at risk'. By doing this 'lazily', you will also have to perform an
> additional check on every request arrival, which you would not have to
> do if you had regenerated missing state at the point that you noticed
> the split.

Actually I didn't mean to say that you should do it lazily. You most
definitely do it aggressively, but I would not try to do _all_ the state
transfer ASAP, because this can kill availability.

If I had to do the state transfer using totem I would use priority queues,
so that you know that while the system is doing state transfer it is still
operating at, say, 80% efficiency.

It was not about lazy vs. greedy.

I believe that if you put some spare capacity in your cluster you will get
good availability. For example, if your minimum R is 2 and the normal
operating value is 4, when a node fails you will not be frantically doing
state transfer.

> >3. If at any time an HTTP reaches a server which does not have itself a
> >replica of the session it sends a client redirect to a node which does.
> >
> >
> WADI can relocate request to session, as you suggest (via redirect or
> proxy), or session to request, by migration. Relocation of request
> should scale better since requests are generally smaller and, in the web
> tier, may run concurrently through the same session, whereas sessions
> are generally larger and may only be migrated serially (since only one
> copy at a time may be 'active').

I would also just send a redirect. I don't think it's worth relocating a
session.

> > and possibly migration of some session for
> >proper load balancing.
> >
> >
> forcing the balancing of state around the cluster is something that I
> have considered with WADI, but not yet tried to implement. The type of
> load-balancer that is being used has a big impact here. If you cannot
> communicate a change of session location satisfactorily to the Http load
> balancer, then you have to just go with wherever it decides a session is
> located With SFSBs we should have much more control at the client
> side, so this becomes a real option.

In my opinion load balancing is not something that a cluster api can
address effectively. Half the problem is evaluating how busy the system is
in the first place.

> all in all, though, it sounds like we see pretty much eye to eye :-)

Better than the other way ..

> the lazy partition regeneration is an interesting idea and this is the
> second time it has been suggested to me, so I will give it some serious
> thought.

Again, I wasn't advocating lazy state transfer. But perhaps it has
applications somewhere.

> Thanks for taking the time to share your thoughts,

No problem.


Re: Replication using totem protocol

2006-01-16 Thread lichtner

On Mon, 16 Jan 2006, Rajith Attapattu wrote:

> This is a very educating thread, maybe Jules can incoporate some of the
> ideas into your document on clustering.

Let's hope the thread also eventually translates into working code :)

> >1. The user should configure a minimum-degree-of-replication R. This is
> >the number of replicas of a specific session which need to be available in
> >order for an HTTP request to be serviced.
>
> 1.) How do u figure out the most efficient value for R?

I am not sure what you mean by efficient. If you mean that it maximizes
availability, I have seen a derivation in this book:

"Fault Tolerance in Distributed Systems"
Pankaj Jalote, 1994
Chapter 7, Section 5, "Degree of Replication"

He shows that the availability _as_ a function of the number of replicas
goes up and then down again, basically because more replicas defend
against failures but require more housekeeping, and the resources used to
do housekeeping cannot be used for servicing transactions.

I believe it is very difficulty to compute availability analytically, and
that the majority of downtime would not be due to hardware failures. It's
probably 1) power failures and 2) software failures. I think Pfister talks
about the various causes of downtime in his book.

> I assume when R increases, network chatter increases at a magnitue of X, and
> X depends on wether it's a multicast  protocol or 1->1 (first of all is this
> assumption correct ???).

I think for this thread we were assuming reliable multicast. See also
the thread about infiniband, which completely changes the calculus because
of the lack of context switching - that would be closer to just using a
symmetric multiprocessor.

> And when R reduces the chances of a request hitting a server where the
> session is not replicated is high.

That doesn't matter. When the request hits a server where the session is
not replicated you send a redirect - the system is available, but perhaps
the latency for that particular request is larger than for others.

> So the sweet spot is a balance btw the above to factors ??? or have I missed
> any other critical factor(s) ??

See reference above.

> 2.) When you say minimum-degree-of-replication it imples to me a floor?? is
> there like a ceiling value like maximum-degree-of-replication?? I guess we
> don't want the session to grow beyond a point.

Yes. See above. Availability goes down past a certain value of R.

> >2. When an HTTP request arrives, if the cluster which received does not
> >have R copies then it blocks (it waits until there are.) This should in
> >data centers because partitions are likely to be very short-lived (aka
> >virtual partitions, which are due to congestion, not to any hardware
> >issue.)
>
> 1) Can u pls elaborate a bit more on this, didn't really understand it, when
> u said wait untill, does it mean
> a) wait till there are R no of replicas in the cluster?

At any time that there was a change in the composition of the cluster it
must review its global state and if necessary arrange for new session
replicas to be installed in some nodes, for replicas to be migrated, or
for replicas to be deleted. For example, if R=3 and replica no. 2 of
session 49030 was on node N7 which just bowed out, the cluster might
decide to install a replica of session 49030 on node N3.

Rearranging replicas, aka state-transfer, takes time. While that happens
you block new http requests for the relevant sessions.

> b) or until a session is replicated within the server the http request
> is received?

No. See above. Although when rearranging replicas you have some freedom
and you are free to give priority to some nodes over others.

> 2) when u said virtual partition did u mean a sub set of nodes
> being isolated due to congestion.

Yes.

> By isolation I meant they have not able to
> replicate there sessions or receive replications from sessions from other
> nodes outside of the subset due to congestion. Is this correct??

It's also possible that all nodes are up to date on a given session, and
the virtual partition heals before the user tries to update the session
again.

A partition occurs when nodes 1 and 2 agree with each other that nodes 3
and 4 are no longer around and install a new group, a.k.a. "view", a.k.a.
"configuration".

But 3 and 4 may appear again soon after (e.g. 5 seconds) and so the
partition may end up having few consequences if any.

> 3) Assuming an HTTP request arrives and the cluster does not have R copies.
> How different is this situation from "an HTTP request arrives but no session
> replication in that server" ??
>
> >3. If at any time an HTTP reaches a server which does not have itself a
> >replica of the session it sends a client redirect to a node which does.
> How can this be achived?? Is it by having a central cordinator that handles
> a mapping or is this information replicated in all nodes on the entire
> cluster.
>
> information == "which clusters have replicas of each session

Re: Replication using totem protocol

2006-01-16 Thread lichtner

On the subject of paritions, I remembered this paper I read a few years
ago which shows that paritions, whether caused by hardware failures or by
heavy traffic, are a fact of life:

"Understanding Partitions and the 'No Partition' Assumption"
A. Ricciard et al.

http://citeseer.ist.psu.edu/32449.html


Re: Replication using totem protocol

2006-01-16 Thread lichtner


On Mon, 16 Jan 2006, Jules Gosnell wrote:

> REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...

I figured. I imagine that if I had to add this distinction to totem I
would add a message were the node in question announces that it is
leaving, and then stops forwarding the token. On the other hand, it does
not need to announce anything, and the other nodes will detect that it
left. In fact totem does not judge a node either way: you can leave
because you want to or under duress, and the consequences as far
distribute algorithms are probably minimal. I think the only where this
might is for logging purposes (but that could be handled at the
application level) or to speed the membership protocol, although it's
already pretty fast.

So I would not draw a distinction there.

> By also treating nodes joining, leaving and dieing, as split and merge
> operations I can reduce the number of cases that I have to deal with.

I would even add that the difference is known only to the application.

> and ensure that what might be very uncommonly run code (run on network
> partition/healing) is the same code that is commonly run on e.g. node
> join/leave - so it is likely to be more robust.

Sounds good.

> In the case of a binary split, I envisage two sets of nodes losing
> contact with each other. Each cluster fragment will repair its internal
> structure. I expect that after this repair, neither fragment will carry
> a complete copy of the cluster's original state (unless we are
> replicating 1->all, which WADI will usually not do), rather, the two
> datasets will intersect and their union will be the original dataset.
> Replicated state will carry a version number.

I think a version number should work very well.

> If client affinity survives the split (i.e. clients continue to talk to
> the same nodes), then we should find ourselves in a working state, with
> two smaller clusters carrying overlapping and diverging state. Each
> piece of state should be static in one subcluster and divergant in the
> other (it has only one client). The version carried by each piece of
> state may be used to decide which is the most recent version.
>
> (If client affinity is not maintained, then, without a backchannel of
> some sort, we are in trouble).
>
> When a merge occurs, WADI will be able to merge the internal
> representations of the participants, delegating awkward decisions about
> divergant state to deploy-time pluggable algorithms. Hopefully, each
> piece of state will only have diverged in one cluster fragment so the
> choosing which copy to go forward with will be trivial.

> A node death can just be thought of as a 'split' which never 'merges'.

Definitely :)

> Of course, multiple splits could occur concurrently and merging them is
> a little more complicated than I may have implied, but I am getting
> there

Although I consider the problem of session replication less than
glamorous, since it is at hand, I would approach it this way:

1. The user should configure a minimum-degree-of-replication R. This is
the number of replicas of a specific session which need to be available in
order for an HTTP request to be serviced.

2. When an HTTP request arrives, if the cluster which received does not
have R copies then it blocks (it waits until there are.) This should in
data centers because partitions are likely to be very short-lived (aka
virtual partitions, which are due to congestion, not to any hardware
issue.)

3. If at any time an HTTP reaches a server which does not have itself a
replica of the session it sends a client redirect to a node which does.

4. When a new cluster is formed (with nodes coming or going), it takes an
inventory of all the sessions and their version numbers. Sessions which do
not have the necessary degree of replication need to be fixed, which will
require some state transfer, and possibly migration of some session for
proper load balancing.

Guglielmo


Re: Infiniband

2006-01-15 Thread lichtner

I think I have found some information which if I had hardware available
would lead me to skip the prototyping stage entirely:

This paper benchmarks the performance of infiniband through 1) UDAPL and
2) Sockets Direct Protocol (SDP) - also available from openib.org:

"Sockets Direct Protocol over Infiniband Clusters: Is it Beneficial?"
Balaji et al.

I think the value of SDP over UDAPL is that it looks like a socket
interface, which means porting applications over could even be trivial.

However, I think that SDP in that paper does not have zero copy, and that
is why the paper shows that UDAPL is faster.

However, Mellanox has a zero-copy SDP:

"Transparently Achieving Superior Socket Performance Using Zero Copy
Socket Direct Protocol over 20Gb/s Infiniband Links"
Goldenberg et al.

It's just enlightening to see the two main sources of waste in network
applications be removed one at a time, namely 1) context switches and 2)
copies.

Using zero-copy SDP from Java should be pretty easy, although interfacing
with UDAPL would also be valuable.

I have found evidence that Sun was planning to include support for Sockets
Direct Protocol in jdk 1.6, but that they gave up because infiniband is
not mainstream hardware (yet).

I think IBM may have put some of this in WebSphere 6:

http://domino.research.ibm.com/comm/research.nsf/pages/r.distributed.innovation.html?Open&printable

That would be just typical of IBM, understating or flat out
hiding important achievements. When Parallex Sysplex came out IBM did a
test with 100% data sharing, meaning _all_ reads and writes where remote
(to the CF), and measured the scalability, and it's basically linear, but
off the diagonal by (only) 13%. Instead of understanding that mainframes
now scaled horizontally, the press focused on the "overhead". This
prompted Lou Goestner that if he invited the press to his house and
showcased his dog walking on water the press would report "Goestner buys
dog that can't swim."

I think if I had a few thousand dollars to spare I would definitely get
a couple of opteron boxes and get this off the ground.

I think from the numbers you can conclude that even where a person wants
to be perverse and keep using object serialization, they will still get
much better throughput (if not better latency) because half the cpu can be
spent executing the jvm rather than switching context and copying data
from the memory to itself.

I hope somebody with a budget picks this up soon.

Guglielmo

On Sun, 15 Jan 2006, James Strachan wrote:

> On 14 Jan 2006, at 22:27, lichtner wrote:
> > On Fri, 13 Jan 2006, James Strachan wrote:
> >
> >>> The infiniband transport would be native code, so you could use JNI.
> >>> However, it would definitely be worth it.
> >>
> >> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> >> & google every once in a while to see if one shows up :)
> >>
> >> It looks like MPI has support for Infiniband; would it be worth
> >> trying to wrap that in JNI?
> >> http://www-unix.mcs.anl.gov/mpi/
> >> http://www-unix.mcs.anl.gov/mpi/mpich2/
> >
> > I did find that HP has a Java interface for MPI. However, to me it
> > doesn't
> > necessarily seem that this is the way to go. I think for writing
> > distributed computations it would be the right choice, but I think
> > that
> > the people who write those choose to work in a natively compiled
> > language,
> > and I think that this may be the reason why this Java mpi doesn't
> > appear
> > to be that well-known.
> >
> > However I did find something which might work for us, namely UDAPL
> > from the DAT Collaborative. A consortium created a spec for
> > interface to
> > anything that provides RDMA capabilities:
> >
> > http://www.datcollaborative.org/udapl.html
> >
> > The header files and the spec are right there.
> >
> > I downloaded the only release made by infiniband.sf.net and they claim
> > that it only works with kernel 2.4, and that for 2.6 you have to use
> > openib.org. The latter claims to provide an implementation of UDAPL:
> >
> > http://openib.org/doc.html
> >
> > The wiki has a lot of info.
> >
> > From the mailing list archive you can tell that this project has a
> > lot of
> > momentum:
> >
> > http://openib.org/pipermail/openib-general/
>
> Awesome! Thanks for all the links
>
>
> > I think the next thing to do would be to prove that using RDMA as
> > opposed
> > to udp is worthwhile. I think it is, because JITs are so fast now,
> > but I
> > think that before planning anything long-term I would get two
> > infiniband-enabled b

Re: Dev branches?

2006-01-15 Thread lichtner

I support your idea. Making branches for new feature development is
a common practice.

Were you thinking of doing it for every single change request, or only for
big ones?

On Sun, 15 Jan 2006, Greg Wilkins wrote:

>
> I would like to create a dev branch to start working on some
> 1.1 and 2.0 stuff.
>
> But I don't think it is appropriate to pollute /branch with
> private branches as it will be good to be able to go there and see
> all the official branches:
>
>/branch/1.0
>/branch/1.1
>
> without seeing
>
>/branch/djencks
>/branch/gregw
>/branch/dain
>
> etc. etc.
>
> So I would like to propose a secondary location for development
> branches /devbranch.
>
> Moreover, I don't think that development branches should be
> considered private branches as this would encourage many branches
> and discourage cooperative development.  I think they should be named
> for the features they are trying to develop.   So we would have
> things like
>
>  /devbranch/servlet-2.5
>  /devbranch/openejb-3
>  /devbranch/kernel
>
>
> I think the policy should be that anything targeted for an
> x.0 release should be developed in a /devbranch.
>
> Anything for a x.y branch can be developed in /trunk or
> in a /devbranch if it's development may take longer
> than a single x.y cycle or if it's inclusion in an x.y
> release is up for debate.
>
> Anything for a x.y.z branch can be developed in trunk but
> should be stabilized in the /branch/x.y
>
>
>


Re: Infiniband

2006-01-14 Thread lichtner


On Fri, 13 Jan 2006, James Strachan wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> & google every once in a while to see if one shows up :)
>
> It looks like MPI has support for Infiniband; would it be worth
> trying to wrap that in JNI?
> http://www-unix.mcs.anl.gov/mpi/
> http://www-unix.mcs.anl.gov/mpi/mpich2/

I did find that HP has a Java interface for MPI. However, to me it doesn't
necessarily seem that this is the way to go. I think for writing
distributed computations it would be the right choice, but I think that
the people who write those choose to work in a natively compiled language,
and I think that this may be the reason why this Java mpi doesn't appear
to be that well-known.

However I did find something which might work for us, namely UDAPL
from the DAT Collaborative. A consortium created a spec for interface to
anything that provides RDMA capabilities:

http://www.datcollaborative.org/udapl.html

The header files and the spec are right there.

I downloaded the only release made by infiniband.sf.net and they claim
that it only works with kernel 2.4, and that for 2.6 you have to use
openib.org. The latter claims to provide an implementation of UDAPL:

http://openib.org/doc.html

The wiki has a lot of info.

>From the mailing list archive you can tell that this project has a lot of
momentum:

http://openib.org/pipermail/openib-general/

I think the next thing to do would be to prove that using RDMA as opposed
to udp is worthwhile. I think it is, because JITs are so fast now, but I
think that before planning anything long-term I would get two
infiniband-enabled boxes and write a little prototype. I think Appro sells
infiniband blades with Mellanox hcas.

There is also IBM's proprietary API for clustering mainframes, the
Coupling Facility:

http://www.research.ibm.com/journal/sj36-2.html

There are some amazing articles there.

Personally I also think there is value in implementing replication using
udp (process groups libraries such as evs4j), so I would pursue both at
the same time.

Guglielmo


Re: Release and Version Philosophy [Discussion]

2006-01-14 Thread lichtner

To me the only important requirements in release numbers are that they
should tell the user:

1. Whether the release is backward compatible.

2. Whether it's a stable build vs. unstable.

I would rather not to have to learn the various meanings of digits 1-N. It
seems like it would make it more transparent, but actually it makes it
less transparent because people need to think (!).

As far as patching goes, there will be people who want bug fix 5 but not
6,7,8 because they think that any change other than they one they want will
cause them to stay past five o'clock, but I think that those are the kind of
the people that must be made to pay for commercial support.

On Sun, 15 Jan 2006, Matt Hogstrom wrote:

> I've seen several posts about the upcoming 1.0.x release and 1.1 and 2.0 etc.
> lately and I think its great that we're having these discussions.
>
> I'd like to use this thread to aggregate people's thoughts about this topic 
> in a
> single thread for reference and clarification as we make forward progress.  So
> I'd like to clarify some terminology (at least post what the terms mean to me)
> so we can make some meaningful plans for our various efforts going forward.
>
> This is a strawman so don't get too revved up.  I think we need to balance
> between structure and fluidity and I'm not sure exactly how to do that; input
> welcome.
>
> First, I see there is a structure for versioning like:
>
> v.r.m[.f] where:
>
> v = Version
> r = Release
> m = modification
> f = fix (optional)
>
> Version
> ---
> - Represents a significant set of improvements in Geronimo and a definite
> milestone for our users.
> - New features are introduced that may break compatibility and require users 
> to
> have to modify their existing applications that ran on previous Versions.
> (Although we should strive to not force them to change immediately but rather
> provide something like a V-1 or -2 compatibility story.  -2 Would be excellent
> but that might be too optimistic given the rate of change.
> - Things like JEE 5 would be found in a version change.
> - Goes through a formal Release Candidate process for user feedback and has
> broad coverage in terms of announcement.  (Not just the Dev List)
> - Release Candidates look something like Geronimo-2.0-RC1/2/3 etc.
>
> Release
> ---
> - Can include significant new features / improvements.
> - Should not break existing applications (lot's of traffic from users saying
> something worked on M5 but doesn't on 1.0)
> - Includes bug fixes and the like.
> - It would be hard to justify moving to JEE 5 based on a release change.
> - Has broad announcement
> - Does not go through formal Release Candidate Process but does make interim
> release attempts based on a dated binary release (ala 
> Geronimo-jetty-1.1-rc20060315)
>
> Modification
> 
> - Incremental release that builds on the goals of the V.R its based on.
> - Can include new features
> - Cannot disrupt existing application deployments
> - Includes multiple bug fixes
>
> Fix
> ---
> - Focused release that addresses a specific critical bug.
> - We're no where near this now but it would be nice to release specific bug
> fixes and not whole server releases.
> - An example of this would be something like a fix to the recent problem Jetty
> uncovered related to security.  A fix in this context would be a simple
> packaging change to get the new Jetty Jar into the build and wouldn't require 
> a
> whole new server to be spun off.
>
> Thoughts?
>


Re: Infiniband

2006-01-14 Thread lichtner

On Fri, 13 Jan 2006, James Strachan wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> & google every once in a while to see if one shows up :)
>
> It looks like MPI has support for Infiniband; would it be worth
> trying to wrap that in JNI?
> http://www-unix.mcs.anl.gov/mpi/
> http://www-unix.mcs.anl.gov/mpi/mpich2/

I don't know MPI. Do you think it's a better interface, or that it is much
easier?

I will take a look at MPI.

Guglielmo


Re: -1 on checkin of 368344 was Re: [wadi-dev] Clustering: WADI/Geronimo integrations.

2006-01-14 Thread lichtner


On Sat, 14 Jan 2006, David Jencks wrote:

> What would the reaction be to something
> that only sort of works in an official release?

IMHO all features all features in a production release should be usable.
It's not a problem if the functionality is limited, as long as it works.


Re: Replication using totem protocol

2006-01-13 Thread lichtner

As Jules requested I am looking at the AC api. I report my observations
below:

ClusterEvent appears to represent membership-related events. These you
can generate from evs4j, as follows: write an adapter that implements
evs4j.Listener. In the onConfiguration(..) method you get notified of
new configurations (new groups). You can generate ClusterEvent.ADD_NODE
etc. by doing a diff of the old configuration and the new one.
Evs4j does not support arbitrary algorithms for electing coordinators.
In fact, in totem there is no coordinator. If a specific election is
important for you, you can design one using totem's messages. If not,
in evs4j node names are integers, so the coordinator can be the lowest
integer. This is checked by evs4j.Configuration.getCoordinator().

I don't know the difference between REMOVE_NODE and FAILED_NODE. In totem
there is no difference between the two.

The only other class I think I need to comment on is Cluster. It
resembles a jms session, even being coupled to actual jms interfaces. You
can definitely implement producers and consumers and put them on top of
evs4j. The method send(Destination, Message) would have to encode Message
on top of fixed-length evs4j messages. No problem here.

Personally, I would not have mixed half the jms api with an original api.
I don't think it sends a good message as far as risk management goes. I
think people are prepared to deal with a product that says 'we assume jms'
or 'we are completely home-grown because we are so much better', but not a
mix of the two. Anyway that's not for me to say. Whatever works.

In conclusion, yes, I think you could build an implementation of AC on top
of evs4j.

BTW, how does AC defend against the problem of a split-brain cluster?
Shared scsi disk? Majority voting? Curious.

Guglielmo


Re: Infiniband

2006-01-13 Thread lichtner

On Fri, 13 Jan 2006, Alan D. Cabrera wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Do you have any references to the where one could get a peek at the
> transport API?

http://infiniband.sourceforge.net/



Re: Replication using totem protocol

2006-01-13 Thread lichtner
> Interesting.  Can you suggest a protocol we should use for
> pessimistic distributed locking?   I expect the cluster size to be
> between 2-16 nodes with the sweet spot at 4 nodes.   Each node will
> be processing about 500-1000 tps and each tps will require on average
> about 1-4 lock requests (most likely just one request for the web
> session).  Nodes should be able to join and leave the cluster easily.

If you must be a pessimist, then get shared locks for reads, exclusive
locks for writes, two locks conflicting if at least one of them is an
exclusive lock. Hold the locks acquired until after commit (strict 2pl).

To get a lock you send a totem message and wait for it to arrive. A few ms.

The latency for 4 nodes should be very respectable. For 16 nodes it might
still be acceptable. I would measure the throughput/latency curve in your
lab and based on that you can decide at what point you need something more
sophisticated (which for me would be an independent replicated lock
manager which can be reached through short tcp messages and some basic
load balancing.)

This paper actually shows some simulations of various concurrency control
protocols, so you can make an educated decision:

http://citeseer.ist.psu.edu/299097.html

Guglielmo




Infiniband

2006-01-13 Thread lichtner

With regard to clustering, I also want to mention a remote option, which
is to use infiniband RDMA for inter-node communication.

With an infiniband link between two machines you can copy a buffer
directly from the memory of one to the memory of the other, without
switching context. This means the kernel scheduler is not involved at all,
and there are no copies.

I think the bandwidth can be up to 30Gbps right now. Pathscale makes an IB
adapter which plugs into the new HTX hypertransport slot, which is to say
it bypasses the pci bus (!). They report an 8-byte message latency of 1.32
microseconds.

I think IB costs about $500 per node. But the cost is going down steadily
because the people who use IB typically buy thousands of network cards at
a time (for supercomputers.)

The infiniband transport would be native code, so you could use JNI.
However, it would definitely be worth it.

Guglielmo



Re: Replication using totem protocol

2006-01-13 Thread lichtner

> > If you cluster an entity bean on two nodes naively, you lose many of the
> > benefits of caching. This is because neither node, at the beginning of a
> > transaction, knows whether the other node has changed the beans contents
> > since it was last loaded into cache, so the cache must be assumed
> > invalid. Thus, you find yourself going to the db much more frequently
> > than you would like, and the number of trips increases linearly with the
> > number of clients - i.e. you are no longer scalable.
>
>
> It depends on your transaction isolation level; i.e. do you want to do a
> dirty read or not. You should be able to enable dirty reads to get
> scalability & performance.

I like dirty reads from a theoretical standpoint because if you can do
dirty reads it means you have a high-class message bus. However, I don't
expect people to ask for dirty reads unless you mean that they are going
to roll back automatically. Example: inventory.

Non-transactional applications however could use dirty data and still
find it useful even if they don't roll back.

> The only way to really and truly know if the cache is up to date is to use a
> pessimistic read lock; but thats what databases are great for - so you might
> as well use the DB and not the cache in those circumstances. i.e. you always
> use caches for dirty reads

Major databases currently do not use read locks. Oracle and SQL Server use
mv2pl (multiversion two phase locking.) MySQL on the InnoDB storage engine
also. PostgresSQL. (Interestingly, the first I article I know on this is
Bernstein and Goodman, 1978 (!)). Sybase I don't know. I think it may have
fallen a bit behind.

When a tx starts you assign a cluster-wide unique id to it. That's its
'position' in time (known in Oracle as the scn, system change number).
When the tx writes a data item it creates a new version, tagged with this
scn. When a transaction wants to read the data, it reads the last version
before _its_ scn. So when you read you definitely don't need a lock. When
you write you can either use a lock (mv2pl) or proceed until you find a
conflict, in which case you roll back. The latter should be used in
workloads that have very little contention. Or you can use in general also
but you need to have automatic retries, as with mdbs, and you should
really not send any data to the dbms until you know for sure to cut down
on the time required to roll back.

> * invalidation; as each entity changes it sends an invalidation message -
> which is really simple & doesn't need total ordering, just something
> lightweight & fast. (Actually pure multicast is fine for invalidation stuff
> since messages are tiny & reliability is not that big a deal, particularly
> if coupled with a cache timeout/reload policy).
>
> * broadcasting the new data to interested parties (say everyone else in the
> cluster). This typically requires either (i) a global publisher (maybe
> listening to the DB transaction log) or (ii) total ordering if each entity
> bean server sends its changes.

That's the one.

> The former is good for very high update rates or very sparse caches, the
> latter is good when everyone interested in the cluster needs to cache mostly
> the same stuff & the cache size is sufficient that most nodes have all the
> same data in their cache. The former is more lightweight and simpler & a
> good first step :)

You can split up the data also. You can keep 4 replicas of each data item
instead of N, and just migrate it around. But for semi-constant data like
reference data, e.g. stock symbols or client data you can keep copies
everywhere.

Guglielmo


Re: Replication using totem protocol

2006-01-13 Thread lichtner
> You could go one step further and send, not an invalidation, but a
> replication message. This would contain the Entity's new value and head
> off any reloading from the DB at all
>
> All of this needs to be properly integrated with e.g. transactions,
> locking etc...
>
> Perhaps Totem might be useful here ?

Yes, I would decide in this area it would be worth using totem. As I said
in a  different email, you would have to pick a concurrency control
protocol. I favor multi-version 2-phase locking, or multiversion timestamp
ordering in the case of transactions that don't mind rolling back.



Re: Replication using totem protocol

2006-01-13 Thread lichtner

I will take a closer look at it. My first impression was that
activecluster assumes a jms or jms-like api as a transport.

> [EMAIL PROTECTED] wrote:
>
>>>Given the inherent over head in total order protocols, I think we
>>>should work to limit the messages passed over the protocol, to only
>>>the absolute minimum to make our cluster work reliably.
>>>Specifically, I think this is only the distributed lock.  For state
>>>replication we can use a much more efficient tcp or udp based protocol.
>>>
>>>
>>
>>As I said, if your workload has low data sharing (e.g. session
>>replication), you should not use totem. It's designed for systems where
>>_each_ processor needs _most_ of the messages.
>>
>>
> Geronimo has a number of replication usecases (I'll be enumerating them
> in a document that I am putting together at the moment) Totem may well
> suit some of these. If we were to look seriously at using it, I think
> the first technical consideration would be API. Geronimo already has
> ActiveCluster (AC) in the incubator and WADI (An HttpSession and SFSB
> clustering solution is built on AC). AC is both an API to basic
> clustering fn-ality and a number of pluggable impls. My suggestion would
> be that we look at how we could map Totem to the AC API.
>
> Do Totem and AC (http://activecluster.codehaus.org/) look like a good
> match ?
>
>
> Jules
>
>
> --
> "Open Source is a self-assembling organism. You dangle a piece of
> string into a super-saturated solution and a whole operating-system
> crystallises out around it."
>
> /**
>  * Jules Gosnell
>  * Partner
>  * Core Developers Network (Europe)
>  *
>  *www.coredevelopers.net
>  *
>  * Open Source Training & Support.
>  **/
>
>




Re: Replication using totem protocol

2006-01-12 Thread lichtner
> Its been talked about but currently not implemented.I'm catching up on the
> conversation and haven't looked at the pointers yet so I have a bit of
> reading
> to do.
>
> Are you thinking about using Totem to replicate Entity cache information
> in a cluster?

Yes.

You can take your pick of concurrency control protocol:

1. Lock locally for reads and possibly roll back if a write arrived with
an earlier transaction id. This works well for low sharing and mdb-based
transactions, because jms will retry the transaction automatically.

2. Lock globally, shared locks for reads and exclusive locks for writes.
This is good if you really cannot handle rollbacks or you have close to
100% sharing so you are destined to wait anyway. It actually turns out
that if you have this type of workload _and_ you expect your application
to operate at peak throughput then totem provides pretty predictable
latency (typically a few ms), which is nice.

3. Multiversion timestamp ordering. See 1, but without read locks.

4. Multiversion two-phase locking. Get exclusive locks for writes, create
a new version for reads. This is my favorite.

Guglielmo




Re: Replication using totem protocol

2006-01-12 Thread lichtner
> [EMAIL PROTECTED] wrote:
>> Well, you guys let me know if I can help you in any way.
>
> Keep on talking ;-)

Okay. I will ask you a question then. What are you doing as far caching
entity beans?



Re: Replication using totem protocol

2006-01-12 Thread lichtner

Well, you guys let me know if I can help you in any way.

> I think there is a time and place for this and can be leveraged in other
> protocols.  As a minimum it can be a pluggable protocol.  Its a great
> start.
>
> [EMAIL PROTECTED] wrote:
>>> Given the inherent over head in total order protocols, I think we
>>> should work to limit the messages passed over the protocol, to only
>>> the absolute minimum to make our cluster work reliably.
>>> Specifically, I think this is only the distributed lock.  For state
>>> replication we can use a much more efficient tcp or udp based protocol.
>>
>> As I said, if your workload has low data sharing (e.g. session
>> replication), you should not use totem. It's designed for systems where
>> _each_ processor needs _most_ of the messages.
>>
>>
>>
>




Re: Replication using totem protocol

2006-01-12 Thread lichtner
> Given the inherent over head in total order protocols, I think we
> should work to limit the messages passed over the protocol, to only
> the absolute minimum to make our cluster work reliably.
> Specifically, I think this is only the distributed lock.  For state
> replication we can use a much more efficient tcp or udp based protocol.

As I said, if your workload has low data sharing (e.g. session
replication), you should not use totem. It's designed for systems where
_each_ processor needs _most_ of the messages.






Re: Replication using totem protocol

2006-01-12 Thread lichtner
> No.  You license the code to the Apache Software Foundation giving
> the foundation the rights to relicense under any license (so the
> foundation can upgrade the license as they did with ASL2).  We do ask
> that you change the copyrights on the version of the code you give to
> the ASF to something like "Copyright 2004 The Apache Software
> Foundation or its licensors, as applicable."

That _is_ transferring the copyright.

As I told Jeff on the phone, I would definitely considering this if it
turns that evs4j will really be used, but I would rather not grant someone
an unlimited license at the present time. Jeff said we are going to have a
discussion, so we'll know more soon enough.

> Nothing better to do between jobs than coding :)

You should see the next program I am writing ;)

>> Also, what do you need to locks for?
>
> Locking web sessions and stateful session beans in the cluster when a
> node is working on it.

I see. I don't think I would pass the token around all the nodes just for
session replication. It's a low-sharing workload, meaning you could have
50 servers but you only want 3 copies of a session, say.

But you could write a high-available lock manager using totem, say, with
three copies of the system, and write a low-latency tcp-based protocol to
grab the lock. The time to get the lock would be the tcp round-trip plus
the time it takes for totem to send itself a 'safe' message, which on
average takes 1.5 token rotations (as opposed to 0.5). And you would
load-balance among the three copies. That would probably get a latency of
about 5 ms total to get a lock (just a gut feeling) and also scalability.
And you can always add more copies.

Guglielmo



Re: Fwd: Replication using totem protocol

2006-01-12 Thread lichtner

I didn't see it - I'm not sure why.

> According to the website (http://www.bway.net/~lichtner/evs4j.html):
>
>  "Extended Virtual Synchrony for Java (EVS4J), an Apache-
> Licensed, pure-Java implementation of the fastest known totally
> ordered reliable multicast protocol."

Yes, I wrote that.

> Once you have a total ordered messing protocol, implementing a
> distributed lock is trivial (I can go into detail if you want).

Yes. You just send a totally-ordered message and wait for it to arrive.

> I suggest we ask Guglielmo if he would like to donate his
> implementation to this incubator project

I don't know about donating it. Who would they want me to transfer the
copyright to?

> and if he would like to work on a pessimistic distributed locking
implementation.
> What do you think?

I would definitely like to work on it, but I still work for a living, so
that's something to think about. (I happen to be between jobs right now.)

Also, what do you need to locks for?

Guglielmo



Totem Protocol and Geronimo Replication

2006-01-12 Thread lichtner

Over the phone Jeff asked me to start a discussion about the totem
protocol, so here it is.

If anyone just wants to get it from the horse's mouth you can read this
paper:

"The Totem Single-Ring Ordering and Membership Protocol",
Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, and P. Ciarfella,
ACM Transactions on Computer Systems 13, 4 (November 1995), 311-342.

It's on the web.

When implementing replication using multicasting, you need the following:

1. The ability to discern which 'processors' or 'nodes' are in "the group".

2. The ability to re-send dropped multicast packets.

3. Flow control.

4. The ability to compute at least a partial ordering on all messages.

5. The ability to avoid a split-brain cluster.

6. State transfer.

Number 5 is not part of the totem protocol so I am not going to address it
right now. You can google "quorum resource" or read Pfister's book "In
Search of Clusters".

Number is not really tied to reliable multicasting, so I am not going to
address it here. Interesting issue.

The membership protocol (1) is not a major challenge (although from the
process-group theoretical viewpoint it may be the most interesting.) Totem
has its own membership protocol. When you start up the totem protocol it
detects the presence of other processors using join messages, which is
like saying "i am here, and these are the other processors I have detected
so far". You can read about the details in the totem paper above. The
membership protocol also detects failures and totem has a recovery state
to 'flush' the messages from the last ring. Again, see article.

Imagine that you now have a 'ring' of processors numbered 1 through n.
Processor number 1 can send some mulicast packets, with ids 1,2 etc. When
it has nothing more to send, it sends a udp packet to processor 2. This
packets is called the 'token'. In my implementation (evs4j) the token is
also multicast to keep it simple. The token contains the number of the
last message sent. Processor 2 may have missed any number of messages,
including the token itself. Retransmission of the token is handled using a
timeout (see article.) Once processor 2 has the token it looks at the last
id sent and its own buffer of received messages and it detects any gaps.
If there are any, then it lists the ids of the missed messages in the
token before it sends it on. It also sends its own new messages before
sending the token on. When a processor receives the token and sees that
some other processor missed messages, it resends whatever it can.

That's how totem adds reliability to multicasting.

Ordering is trivial. Since each message has a positive integer as an id,
totem does not deliver message N to the application unless it already
delivered message N-1.

Flow control is actually totem's strongest point. On a LAN, when a message
is dropped it's usually because of buffer overflow at the receiver.
Therefore throttling is the critical factor that needs to be addressed in
order to reach the maximum theoretical throughput of the group as a whole.
Totem implements flow control using a window which sets the maximum number
of messages that may be sent during the current token rotation. It works
pretty well. You can read the details in the article above.

EVS4J has an extra feature which is not in the totem article, namely
congestion control, aka window tuning. I used to test totem on my lan and
telnetting to the various servers was a nightmare because the network
cards were all packed with totem messages. I knew the problem had been
solved in TCP using the Van Jacobson et al. algorithm, so I adapted this
algorithm to the totem window. Now if you run the totem benchmark and try
to transfer a big file on the same lan the window backs off. So now I
still use a _maximum_ window size, but when there is extraneous load on
the LAN the window is free to oscillate between zero (just sending the
token around) and full throttle.

As far as performance goes, I believe that on a regular pc and a fast
ethernet _hub_ you evs4j will do at least 6000 messages per second (with
1500-byte messages) with an average latency of a few ms for a small ring.
For a larger ring the latency can get out of hand, and indeed there is a
whole other protocol called the totem "multiple-ring" protocol, which uses
several rings and gateways between them. I didn't implement that because I
think testing it for me would be too challenging.

The thing I like about token is that its throughput and latency are
predictable. The maximum window size is the tuning parameter. The bigger
the window size, the greater the throughput (until you reach maximum) but
also the greater the latency. So it's ideal for systems in a data center
with high data-sharing requirements, whereas for loosely coupled clusters
you can do better with a non-token-passing protocol. The problem with
those is that flow control itself requires communication within nodes, so
those protocols probably don't get close to the maximum t

Re: Fwd: Replication using totem protocol

2006-01-12 Thread lichtner
> Yes...awesome.  Bruce had chatted with me about this too...I am very
> interested.

Thanks.

> Guglielmo, I would be very interested in speaking with you further on
> this.

I am available to speak more about it. If you need my phone number, it's
six one nine, two five five, nine seven eight six.

> This is looks like something we could heavily use.  What's your thoughts?

I think totem is a great protocol. Whether _you_ need it depends on the
application. I originally wrote this code back in 2000, and it took me
this long to find the ideal application for it.

I would like to recommend the following article by Ken Birman (probably
the grandad of process groups):

http://portal.acm.org/citation.cfm?id=326136

Unfortunately you need to be a member of the acm to read it (I used to be,
but right now I am not.) This article describes his experiences using
ISIS, an early process group library, to build some interesting systems.

Guglielmo