+1
+1
Regards,
Thomas
Hi,
Network delay .. is faster than the delay of a disk
I wrote the network is the new disk (in terms of bottleneck, in
terms of performance problem). Network delay may be a bit faster now
than disk access. But it's *still* a huge bottleneck (compared to
in-memory operations), specially if
Hi,
See section 7 Vector Time. Also see [1] from slide 14 onwards for a more
approachable reference.
[1] http://www.cambridge.org/resources/0521876346/6334_Chapter3.pdf
Thanks! From what I read so far it sounds like my idea is called Time
Warp / Virtual Time.
On page 10 and 11 there is the
Hi,
Let's discuss partitioning / sharding in another thread. Asynchronous
change merging is not about how to manage huge repositories (for that
you need partitioning / sharding), it's about how to manage cluster
nodes that are relatively far apart. I'm not sure if this is the
default use case for
The current Jackrabbit clustering doesn't scale well for writes
because all cluster nodes use the same persistent storage. Even if
persistence storage is clustered, the cluster journal relies on
changes being immediately visible in all nodes. That means Jackrabbit
clustering can scale well for
Hi,
Are you sure the problem is concurrency and not performance? Are
you sure that the persistence manager you use does support higher
write throughput? What persistence manager do you use, and what is the
write throughput you see, and what do you need?
Regards,
Thomas
Hi,
Do you use Day CRX / CQ? If yes, I suggest to use the Day support.
Regards,
Thomas
Hi,
What I am getting here is that writes will be
serialized due to a Single Write lock
For scalability, you also need scalable hardware. Just using multiple
threads will not improve performance if all the data is then stored on
the same disk.
Regards,
Thomas
Hi,
It looks good to me too (but I'm not an expert in this area).
i don't fully understand the distinction between 'Component' and 'Shared
Code',
but apart from that, looks good to me.
I also don't understand the difference.
Regards,
Thomas
Hi,
I am not completely sure if that solves the problem, but could you try
sharing the repository-level FileSystem between cluster nodes? That
means, configure the first FileSystem entry in the repository.xml
(the one directly within the Repository element; not the one within
the Workspace
well, i don't ;) i don't think that a proper oo design will
necessarily be overly complex.
Having everything convoluted just for the sake of avoiding public
implementation methods is completely unrelated to proper OO design. It
may be your understanding of proper OO design, but it's definitely
Hi,
I'm sorry about the tone of my mails.
I just want to avoid that we run into the trap of making Jackrabbit 3
much too complicated and complex for the sake of being modular. I
agree there shouldn't be many public implementation methods, but what
I don't want to do is add additional glue
Do we want have public methods in the Jackrabbit 3 implementation that
can possibly be misused (if somebody casts to an implementation
class)? See the discussion at
https://issues.apache.org/jira/browse/JCR-2640
The advantage of not having public classes: people can't cast
implementation classes
Not exposing implementation details through public API
is a basic OO design principle.
i think with a proper design and packaging, this will not be a problem.
I don't think you talk about the same thing here.
Proper OO is using interfaces, and not casting to implementation classes.
For
Hi,
If you think it's proper OO and such, could you please provide *one*
example of a larger project that does *not* have public implementation
methods?
Regards,
Thomas
Hi,
I completely agree with Justin.
Package-protected
I think it does have it's use, but for more complex products it's just
not enough. Somewhat related: in Java 1.0.2 you could use private
protected: http://www.jguru.com/faq/view.jsp?EID=15576
Security
For real security, you either need
Hi,
objectives of the
jr3 project is to deliver better performance than jr2 on scalability,
concurrency, latency, etc., it would be helpful to have an automated stress
test framework
That's true. There are already a few such test cases, but more are
required. Patches are welcome of course
Hi,
My suggestion (admittedly as a bystander) would be that the sooner people
can start breaking it, the sooner it can get fixed, so prioritize activities
based on first getting it to the point of breakability (rather than
usability), and then merge.
Sorry, I don't understand what you mean
Hi,
I'm not sure if this will help more than it will complicate things.
Disadvantages:
- Isn't almost every class at in o.a.j.core at least somewhat session related?
- If you move classes to other packages, you will have to make many
method public.
Instead of moving session related classes to
Hi,
These unrelated classes are mostly things like RepositoryImpl,
TransientRepository, RepositoryCopier, etc. to which many external
codebases are linking, so we can't move them.
SessionImpl is used in my applications as well.
RepositoryImpl,
TransientRepository
I don't think those
Hi,
As far as I understand, you want to move the classes so we can add
checkstyle / PMD constraints, and more easily ensure every method call
from an external class is synchronized. I think that's fine.
Having the 'proxy' classes sounds like a solution for the backward
compatibility concerns
Hi,
So far the prototype is not yet usable, meaning too many features
are missing, tools are missing, documentation is missing. I guess this
needs to be fixed first, so that it becomes somewhat usable (even with
limited functionality). We also need to find out how / where exactly
we want to add
Hi
An alternative is: download the old Jackrabbit jar files when running
the tests (download the jar files dynamically when required, for
example to the target directory), and then load them using a custom
class loader, or create the old repository in a separate process.
While this is currently
There are a few interfaces that might be interesting for all users of
Jackrabbit. Those should be in the api package (not only for OSGi).
The most important is probably:
org.apache.jackrabbit.core.observation.SynchronousEventListener
What about 'officially' supporting it, and moving it to
Hi,
org.apache.jackrabbit.api.observation.JackrabbitEvent ?
You are right, I didn't see that... sorry... JackrabbitEvent already
has isExternal, so forget about ClusterEvent.
org.apache.jackrabbit.api.observation.ExtendedEvent
I can't find this one.
Regards,
Thomas
Hi,
I'm wondering if the Jackrabbit 3 should support storage backends that
use the path as the identifier. It's probably possible (with some
limitations), but I'm not sure if it's necessary. I'm sure it's
inefficient, but sometimes that's not a problem.
What do others think? If we want to
Hi,
I agree, we should concentrate on few backends. I think there are at least two:
- database (what we have now, default)
- in-memory (for testing only)
Still I will check what it takes to support path based node ids.
Currently I think it will only take one additional parameter in one
method
== Node Identifier Format ==
Jackrabbit node ids are currently UUIDs. For Jackrabbit 3, I think
that embedded storage mechanisms should use a long sequence instead.
Advantages of sequences: faster to generate (nodeId = nextId++);
faster index lookup (nodes generated at around the same time have
Hi,
- The jackrabbit repository is accessed from our app using RMI.
Can you use the repository in embedded mode? That would help a lot.
embedded Derby database
We've tested using postgres
I would test the H2 database if you have time.
Regards,
Thomas
Hi,
With regard to concurrency, are there any plans for jackrabbit to support
concurrency out of the box?
If you use one session for each thread then it should already work.
It's a bug if it doesn't.
In any case I would use one session per thread, no matter if a future
version of Jackrabbit
Hi,
Stefan is right, File.createTempFile() doesn't generate colliding
files. However, there is a potential problem with the
TransientFileFactory. Consider the following case:
- The file bin-1.tmp is created (BLOBInTempFile line 51).
- The TransientFileFactory adds a PhantomReference A in its
Hi,
it's too early IMO to judge whether a caching hierarchy manager is
needed or not...
IMO the only statement that can be made based on your comparison
is that if the prototype with very limited functionality were slower than
jackrabbit with a fully implemented feature set, the protoype's
Hi,
i doubt that the results of this comparison is any way significant.
It was not supposed to be a fair comparison :-) Of course the
prototype doesn't implement all features. For example path are parsed
in a very simplistic way. I don't think the end result will be as fast
as the prototype.
Hi,
I have some early performance test results: There is a test with 3
levels of child nodes (each node 20 children)
(TestSimple.createReadNodes).
With the JDBC storage and the H2 database, this is about 14 times
faster than the Jackrabbit 2.0 trunk (0.2 seconds versus 2.9 seconds
for Jackrabbit
Currently the journal (cluster journal and event journal) is stored
using a separate storage mechanism.
I think it should be stored using the 'normal' storage mechanism.
Advantages:
- Simplifies the architecture (specially for clustering)
- Events and node data are in the same transaction, which
Hi,
(except logging
Yes, I think SLF4J is fine
and configuration, probably
Some information need to be available when the repository is
constructed, or at the latest when logging in: What storage backend to
use, and how to connect to the storage backend.
The rest of the configuration
Hi,
In case of cluster db journal, the hostname of db connection.
The hostname of the database (if a database is used) and the database
name needs to be known when creating the repository object. Storing it
in a 'repository.xml' file is possible, but it's just an unnecessary
indirection. If you
Hi,
consistency. I don't know of a relational database that allows you
to violate referential integrity, unique key constraints, or check
constraints - simply by using the same connection in multiple threads.
jcr repository should have some point to do the constraints check as
well. Should
Hi,
It may slow down writes around 50%. I think it should be an optional
feature (some storage backends may not support it at all, and there
should be a way to disable / enable it for those backends that can
support it). I think we should support writing our own transaction log
even when using
Hi,
I think the persistence / storage API should be generic enough to
support at least 3 different implementations efficiently:
- an implementation based on a relational database
- a file based implementation
- in-memory
I think the storage API should support some kind of storage session
Hi,
I am not clear what credentials you are refering to
I refer to the database user name and password that are currently
stored in the repository.xml (except when using JNDI):
http://jackrabbit.apache.org/api/1.5/org/apache/jackrabbit/core/persistence/bundle/BundleDbPersistenceManager.html
#
Hi,
Currently Jackrabbit doesn't support relayed initialization. Unless I
misunderstood Felix, he would also like to get rid of this
restriction.
Just to clarify: my suggestion is *not* about requiring the repository
is initialized when the first session is opened. It's also *not* about
Currently Jackrabbit initializes the repository storage (persistence
manager) when creating the repository object. If the repository data
is stored in relational database, then the database connection is
opened at that time.
I suggest to allow delayed initialization (allow, not require). For
some
For Jackrabbit 3, I would like to improve exception handling. Some ideas:
== Use Error Codes ==
Currently exception message are hardcoded (in English). When using
error codes, exception messages could be translated. I'm not saying we
should translate them ourselves, but if somebody wants to, he
Hi,
I would prefer to initialise the repository at first place and make sure
everything
is correctly for repository
I wrote: *allow* delayed initialization (allow, not require).
If user want delay the initialisation, may create the repository
reference only when first accessed.
If the
Hi,
jdbc connection is not thread safe. jcr session works similar way and I
prefer follow the same pattern.
Me too. But there is a difference between thread safety and
consistency. I don't know of a relational database that allows you
to violate referential integrity, unique key constraints,
Hi,
Multiple threads adding child nodes to the same parent node
Yes, that's an important use case, and should not be a problem problem
for my proposed solution.
For instance, more than 1 thread calling
UserManager.createUser(userId,shardPath(useId)) where shardPath(userId)
results in a
Hi Ian,
Could you describe your use case?
probability of conflict when updating a multivalued property is reduced
What methods do you call, and how should the conflict be resolved?
Example: if you currently use the following code:
1) session1.getNode(test).setProperty(multi, new String[]{a,
There are low level merge and high level merge. A low level
merge is problematic it can result in unexpected behavior. I would
even say the way Jackrabbit merges changes currently (by looking at
the data itself, not at the operations) is problematic.
Example: Currently, orderBefore can not be
Hi,
this creates a big potential for deadlocks
Could you provide an example on how such a deadlock could look like?
just synchronizing all methods
So you also synchronize all Node/Item/Property methods
Some methods don't need to be synchronized, for example some getter
methods such as
Hi,
Consider two or more threads reading different items at the same time:
they all are chained one after the other.
Only if those threads use the same session.
this is unsupported, yet you want to add synchronization to secure
this unsupported case ...
When we are done it becomes a
Hi
http://issues.apache.org/jira/browse/JCR-2443.
Unfortunately this bug doesn't have a test case. Also I didn't find a
thread dump that shows what the problem was exactly. I can't say what
was the problem there.
Observation is definitely an area where synchronization can
potentially lead to
== Current Behavior ==
Currently Jackrabbit tries to merge changes when two sessions
add/change/remove different properties concurrently on the same node.
As far as I understand, Jackrabbit merges changes by looking at the
data (baseline, currently stored, and new). The same for child nodes:
when
Currently, Jackrabbit sessions are somewhat synchronized, but not
completely (for example it's still possible to concurrently read and
update data). There were some problems because of that, and probably
there still are.
I believe it's better to synchronize all methods in the session (on
the
Hi,
deadlocks
I think it's relatively simple to synchronize all methods on the session.
If we want to make sessions thread-safe, we should use proper
implementations.
Yes, that's what I want to write: a proper implementation.
any concurrent use of the same session is unsupported.
The
I would like to define a new storage format for nodes and properties.
A few ideas:
== Name and Namespace Index ==
Currently each new property and node name is stored in the name index.
Each namespace is stored in the namespace index. Those indexes are
used to compress the data. There are several
Hi,
A agree with Marcel, the current Jackrabbit SPI it too high level.
it must be impossible to create inconsistent data using the micro-kernel API
- tree and referential integrity awareness
+1. No more consistency check / fix. No more inconsistent index if the
property-value-index is also
Hi,
Which makes observation listeners an integral part of the microkernel, btw.
The microkernel would only need to support one callback object
(listener is probably the wrong word, because it is also called for
read operations). This one would then call (and allow to register)
regular JCR
About micro-kernel: I think the micro-kernel shouldn't have to support
splitting large child lists into smaller lists. That could be done on
a higher level. What it should support is all the features required by
this mechanism:
- Support 'hidden' nodes (those are marked using a hidden property).
Hi,
About clustering: there are two main use cases:
A) to improve read throughput and to achieve high availability. In
this case writes can be serialized.
B) to improve write throughput. In this case writes should not be
serialized, instead writes should be merged later on (eventually
Hi,
This would be after the fact and wouldn't work to validate that
changes are correct (to verify added / changed nodes don't violate
node type constraints). Also it wouldn't work for security.
Regards,
Thomas
Hi,
I don't see the point of doing such steps after the transaction has already
been committed.
Well, because you don't have a callback mechanism that gets called
_before_ committing (or reading, in the case of security).
I'd make node type constraints and security checks the responsibility
Hi,
The configuration should be persisted in the repository itself. Not in
external configuration files.
* dynamic configuration
First of all, I would define an API for configuration changes. This
API could be the regular JCR API, and the configuration could be
stored in special system nodes.
Hi,
Is Jackrabbit too slow for you? Or do you have out of memory problems?
Or why do you want to use your own cache?
features like overflow to disk
I would try to avoid that. It's not really a 'cache' if it has to be
stored to disk, if the original data is also on disk.
I would try to solve
Currently node types are integral part of the repository. There is a
special storage mechanism (the file custom_nodetypes.xml), which is
non-transactional and problematic for clustering.
To simplify the architecture ('microkernel'), could the node type
functionality be implemented outside the
Hi,
Thanks for the explanation!
index every unique jcr fieldname in a unique lucene field, and do not prefix
values as currently is being done.
This sounds very reasonable.
Regards,
Thomas
I'd use Lucene to manage it.
There are several problems. One is transactions, another is updating
the index synchronously. Another problem is dependence on Lucene which
is a problem for persistence and clustering.
I would very much like to avoid inventing our own search index.
I would
Hi,
I would also use a b-tree structure. If a node has too many child
nodes, two new invisible internal nodes are created, and the list of
child nodes is split up. Those internal nodes wouldn't have any
properties.
For efficient path lookup, the child node list should be sorted by
name. This is
Hi,
I would also use a b-tree structure. If a node has too many child
nodes, two new invisible internal nodes are created, and the list of
child nodes is split up. Those internal nodes wouldn't have any
properties.
You mean a b-tree for each node? I think this could be a separate
index,
Hi,
Property/value indexes: We anyway will have to implement some kind of
database persistence. Databases support transactional indexes. We
could use those instead of using Lucene. Or we could store the index
in JCR nodes (which is part of the large repository b-tree). Indexes
in databases are
Hi,
A Jackrabbit repository is some kind of b-tree - just that the pages
never split and balanced automatically. Maybe using b-tree is
confusing? Let's call it manual b-tree then.
i agree that flat hierarchies is an important feature, however we
shouldn't compromise
the performance for the
not sure that the JCR EventListener interface could be used for persistent
observation listeners
You are right. It would probably be a different API (to be defined).
This mechanism could be used for (just an idea):
- JCR observation
- security (filtering nodes and properties; allowing /
Hi,
I think Jukka is correct that the correct use of B-trees is to use one for
each list of child nodes, not as a way to model the entire hierarchy.
If you are more comfortable with this view, that's OK. I probably
should have said: the whole repository is a tree data structure.
And there
Hi,
JCR requires lookup of children by name and/or position (for orderable
children), so the implementation needs to support all these cases
efficiently. The trickiest one to handle is probably Node.getNodes(String
namePattern) because it requires using both name and position together.
Hi,
Even without using orderBefore, the specification still requires a stable
ordering of children for nodes that support orderable child nodes (see 23.1
of JCR 2.0 spec).
Thanks for the link! I see now my idea violates the specification
which says (23.3) When a child node is added to a node
+1
For simple search a built-in index would help a lot, for example node
names and (some) property values. Each property name could have its
own index. Advantages:
- transactional index updates
- reduced complexity
- reduced number of open files
- allows to implement Jackrabbit in C
I would not
Hi
each property indexed in its own Lucene field
Could you explain in more details? What is a 1:1 mapping? Do you mean
each property type should have it's own index, or each property name
should have its own index? Would this not increase the number of
Lucene index files a lot?
Regards,
Thomas
Hi,
I would implement the storage layer ourselves. It could look like:
- FileDataStore: keep as is (maybe reduce the directory level by one).
- Each node has a number (I would use a long). Used for indexing.
- MainStorage: the node data is kept in an append-only main
persistence storage. When
Hi,
The most obvious trouble with this approach is that the node UUIDs
would no longer be unique within such a super-workspace. I'm not sure
how to best solve that problem, apart from switching to some
alternative internal node identifiers. Any ideas?
Use a number (variable size when stored
Hi,
For me, there are two kinds of indexes: the property/value indexes,
and the fulltext index.
The property/value indexes are for property values, node names, paths,
node references, and so on. Such indexes (or indices) are relatively
small and fast. In relational databases, those are the
Hi,
I would do MVCC in a similar way it is done in relational databases
such as PostgreSQL. See also
www.postgresql.org/files/developer/transactions.pdf
Concurrent writes and MVCC: usually MVCC means readers are never
blocked by other readers or writers, and writers are not blocked by
readers.
Hi,
About 'append only' and 'immutable' storage. Here is an interesting link:
http://eclipsesource.com/blogs/2009/12/13/persistent-trees-in-git-clojure-and-couchdb-data-structure-convergence/
Regards,
Thomas
Hi,
A very simple implementation of my idea:
http://h2database.com/p.html#e5e5d0fa3aabc42932e6065a37b1f6a8
The method hasSameNameSibling() that is called for each remove(). If
it turns out to be a performance problem we could add a hidden
property in the first SNS node itself (only required
Hi,
Could you point me in the right direction for a production-ready model 3
deployment model (where we can access the repository remotely)?
There is some documentation available here:
http://wiki.apache.org/jackrabbit/RemoteAccess
Regards,
Thomas
Hi,
About SNS (same name siblings): what about moving that part away from
the core? Currently, the Jackrabbit architecture is (simplified):
1) API layer (JCR API, SPI API)
2) Jackrabbit core, which knows about SNS
After moving the SNS support, it would be something like this:
1) API layer (JCR
Hi,
Is this could be an optional feature in 3.X? As JCR 2.X is out and it could
raise comparability problem, right?
This change wouldn't affect the public API. SNS would still be
supported as they are done now. Maybe with a few changes, but all
within the JCR 2.0 specification.
About
Hi,
Please use the 'user' list for questions.
the lock timeouts are occurring only with non-jcr tables during routine
actions in other areas of our site, even though they have nothing to do with
Jackrabbit.
It sounds like the problem is not related to Jackrabbit then.
disabling
Hi,
Currently there is only data store per repository. If you need a data
store per workspace, then you need one repository per workspace.
- Assign a datastore per workspace (customer) so it's possible to measure
(and limit) storage usage for a given customer
This more sounds like an
Hi,
extend the datastore interface
workspace name, node name, property name ...
I'm not sure if the workspace / node name / node identifier / property name
is always available.
One advantage of this addition would be: it could speed up garbage
collection. If a binary object knows the node
+1 Release this package as Apache Jackrabbit 2.0.0
- checksums OK
- licences OK
- notice.txt, readme.txt and release-notes.txt files OK
- mvn clean install OK with Sun Java 1.5.0_22 / Mac OS X
Regards,
Thomas
Hi,
About the repository lock see http://wiki.apache.org/jackrabbit/RepositoryLock
P.S. Please use the user list for usage questions
Regards,
Thomas
On Thu, Jan 21, 2010 at 2:04 PM, abhishek reddy
abhishek.c1...@gmail.com wrote:
hi,
For the first time, i can able to access repository
Hi,
now org.apache.jackrabbit.core.util.CooperativeFileLockTest.testFileLock
failed
thomas, I think it was you who added this test recently, right?
Yes... does Hudson run on Windows? It looks like a timing problem (the
thread doesn't stop quickly enough).
Regards,
Thomas
Hi,
We can't change that API part in 2.x.
I understand we should not _change_ (or remove) a public API within
2.x. That's actually the main reason why I wouldn't export the
PersistenceManager API now, because it would force us to keep it like
it is for the whole 2.x.
But we can still export
Hi,
The problem of Jackrabbit Core is, that apart from implementing the
Jackrabbit API (which is imported in the bundle), it has its internal
API (for example the PersistenceManager interface or others). This
internal API is not properly separated (in terms of Java Packages) from
Hi,
It's worth to move some of the internal API to jackrabbit-api
for other bundle to provide different implementation. Tt could well
documented and better for third party to extend the jackrabbit.
I would do that only if there is an actual need for it. Do you have
another implementation?
Hi,
I don't have another implementation at the moment for any of them.
OK, good to know.
I can think of it's might be possible to add key/value store as bundle
persistence store in future.
I would wait until it's a real problem. Trying to solve _potential_
problems in advance is usually the
Hi,
I would not move the API to the Jackrabbit API. Just moving the
interfaces into sepatate packages, eg. below o.a.j.core.api would
suffice it to export this space and leave the implementation private.
+1
Moving the persistence manager interface and co (basically everything
that can be
+1 Release this package as Apache Jackrabbit 2.0-beta3
- checksums OK
- licences OK
- notice.txt, readme.txt and release-notes.txt files OK
- mvn clean install OK with Sun Java 1.6.0_15 / Mac OS X
Regards,
Thomas
On Mon, Nov 23, 2009 at 10:21 AM, Sébastien Launay
sebastienlau...@gmail.com
Hi,
How big is this directory?
By default, Jackrabbit uses Apache Derby to persist data. This directory
belongs to the embedded Apache Derby databases. There is a way to compact
Derby databases, however you would need implement this yourself. I found the
link to the Apache Derby documentation:
1 - 100 of 224 matches
Mail list logo