[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-30 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-177246334
  
Starting from the JENA-1122 description:

> Two Fuseki services, linking to the same dataset description.

Fuseki only calls assemblers once. No other system is (legitimately) 
calling Fuseki service building.  The configuration file processing puts 
service access points into the server-wide state.  There is no service 
assembler (it could be done but it isn't, it serves no purpose); it is done by 
custom processing during walking the datastructure which is happening anyway.

In the Fuseki case, we want shared datasets descriptions, that is, same 
name, to yield the same dataset.  Processing dataset descriptions is driven by 
assemblers and they have names for keys using the root resource.  A general 
"dataset sharing" outside assemblers is hard because of the lack of key. In 
other cases, I can imagine that a shared description alone does not always 
imply a shared object - in-memory datasets for example.  The more general area 
is not clearly defined.

The solution I see is that Fuseki handles the process step for the link:


```
fuseki:dataset   <#dataset>
   
<#dataset> rdf:type ja:RDFDataset(OrSubType)
```


This happens in `Builder.buildDataService` as it calls 
`Assembler.general.open(datasetDesc)`.

It looks to me that if sharing is provided  here, the problem statement of 
JENA-1122 is addressed.

One matter arising: 

Service descriptions can be in multiple files (it is the preferred pattern 
to use `configuration/`). The template system behind the UI uses relative URIs 
so names of descriptions are unique across the server.  

If a user manually writes two configuration files, but uses the same 
absolute URI and they meant it to be different, we have a problem and this 
could be made to cause an error (safe choice to force shared datasets to be in 
the server `config.ttl`).

`FusekiConfig.initializeDataAccessPoints` is the driver, it calls 
`readConfigurationDirectory` and the others places service descriptions can be 
and so needs checking.

For now, just solving this for two services in the server configuration 
file, with entries in the `fuseki:services` list links is a good start.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-30 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-177243620
  
This PR protects against multiple creation of a text index (JENA-1104), not 
against two calls to create the same dataset for two services in Fuseki. By 
chance, TDB is less prone to problems if that happens but that is luck.  
General datasets e.g. with inference graphs, SDB or plain in-memory datasets 
are likely exposed to problems.

Let's solve the immediate issue described in JENA-1122, then see if 
JENA-1104 needs addressing or whether the situations where it can still happen 
are uninteresting or have other problems in which case the application must be 
responsible for creating the index only once.

For the record, there are some specific items with the current PR that I 
would like clarified or refuted before this code is used to address JENA-1107, 
if that is still needed.

1: `TextIndexLucene.close` is not reference counted.

* Create text index -> T1
* Create text index -> T2 (which is T1, shared)
* Close T2.
* Any T1 code will now crash - the index is closed.

2: Using WeakReferences and managing `close()` seems to be duplicating 
lifecycle management.

I am not clear that the WeakReference to the Lucene index helps because 
there are no finalizers, so GC fnalization does not tidy up lucene.  A freed 
WeakReference would cause a new attempt to create the index but it will hit the 
state lock.

3: Creation by `createLuceneIndex` is not thread safe.  It has a 
get-create-put timing hole.

4: Need to be clear on the contract for "same `Directory`, different Lucene 
configuration (`TextIndexConfig`)". 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-25 Thread ajs6f
Github user ajs6f commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174658906
  
Is there any understanding about the sharing of components between the 
results of calls to the various forms of `Assembler::open`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-24 Thread bwmcbride
Github user bwmcbride commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174351980
  
Thanks for all the comments.  I've updated the pull request implementing 
all the proposed changes and made a few other minor amendments.

However, looking at @afs comments again, I realise I am heading off down 
the wrong path.  What @afs had suggested was not memoizing the text index 
objects in TextDataset factory, which is what this PR currently does, but 
memoizing  dataset objects.

My bad.  I thought the suggestion was to memoize the TextDataSetLucene 
objects in TextDatasetFactory.  That does I think solve JENA-1122 (as I 
understand it) but, as Andy says, it opens up other nasty problems.

So I propose to switch to memoizing datasets in the assembler.  

I'm sorry the rest of this is so long.  The short version is, what is the 
best way to go about managing state for memoizing datasets in assemblers?

The simple option is to just move the code in this pull request for 
memoizing TextIndexLucene datasets from TextDatasetFactory to 
TextIndexLuceneAssembler.  That addresses my immediate problem, but a solution 
that worked for datasets in general would be better.

A problem is if/when/how to clear out the memoized state.

In use cases where an application or a service starts up, assembles its 
components and then does what it does until it terminates, there may not be 
much of an issue leaving state around in the assembler maps used to relate 
nodes to reusable objects.

In use cases where an application is repeatedly building and tearing down 
assemblies during its lifetime then not clearing out the memoizing state can 
lead to failures when a component is reused between 'builds'.  The current 
TextIndexLucene test cases do this and they fail if the memoizing map is not 
cleared out between tests.  Maybe the tests are broken.

The current code in this pull request clears a TextIndexLucene object out 
of the memoizingmap when the TextIndexLucene object is closed.  But that is 
basically a kludge.  I don't think it a good idea to go round changing all 
existing dataset implementations to support event handling callbacks on close.

The way I would naturally expect things to work is for assemblers to have 
some notion of a build.  A build defines a context in which state like the 
memoizing map can be built up.  The build context can be thrown away at the end 
of the build and a new one created for the next build.  This would prevent 
reuse across builds.  So if you do:

  foo = assembler.build(R);
  bar = assembler.build(R);

you will get two different assemblies, typically with no sharing between 
them.  (You would still get a lock failure if R had a TextIndexLucene component 
that was not closed between creating the two assemblies.)

As far as I can see assemblers are not designed to work like this.  Nor can 
I see how to add this notion of a build context without affecting many existing 
assemblers.  I may be missing something.

I have raised the question in the hope that someone will suggest an 
approach to a more general solution. 
















---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-24 Thread ajs6f
Github user ajs6f commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174311885
  
There has been some 
[discussion](http://markmail.org/message/6uab6glusw7lqpmz) about the 
commonalities between `jena-text` and `jena-spatial`. Is the same problem 
likely to come up with `jena-spatial`? Maybe it's worth doing the 
assembler-based fix because of that? (Sorry if I'm really off-base here, just 
trying to avoid any potential future problems.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-24 Thread ajs6f
Github user ajs6f commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174311642
  
Thanks for the clarification re: `Mode`. Should the Javadocs for `Mode` say 
that something like that it "advises implementations of user intention, but 
does not constrain implementation behavior"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-24 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174309243
  
As far as I can see, only an assembler solution will address 
[JENA-1122](https://issues.apache.org/jira/browse/JENA-1122). @bwmcbride - do 
you believe this PR alone will address JENA-1122? Has this PR been run under 
load?

By protecting the text index alone, I see a new class of concurrency issues 
arising. A shared text index between two different TDB datasets will go wrong 
on update. At the moment, you can't create that setup because of JENA-1104. 
Other setups will have concurrency problems.

*If* this works in a "one dataset, must be TDB, only in Fuseki" setup, it 
only protects against 
[JENA-1104](https://issues.apache.org/jira/browse/JENA-1104). If then use of 
TDB happens to work at the moment, it is based on internal implementation 
handling in Fuseki. That is very fragile. 

This PR does no harm though it is a bit more complicated than needed 
because a solution to the root cause 
[JENA-1122](https://issues.apache.org/jira/browse/JENA-1122) may well remove 
the need for some of the machinery.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-24 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174306928
  
`org.apache.jena.assembler.Mode` does not help unfortunately. It assists 
with passing down a requirement to share for certain setups that work as 
tightly coordinated groups but that isn't good enough here (it's a hint of 
unclear meaning really, not an instruction). It should not be in the primary 
interface method but it got there a long time ago.

1. How do you decide when to set it in the first place?
1. When set, why would assembler know another should be shared? It assumes 
knowledge about implementation but the target may be a subtype or 
specialization.

 e.g. TDB internally manages sharing because only TDB knows what must be 
shared. But a TDB dataset assembler is a subclass of DatasetRDF assembler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-24 Thread bwmcbride
Github user bwmcbride commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50635465
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextDatasetFactory.java ---
@@ -191,5 +213,50 @@ public static DatasetGraph 
createSolrIndex(DatasetGraph base, SolrServer server,
 TextIndex index = createSolrIndex(server, entMap) ;
 return create(base, index, true) ; 
 }
+ 
+protected static class LuceneIndexMap {
+   
+   private static HashMap> map = 
new HashMap>();
+
+   protected synchronized TextIndex get(Directory directory) {
--- End diff --

Thanks for the sketch.  I've tried something based on it and does look 
better.  Will push shortly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-24 Thread ajs6f
Github user ajs6f commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174299109
  
Just as a sidenote, isn't `org.apache.jena.assembler.Mode` intended to 
control whether new things are created or old things are reused during the 
assembly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-24 Thread bwmcbride
Github user bwmcbride commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174298772
  
The main problem I had was figuring out how to manage the static state so 
it was consistent with the lifecyle implemented by TextIndexLucene and avoid 
memory leaks on long running systems. Hence the call backs. That is an issue 
that is common to both API and assembler construction.

The assembler mechanism is not designed to handle graph (or even lattice) 
structured composition of components. According to Chris, it is by design that 
It does a recursive decent over the structure and assembles a new thing each 
time a node is encountered.

Perhaps there is a need for an assembly mechanism designed to handle graph 
structures. Would also be useful to have some notion of an assember 'session', 
that is the process of building a complete structure, during which session 
state could be built up and thrown away at the end.  Maybe there is one and I 
didn;t find it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread afs
Github user afs commented on the pull request:

https://github.com/apache/jena/pull/123#issuecomment-174066342
  
**Design**

Protecting the text index this way sort of works for TDB specifically 
because of an internal feature of TDB (it manages storage to stop duplication) 
which is not a guaranteed feature. Other dataset implementations will not work 
out so nicely. It will be like two separate datasets and one index will and 
probably lead to corruption or inconsistent reading (c.f. email ["transactions 
and 
docProducers"](http://mail-archives.apache.org/mod_mbox/jena-users/201601.mbox/%3C568FD70B.8060301%40epimorphics.com%3E)).

On [JENA-1122](https://issues.apache.org/jira/browse/JENA-1122) I 
summarized discussions up to here as two options suggested:

1. Internal static state in `TextDatasetFactory` that the same datasets 
object is returned each time. c.f. TDB's StoreConnection. Extends sharing of 
text datasets to work with java/API uses but not "any dataset" in Fuseki 
configurations.
2. Fuseki (or in `DatasetAssembler` maybe) assembling datasets deals with 
sharing using the graph structure. This copes with any dataset but not API use. 

The first one looks hard because of choosing the key to include the dataset 
in the general case.

The second one is easier to do because there is a natural key of the 
resource (URI, bnode) for the dataset. Bonus would a similar per-text index 
assembler check on reuse 
[JENA-1104](https://issues.apache.org/jira/browse/JENA-1104).

There is one minor point - Fuseki can have multiple assembler files and 
badly chosen, clashing dataset URIs (solution - keep a list of all URIs acorss 
assembler configs - useful check anyway)

The ideal for [JENA-1122](https://issues.apache.org/jira/browse/JENA-1122) 
is this PR (simplified?) to protect text indexes and (2) above to allow complex 
configurations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread bwmcbride
Github user bwmcbride commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50553029
  
--- Diff: 
jena-text/src/test/java/org/apache/jena/query/text/assembler/TestTextIndexLuceneAssembler.java
 ---
@@ -97,6 +97,14 @@
 index.close();
 }
 }
+
+@Test public void testOneIndexPerDirectoryMem() {
--- End diff --

Yes.  Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread ajs6f
Github user ajs6f commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50548291
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextDatasetFactory.java ---
@@ -191,5 +213,50 @@ public static DatasetGraph 
createSolrIndex(DatasetGraph base, SolrServer server,
 TextIndex index = createSolrIndex(server, entMap) ;
 return create(base, index, true) ; 
 }
+ 
+protected static class LuceneIndexMap {
+   
+   private static HashMap> map = 
new HashMap>();
+
+   protected synchronized TextIndex get(Directory directory) {
--- End diff --

Fair enough. I will try and sketch something quickly and if you want (if 
you think it is clearer), you can use it, or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread bwmcbride
Github user bwmcbride commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50546853
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextDatasetFactory.java ---
@@ -191,5 +213,50 @@ public static DatasetGraph 
createSolrIndex(DatasetGraph base, SolrServer server,
 TextIndex index = createSolrIndex(server, entMap) ;
 return create(base, index, true) ; 
 }
+ 
+protected static class LuceneIndexMap {
+   
+   private static HashMap> map = 
new HashMap>();
+
+   protected synchronized TextIndex get(Directory directory) {
+   Object key = key(directory);
+   WeakReference ref = map.get(key);
+   if (ref == null)
+   return null;
+   TextIndex result = ref.get();
+   if (result == null)
+   map.put(key, null);
+   return result;
+   }
+   
+   protected synchronized void put(Directory directory, TextIndex 
index) {
+   map.put(key(directory), new 
WeakReference(index));
+   }
+   
+   protected synchronized void remove(Directory directory) {
+   map.put(key(directory), null);
+   }
+   
+   private Object key(Directory directory) {
+   // uglyness alert
+   if (directory instanceof FSDirectory) {
+   return getFilePath( ( (FSDirectory) 
directory).getDirectory() );
+   } else if (directory instanceof CompoundFileDirectory) {
+   return getFilePath( ( (FSDirectory) 
directory).getDirectory() );
--- End diff --

Yes - thank you.  Will fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread bwmcbride
Github user bwmcbride commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50546729
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextDatasetFactory.java ---
@@ -191,5 +213,50 @@ public static DatasetGraph 
createSolrIndex(DatasetGraph base, SolrServer server,
 TextIndex index = createSolrIndex(server, entMap) ;
 return create(base, index, true) ; 
 }
+ 
+protected static class LuceneIndexMap {
+   
+   private static HashMap> map = 
new HashMap>();
+
+   protected synchronized TextIndex get(Directory directory) {
--- End diff --

Yes, there is some commonality.  It may be possible to factor something out 
from both.

There are differences.  One uses a WeakHashMap and the other a normal 
HashMap and the handling of keys is a little different.

I reckoned the common functionality was not worth factoring out.  Ymmv.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread rvesse
Github user rvesse commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50546299
  
--- Diff: 
jena-text/src/test/java/org/apache/jena/query/text/assembler/TestTextIndexLuceneAssembler.java
 ---
@@ -97,6 +97,14 @@
 index.close();
 }
 }
+
+@Test public void testOneIndexPerDirectoryMem() {
--- End diff --

Should there be a corresponding test case that validates that when you have 
two different nodes specified as RAM indexes you get two different 
`TextIndexLucene` instances?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread ajs6f
Github user ajs6f commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50546226
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexLuceneAssembler.java
 ---
@@ -61,7 +74,12 @@ public TextIndex open(Assembler a, Resource root, Mode 
mode) {
 if ( n.isLiteral() ) {
 String literalValue = n.asLiteral().getLexicalForm() ; 
 if (literalValue.equals("mem")) {
-directory = new RAMDirectory() ;
+   // on RAMDirectory per node, not one per invocation
+   directory = ramDirectories.get(root);
+   if (directory == null) {
+directory = new RAMDirectory() ;
+ramDirectories.put(root, (RAMDirectory) directory);
--- End diff --

Sorry-- didn't look hard enough!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread bwmcbride
Github user bwmcbride commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50546204
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -399,4 +404,24 @@ private Node entryToNode(String v) {
 // TEMP
 return NodeFactoryExtra.createLiteralNode(v, null, null) ;
 }
+
+
+// set up event handlers
+
+public enum Event { CLOSED };
+
+public static interface EventHandler {
+   void handle(TextIndexLucene index);
+}
+
+private Map> eventHandlers = new 
HashMap>();
+
+public void addEventHandler(Event event, EventHandler handler) {
+   List handlers = eventHandlers.get(event);
--- End diff --

Yes it is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread rvesse
Github user rvesse commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50546051
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextDatasetFactory.java ---
@@ -191,5 +213,50 @@ public static DatasetGraph 
createSolrIndex(DatasetGraph base, SolrServer server,
 TextIndex index = createSolrIndex(server, entMap) ;
 return create(base, index, true) ; 
 }
+ 
+protected static class LuceneIndexMap {
+   
+   private static HashMap> map = 
new HashMap>();
+
+   protected synchronized TextIndex get(Directory directory) {
+   Object key = key(directory);
+   WeakReference ref = map.get(key);
+   if (ref == null)
+   return null;
+   TextIndex result = ref.get();
+   if (result == null)
+   map.put(key, null);
+   return result;
+   }
+   
+   protected synchronized void put(Directory directory, TextIndex 
index) {
+   map.put(key(directory), new 
WeakReference(index));
+   }
+   
+   protected synchronized void remove(Directory directory) {
+   map.put(key(directory), null);
+   }
+   
+   private Object key(Directory directory) {
+   // uglyness alert
+   if (directory instanceof FSDirectory) {
+   return getFilePath( ( (FSDirectory) 
directory).getDirectory() );
+   } else if (directory instanceof CompoundFileDirectory) {
+   return getFilePath( ( (FSDirectory) 
directory).getDirectory() );
--- End diff --

Isn't this an invalid cast, you check for `instanceof 
CompoundFileDirectory` and then proceed to cast to `FSDirectory` which is going 
to fail

Unless `CompoundFileDirectory` is actally a type derived from `FSDirectory` 
in which case the previous branch of the `if` would have caught it already and 
this branch can never be reached


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread bwmcbride
Github user bwmcbride commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50545658
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexLuceneAssembler.java
 ---
@@ -61,7 +74,12 @@ public TextIndex open(Assembler a, Resource root, Mode 
mode) {
 if ( n.isLiteral() ) {
 String literalValue = n.asLiteral().getLexicalForm() ; 
 if (literalValue.equals("mem")) {
-directory = new RAMDirectory() ;
+   // on RAMDirectory per node, not one per invocation
+   directory = ramDirectories.get(root);
+   if (directory == null) {
+directory = new RAMDirectory() ;
+ramDirectories.put(root, (RAMDirectory) directory);
--- End diff --

... and would give a compilation error on line 85.

Its easy to get rid of the cast if its a problem, but not like that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread ajs6f
Github user ajs6f commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50542029
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextDatasetFactory.java ---
@@ -191,5 +213,50 @@ public static DatasetGraph 
createSolrIndex(DatasetGraph base, SolrServer server,
 TextIndex index = createSolrIndex(server, entMap) ;
 return create(base, index, true) ; 
 }
+ 
+protected static class LuceneIndexMap {
+   
+   private static HashMap> map = 
new HashMap>();
+
+   protected synchronized TextIndex get(Directory directory) {
--- End diff --

It seems like there is common logic between this method and 
`RAMDirectoryMap::get`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread ajs6f
Github user ajs6f commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50541229
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexLuceneAssembler.java
 ---
@@ -61,7 +74,12 @@ public TextIndex open(Assembler a, Resource root, Mode 
mode) {
 if ( n.isLiteral() ) {
 String literalValue = n.asLiteral().getLexicalForm() ; 
 if (literalValue.equals("mem")) {
-directory = new RAMDirectory() ;
+   // on RAMDirectory per node, not one per invocation
+   directory = ramDirectories.get(root);
+   if (directory == null) {
+directory = new RAMDirectory() ;
+ramDirectories.put(root, (RAMDirectory) directory);
--- End diff --

Changing `Directory directory;` above to `RAMDirectory directory;` would 
remove the cast.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread ajs6f
Github user ajs6f commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50540841
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexLuceneAssembler.java
 ---
@@ -123,10 +141,35 @@ public TextIndex open(Assembler a, Resource root, 
Mode mode) {
 config.setMultilingualSupport(isMultilingualSupport);
 config.setValueStored(storeValues);
 
-return TextDatasetFactory.createLuceneIndex(directory, config) 
;
+TextIndexLucene index = 
TextDatasetFactory.createLuceneIndex(directory, config) ;
+index.addEventHandler(TextIndexLucene.Event.CLOSED, 
+   ( i ) -> ramDirectories.remove(root));
+return index;
 } catch (IOException e) {
 IO.exception(e) ;
 return null ;
 }
 }
+protected static class RAMDirectoryMap {
+   
+   private static WeakHashMap> map = new WeakHashMap>();
+
+   protected synchronized RAMDirectory get(RDFNode node) {
+   WeakReference ref = map.get(node);
+   if (ref == null)
+   return null;
+   RAMDirectory result = ref.get();
+   if (result == null)
+   map.put(node, null);
+   return result;
+   }
+   
+   protected synchronized void put(RDFNode node, RAMDirectory 
directory) {
+   map.put(node, new 
WeakReference(directory));
--- End diff --

Could be little shorter with `new WeakReference<>(directory)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread ajs6f
Github user ajs6f commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50538211
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -399,4 +404,24 @@ private Node entryToNode(String v) {
 // TEMP
 return NodeFactoryExtra.createLiteralNode(v, null, null) ;
 }
+
+
+// set up event handlers
+
+public enum Event { CLOSED };
+
+public static interface EventHandler {
+   void handle(TextIndexLucene index);
+}
+
+private Map> eventHandlers = new 
HashMap>();
--- End diff --

You can use just `new HashMap<>()` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread ajs6f
Github user ajs6f commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50538146
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/TextIndexLuceneAssembler.java
 ---
@@ -47,8 +59,9 @@
 text:entityMap <#endMap> ;
 .
 */
+   
+   private static RAMDirectoryMap ramDirectories = new RAMDirectoryMap();;
--- End diff --

Just a typo, but two semicolons here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread ajs6f
Github user ajs6f commented on a diff in the pull request:

https://github.com/apache/jena/pull/123#discussion_r50537882
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -399,4 +404,24 @@ private Node entryToNode(String v) {
 // TEMP
 return NodeFactoryExtra.createLiteralNode(v, null, null) ;
 }
+
+
+// set up event handlers
+
+public enum Event { CLOSED };
+
+public static interface EventHandler {
+   void handle(TextIndexLucene index);
+}
+
+private Map> eventHandlers = new 
HashMap>();
+
+public void addEventHandler(Event event, EventHandler handler) {
+   List handlers = eventHandlers.get(event);
--- End diff --

I think this and the `if-then` below might just be 
`eventHandlers.computeIfAbsent(event, e -> new ArrayList<>())`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] jena pull request: Add memoizing of LuceneTextIndexes so that ther...

2016-01-22 Thread bwmcbride
GitHub user bwmcbride opened a pull request:

https://github.com/apache/jena/pull/123

Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene

JENA-1122

These changes memoize LuceneTextIndexes so that there is one per directory, 
and Lucene RAMDirectories created by the Lucene assembler so there is one 
RAMDirectory per node in the configuration graph.

One issue is when to forget a memoized object.  The policy implemented in 
this code is to forget the object when it is closed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/epimorphics/jena JENA-1122

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/jena/pull/123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #123


commit d0451f1fda636aa9fa0915e4bedd3507cea54e7a
Author: Brian McBride 
Date:   2016-01-21T12:32:20Z

Add memoizing of LuceneTextIndexes so that there is one TextIndexLucene
object per directory.  Similary so the Lucene assembler only creates one
RAMDirectory per node.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---