Re: Serializing Queries

2016-03-18 Thread McKinley, James T
We use Kryo to pass query objects between hosts:

https://github.com/EsotericSoftware/kryo

We initially had some trouble with it creating dynamic classes and running out 
of PermGen space but we got around that using an ObjectPool:

http://commons.apache.org/proper/commons-pool/api-1.6/org/apache/commons/pool/impl/StackObjectPool.html

I've not looked at the project recently, we're using version 2.21 and there's a 
3.0.0 now, so they may have solved the issues we had and made things nicer, but 
here's how we're doing it with 2.21:

To serialize:

static private ObjectPool pool = new StackObjectPool(new 
KryoFactory(), 75, 75);

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Output output = new Output(baos);
kryo = pool.borrowObject();
kryo.writeClassAndObject(output, query);
pool.returnObject(kryo);
output.close();
String base64EncodedSerializedObject = Base64.encodeBytes(baos.toByteArray());

Where query is a Lucene Query object (I've left out the error handling for 
brevity).

To deserialize:

ByteArrayInputStream bais = new 
ByteArrayInputStream(Base64.decode(encodedQuery));
Input input = new Input(bais);
kryo = pool.borrowObject();
deserializedQueryObject = (Query) kryo.readClassAndObject(input);
pool.returnObject(kryo);
input.close();

Hope that might help.

Jim


From: Bauer, Herbert S. (Scott) 
Sent: 18 March 2016 10:02
To: java-user@lucene.apache.org
Subject: Serializing Queries

Has anyone in this group solved the problem of serializing complex boolean 
queries (Some of our clauses have span and other query types)?  Our Java RMI 
depends upon being able to do so.   I have seen posts that say you can just 
parse the string representation but apparently that only works on simple query 
representations. I’m looking at the CoreParser and it’s supporting xml 
parsing capabilities with an eye toward Marshalling the boolean query into a 
DOM object and unmarshalling it on the server side using some of the support 
implied by the CoreParser and related classes.  -scott
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scoring over Multiple Indexes

2015-10-22 Thread McKinley, James T
Hi Scott,

I don't know your reasons for splitting your index up, but assuming you want to 
do that and then merge the search results back together I think you could 
re-unify the term document frequencies across all your indexes and then extend 
IndexSearcher and override termStatistics and collectionStatistics methods to 
return the statistics from the re-unified data.  Our index is partitioned for 
performance reasons and we re-combine the document frequencies from all the 
partitions into a file during our indexing workflow so that we can add new 
partitions as needed without suffering from the low statistics problem Erick 
described.  We use an FST (see org.apache.lucene.util.fst.Builder) to hold the 
stats in memory so that the lookups are fast.

Jim

From: Erick Erickson 
Sent: 22 October 2015 15:15
To: java-user
Subject: Re: Scoring over Multiple Indexes

bq:  Given that the content loaded for these indexes
represents individually curated terminologies, I think we can argue to our
users that what comes from combined queries over the latter is as
meaningful in it¹s own right as those run over the monolithic index

If one assumes that the individually curated terminologies are that way
for a reason, putting these all into a single index in some sense undid the
reason for curating them. Presumably an index specialized for
pharmaceuticals has a much different set of characteristics than for
an index specific to financials. I doubt that "leveraged buyout" appears
very often in a pharmaceutical index...

But let's say that two documents in your pharmaceutical index do
mention this phrase. The score in that index will be high since
the terms are so rare. How does one even theoretically relate
the scores coming from the financial index to one coming from the
pharmaceutical index?

None of which you can explain to an end user ;). Often the most use
to the _user_ is achieved by giving them some way to indicate which
sources they're most interested in and presenting those first.

FWIW,
Erick

On Thu, Oct 22, 2015 at 11:29 AM, Bauer, Herbert S. (Scott)
 wrote:
> Thanks for your reply.  We¹ve recently moved from a single large index to
> multiple indexes. Given that the content loaded for these indexes
> represents individually curated terminologies, I think we can argue to our
> users that what comes from combined queries over the latter is as
> meaningful in it¹s own right as those run over the monolithic index. We
> had to consider that our changes to the back end of our application might
> change sorting orders for results which is what we normally want to avoid.
>
>
> On 10/22/15, 10:43 AM, "Erick Erickson"  wrote:
>
>>In a word, no. At least not that I've heard of. "normalizing scores"
>>is one of those things
>>that sounds reasonable on the surface, but is really meaningless.
>>Scores don't really
>>_tell_ you anything about the abstract "goodness" of a doc, they just
>>tell you that
>>doc1 is likely better than doc2 _within a single query_. You can't even
>>compare
>>scores in the _same_ index across two different queries.
>>
>>At its lowest level, say one index has 1,000,000 occurrences of
>>"erick", while index 2 has
>>exactly 1. Term frequency is one of the numbers that is used to
>>calculate the score.
>>How does one normalize the part of the calculation resulting from
>>matching "erick"
>>between the two indexes? Anything you do is wrong.
>>
>>Similarly, expecting documents to be returned in a particular order
>>because of boosting
>>is not going to be satisfactory. Boosting will influence the final
>>score and thus the
>>position of the document, but not absolutely order them unless you put
>>in insane boosts.
>>Tests based on boosting and doc ordering will be very fragile I'd guess.
>>
>>Best,
>>Erick
>>
>>On Thu, Oct 22, 2015 at 8:34 AM, Bauer, Herbert S. (Scott)
>> wrote:
>>> We have a test case that boosts a set of terms.  Something along the
>>>lines of ³term1^2 AND term2^3 AND term3^4 and this query runs over a two
>>>content distinct indexes.  Our expectation is that the terms would be
>>>returned to us as term3, term2 and term1.  Instead we get something
>>>along the lines of term3, term1 and term2.  I realize from a number of
>>>postings that this is the result of the scoring methods action taking
>>>place within an individual index rather than against several indexes.
>>>At the same time I don¹t see a lot of solutions offered. Is there an out
>>>of the box solution to normalize scoring over diverse indexes?  If not
>>>is there a strategy for rolling your own normalizing solution?  I¹m
>>>assuming this has to be a common problem.-scott
>>>
>>
>>-
>>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>

Re: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-30 Thread McKinley, James T
Hi Mike,

Thanks for your response.  We've been using NFS for 10 years with Lucene and 
never saw index corruption until we moved to 4.x if I remember correctly.  We 
are aware of the locking and other issues you mentioned with NFS, but they've 
not been much of a problem for us.  You're probably correct that using a 
network file system is causing the check index to be slower than it would be on 
local disk.

We really don't have the option of moving to local disk without a significant 
redesign of our systems.  However, we do have the possibility of switching to 
iSCSI instead of NFS without changing our hardware, do you happen to know 
whether iSCSI would be a better protocol for use with Lucene?  Thanks!

Jim


From: will martin <wmartin...@gmail.com>
Sent: 30 September 2015 06:49
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

Thanks Mike. This is very informative.



-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Tuesday, September 29, 2015 3:22 PM
To: Lucene Users
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

No, it is not possible to disable, and, yes, we removed that API in 5.x because 
1) the risk of silent index corruption is too high to warrant this small 
optimization and 2) we re-worked how merging works so that this checkIntegrity 
has IO locality with what's being merged next.

There were other performance gains for merging in 5.x, e.g. using much less 
memory in the many-fields case, not decompressing + recompressing stored fields 
and term vectors, etc.

As Adrien pointed out, the cost should be much lower than 25% for a local 
filesystem ... I suspect something about your NFS setup is making it more 
costly.

NFS is in general a dangerous filesystem to use with Lucene (no delete on last 
close, locking is tricky to get right, incoherent client file contents and 
directory listing caching).

If you want to also checkIntegrity of the merged segment you could e.g. install 
an IndexReaderWarmer in your IW and call IndexReader.checkIntegrity.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 29, 2015 at 9:00 PM, will martin <wmartin...@gmail.com> wrote:
> Ok So I'm a little confused:
>
> The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on
> a flag to setCheckIntegrityAtMerge ...
>
> Method states it controls pre-merge cost.
>
> Ref:
>
> https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/index/Liv
> eIndex
> WriterConfig.html#setCheckIntegrityAtMerge%28boolean%29
>
> And it seems to be gone in 5.3 folks? Meaning Adrien's comment is a
> whole lot significant? Merges ALWAYS pre-merge CheckIntegrity? Is this
> a 5.0 feature drop? You can't deprecate, um, er totally remove an
> index time audit feature on a point release of any level IMHO.
>
>
> -Original Message-
> From: McKinley, James T [mailto:james.mckin...@cengage.com]
> Sent: Tuesday, September 29, 2015 2:42 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Yes, the indexing workflow is completely separate from the runtime system.
> The file system is EMC Isilon via NFS.
>
> Jim
>
> 
> From: will martin <wmartin...@gmail.com>
> Sent: 29 September 2015 14:29
> To: java-user@lucene.apache.org
> Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?
>
> This sounds robust. Is the index batch creation workflow a separate process?
> Distributed shared filesystems?
>
> --will
>
> -Original Message-
> From: McKinley, James T [mailto:james.mckin...@cengage.com]
> Sent: Tuesday, September 29, 2015 2:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?
>
> Hi Adrien and Will,
>
> Thanks for your responses.  I work with Selva and he's busy right now
> with other things, so I'll add some more context to his question in an
> attempt to improve clarity.
>
> The merge in question is part of our batch indexing workflow wherein
> we index new content for a given partition and then merge this new
> index with the big index of everything that was previously loaded on
> the given partition.  The increase in merge time we've seen since
> upgrading from 4.10 to 5.2 is on the order of 25%.  It varies from
> partition to partition, but 25% is a good ballpark estimate I think.
> Maybe our case is non-standard, we have a large number of fields (> 200).
>
> The reason we perform an index check after the merge is that this is
> the final index state that will be used for a given batch.  Since we
> have a batch-oriented workflow we are able to roll back to a

Re: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread McKinley, James T
Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with 
other things, so I'll add some more context to his question in an attempt to 
improve clarity.

The merge in question is part of our batch indexing workflow wherein we index 
new content for a given partition and then merge this new index with the big 
index of everything that was previously loaded on the given partition.  The 
increase in merge time we've seen since upgrading from 4.10 to 5.2 is on the 
order of 25%.  It varies from partition to partition, but 25% is a good 
ballpark estimate I think.  Maybe our case is non-standard, we have a large 
number of fields (> 200).

The reason we perform an index check after the merge is that this is the final 
index state that will be used for a given batch.  Since we have a 
batch-oriented workflow we are able to roll back to a previous batch if we find 
a problem with a given batch (Lucene or other problem).  However due to disk 
space constraints we can only keep a couple batches.  If our indexing workflow 
completes without errors but the index is corrupt, we may not know right away 
and we might delete the previous good batch thinking the latest batch is OK, 
which would be very bad requiring a full reload of all our content.

Checking the index prior to the merge would no doubt catch many issues, but it 
might not catch corruption that occurs during the merge step itself, so we 
implemented a check step once the index is in its final state to ensure that it 
is OK.

So, since we want to do the check post-merge, is there a way to disable the 
check during merge so we don't have to do two checks?

Thanks!

Jim 


From: will martin 
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to be 
understood I think.



And, I'm really curious, what happens to the result of the post merge 
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean if 
you let it merge anyway could you get a false positive for integrity?  [see the 
concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my 
triage mode of the scenario.





-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues as 
it should be much faster than the merge itself. I don't understand your 
proposal to check the index after merge: the goal is to make sure that we do 
not propagate corruptions so it's better to check the index before the merge 
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar < 
 selva.kumar.at.w...@gmail.com> a écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running

> post merge?

>

>

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <

>   selva.kumar.at.w...@gmail.com

> > wrote:

>

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread McKinley, James T
Yes, the indexing workflow is completely separate from the runtime system.  The 
file system is EMC Isilon via NFS.

Jim


From: will martin <wmartin...@gmail.com>
Sent: 29 September 2015 14:29
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

This sounds robust. Is the index batch creation workflow a separate process?
Distributed shared filesystems?

--will

-Original Message-
From: McKinley, James T [mailto:james.mckin...@cengage.com]
Sent: Tuesday, September 29, 2015 2:22 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?

Hi Adrien and Will,

Thanks for your responses.  I work with Selva and he's busy right now with
other things, so I'll add some more context to his question in an attempt to
improve clarity.

The merge in question is part of our batch indexing workflow wherein we
index new content for a given partition and then merge this new index with
the big index of everything that was previously loaded on the given
partition.  The increase in merge time we've seen since upgrading from 4.10
to 5.2 is on the order of 25%.  It varies from partition to partition, but
25% is a good ballpark estimate I think.  Maybe our case is non-standard, we
have a large number of fields (> 200).

The reason we perform an index check after the merge is that this is the
final index state that will be used for a given batch.  Since we have a
batch-oriented workflow we are able to roll back to a previous batch if we
find a problem with a given batch (Lucene or other problem).  However due to
disk space constraints we can only keep a couple batches.  If our indexing
workflow completes without errors but the index is corrupt, we may not know
right away and we might delete the previous good batch thinking the latest
batch is OK, which would be very bad requiring a full reload of all our
content.

Checking the index prior to the merge would no doubt catch many issues, but
it might not catch corruption that occurs during the merge step itself, so
we implemented a check step once the index is in its final state to ensure
that it is OK.

So, since we want to do the check post-merge, is there a way to disable the
check during merge so we don't have to do two checks?

Thanks!

Jim


From: will martin <wmartin...@gmail.com>
Sent: 29 September 2015 12:08
To: java-user@lucene.apache.org
Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x?

So, if its new, it adds to pre-existing time? So it is a cost that needs to
be understood I think.



And, I'm really curious, what happens to the result of the post merge
checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean
if you let it merge anyway could you get a false positive for integrity?
[see the concept of lazy-evaluation]



These are, imo, the kinds of engineering questions Selva's post raised in my
triage mode of the scenario.





-Original Message-
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Tuesday, September 29, 2015 8:46 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 5 : any merge performance metrics compared to 4.x?



Indeed this is new but I'm a bit surprised this is the source of your issues
as it should be much faster than the merge itself. I don't understand your
proposal to check the index after merge: the goal is to make sure that we do
not propagate corruptions so it's better to check the index before the merge
starts so that we don't even try to merge if there are corruptions?



Le mar. 15 sept. 2015 à 00:40, Selva Kumar <
<mailto:selva.kumar.at.w...@gmail.com> selva.kumar.at.w...@gmail.com> a
écrit :



> it appears Lucene 5.2 index merge is running checkIntegrity on

> existing index prior to merging additional indices.

> This seems to be new.

>

> We have an existing checkIndex but this is run post index merge.

>

> Two follow up questions :

> * Is there way to turn off built-in checkIntegrity? Just for my
understand.

> No plan to turn this off.

> * Is running checkIntegrity prior to index merge better than running

> post merge?

>

>

> On Mon, Sep 14, 2015 at 12:24 PM, Selva Kumar <

>  <mailto:selva.kumar.at.w...@gmail.com> selva.kumar.at.w...@gmail.com

> > wrote:

>

> > We observe some merge slowness after we migrated from 4.10 to 5.2.

> > Is this expected? Any new tunable merge parameters in Lucene 5 ?

> >

> > -Selva

> >

> >

>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-

Re: How to index search arrays of double?

2015-08-06 Thread McKinley, James T
Hi Stan,

I played around with LIRE a couple years ago.  I don't know exactly how it 
works, but it doesn't just use Lucene from what I remember, it has its own 
classes built around Lucene to perform the image search.  There used to be a 
PDF of a paper on the site, but I couldn't find a link when I just looked, 
here's a quote from the search section of it:

For search, classes implementing the ImageSearcher interface
are used. The ImageSearcher either takes the given
query feature or extracts the feature from a query image. It
then reads documents from the index sequentially and compares
them to the query image (linear search). Although
the main indexing features of Lucene (e.g. an inverted list
or stemming) are not employed in this kind of search, LIRe
takes advantage of the efficient and fast disk access layer
of Lucene, which results in lower search times compared to
implementations using the embedded databases HSQLDB2,
which is used in Open Office, and Apache Derby3, which is
also included in the Java runtime releases as Java DB. Also
the use of Lucene allows indexes bigger than common RAM
restrictions (e.g. smaller than 2 GB on 32 bit Java) and
additional indexing of textual metadata for the images.

So it sounds like they're just using Lucene as a fast document store and then 
implementing their own matching if I understand that blurb correctly.  Here's 
the github page of the project if you want to dig around in the code and see 
what they're actually doing.

https://github.com/dermotte/LIRE

Jim


From: Estanislao Oubel estanislao.ou...@gmail.com
Sent: 06 August 2015 10:13
To: java-user@lucene.apache.org
Subject: Re: How to index  search arrays of double?

Thanks Phaneendra for responding,

I know LIRE, I have been playing around with this library but I don't
understand which is the added value. To be more specific, LIRE allows
computing several image features and similarity between them, No problem so
far. My main concern is that the index used by LIRE is a lucene index (at
list in the examples). However, lucene index is an inverted index that
seems suitable for indexing terms but it's not clear to me how arrays of
values (LIRE features for example) are managed. What is even more strange
is that, when searching a specific feature, this is compared to all
documents in the index, and therefore I don't see which is the advantage of
using a lucene index ... Perhaps I am missing something but my
understanding is that an index should optimize the search of documents,
which seems not to be the case ...

If you have some experience with LIRE, could you please help me understand
all this ? The one-millon question is: do I have to use necessarily LIRE to
solve my specific problem?

If you think that this topic is not suitable for the lucene forum please
tell me and we could continue the discussion outside the mailing list. But
I think that is of general interest because perhaps there are solutions
using native lucene functions.

Thanks!

Stan





2015-08-06 10:48 GMT+02:00 Phaneendra N phaneendran.gi...@gmail.com:

 Hello Stan,
   Great question. I come across with one such implementation based on
 lucene. Its called LIRE .
 This is an open source project. http://www.lire-project.net/
 You might get some ideas there.
 Please let me know if you find answers to your specific questions there.
 I'm curious.

 Thanks
 Phaneendra

 On Thu, Aug 6, 2015 at 12:39 PM, Estanislao Oubel 
 estanislao.ou...@gmail.com wrote:

  Hello everybody,
 
  I'm currently investigating methods for content-based image retrieval. In
  this context, I would like to index documents containing arrays of
 doubles
  and then perform an approximate search based on these arrays. For
 example,
  I would like to insert in the index three documents (d1,d2,d3)
 containing a
  field called feature1, a vector of doubles of dimension 3:
 
  d1_feature1  = [0.5 1.8 2.4].
  d2_feature1  = [30.1 0 9.1].
  d3_feature1  = [0.6 5.8 2.0].
 
  Now, I would like that lucene gives me d1 when I search a document
  containing [0.51 1.79 2.41] (because d1 is the closest one according to a
  distance L1 for example).
 
  Is it possible to do this type of things with lucene? More specifically:
  1. Does lucene support arrays of doubles as field type?
  2. Is it possible to search documents based on custom distances between
  these arrays?
 
  If so, can you provide some clues about how to implement it? (fields
 types
  and classes to use,  or an example)
 
  Thanks!
 
  Stan
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Searcher Caching and Performance

2015-08-04 Thread McKinley, James T
Hi Clive,

We essentially do what you're suggesting, namely we create a single index 
searcher (as well as the directory reader it uses) on each partition that is 
shared amongst all threads.  We also perform various index operations 
(searching, browsing terms etc.) for a while to warm up Lucene's internal 
data structures as well as the Linux OS file caches prior to putting the 
partition server in service.  I don't know if this is the recommended method, 
but it seems to work for us.

Jim

From: kiwi clive kiwi_cl...@yahoo.com.INVALID
Sent: 04 August 2015 11:41
To: Java-user
Subject: Lucene Searcher Caching and Performance

Hi Guys,
We have an index/query server that contains several thousand fairly hefty 
indexes. Each searcher is shared between many 'user-threads' and once opened we 
keep the searcher in a cache which is refreshed depending on how often it is 
used. Due to memory limitations on the server, we need some kind of LRU 
mechanism to drop unused searchers to make way for newer ones.
We are seeing load spikes when we get hit by queries that try to open several 
non-cached searches at the same (or at least a small delta) time. This looks to 
be the disks struggling to open all the appropriate files for that period, and 
it takes a little while for the server to return to normal operating limits 
thereafter.
Given that upgrading hardware/memory is not currently an option, we need a way 
to smooth over these spikes, even if it is at the cost of slowing query 
performance overall.

It strikes me that if we could cache all of our searchers on the machine (ie 
have all of our indexes 'open for business'), possibly having to alter kernel 
parameters to cater for the large number of file handles, without caching many 
query results, this might solve the problem, without pushing memory usage too 
high. Also, the higher number of searchers stored in the heap is going to steal 
space from the lucene filecache so is there a recommended mechanism for doing 
this?
So is there a way to mimimize the searcher cache memory footprint to possibly 
keep more of them in memory, at the cost of storing less data?
Any insight would be most appreciated.
ThanksClive


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-02-12 Thread McKinley, James T
Hi Robert,

Thanks for responding to my message.  Are you saying that you or others have 
encountered problems running Lucene 4.8+ on the 64-bit Java SE 1.7 JVM with G1 
and was it on Windows or on Linux?  If so, where can I find out more?  I only 
looked into the one bug because that was the only bug I saw on the 
https://wiki.apache.org/lucene-java/JavaBugs page that was related to G1.  If 
there are other Lucene on Java 1.7 with G1 related bugs how can I find them?  
Also, are these failures something that would be triggered by running the 
standard Lucene 4.8.1 test suite or are there other tests I should run in order 
to reproduce these bugs?

We have been running the user facing runtime portion of our search engine using 
Java SE 1.7.0_04 with the G1 garbage collector for almost two years now and I 
was not aware of these JVM bugs with Lucene.  However, the indexing workflow 
portion of our system uses Parallel GC since it is a batch system and is not 
constrained by user facing response time requirements.  From what I understood 
from the JDK-8038348 bug comments, it is a compiler bug that can be tripped 
when using G1 and if the compiler is producing incorrect code I guess any 
behaviour is possible.  

We have experienced index corruption 3 times so far since upgrading to Lucene 
4.8.1 from Lucene 4.4 (I don't recall any corruption prior to moving to 4.8) 
but as I said we are using Parallel GC (-XX:+UseParallelGC 
-XX:+UseParallelOldGC) in the indexing workflow that writes the indexes, we 
only use G1 in the runtime system that does no index writing.  We have twice 
encountered index corruption during the index creation workflow (the runtime 
system never opened the indexes) and once found the index to be corrupt when we 
restarted the runtime on it.  So this may just be JVM bugs that can be 
triggered regardless of which garbage collector is used (which is of course 
even worse).  We do have relatively large indexes (530M+ docs total across 30 
partitions), so maybe we're more likely to see corruption even when using 
Parallel GC?  We haven't seen any corruption since the end of September 2014, 
but we have now added an index checking step to our workflow to ensure we don't 
ever point the runtime at a bad batch.  When we've encountered index corruption 
in the past we've just deleted the bad batch and re-ran the workflow and the 
subsequent runs have succeeded.  We've never figured out what caused the 
corruption.  Thanks for any further help.

Jim

From: Robert Muir [rcm...@gmail.com]
Sent: Wednesday, February 11, 2015 5:05 PM
To: java-user
Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

No, because you only looked into one bug. We have seen and do so see
many G1 related test failures, including latest 1.8.0 update 40 early
access editions. These include things like corruption.

I added this message with *every intention* to scare away users,
because I don't want them having index corruption.

I am sick of people asking but isnt it fine on the latest version
and so on. It is not.

On Wed, Feb 11, 2015 at 11:41 AM, McKinley, James T
james.mckin...@cengage.com wrote:
 Hi,

 A couple mailing list members have brought the following paragraph from the 
 https://wiki.apache.org/lucene-java/JavaBugs page to my attention:

 Do not, under any circumstances, run Lucene with the G1 garbage collector. 
 Lucene's test suite fails with the G1 garbage collector on a regular basis, 
 including bugs that cause index corruption. There is no person on this planet 
 that seems to understand such bugs (see 
 https://bugs.openjdk.java.net/browse/JDK-8038348, open for over a year), so 
 don't count on the situation changing soon. This information is not out of 
 date, and don't think that the next oracle java release will fix the 
 situation.

 Since we run Lucene 4.8.1 on Java(TM) SE Runtime Environment (build 
 1.7.0_04-b20) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) 
 using G1GC in production I felt I should look into the issue and see if it is 
 reproducible in our environment.  First I read the bug linked in the above 
 paragraph as well as https://issues.apache.org/jira/browse/LUCENE-5168 and it 
 appears quite a bit of work in trying to track down this bug has already been 
 done by Dawid Weiss and Vladmir Kozlov but it seems it is limited to the 
 32-bit JVM (maybe even only on Windows), to quote Dawid Weiss from the Jira 
 bug:

 My quest continues

 I thought it'd be interesting to see how far back I can trace this
 issue. I fetched the official binaries for jdk17 (windows, 32-bit) and
 did a binary search with the failing Lucene test command. The results
 show that, in short:

 ...
 jdk1.7.0_03: PASSES
 jdk1.7.0_04: FAILS
 ...

 and are consistent before and after. jdk1.7.0_04, 64-bit does *NOT*
 exhibit the issue (and neither does any version afterwards, it only
 happens on 32-bit; perhaps it's because of smaller number

RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-02-11 Thread McKinley, James T
Hi,

A couple mailing list members have brought the following paragraph from the 
https://wiki.apache.org/lucene-java/JavaBugs page to my attention:

Do not, under any circumstances, run Lucene with the G1 garbage collector. 
Lucene's test suite fails with the G1 garbage collector on a regular basis, 
including bugs that cause index corruption. There is no person on this planet 
that seems to understand such bugs (see 
https://bugs.openjdk.java.net/browse/JDK-8038348, open for over a year), so 
don't count on the situation changing soon. This information is not out of 
date, and don't think that the next oracle java release will fix the situation.

Since we run Lucene 4.8.1 on Java(TM) SE Runtime Environment (build 
1.7.0_04-b20) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) 
using G1GC in production I felt I should look into the issue and see if it is 
reproducible in our environment.  First I read the bug linked in the above 
paragraph as well as https://issues.apache.org/jira/browse/LUCENE-5168 and it 
appears quite a bit of work in trying to track down this bug has already been 
done by Dawid Weiss and Vladmir Kozlov but it seems it is limited to the 32-bit 
JVM (maybe even only on Windows), to quote Dawid Weiss from the Jira bug:

My quest continues 

I thought it'd be interesting to see how far back I can trace this
issue. I fetched the official binaries for jdk17 (windows, 32-bit) and
did a binary search with the failing Lucene test command. The results
show that, in short:

...
jdk1.7.0_03: PASSES
jdk1.7.0_04: FAILS
...

and are consistent before and after. jdk1.7.0_04, 64-bit does *NOT*
exhibit the issue (and neither does any version afterwards, it only
happens on 32-bit; perhaps it's because of smaller number of available
registers and the need to spill?).

jdk1.7.0_04 was when G1GC was officially made supported but I don't
think this plays a big difference. I'll see if I can bsearch on
mercurial revisions to see which particular revision introduced the
problem. Anyway, the problem has to be a long-standing issue and not a
regression. Which makes it even more interesting I guess.

Dawid

In addition the second to last comment in the LUCENE-5168 bug is I don't think 
this is closely related to G1GC. It looks more that G1GC happily triggers this 
bug in this special case.

Just to make sure the bug wasn't reproducible with our specific environment I 
checked out the tag for Lucene 4.8.1 
(http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_8_1) and made 
the following change to common-build.xml:

gada@C006129:~/workspace-java/lucene_solr_4_8_1/lucene$ svn diff 
common-build.xml 
Index: common-build.xml
===
--- common-build.xml(revision 1658458)
+++ common-build.xml(working copy)
@@ -92,7 +92,7 @@
   /path
 
   !-- default arguments to pass to JVM executing tests --
-  property name=args value=/
+  property name=args value=-XX:+UnlockDiagnosticVMOptions -XX:+UseG1GC 
-XX:MaxGCPauseMillis=100 -XX:InitiatingHeapOccupancyPercent=65 
-XX:ParallelGCThreads=12 -verbose:gc -XX:+PrintGC -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-Xloggc:/home/gada/tmp/lucene-test-gc.log 
-XX:LogFile=/home/gada/tmp/lucene-test-vmop.log -XX:+LogVMOutput 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1/
 
   property name=tests.seed value= /
 
I then ran the following script:

#!/bin/bash
count=0
while ant test ; do
count=$[$count +1]
printf \n\n\nrun $count completed without errors\n\n\n
if [ $count -ge 100 ]; then
break
fi
sleep 1
done

All tests ran successfully 100 times in a row on a dual 6-core CPU Intel Xeon 
Lenovo C30 ThinkStation with 64GB RAM running the Ubuntu 14.04 LTS Linux 
distribution.  I also successfully ran the test suite a few times on Java(TM) 
SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM 
(build 24.55-b03, mixed mode) since I had it available.

TL;DR:

I think perhaps the sentence: Do not, under any circumstances, run Lucene with 
the G1 garbage collector. is a bit too strong.  Maybe a more balanced 
statement is in order?  For example, we've found that the OpenJDK/Oracle 
32-bit JVM (if only on Windows, say only on Windows) has a bug that when used 
in combination with the the G1 garbage collector causes incorrect code to be 
produced possibly resulting in index corruption, or something along those 
lines.  It seems a shame to possibly scare new Lucene users away from using 
G1GC with the 64-bit JVM given that it has better performance on large heaps 
which are becoming more common today.

FWIW,
Jim

From: McKinley, James T [james.mckin...@cengage.com]
Sent: Monday, February 09, 2015 11:00 AM
To: java-user@lucene.apache.org
Subject: RE: Lucene Version

RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-02-09 Thread McKinley, James T
OK thanks Erick, I have put a story in our jira backlog to investigate the G1GC 
issues with the Lucene test suite.  I don't know if we'll be able to shed any 
light on the issue, but since we're using Lucene with Java 7 G1GC, I guess we 
better investigate it.

Jim

From: Erick Erickson [erickerick...@gmail.com]
Sent: Saturday, February 07, 2015 2:22 PM
To: java-user
Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

The G1C1 issue reference by Robert Muir on the Wiki page is at a
Lucene level. Lucene, of course, is critically important to Solr so
from that perspective it is about Solr too.

https://wiki.apache.org/lucene-java/JavaBugs

And, I assume, it also applies to your custom app.

FWIW,
Erick

On Fri, Feb 6, 2015 at 12:10 PM, McKinley, James T
james.mckin...@cengage.com wrote:
 Just to be clear in case there was any confusion about my previous message 
 regarding G1GC, we do not use Solr, my team works on a proprietary 
 Lucene-based search engine.  Consequently, I can't really give any advice 
 regarding Solr with G1GC, but for our uses (so far anyway), G1GC seems to 
 work well with Lucene.

 Jim
 
 From: Piotr Idzikowski [piotridzikow...@gmail.com]
 Sent: Friday, February 06, 2015 5:35 AM
 To: java-user@lucene.apache.org
 Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

 Hello.
 A little bit delayed question. But recently I have found this articles:
 https://wiki.apache.org/solr/SolrPerformanceProblems
 https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

 Especially this part from first url:
 *Using the ConcurrentMarkSweep (CMS) collector with tuning parameters is a
 very good option for for Solr, but with the latest Java 7 releases (7u72 at
 the time of this writing), G1 is looking like a better option, if the
 -XX:+ParallelRefProcEnabled option is used.*

 How does it play with *Do not, under any circumstances, run Lucene with
 the G1 garbage collector.*
 from https://wiki.apache.org/lucene-java/JavaBugs?

 Regards
 Piotr

 On Tue, Jan 27, 2015 at 9:55 PM, McKinley, James T 
 james.mckin...@cengage.com wrote:

 Hi Uwe,

 OK, thanks for the info.  We'll see if we can download the Lucene test
 suite and check it out.

 FWIW, we use G1GC in our production runtime (~70 12-16 core Cisco UCS and
 HP Gen7/Gen8 nodes with 20+ GB heaps using Java 7 and Lucene 4.8.1 with
 pairs of 30 index partitions with 15M-23M docs each) and have not
 experienced any VM crashes (well, maybe a couple, but not directly
 traceable to G1 to my knowledge).  We have found some undocumented pauses
 in G1 due to very large object arrays and filed a bug report which was
 confirmed and also affects CMS (we worked around this in our code using
 memory mapping of some files whose contents we previously held all in
 RAM).  I think the only index corruption we've ever seen was in our index
 creation workflow (~30 HP Gen7 nodes with 27GB heaps) but this was using
 Parallel GC since it is a batch system, so that corruption (which we've not
 seen recently and never found a cause for) was definitely not due to G1GC.

 G1GC has bugs as does CMS but we've found it to work pretty well so far in
 our runtime system.  Of course YMMV, thanks again for the info.

 Jim
 
 From: Uwe Schindler [u...@thetaphi.de]
 Sent: Tuesday, January 27, 2015 3:02 PM
 To: java-user@lucene.apache.org
 Subject: RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

 Hi.,

 About G1GC. We consistently see problems when running the Lucene Testsuite
 with G1GC enabled. The people from Elasticsearch concluded:

 There is a newer GC called the Garbage First GC (G1GC). This newer GC is
 designed to minimize pausing even more than CMS, and operate on large
 heaps. It works by dividing the heap into regions and predicting which
 regions contain the most reclaimable space. By collecting those regions
 first (garbage first), it can minimize pauses and operate on very large
 heaps.

 Sounds great! Unfortunately, G1GC is still new, and fresh bugs are found
 routinely. These bugs are usually of the segfault variety, and will cause
 hard crashes. The Lucene test suite is brutal on GC algorithms, and it
 seems that G1GC hasn’t had the kinks worked out yet.

 We would like to recommend G1GC someday, but for now, it is simply not
 stable enough to meet the demands of Elasticsearch and Lucene.
 (
 http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_don_8217_t_touch_these_settings.html
 )

 In fact, the problems with G1GC can sometimes lead to index corruption,
 and are hard to reproduce. So better don't use...

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: McKinley, James T [mailto:james.mckin...@cengage.com]
  Sent: Tuesday, January 27, 2015 8:58 PM
  To: java-user@lucene.apache.org
  Subject

RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-02-06 Thread McKinley, James T
Just to be clear in case there was any confusion about my previous message 
regarding G1GC, we do not use Solr, my team works on a proprietary Lucene-based 
search engine.  Consequently, I can't really give any advice regarding Solr 
with G1GC, but for our uses (so far anyway), G1GC seems to work well with 
Lucene.

Jim

From: Piotr Idzikowski [piotridzikow...@gmail.com]
Sent: Friday, February 06, 2015 5:35 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

Hello.
A little bit delayed question. But recently I have found this articles:
https://wiki.apache.org/solr/SolrPerformanceProblems
https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Especially this part from first url:
*Using the ConcurrentMarkSweep (CMS) collector with tuning parameters is a
very good option for for Solr, but with the latest Java 7 releases (7u72 at
the time of this writing), G1 is looking like a better option, if the
-XX:+ParallelRefProcEnabled option is used.*

How does it play with *Do not, under any circumstances, run Lucene with
the G1 garbage collector.*
from https://wiki.apache.org/lucene-java/JavaBugs?

Regards
Piotr

On Tue, Jan 27, 2015 at 9:55 PM, McKinley, James T 
james.mckin...@cengage.com wrote:

 Hi Uwe,

 OK, thanks for the info.  We'll see if we can download the Lucene test
 suite and check it out.

 FWIW, we use G1GC in our production runtime (~70 12-16 core Cisco UCS and
 HP Gen7/Gen8 nodes with 20+ GB heaps using Java 7 and Lucene 4.8.1 with
 pairs of 30 index partitions with 15M-23M docs each) and have not
 experienced any VM crashes (well, maybe a couple, but not directly
 traceable to G1 to my knowledge).  We have found some undocumented pauses
 in G1 due to very large object arrays and filed a bug report which was
 confirmed and also affects CMS (we worked around this in our code using
 memory mapping of some files whose contents we previously held all in
 RAM).  I think the only index corruption we've ever seen was in our index
 creation workflow (~30 HP Gen7 nodes with 27GB heaps) but this was using
 Parallel GC since it is a batch system, so that corruption (which we've not
 seen recently and never found a cause for) was definitely not due to G1GC.

 G1GC has bugs as does CMS but we've found it to work pretty well so far in
 our runtime system.  Of course YMMV, thanks again for the info.

 Jim
 
 From: Uwe Schindler [u...@thetaphi.de]
 Sent: Tuesday, January 27, 2015 3:02 PM
 To: java-user@lucene.apache.org
 Subject: RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

 Hi.,

 About G1GC. We consistently see problems when running the Lucene Testsuite
 with G1GC enabled. The people from Elasticsearch concluded:

 There is a newer GC called the Garbage First GC (G1GC). This newer GC is
 designed to minimize pausing even more than CMS, and operate on large
 heaps. It works by dividing the heap into regions and predicting which
 regions contain the most reclaimable space. By collecting those regions
 first (garbage first), it can minimize pauses and operate on very large
 heaps.

 Sounds great! Unfortunately, G1GC is still new, and fresh bugs are found
 routinely. These bugs are usually of the segfault variety, and will cause
 hard crashes. The Lucene test suite is brutal on GC algorithms, and it
 seems that G1GC hasn’t had the kinks worked out yet.

 We would like to recommend G1GC someday, but for now, it is simply not
 stable enough to meet the demands of Elasticsearch and Lucene.
 (
 http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_don_8217_t_touch_these_settings.html
 )

 In fact, the problems with G1GC can sometimes lead to index corruption,
 and are hard to reproduce. So better don't use...

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: McKinley, James T [mailto:james.mckin...@cengage.com]
  Sent: Tuesday, January 27, 2015 8:58 PM
  To: java-user@lucene.apache.org
  Subject: RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)
 
  Why do you say not to use G1GC?  We are using Java 7  G1GC with Lucene
  4.8.1 in production.  Thanks.
 
  Jim
  
  From: Uwe Schindler [u...@thetaphi.de]
  Sent: Tuesday, January 27, 2015 2:49 PM
  To: java-user@lucene.apache.org; 'kiwi clive'
  Subject: RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)
 
  Java 8 update 20 or later is also fine. At current time, always use
 latest update
  release and you are be fine with Java 7 and Java 8. Don't use older
 releases
  and don't use G1 Garbage Collector.
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
   -Original Message-
   From: kiwi clive [mailto:kiwi_cl...@yahoo.com.INVALID]
   Sent: Tuesday, January 27, 2015 8:03 PM

RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-01-27 Thread McKinley, James T
Why do you say not to use G1GC?  We are using Java 7  G1GC with Lucene 4.8.1 
in production.  Thanks.

Jim

From: Uwe Schindler [u...@thetaphi.de]
Sent: Tuesday, January 27, 2015 2:49 PM
To: java-user@lucene.apache.org; 'kiwi clive'
Subject: RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

Java 8 update 20 or later is also fine. At current time, always use latest 
update release and you are be fine with Java 7 and Java 8. Don't use older 
releases and don't use G1 Garbage Collector.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: kiwi clive [mailto:kiwi_cl...@yahoo.com.INVALID]
 Sent: Tuesday, January 27, 2015 8:03 PM
 To: java-user@lucene.apache.org
 Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

 Hi Hoss,
 Many thanks for the information. This looks very encouraging as the Java7
 bug I remember  was fixed and as far as I know, we should not be affected
 by the others.
 I'll put a few tests together and put my toe in the water :-) Clive

   From: Chris Hostetter hossman_luc...@fucit.org
  To: java-user@lucene.apache.org java-user@lucene.apache.org; kiwi
 clive kiwi_cl...@yahoo.com
  Sent: Tuesday, January 27, 2015 4:01 PM
  Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)




 : I seem to remember reading that certain versions of lucene were
 : incompatible with some java versions although I cannot find anything to
 : verify this. As we have tens of thousands of large indexes, backwards
 : compatibility without the need to reindex on an upgrade is of prime
 : importance to us.

 All known JVM bugs affecting Lucene are listed here...

 https://wiki.apache.org/lucene-java/JavaBugs


 -Hoss
 http://www.lucidworks.com/

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-01-27 Thread McKinley, James T
Hi Uwe,

OK, thanks for the info.  We'll see if we can download the Lucene test suite 
and check it out.  

FWIW, we use G1GC in our production runtime (~70 12-16 core Cisco UCS and HP 
Gen7/Gen8 nodes with 20+ GB heaps using Java 7 and Lucene 4.8.1 with pairs of 
30 index partitions with 15M-23M docs each) and have not experienced any VM 
crashes (well, maybe a couple, but not directly traceable to G1 to my 
knowledge).  We have found some undocumented pauses in G1 due to very large 
object arrays and filed a bug report which was confirmed and also affects CMS 
(we worked around this in our code using memory mapping of some files whose 
contents we previously held all in RAM).  I think the only index corruption 
we've ever seen was in our index creation workflow (~30 HP Gen7 nodes with 27GB 
heaps) but this was using Parallel GC since it is a batch system, so that 
corruption (which we've not seen recently and never found a cause for) was 
definitely not due to G1GC.

G1GC has bugs as does CMS but we've found it to work pretty well so far in our 
runtime system.  Of course YMMV, thanks again for the info.

Jim

From: Uwe Schindler [u...@thetaphi.de]
Sent: Tuesday, January 27, 2015 3:02 PM
To: java-user@lucene.apache.org
Subject: RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

Hi.,

About G1GC. We consistently see problems when running the Lucene Testsuite with 
G1GC enabled. The people from Elasticsearch concluded:

There is a newer GC called the Garbage First GC (G1GC). This newer GC is 
designed to minimize pausing even more than CMS, and operate on large heaps. It 
works by dividing the heap into regions and predicting which regions contain 
the most reclaimable space. By collecting those regions first (garbage first), 
it can minimize pauses and operate on very large heaps.

Sounds great! Unfortunately, G1GC is still new, and fresh bugs are found 
routinely. These bugs are usually of the segfault variety, and will cause hard 
crashes. The Lucene test suite is brutal on GC algorithms, and it seems that 
G1GC hasn’t had the kinks worked out yet.

We would like to recommend G1GC someday, but for now, it is simply not stable 
enough to meet the demands of Elasticsearch and Lucene.
(http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_don_8217_t_touch_these_settings.html)

In fact, the problems with G1GC can sometimes lead to index corruption, and are 
hard to reproduce. So better don't use...

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: McKinley, James T [mailto:james.mckin...@cengage.com]
 Sent: Tuesday, January 27, 2015 8:58 PM
 To: java-user@lucene.apache.org
 Subject: RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

 Why do you say not to use G1GC?  We are using Java 7  G1GC with Lucene
 4.8.1 in production.  Thanks.

 Jim
 
 From: Uwe Schindler [u...@thetaphi.de]
 Sent: Tuesday, January 27, 2015 2:49 PM
 To: java-user@lucene.apache.org; 'kiwi clive'
 Subject: RE: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

 Java 8 update 20 or later is also fine. At current time, always use latest 
 update
 release and you are be fine with Java 7 and Java 8. Don't use older releases
 and don't use G1 Garbage Collector.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


  -Original Message-
  From: kiwi clive [mailto:kiwi_cl...@yahoo.com.INVALID]
  Sent: Tuesday, January 27, 2015 8:03 PM
  To: java-user@lucene.apache.org
  Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)
 
  Hi Hoss,
  Many thanks for the information. This looks very encouraging as the
  Java7 bug I remember  was fixed and as far as I know, we should not be
  affected by the others.
  I'll put a few tests together and put my toe in the water :-) Clive
 
From: Chris Hostetter hossman_luc...@fucit.org
   To: java-user@lucene.apache.org java-user@lucene.apache.org; kiwi
  clive kiwi_cl...@yahoo.com
   Sent: Tuesday, January 27, 2015 4:01 PM
   Subject: Re: Lucene Version Upgrade (3-4) and Java JVM
  Versions(6-8)
 
 
 
 
  : I seem to remember reading that certain versions of lucene were
  : incompatible with some java versions although I cannot find anything
  to
  : verify this. As we have tens of thousands of large indexes,
  backwards
  : compatibility without the need to reindex on an upgrade is of prime
  : importance to us.
 
  All known JVM bugs affecting Lucene are listed here...
 
  https://wiki.apache.org/lucene-java/JavaBugs
 
 
  -Hoss
  http://www.lucidworks.com/
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: ToChildBlockJoinQuery question

2015-01-22 Thread McKinley, James T
);
}
}

private void displayCreators(IndexReader reader, IndexSearcher 
searcher, TopDocs worksDocs) throws IOException {
for (int i = 0; i  worksDocs.scoreDocs.length; i++) {
String agdn = 
reader.document(worksDocs.scoreDocs[i].doc).get(AGDN);
String agty = 
reader.document(worksDocs.scoreDocs[i].doc).get(AGTY);
String nt = 
reader.document(worksDocs.scoreDocs[i].doc).get(NT);
String poc = 
reader.document(worksDocs.scoreDocs[i].doc).get(POC);
System.out.println(\tby:  + agdn +  -  + agty + , 
 +nt + ,  + poc);
}
}

When I try to use ToParentBlockJoinQuery I don't get any results either and it 
is not what I really want anyway, I want the child documents limited by the 
parent documents.

ToChildBlockJoinQuery almost gives me what I want, but I really need to be able 
to filter the child docs returned as well as the parent from which they came.  
If you (or anybody) still thinks I'm doing it wrong please let me know.  If I 
should file a bug report also let me know that, I have a small index I can 
provide if it is useful.  Thanks again for your help.

Jim


From: Gregory Dearing [gregdear...@gmail.com]
Sent: Wednesday, January 21, 2015 6:59 PM
To: java-user@lucene.apache.org
Subject: Re: ToChildBlockJoinQuery question

Jim,

I think you hit the nail on the head... that's not what BlockJoinQueries do.

If you're wanting to search for children and join to their parents... then
use ToParentBlockJoinQuery, with a query that matches the set of children
and a filter that matches the set of parents.

If you're searching for parents, then joining to their children... then use
ToChildBlockJoinQuery, with a query that matches the set of parents and a
filter that matches the set of children.

When you add related documents to the index (via addDocuments), make that
children are added before their parents.

The reason all the above is necessary is that it makes it possible to have
a nested hierarchy of relationships (ie. Parents have Children, which have
Children of their own).  You need a query to indicate which part of the
hierarchy you're starting from, and a filter indicating which part of the
hierarchy you're joining to.

Also, you will always get an exception if your query and your filter both
match the same document.  A child can't be its own parent.

BlockJoin is a very powerful feature, but what it's really doing is
modelling relationships using an index that doesn't know what a
relationship is.  The relationships are determined by a combination of the
order that you indexed the block, and the format of your query.  This
disjoin can lead to some weird behavior if you're not absolutely sure how
it works.

Thanks,
Greg





On Wed, Jan 21, 2015 at 4:34 PM, McKinley, James T 
james.mckin...@cengage.com wrote:


 Am I understanding how this is supposed to work?  What I think I am (and
 should be) doing is providing a query and filter that specifies the parent
 docs and the ToChildBlockJoinQuery should return me all the child docs for
 the resulting parent docs.  Is this correct?  The reason I think I'm not
 understanding is that I don't see why I need both a filter and a query to
 specify the parent docs when a single query or filter should suffice.  Am I
 misunderstanding what parentQuery and parentFilter mean, they both refer to
 parent docs right?

 Jim

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: ToChildBlockJoinQuery question

2015-01-22 Thread McKinley, James T
Hi Mike,

I guess given the difficulty I've had getting the block join query to work it 
didn't occur to me to try and combine it in a BooleanQuery. :P   Using the BJQ 
in a BooleanQuery with other TermQuerys works fine and does exactly what I 
wanted!  Thanks very much for your help!

Jim

From: Michael Sokolov [msoko...@safaribooksonline.com]
Sent: Thursday, January 22, 2015 11:45 AM
To: java-user@lucene.apache.org
Subject: Re: ToChildBlockJoinQuery question

I think the idea is that you create a blockjoinquery that encapsulates
the join relation, and then you can create additional constraints in the
result document space. In the case of ToChildBJQ, the result documents
are child documents, so any additional query constraints will be applied
to child documents.  For example, you could create the

ToChildBlockJoinQuery bjq = jamesBJQ();
TermQuery tq = new TermQuery (new Term(title, doctor));
BooleanQuery bq = new BooleanQuery (bjq, tq);

bq would then match books with parent (ie author) restrictions defined
in jamesBJQ(), and child (ie book) restrictions defined by other queries
like tq (title:doctor)

-Mike

On 1/22/15 11:27 AM, McKinley, James T wrote:
 Hi Greg,

 Thanks describing how block join queries were intended to work.  Your 
 description makes sense to me, however according to the API docs:

 http://lucene.apache.org/core/4_8_0/join/org/apache/lucene/search/join/ToChildBlockJoinQuery.html

 and particularly the naming of the parameters I don't think the API actually 
 works as you described:

   ToChildBlockJoinQuery(Query parentQuery, Filter parentsFilter, boolean 
 doScores)

 If the filter was intended to filter the child docs I think it would be 
 called childFilter no?

 I think the use of the CachingWrappingFilter in the example I got from Mike 
 McCandless' blog post was the real cause of the exception I was seeing (maybe 
 things have changed internally since that post).  I finally noticed a mention 
 of the FixedBitSetCachingWrapperFilter in the description of the 
 ToChildBlockJoinQuery constructor in the API docs.  When I changed to using a 
 filter produced by the FixedBitSetCachingWrapperFilter class the 
 IllegalStateException no longer occurs and I get the child docs using 
 ToChildBlockJoinQuery with a parent doc filter and parent doc query and 
 results look correctly limited by the parent constraints.  For example:

 ...
 Gub-Gub's Book: An Encyclopedia of Food (Fictional work), Fictional work, 
 119320101
   by: Lofting, Hugh - NP, American, Writer

 The Story of Doctor Dolittle, Being the History of His Peculiar Life at Home 
 and Astonishing Adventures in Foreign Parts (Novel), Novel, 119200101
   by: Lofting, Hugh - NP, American, Writer

 The Voyages of Doctor Dolittle (Novel), Novel, 119220101
   by: Lofting, Hugh - NP, American, Writer

 The Story of Doctor Dolittle (Novel), Novel, 119200101
   by: Lofting, Hugh - NP, American, Writer

 ...
 Mister Beers (Poem), Poem, null
   by: Lofting, Hugh - NP, American, Writer

 The Twilight of Magic (Novel), Novel, 119300101
   by: Lofting, Hugh - NP, American, Writer

 Picnic (Lofting, Hugh) (Poem), Poem, null
   by: Lofting, Hugh - NP, American, Writer

 The Impossible Patriotism Project (Picture story), Picture story, 120070101

 A Skeleton in God's Closet: A Novel (Novel), Novel, 119940101
   by: Maier, Paul Luther - NP, American, null

 Pontius Pilate (Novel), Novel, 119680101
   by: Maier, Paul Luther - NP, American, null

 ...
 Josephus: The Essential Writings (Collection), Collection, 119880101
   by: Maier, Paul Luther - NP, American, null

 She Said the Geese (Poem), Poem, null
   by: Lifshin, Lyn - NP, American, Poet

 She Said She Could See Music (Poem), Poem, null
   by: Lifshin, Lyn - NP, American, Poet
 ...

 However I see no way to further limit the children as you describe.  If I use 
 a query that matches the set of parents and a filter that matches the set of 
 children as you suggest I get no results back.  I think your description of 
 how it should work makes complete sense, but that is not what I'm seeing when 
 I try it.  Here's the code that produced the above output:

   private void runToChildBlockJoinQuery(String indexPath) throws 
 IOException {
   FSDirectory dir = FSDirectory.open(new File(indexPath));
   IndexReader reader = DirectoryReader.open(dir);
   IndexSearcher searcher = new IndexSearcher(reader);

   TermQuery parentFilterQuery = new TermQuery(new Term(AGTY, 
 np));
   BooleanQuery parentQuery = new BooleanQuery();
   parentQuery.add(new TermQuery(new Term(AGTY, np)), 
 Occur.MUST);
   parentQuery.add(new TermQuery(new Term(NT, american)), 
 Occur.MUST);

   Filter parentFilter = new FixedBitSetCachingWrapperFilter(new 
 QueryWrapperFilter(parentFilterQuery));

   ToChildBlockJoinQuery

RE: ToChildBlockJoinQuery question

2015-01-21 Thread McKinley, James T
 the parentFilterQuery and the parentQuery 
the same and still got the exception.

Am I understanding how this is supposed to work?  What I think I am (and should 
be) doing is providing a query and filter that specifies the parent docs and 
the ToChildBlockJoinQuery should return me all the child docs for the resulting 
parent docs.  Is this correct?  The reason I think I'm not understanding is 
that I don't see why I need both a filter and a query to specify the parent 
docs when a single query or filter should suffice.  Am I misunderstanding what 
parentQuery and parentFilter mean, they both refer to parent docs right?

I attempted to attach a small tar.gz file ( 1MB) to this message that 
contained a 100 parent index (~10,000 docs total) that gives the exception with 
my block join query, but the mailing list rejected my message, if there's a 
better place to send/upload this index let me know and I surely will.  Thanks 
again for any help.

Jim


From: Gregory Dearing [gregdear...@gmail.com]
Sent: Wednesday, January 21, 2015 1:01 PM
To: java-user@lucene.apache.org
Subject: Re: ToChildBlockJoinQuery question

James,

I haven't actually ran your example, but I think the source problem is that
your source query (NT:American) is hitting documents that have no
children.

The reason the exception is so weird is that one of your index segments
contains zero documents that match your filter.  Specifically, there's an
index segment containing docs matching NT:american, but with no documents
matching AGTY:np.

This will cause CachingWrapperFilter, which normally returns a FixedBitSet,
to instead return a generic Empty DocIdSet.  Which leads to the exception
from ToChildBlockJoinQuery.

The summary is, make sure that your source query only hits documents that
were actually added using 'addDocuments()'.  Since it looks like you're
extracting your block relationships from the existing index, that might
mean that you'll need to add some extra metadata to the newly created docs
instead of just cloning what already exists.

-Greg


On Wed, Jan 21, 2015 at 10:00 AM, McKinley, James T 
james.mckin...@cengage.com wrote:

 Hi,

 I'm attempting to use ToChildBlockJoinQuery in Lucene 4.8.1 by following
 Mike McCandless' blog post:


 http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html

 I have a set of child documents which are named works and a set of parent
 documents which are named persons that are the creators of the named
 works.  The parent document has a nationality and the child document does
 not.  I want to query the children (named works) limiting by the
 nationality of the parent (named person).  I've indexed the documents as
 follows (I'm pulling the docs from an existing index):

 private void createNamedWorkIndex(String srcIndexPath, String
 destIndexPath) throws IOException {
 FSDirectory srcDir = FSDirectory.open(new
 File(srcIndexPath));
 FSDirectory destDir = FSDirectory.open(new
 File(destIndexPath));

 IndexReader reader = DirectoryReader.open(srcDir);

 Version version = Version.LUCENE_48;
 IndexWriterConfig conf = new IndexWriterConfig(version,
 new StandardTextAnalyzer(version));

 SetString crids = getCreatorIds(reader);

 String[] crida = crids.toArray(new String[crids.size()]);

 int numThreads = 24;
 ExecutorService executor =
 Executors.newFixedThreadPool(numThreads);

 int numCrids = crids.size();
 int batchSize = numCrids / numThreads;
 int remainder = numCrids % numThreads;

 System.out.println(Inserting work/creator blocks using 
 + numThreads +  threads...);
 try (IndexWriter writer = new IndexWriter(destDir, conf)){
 for (int i = 0; i  numThreads; i++) {
 String[] cridRange;
 if (i == numThreads - 1) {
 cridRange =
 Arrays.copyOfRange(crida, i*batchSize, ((i+1)*batchSize - 1) + remainder);
 } else {
 cridRange =
 Arrays.copyOfRange(crida, i*batchSize, ((i+1)*batchSize - 1));
 }
 String id =  + ((char)('A' + i));
 Runnable indexer = new IndexRunnable(id ,
 reader, writer, new HashSetString(Arrays.asList(cridRange)));
 executor.execute(indexer);
 }
 executor.shutdown();
 executor.awaitTermination(2, TimeUnit.HOURS);
 } catch (Exception e) {
 executor.shutdownNow();
 throw new RuntimeException(e);
 } finally {
 reader.close

ToChildBlockJoinQuery question

2015-01-21 Thread McKinley, James T
Hi,

I'm attempting to use ToChildBlockJoinQuery in Lucene 4.8.1 by following Mike 
McCandless' blog post:

http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html

I have a set of child documents which are named works and a set of parent 
documents which are named persons that are the creators of the named works.  
The parent document has a nationality and the child document does not.  I want 
to query the children (named works) limiting by the nationality of the parent 
(named person).  I've indexed the documents as follows (I'm pulling the docs 
from an existing index):

private void createNamedWorkIndex(String srcIndexPath, String 
destIndexPath) throws IOException {
FSDirectory srcDir = FSDirectory.open(new File(srcIndexPath));
FSDirectory destDir = FSDirectory.open(new File(destIndexPath));

IndexReader reader = DirectoryReader.open(srcDir);

Version version = Version.LUCENE_48;
IndexWriterConfig conf = new IndexWriterConfig(version, new 
StandardTextAnalyzer(version));

SetString crids = getCreatorIds(reader);

String[] crida = crids.toArray(new String[crids.size()]);

int numThreads = 24;
ExecutorService executor = 
Executors.newFixedThreadPool(numThreads);

int numCrids = crids.size();
int batchSize = numCrids / numThreads;
int remainder = numCrids % numThreads;

System.out.println(Inserting work/creator blocks using  + 
numThreads +  threads...);
try (IndexWriter writer = new IndexWriter(destDir, conf)){
for (int i = 0; i  numThreads; i++) {
String[] cridRange;
if (i == numThreads - 1) {
cridRange = Arrays.copyOfRange(crida, 
i*batchSize, ((i+1)*batchSize - 1) + remainder);
} else {
cridRange = Arrays.copyOfRange(crida, 
i*batchSize, ((i+1)*batchSize - 1));
}
String id =  + ((char)('A' + i));
Runnable indexer = new IndexRunnable(id , 
reader, writer, new HashSetString(Arrays.asList(cridRange)));
executor.execute(indexer);
}
executor.shutdown();
executor.awaitTermination(2, TimeUnit.HOURS);
} catch (Exception e) {
executor.shutdownNow();
throw new RuntimeException(e);
} finally {
reader.close();
srcDir.close();
destDir.close();
}

System.out.println(Done!);
}

public static class IndexRunnable implements Runnable {
private String id;
private IndexReader reader;
private IndexWriter writer;
private SetString crids;

public IndexRunnable(String id, IndexReader reader, IndexWriter 
writer, SetString crids) {
this.id = id;
this.reader = reader;
this.writer = writer;
this.crids = crids;
}

@Override
public void run() {
IndexSearcher searcher = new IndexSearcher(reader);

try {
int count = 0;
for (String crid : crids) {
ListDocument docs = new ArrayList();

BooleanQuery abidQuery = new 
BooleanQuery();
abidQuery.add(new TermQuery(new 
Term(ABID, crid)), Occur.MUST);
abidQuery.add(new TermQuery(new 
Term(AGPR, true)), Occur.MUST);

TermQuery cridQuery = new TermQuery(new 
Term(CRID, crid));

TopDocs creatorDocs = 
searcher.search(abidQuery, Integer.MAX_VALUE);
TopDocs workDocs = 
searcher.search(cridQuery, Integer.MAX_VALUE);

for (int i = 0; i  
workDocs.scoreDocs.length; i++) {

docs.add(reader.document(workDocs.scoreDocs[i].doc));
}