from:"Roman Chyla"

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-25 Thread Roman Chyla

Hi Mike,

Sorry for the delay, I was away last week. Now that I get back to it
again my plan is to write a test for the WordDelimiterFilter and
pinpoint the problem.

Cheers,

  Roman

On Thu, Aug 20, 2020 at 11:21 AM Michael McCandless
 wrote:
>
> Hi Roman,
>
> No need for anyone to be falling on swords here!  This is really complicated 
> stuff, no worries.  And I think we have a compelling plan to move forwards so 
> that we can index multi-token synonyms AND have 100% correct positional 
> queries at search time, thanks to Michael Gibney's cool approach on 
> https://issues.apache.org/jira/browse/LUCENE-4312.
>
> So it looks like WordDelimiterGraphFilter is producing buggy (out of order 
> offsets) tokens here?
>
> Or are you running SynonymGraphFilter after WordDelimiterFilter?
>
> Looking at that failing example, it should have output'd that spacetime token 
> immediately after the space token, not after the time token.
>
> Maybe use TokenStreamToDot to visualize what the heck token graph you are 
> getting ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Aug 18, 2020 at 9:41 PM Roman Chyla  wrote:
>>
>> Hi Mike,
>>
>> I'm sorry, the problem all the time is inside related to a
>> word-delimiter filter factory. This is embarrassing but I have to
>> admit publicly and self-flagellate.
>>
>> A word-delimiter filter is used to split tokens, these then are used
>> to find multi-token synonyms (hence the connection). In my desire to
>> simplify, I have omitted that detail while writing my first email.
>>
>> I went to generate the stack trace:
>>
>> ```
>> assertU(adoc("id", "603", "bibcode", "xx603",
>> "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE
>> TELESCOPE program"));```
>>
>> stage:indexer term=xx603 pos=1 type=word offsetStart=0 offsetEnd=13
>> stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10
>> stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM
>> offsetStart=38 offsetEnd=60
>> stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68
>>
>> that worked, only the next one failed:
>>
>> ```assertU(adoc("id", "605", "bibcode", "xx604",
>> "title", "MIT and anti de sitter space-time"));```
>>
>>
>> stage:indexer term=xx604 pos=1 type=word offsetStart=0 offsetEnd=13
>> stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3
>> stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3
>> stage:indexer term=syn::massachusetts institute of technology pos=0
>> type=SYNONYM offsetStart=0 offsetEnd=3
>> stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3
>> stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12
>> stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28
>> stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15
>> stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22
>> stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28
>> stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33
>> stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33
>>
>> ```325677 ERROR
>> (TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40])
>> [] o.a.s.h.RequestHandlerBase
>> org.apache.solr.common.SolrException: Exception writing document id
>> 605 to the index; possible analysis error: startOffset must be
>> non-negative, and endOffset must be >= startOffset, and offsets must
>> not go backwards startOffset=23,endOffset=33,lastStar

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-18 Thread Roman Chyla

es(DirectUpdateHandler2.java:969)
at 
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235)
... 61 more
```

Embarrassingly Yours,

  Roman



On Mon, Aug 17, 2020 at 10:39 AM Michael McCandless
 wrote:
>
> Hi Roman,
>
> Can you share the full exception / stack trace that IndexWriter throws on 
> that one *'d token in your first example?  I thought IndexWriter checks 1) 
> startOffset >= last token's startOffset, and 2) endOffset >= startOffset for 
> the current token.
>
> But you seem to be hitting an exception due to endOffset check across tokens, 
> which I didn't remember/realize IW was enforcing.
>
> Could you share a small standalone test case showing the first example?  
> Maybe attach it to the issue 
> (http://issues.apache.org/jira/browse/LUCENE-8776)?
>
> Thanks,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla  wrote:
>>
>> Hi Mike,
>>
>> Thanks for the question! And sorry for the delay, I haven't managed to
>> get to it yesterday. I have generated better output, marked with (*)
>> where it currently fails the first time and also included one extra
>> case to illustrate the PositionLength attribute.
>>
>> assertU(adoc("id", "603", "bibcode", "xx603",
>> "title", "THE HUBBLE constant: a summary of the hubble space
>> telescope program"));
>>
>>
>> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
>> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
>> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
>> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
>> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
>> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM
>> offsetStart=38 offsetEnd=60
>> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
>> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
>> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
>> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
>> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68
>>
>> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered
>> (the multi-token synonym emitted as a last token) it would fail as
>> well, because of the check for lastToken.beginOffset <
>> currentToken.beginOffset. Basically, any reordering would result in a
>> failure (unless offsets are trimmed).
>>
>>
>>
>> The following example has additional twist because of `space-time`;
>> the tokenizer first splits the word and generate two new tokens --
>> those alternative tokens are then used to find synonyms (space ==
>> universe)
>>
>> assertU(adoc("id", "605", "bibcode", "xx604",
>> "title", "MIT and anti de sitter space-time"));
>>
>>
>> term=xx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
>> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
>> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> term=syn::massachusetts institute of technology posInc=0 posLen=1
>> type=SYNONYM offsetStart=0 offsetEnd=3
>> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
>> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
>> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
>> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
>> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
>> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM
>> offsetStart=8 offsetEnd=28
>> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
>> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
>> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
>> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28
>> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
>> term=spacetime posInc=0 p

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-14 Thread Roman Chyla

ile offsets are still correct.

This would, I think, affect not only highlighting, but also search
(which is, at least for us, more important). But I can imagine that in
more NLP-related domains, ability to identify the source of a
transformation could be more than a highlighting problem.

Admittedly, most users would not care to notice, but it might be
important to some. Fundamentally, I think, the problem translates to
inability to reconstruct the DAG graph (under certain circumstances)
because of the lost pieces of information.

~roman

On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless
 wrote:
>
> Hi Roman,
>
> Sorry for the late reply!
>
> I think there remains substantial confusion about multi-token synonyms and 
> IW's enforcement of offsets.  It really is worth thoroughly 
> iterating/understanding your examples so we can get to the bottom of this.  
> It looks to me it is possible to emit tokens whose offsets do not go 
> backwards and that properly model your example synonyms, so I do not yet see 
> what the problem is.  Maybe I am being blind/tired ...
>
> What do you mean by pos=2, pos=0, etc.?  I think that is really the position 
> increment?  Can you re-do the examples with posInc instead?  (Alternatively, 
> you could keep "pos" but make it the absolute position, not the increment?).
>
> Could you also add posLength to each token?  This helps (me?) visualize the 
> resulting graph, even though IW does not enforce it today.
>
> Looking at your first example, "THE HUBBLE constant: a summary of the hubble 
> space telescope program", it looks to me like those tokens would all be 
> accepted by IW's checks as they are?  startOffset never goes backwards, and 
> for every token, endOffset >= startOffset.  Where in that first example does 
> IW throw an exception?  Maybe insert a "** IW fails here" under the 
> problematic token?  Or, maybe write a simple test case using e.g. 
> CannedTokenStream?
>
> Your second example should also be fine, and not at all weird, but could you 
> enumerate it into the specific tokens with posInc, posLength, start/end 
> offset, "** IW fails here", etc., so we have a concrete example to discuss?
>
> Lucene's TokenStreams are really serializing a directed acyclic graph (DAG), 
> in a specific order, one transition at a time.  Ironically/strangely, it is 
> similar to the graph that git history maintains, and how "git log" then 
> serializes that graph into an ordered series of transitions.  The simple int 
> position in Lucene's TokenStream corresponds to git's githashes, to uniquely 
> identify each "node", though, I do not think there is an analog in git to 
> Lucene's offsets.  Hmm, maybe a timestamp?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla  wrote:
>>
>> Hi Mike,
>>
>> Yes, they are not zero offsets - I was instinctively avoiding
>> "negative offsets"; but they are indeed backward offsets.
>>
>> Here is the token stream as produced by the analyzer chain indexing
>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>>
>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 
>> offsetEnd=60
>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>>
>> Sometimes, we'll even have a situation when synonyms overlap: for
>> example "anti de sitter space time"
>>
>> "anti de sitter space time" -> "antidesitter space" (one token
>> spanning offsets 0-26; it gets emitted with the first token "anti"
>> right now)
>> "space time" -> "spacetime" (synonym 16-26)
>> "space" -> "universe" (25-26)
>>
>> Yes, weird, but useful if people want to search for `universe NEAR
>> anti` -- but another usecase which would be prohibited by the "new"
>> rule.
>>
>> DefaultIndexingChain checks new token offset against the last emitted
>> token, so I don't see a way to emit the multi-token synonym with
>> offsetts span

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-10 Thread Roman Chyla

oh,thanks! that saves everybody some time. I have commented in there,
pleading to be allowed to do something - if that proposal sounds even
little bit reasonable, please consider amplifying the signal

On Mon, Aug 10, 2020 at 4:22 PM David Smiley  wrote:
>
> There already is one: https://issues.apache.org/jira/browse/LUCENE-8776
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Aug 10, 2020 at 1:30 PM Roman Chyla  wrote:
>>
>> I'll have to somehow find a solution for this situation, giving up
>> offsets seems like too big a price to pay, I see that overriding
>> DefaultIndexingChain is not exactly easy -- the only thing I can think
>> of is to just trick the classloader into giving it a different version
>> of the chain (praying this can be done without compromising security,
>> I have not followed JDK evolutions for some time...) - aside from
>> forking lucene and editing that; which I decidedly don't want to do
>> (monkey-patching it, ok, i can live with that... :-))
>>
>> It *seems* to me that the original reason for negative offset checks
>> stemmed from the fact that vint could have been written (and possibly
>> vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738
>>
>> but the underlying issue and some of the patches seem to have been
>> addressing those problems; but a much shorter version of the patch was
>> committed -- despite the perf results not being indicative (i.e. it
>> could have been good with the longer patch) -- but to really
>> understand it, one would have to spend more than 10mins reading the
>> comments
>>
>> Further to the point, I think negative offsets can be produced only on
>> the very first token, unless there is a bug in a filter (there was/is
>> a separate check for that in 6x and perhaps it is still there in 7x).
>> That would be much less restrictive than the current condition which
>> disallows all backward offsets. We never ran into an index corruption
>> in lucene 4-6x, so I really wonder if the "forbid all backwards
>> offsets" approach might be too restrictive.
>>
>> Looks like I should create an issue...
>>
>> On Thu, Aug 6, 2020 at 11:28 AM Gus Heck  wrote:
>> >
>> > I've had a nearly identical experience to what Dave describes, I also 
>> > chafe under this restriction.
>> >
>> > On Thu, Aug 6, 2020 at 11:07 AM David Smiley  wrote:
>> >>
>> >> I sympathize with your pain, Roman.
>> >>
>> >> It appears we can't really do index-time multi-word synonyms because of 
>> >> the offset ordering rule.  But it's not just synonyms, it's other forms 
>> >> of multi-token expansion.  Where I work, I've seen an interesting 
>> >> approach to mixed language text analysis in which a sophisticated 
>> >> Tokenizer effectively re-tokenizes an input multiple ways by producing a 
>> >> token stream that is a concatenation of different interpretations of the 
>> >> input.  On a Lucene upgrade, we had to "coarsen" the offsets to the point 
>> >> of having highlights that point to a whole sentence instead of the words 
>> >> in that sentence :-(.  I need to do something to fix this; I'm trying 
>> >> hard to resist modifying our Lucene fork for this constraint.  Maybe 
>> >> instead of concatenating, it might be interleaved / overlapped but the 
>> >> interpretations aren't necessarily aligned to make this possible without 
>> >> risking breaking position-sensitive queries.
>> >>
>> >> So... I'm not a fan of this constraint on offsets.
>> >>
>> >> ~ David Smiley
>> >> Apache Lucene/Solr Search Developer
>> >> http://www.linkedin.com/in/davidwsmiley
>> >>
>> >>
>> >> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla  wrote:
>> >>>
>> >>> Hi Mike,
>> >>>
>> >>> Yes, they are not zero offsets - I was instinctively avoiding
>> >>> "negative offsets"; but they are indeed backward offsets.
>> >>>
>> >>> Here is the token stream as produced by the analyzer chain indexing
>> >>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>> >>>
>> >>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>> >>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>> >>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-10 Thread Roman Chyla

I'll have to somehow find a solution for this situation, giving up
offsets seems like too big a price to pay, I see that overriding
DefaultIndexingChain is not exactly easy -- the only thing I can think
of is to just trick the classloader into giving it a different version
of the chain (praying this can be done without compromising security,
I have not followed JDK evolutions for some time...) - aside from
forking lucene and editing that; which I decidedly don't want to do
(monkey-patching it, ok, i can live with that... :-))

It *seems* to me that the original reason for negative offset checks
stemmed from the fact that vint could have been written (and possibly
vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738

but the underlying issue and some of the patches seem to have been
addressing those problems; but a much shorter version of the patch was
committed -- despite the perf results not being indicative (i.e. it
could have been good with the longer patch) -- but to really
understand it, one would have to spend more than 10mins reading the
comments

Further to the point, I think negative offsets can be produced only on
the very first token, unless there is a bug in a filter (there was/is
a separate check for that in 6x and perhaps it is still there in 7x).
That would be much less restrictive than the current condition which
disallows all backward offsets. We never ran into an index corruption
in lucene 4-6x, so I really wonder if the "forbid all backwards
offsets" approach might be too restrictive.

Looks like I should create an issue...

On Thu, Aug 6, 2020 at 11:28 AM Gus Heck  wrote:
>
> I've had a nearly identical experience to what Dave describes, I also chafe 
> under this restriction.
>
> On Thu, Aug 6, 2020 at 11:07 AM David Smiley  wrote:
>>
>> I sympathize with your pain, Roman.
>>
>> It appears we can't really do index-time multi-word synonyms because of the 
>> offset ordering rule.  But it's not just synonyms, it's other forms of 
>> multi-token expansion.  Where I work, I've seen an interesting approach to 
>> mixed language text analysis in which a sophisticated Tokenizer effectively 
>> re-tokenizes an input multiple ways by producing a token stream that is a 
>> concatenation of different interpretations of the input.  On a Lucene 
>> upgrade, we had to "coarsen" the offsets to the point of having highlights 
>> that point to a whole sentence instead of the words in that sentence :-(.  I 
>> need to do something to fix this; I'm trying hard to resist modifying our 
>> Lucene fork for this constraint.  Maybe instead of concatenating, it might 
>> be interleaved / overlapped but the interpretations aren't necessarily 
>> aligned to make this possible without risking breaking position-sensitive 
>> queries.
>>
>> So... I'm not a fan of this constraint on offsets.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla  wrote:
>>>
>>> Hi Mike,
>>>
>>> Yes, they are not zero offsets - I was instinctively avoiding
>>> "negative offsets"; but they are indeed backward offsets.
>>>
>>> Here is the token stream as produced by the analyzer chain indexing
>>> "THE HUBBLE constant: a summary of the hubble space telescope program"
>>>
>>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
>>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
>>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20
>>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30
>>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
>>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 
>>> offsetEnd=60
>>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
>>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
>>> term=space pos=1 type=word offsetStart=45 offsetEnd=50
>>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
>>> term=program pos=1 type=word offsetStart=61 offsetEnd=68
>>>
>>> Sometimes, we'll even have a situation when synonyms overlap: for
>>> example "anti de sitter space time"
>>>
>>> "anti de sitter space time" -> "antidesitter space" (one token
>>> spanning offsets 0-26; it gets emitted with the first token "anti"
>>> right now)
>>> "space time" -> "spacetime" (synonym 16-26)
>>> "space" -> "universe" (25-26)
>>>
>>> Yes, weird, but useful if peop

Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-06 Thread Roman Chyla

Hi Mike,

Yes, they are not zero offsets - I was instinctively avoiding
"negative offsets"; but they are indeed backward offsets.

Here is the token stream as produced by the analyzer chain indexing
"THE HUBBLE constant: a summary of the hubble space telescope program"

term=hubble pos=2 type=word offsetStart=4 offsetEnd=10
term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10
term=constant pos=1 type=word offsetStart=11 offsetEnd=20
term=summary pos=1 type=word offsetStart=23 offsetEnd=30
term=hubble pos=1 type=word offsetStart=38 offsetEnd=44
term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60
term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60
term=space pos=1 type=word offsetStart=45 offsetEnd=50
term=telescope pos=1 type=word offsetStart=51 offsetEnd=60
term=program pos=1 type=word offsetStart=61 offsetEnd=68

Sometimes, we'll even have a situation when synonyms overlap: for
example "anti de sitter space time"

"anti de sitter space time" -> "antidesitter space" (one token
spanning offsets 0-26; it gets emitted with the first token "anti"
right now)
"space time" -> "spacetime" (synonym 16-26)
"space" -> "universe" (25-26)

Yes, weird, but useful if people want to search for `universe NEAR
anti` -- but another usecase which would be prohibited by the "new"
rule.

DefaultIndexingChain checks new token offset against the last emitted
token, so I don't see a way to emit the multi-token synonym with
offsetts spanning multiple tokens if even one of these tokens was
already emitted. And the complement is equally true: if multi-token is
emitted as last of the group - it trips over `startOffset <
invertState.lastStartOffset`

https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915


  -roman

On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless
 wrote:
>
> Hi Roman,
>
> Hmm, this is all very tricky!
>
> First off, why do you call this "zero offsets"?  Isn't it "backwards offsets" 
> that your analysis chain is trying to produce?
>
> Second, in your first example, if you output the tokens in the right order, 
> they would not violate the "offsets do not go backwards" check in 
> IndexWriter?  I thought IndexWriter is just checking that the startOffset for 
> a token is not lower than the previous token's startOffset?  (And that the 
> token's endOffset is not lower than its startOffset).
>
> So I am confused why your first example is tripping up on IW's offset checks. 
>  Could you maybe redo the example, listing single token per line with the 
> start/end offsets they are producing?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla  wrote:
>>
>> Hello devs,
>>
>> I wanted to create an issue but the helpful message in red letters
>> reminded me to ask first.
>>
>> While porting from lucene 6.x to 7x I'm struggling with a change that
>> was introduced in LUCENE-7626
>> (https://issues.apache.org/jira/browse/LUCENE-7626)
>>
>> It is believed that zero offset tokens are bad bad - Mike McCandles
>> made the change which made me automatically doubt myself. I must be
>> wrong, hell, I was living in sin the past 5 years!
>>
>> Sadly, we have been indexing and searching large volumes of data
>> without any corruption in index whatsover, but also without this new
>> change:
>>
>> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774
>>
>> With that change, our multi-token synonyms house of cards is falling.
>>
>> Mike has this wonderful blogpost explaining troubles with multi-token 
>> synonyms:
>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>
>> Recommended way to index multi-token synonyms appears to be this:
>> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr
>>
>> BUT, but! We don't want to place multi-token synonym into the same
>> position as the other words. We want to preserve their positions! We
>> want to preserve informaiton about offsets!
>>
>> Here is an example:
>>
>> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program
>>
>> This is how it gets indexed
>>
>> [(0, []),
>> (1, ['acr::hubble']),
>> (2, ['constant']),
>> (3, ['summary']),
>> (4, []),
>> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
>> (6, ['a

When zero offsets are not bad - a.k.a. multi-token synonyms yet again

2020-08-05 Thread Roman Chyla

Hello devs,

I wanted to create an issue but the helpful message in red letters
reminded me to ask first.

While porting from lucene 6.x to 7x I'm struggling with a change that
was introduced in LUCENE-7626
(https://issues.apache.org/jira/browse/LUCENE-7626)

It is believed that zero offset tokens are bad bad - Mike McCandles
made the change which made me automatically doubt myself. I must be
wrong, hell, I was living in sin the past 5 years!

Sadly, we have been indexing and searching large volumes of data
without any corruption in index whatsover, but also without this new
change:

https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774

With that change, our multi-token synonyms house of cards is falling.

Mike has this wonderful blogpost explaining troubles with multi-token synonyms:
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Recommended way to index multi-token synonyms appears to be this:
https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr

BUT, but! We don't want to place multi-token synonym into the same
position as the other words. We want to preserve their positions! We
want to preserve informaiton about offsets!

Here is an example:

* THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program

This is how it gets indexed

[(0, []),
(1, ['acr::hubble']),
(2, ['constant']),
(3, ['summary']),
(4, []),
(5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']),
(6, ['acr::space', 'space']),
(7, ['acr::telescope', 'telescope']),
(8, ['program']),

Notice the position 5 - multi-token synonym `syn::hubble space
telescope` token is on the first token which started the group
(emitted by Lucene's synonym filter). hst is another synonym; we also
index the 'hubble' word there.

 if you were to search for a phrase "HST program" it will be found
because our search parser will search for ("HST ? ? program" | "Hubble
Space Telescope program")

It simply found that by looking at synonyms: HST -> Hubble Space Telescope

And because of those funny 'syn::' prefixes, we don't suffer from the
other problem that Mike described -- "hst space" phrase search will
NOT find this paper (and that is a correct behaviour)

But all of this is possible only because lucene was indexing tokens
with offsets that can be lower than the last emitted token; for
example 'hubble space telescope' wil have offset 21-45; and the next
emitted token "space" will have offset 28-33

And it just works (lucene 6.x)

Here is another proof with the appropriate verbiage ("crazy"):

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618

Zero offsets have been working wonderfully for us so far. And I
actually cannot imagine how it can work without them - i.e. without
the ability to emit a token stream with offsets that are lower than
the last seen token.

I haven't tried SynonymFlatten filter, but because of this line in the
DefaultIndexingChain - I'm convinced the flatten symbol is not going
to do what we need (as seen in the example above)

https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915

What would you say? Is it a bug, is it not a bug but just some special
usecase? If it is a special usecase, what do we need to do? Plug in
our own indexing chain?

Thanks!

  -roman

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method

2016-10-06 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-7481:

Description: 
If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

```
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
```

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x


  was:
If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

{code:java}
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
{/code}

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x



> SpanPayloadCheckQuery is missing rewrite method
> ---
>
> Key: LUCENE-7481
> URL: https://issues.apache.org/jira/browse/LUCENE-7481
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.x
>Reporter: Roman Chyla
>
> If used with a wildcard query, the result is a failure saying: "Rewrite query 
> first"
> The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
> just returns the query itself. 
> this works:
> ```
> spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
> SpanMultiTermQueryWrapper(vectrfield:m*), 
> SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
> ```
> code to generate the query:
> ```
> private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
> throws UnsupportedEncodingException {
>   SpanQuery[] clauses = new SpanQuery[howMany+1];
>

[jira] [Updated] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method

2016-10-06 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-7481:

Description: 
If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

{code:java}
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
{/code}

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x


  was:
If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

```
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
```

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x



> SpanPayloadCheckQuery is missing rewrite method
> ---
>
> Key: LUCENE-7481
> URL: https://issues.apache.org/jira/browse/LUCENE-7481
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.x
>Reporter: Roman Chyla
>
> If used with a wildcard query, the result is a failure saying: "Rewrite query 
> first"
> The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
> just returns the query itself. 
> this works:
> ```
> spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
> SpanMultiTermQueryWrapper(vectrfield:m*), 
> SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
> ```
> code to generate the query:
> ```
> private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
> throws UnsupportedEncodingException {
>   SpanQuery[] clauses = new SpanQuery[howMany+1];
>

[jira] [Created] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method

2016-10-06 Thread Roman Chyla (JIRA)

Roman Chyla created LUCENE-7481:
---

 Summary: SpanPayloadCheckQuery is missing rewrite method
 Key: LUCENE-7481
 URL: https://issues.apache.org/jira/browse/LUCENE-7481
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 6.x
Reporter: Roman Chyla


If used with a wildcard query, the result is a failure saying: "Rewrite query 
first"

The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery 
just returns the query itself. 

this works:

```
spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true)
```

code to generate the query:

```
private Query getSpanQuery(String[] parts, int howMany, boolean truncate) 
throws UnsupportedEncodingException {
SpanQuery[] clauses = new SpanQuery[howMany+1];
clauses[0] = new SpanTermQuery(new Term("vectrfield", 
parts[0])); // surname
for (int i = 0; i < howMany; i++) {
if (truncate) {
  SpanMultiTermQueryWrapper q = new 
SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
clauses[i+1] = q;
}
else {
clauses[i+1] = new SpanTermQuery(new 
Term("vectrfield", parts[i+1]));
}
}
SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // 
match in order
return sq;
}
```

and this fails:

```
spanPayCheck(spanNear([vectrfield:ebyuugz, 
SpanMultiTermQueryWrapper(vectrfield:e*), 
SpanMultiTermQueryWrapper(vectrfield:m*), 
SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;)
```

each clause is made of:

```
new SpanMultiTermQueryWrapper(new WildcardQuery(new 
Term("vectrfield", parts[i+1].substring(0, 1) + "*")));
```

It is a regression; the code was working well in SOLR4.x




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6468) Regression: StopFilterFactory doesn't work properly without enablePositionIncrements="false"

2016-09-22 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514785#comment-15514785
 ] 

Roman Chyla commented on SOLR-6468:
---

Ha! :-)
I've found my own comment above, 2 years later I'm facing this situation again, 
I completely forgot (and truth be told: preferred running old solr 4x).

This is how the new solr sees things:

A 350-MHz GBT Survey of 50 Faint Fermi γ ray Sources for Radio Millisecond 
Pulsars

is indexed as
```
null_1
1   :350|350mhz
2   :mhz|syn::mhz
3   :acr::gbt|gbt|syn::gbt|syn::green bank telescope
4   :survey|syn::survey
null_1
6   :50
```

the 1st and 5th position is a gap - so the search for "350-MHz GBT Survey of 50 
Faint" will fail - because 'of' is a stopword and the stop-filter will always 
increment the position (what's the purpose of a stopfilter; if it is leaving 
gaps?)

anyways, the solution with CharFilterFactory cannot work for me, I have to do 
this:
 
 1. search for synonyms (they can contain stopwords)
 2. remove stopwords
 3. search for other synonyms (that don't have stopwords)

I'm afraid the real life is little bit more complex than what it seems; but 
there is a logic to your choices, SOLR devs, I'm afraid I can agree with you. 
People who understand the *why* will make it work again as it *should*. Others 
will happily keep using the 'simplified' version.

> Regression: StopFilterFactory doesn't work properly without 
> enablePositionIncrements="false"
> 
>
> Key: SOLR-6468
> URL: https://issues.apache.org/jira/browse/SOLR-6468
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.8.1, 4.9
>Reporter: Alexander S.
>
> Setup:
> * Schema version is 1.5
> * Field config:
> {code}
>  autoGeneratePhraseQueries="true">
>   
> 
>  ignoreCase="true" />
> 
>   
> 
> {code}
> * Stop words:
> {code}
> http 
> https 
> ftp 
> www
> {code}
> So very simple. In the index I have:
> * twitter.com/testuser
> All these queries do match:
> * twitter.com/testuser
> * com/testuser
> * testuser
> But none of these does:
> * https://twitter.com/testuser
> * https://www.twitter.com/testuser
> * www.twitter.com/testuser
> Debug output shows:
> "parsedquery_toString": "+(url_words_ngram:\"? twitter com testuser\")"
> But we need:
> "parsedquery_toString": "+(url_words_ngram:\"twitter com testuser\")"
> Complete debug outputs:
> * a valid search: 
> http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
> * an invalid search: 
> http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww
> The complete discussion and explanation of the problem is here: 
> http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-td4153839.html
> I didn't find a clear explanation how can we upgrade Solr, there's no any 
> replacement or a workarround to this, so this is not just a major change but 
> a major disrespect to all existing Solr users who are using this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6468) Regression: StopFilterFactory doesn't work properly without enablePositionIncrements=false

2014-11-25 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225186#comment-14225186
 ] 

Roman Chyla commented on SOLR-6468:
---

I also find this change to be unfortunate. If this is just a developers making 
decisions for users (then it causes problems to users who really know why they 
do need that feature: for phrase search that should ignore stopwords). But if 
the underlying issue is something serious with the indexer not being able to 
work with the position, than it would be even weirder - and actually very bad 
for many users. I don't really understand benefits of this change. Any chance 
to return to the original?

 Regression: StopFilterFactory doesn't work properly without 
 enablePositionIncrements=false
 

 Key: SOLR-6468
 URL: https://issues.apache.org/jira/browse/SOLR-6468
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.8.1, 4.9
Reporter: Alexander S.

 Setup:
 * Schema version is 1.5
 * Field config:
 {code}
 fieldType name=words_ngram class=solr.TextField omitNorms=false 
 autoGeneratePhraseQueries=true
   analyzer
 tokenizer class=solr.PatternTokenizerFactory pattern=[^\w]+ /
 filter class=solr.StopFilterFactory words=url_stopwords.txt 
 ignoreCase=true /
 filter class=solr.LowerCaseFilterFactory /
   /analyzer
 /fieldType
 {code}
 * Stop words:
 {code}
 http 
 https 
 ftp 
 www
 {code}
 So very simple. In the index I have:
 * twitter.com/testuser
 All these queries do match:
 * twitter.com/testuser
 * com/testuser
 * testuser
 But none of these does:
 * https://twitter.com/testuser
 * https://www.twitter.com/testuser
 * www.twitter.com/testuser
 Debug output shows:
 parsedquery_toString: +(url_words_ngram:\? twitter com testuser\)
 But we need:
 parsedquery_toString: +(url_words_ngram:\twitter com testuser\)
 Complete debug outputs:
 * a valid search: 
 http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
 * an invalid search: 
 http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww
 The complete discussion and explanation of the problem is here: 
 http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-td4153839.html
 I didn't find a clear explanation how can we upgrade Solr, there's no any 
 replacement or a workarround to this, so this is not just a major change but 
 a major disrespect to all existing Solr users who are using this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Query parsing - what's new?

2013-10-01 Thread Roman Chyla

Hi guys,

Could sb please take a look at the LUCENE-5014 and comment on it? That JIRA
ticket is proposing a new way to build query parsers:

https://issues.apache.org/jira/browse/LUCENE-5014

The thing is: the new code has been there lying for about 6 months, and I
don't know whether it is  because people don't have time to actually look
at it, or because it is a bad solution, or anything else... I don't want to
assume anything at this point, but your input would be much appreciated. I
know you are busy and I understand that parsers are not as exciting as
cloud etc, but at the same time I do NOT understand how lucene can live so
long with 'that' standard query parser...

Thank you!

  roman

Re: Measuring SOLR performance

2013-07-31 Thread Roman Chyla

Hi Dmitry,
probably mistake in the readme, try calling it with -q
/home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries

as for the base_url, i was testing it on solr4.0, where it tries contactin
/solr/admin/system - is it different for 4.3? I guess I should make it
configurable (it already is, the endpoint is set at the check_options())

thanks

roman


On Wed, Jul 31, 2013 at 10:01 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Ok, got the error fixed by modifying the base solr ulr in solrjmeter.py
 (added core name after /solr part).
 Next error is:

 WARNING: no test name(s) supplied nor found in:
 ['/home/dmitry/projects/lab/solrjmeter/demo/queries/demo.queries']

 It is a 'slow start with new tool' symptom I guess.. :)


 On Wed, Jul 31, 2013 at 4:39 PM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 What  version and config of SOLR does the tool expect?

 Tried to run, but got:

 **ERROR**
   File solrjmeter.py, line 1390, in module
 main(sys.argv)
   File solrjmeter.py, line 1296, in main
 check_prerequisities(options)
   File solrjmeter.py, line 351, in check_prerequisities
 error('Cannot contact: %s' % options.query_endpoint)
   File solrjmeter.py, line 66, in error
 traceback.print_stack()
 Cannot contact: http://localhost:8983/solr


 complains about URL, clicking which leads properly to the admin page...
 solr 4.3.1, 2 cores shard

 Dmitry


 On Wed, Jul 31, 2013 at 3:59 AM, Roman Chyla roman.ch...@gmail.comwrote:

 Hello,

 I have been wanting some tools for measuring performance of SOLR, similar
 to Mike McCandles' lucene benchmark.

 so yet another monitor was born, is described here:
 http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/

 I tested it on the problem of garbage collectors (see the blogs for
 details) and so far I can't conclude whether highly customized G1 is
 better
 than highly customized CMS, but I think interesting details can be seen
 there.

 Hope this helps someone, and of course, feel free to improve the tool and
 share!

 roman

Re: Measuring SOLR performance

2013-07-31 Thread Roman Chyla

On Wed, Jul 31, 2013 at 1:21 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/31/2013 10:21 AM, Roman Chyla wrote:

 Hi Dmitry,
 probably mistake in the readme, try calling it with -q
 /home/dmitry/projects/lab/**solrjmeter/queries/demo/demo.**queries

 as for the base_url, i was testing it on solr4.0, where it tries
 contactin /solr/admin/system - is it different for 4.3? I guess I should
 make it configurable (it already is, the endpoint is set at the
 check_options())


 /solr URLs that don't include a core name (like /solr/admin/system) will
 only work if you have a defaultCoreName attribute in your solr.xml file and
 its value refers to an existing core.  Behind the scenes, Solr just directs
 those queries to the default core.


thanks, so i should add a way to specify a core, or rather i will make the
whole endpoint user configurable



 If you use the new solr.xml format (required in trunk), then there is no
 defaultCoreName, so these URLs currently don't work at all.  I think this
 behavior is correct, but it's early days for this feature.  The default
 core name might get re-introduced.


and which urls will work? /solr/admin/collection or /solr/collection/admin?
can we assume the info handlers will be available under the collection
url as well?




 Exceptions to the above rule include the CoreAdmin API, the Collections
 API, and the new admin info handler introduced in Solr 4.4 by SOLR-4943.

 In 4.5, SOLR-3633 will use the new info handler allow the UI to work when
 there are no cores present.


hmm, ok, i guess i'm fine now, i'll worry about that later

roman


 Thanks,
 Shawn


 --**--**-
 To unsubscribe, e-mail: 
 dev-unsubscribe@lucene.apache.**orgdev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

Measuring SOLR performance

2013-07-30 Thread Roman Chyla

Hello,

I have been wanting some tools for measuring performance of SOLR, similar
to Mike McCandles' lucene benchmark.

so yet another monitor was born, is described here:
http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/

I tested it on the problem of garbage collectors (see the blogs for
details) and so far I can't conclude whether highly customized G1 is better
than highly customized CMS, but I think interesting details can be seen
there.

Hope this helps someone, and of course, feel free to improve the tool and
share!

roman

Re: for those of you using gmail...

2013-07-17 Thread Roman Chyla

On Wed, Jul 17, 2013 at 10:26 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Can you try this search in your gmail:

 from:jenk...@thetaphi.de regression build 6605

 And let me know if you get 1 or 0 results back?


0 results back



 I get 0 results back but I should get 1, I think.

 Furthermore, if I search for:

 from:jenk...@thetaphi.de regression

 I only get results up to Jul 2, even though there are many build
 failures after that.


I am getting many before Jul 2, even March and beyond

--roman




 It's as if on Jul 2 Google made regression an index-time-only
 stopword; failed, replication, handler also became stopwords (but
 apparently at different times).

 Frustrating ...

 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-07-03 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698981#comment-13698981
 ] 

Roman Chyla commented on LUCENE-5014:
-

HiErik, i'll add a solr qparser plugin too. thanks for reminding me. 

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-07-03 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698985#comment-13698985
 ] 

Roman Chyla commented on LUCENE-5014:
-

will it be OK to include the solr parts in this ticket? besides the jira name, 
that seems s aa best option to me.

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-07-03 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699417#comment-13699417
 ] 

Roman Chyla commented on LUCENE-5014:
-

New addition: solr qparser plugin. 

It is unfortunately not as easy as one may think, because of various defaults - 
e.g. user may want to specify different defaultField, whether wildcards are 
allowed at the beginning, what is the maximum range for proximity values... 
some of which should be only in solrconfig.xml, and some also in query params. 

So here is a stab at it, it works, but may require more config options - there 
is also a new unittest. Only that Ivy mirrors decided to not work now (ughhh) 
so I could not test solr unittests - ihope it works. Lucene's 'ant test' went 
fine. 

If sb wants to try in solr, please make sure you have antlr-runtime.jar in your 
solr libs and this should go inside solrconfig.xml

{code}
queryParser name=lucene2 class=AqpLuceneQParserPlugin
lst name=defaults
   str name=defaultFieldtext/str
/lst
  /queryParser
{code}


 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-07-03 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

Added solr qparserplugin

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-06-28 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

The patch that *actually* contains the extended parser with NEAR operator 
support

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, 
 LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-06-27 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13695149#comment-13695149
 ] 

Roman Chyla commented on LUCENE-5014:
-

Adding an example, standard lucene grammar extended with NEAR operators (as 
discussed above)

This should illustrate how easy it is to extend/modify/add a new query dialect. 
Handling of NEAR operators is not at all trivial, so I hope you will have some 
fun realizing it can be done in two lines ;)


{code}
setGrammarName(ExtendedLuceneGrammar);
((AqpQueryTreeBuilder) qp.getQueryBuilder()).setBuilder(AqpNearQueryNode.class, 
new AqpNearQueryNodeBuilder());
{code}

Have a look at TestAqpExtendedLGSimple

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-06-27 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

The same patch + lucene grammar extended with NEARx operator

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser

2013-05-27 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667908#comment-13667908
 ] 

Roman Chyla commented on LUCENE-5014:
-

Hi David,
In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, 
backtracking,memoization) - see this 
http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference

But it is also capable of doing more things than PEG (ie. better error recovery 
- PEG parser needs to parse the whole tree before it discovers an error; then 
the error recovery is not the same thing)

PEG's can be easier *especially* because of the first-choice operator; in fact 
at times I wished that ANTLR just chose the first available option (well, it 
does, but it reports and error and I didn't want to have grammar with errors). 
So, in CFGANTLR world, ambiguity is solved using syntactic predicated 
(lookahead) -- so far, this has been a theoretical, here are few more points:

Clarity
===

I looked at the presentation and the parser contains the operator precedence, 
however there it is spread across several screens of java code, i find the 
following much more readable

{code}
mainQ : 
  clauseOr+ EOF
  ;
  
clauseOr
  : clauseAnd (or clauseAnd )*
  ;

clauseAnd
  : clauseNot  (and clauseNot)*
  ; 
{code}
  
It is essentially the same thing, but it is independent of the Java and I can 
see it on few lines - and extend it adding few more lines. The patch I wrote 
makes the handling of separate grammar and generated code seamless. So the 2/3 
advantages of PEG over ANTLR disappear.


Syntax vs semantics (business logic)


The example from the presentation needs to be much more involved if it is to be 
used in the real life. Consider this query:

{noformat}
dog NEAR cat
{noformat}

This is going to work only in the simplest case, where each term is a single 
TermQuery. Yet if there was a synonym expansion (where would it go inside the 
PEG parser, is one question) - the parser needs to *rewrite* the query 

something like:

{noformat}
(dog|canin) NEAR cat -- (dog NEAR cat) OR (canin NEAR cat)
{noformat}

So, there you get the 'spaghetti problem' - in the example presented, the logic 
that rewrites the query must reside in the same place as the query parsing. 
That is not an improvement IMO, it is the same thing as the old Lucene parsers 
written in JavaCC which are very difficult to extend or debug

I think I'll add a new grammar with the proximity operators so that you can see 
how easy it is to solve the same situation with ANTLR (but you will need to 
read the patch this time ;)) btw. the patch is big because i included the html 
with SVG charts of the generated parse trees and one Excel file (that one helps 
in writing unittest for the grammar)

Developer vs user experience


I think PEG definitely looks simpler (in the presented example) and its main 
advantage is the first-choice operator. But since ANTLR can do the same and it 
has programming language independent grammar, it can do the same job. The 
difference may be in maturity of the project, tools available (ie debuggers) - 
and of course implementation (see the link above for details)

I can imagine that for PEG you can use your IDE of choice, while with ANTLR 
there is this 'pesky' level of abstraction - but there are tools that make life 
bearable, such as ANTLRWorks or Eclipse ANTLR debugger (though I have not liked 
that one); grammar unittest and I added ways to debug/view the grammar. Again, 
I recommend trying it, e.g. 

{code}
ant -f aqp-build.xml gunit
# edit StandardLuceneGrammar and save as 'mytestgrammar'
ant -f aqp-build.xml try-view -Dquery=foo NEAR bar -Dgrammar=mytestgrammar
{code}


There may be of course more things to consider, but I believe the 3 issues 
above present some interesting vantage points.

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star

[jira] [Comment Edited] (LUCENE-5014) ANTLR Lucene query parser

2013-05-27 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667908#comment-13667908
 ] 

Roman Chyla edited comment on LUCENE-5014 at 5/27/13 7:04 PM:
--

Hi David,
In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, 
backtracking,memoization) - see this 
http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference

But it is also capable of doing more things than PEG (ie. better error recovery 
- PEG parser needs to parse the whole tree before it discovers an error; then 
the error recovery is not the same thing)

PEG's can be easier *especially* because of the first-choice operator; in fact 
at times I wished that ANTLR just chose the first available option (well, it 
does, but it reports and error and I didn't want to have grammar with errors). 
So, in CFGANTLR world, ambiguity is solved using syntactic predicates 
(lookahead) -- so far, this has been a theoretical, here are few more points:

Grammar vs code
===

I looked at the presentation and the parser contains the operator precedence, 
however there it is spread across several screens of java code, i find the 
following much more readable

{code}
mainQ : 
  clauseOr+ EOF
  ;
  
clauseOr
  : clauseAnd (or clauseAnd )*
  ;

clauseAnd
  : clauseNot  (and clauseNot)*
  ; 
{code}
  
It is essentially the same thing, but it is independent of the Java and I can 
see it on few lines - and extend it adding few more lines. The patch I wrote 
makes the handling of separate grammar and generated code seamless. So the 2/3 
advantages of PEG over ANTLR disappear.


Syntax vs semantics (business logic)


The example from the presentation needs to be much more involved if it is to be 
used in the real life. Consider this query:

{noformat}
dog NEAR cat
{noformat}

This is going to work only in the simplest case, where each term is a single 
TermQuery. Yet if there was a synonym expansion (where would it go inside the 
PEG parser, is one question) - the parser needs to *rewrite* the query 

something like:

{noformat}
(dog|canin) NEAR cat -- (dog NEAR cat) OR (canin NEAR cat)
{noformat}

So, there you get the 'spaghetti problem' - in the example presented, the logic 
that rewrites the query must reside in the same place as the query parsing. 
That is not an improvement IMO, it is the same thing as the old Lucene parsers 
written in JavaCC which are very difficult to extend or debug

I think I'll add a new grammar with the proximity operators so that you can see 
how easy it is to solve the same situation with ANTLR (but you will need to 
read the patch this time ;)) btw. the patch is big because i included the html 
with SVG charts of the generated parse trees and one Excel file (that one helps 
in writing unittest for the grammar)


Developer vs user experience


I think PEG definitely looks simpler to developers (in the presented example) 
and its main advantage is the first-choice operator. But since ANTLR can do the 
same and it has programming language independent grammar, it can do the same 
job. The difference may be in maturity of the project, tools available (ie 
debuggers) - and of course implementation (see the link above for details)

I can imagine that for PEG you can use your IDE of choice, while with ANTLR 
there is this 'pesky' level of abstraction - but there are tools that make life 
bearable, such as ANTLRWorks or Eclipse ANTLR debugger (though I have not liked 
that one); grammar unittest and I added ways to debug/view the grammar. If you 
apply the patch, you can try:

{code}
ant -f aqp-build.xml gunit
# edit StandardLuceneGrammar and save as 'mytestgrammar'
ant -f aqp-build.xml try-view -Dquery=foo NEAR bar -Dgrammar=mytestgrammar
{code}


There may be of course more things to consider, but I believe the 3 issues 
above present some interesting vantage points.

  was (Author: rchyla):
Hi David,
In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, 
backtracking,memoization) - see this 
http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference

But it is also capable of doing more things than PEG (ie. better error recovery 
- PEG parser needs to parse the whole tree before it discovers an error; then 
the error recovery is not the same thing)

PEG's can be easier *especially* because of the first-choice operator; in fact 
at times I wished that ANTLR just chose the first available option (well, it 
does, but it reports and error and I didn't want to have grammar with errors). 
So, in CFGANTLR world, ambiguity is solved using syntactic predicated 
(lookahead) -- so far, this has been a theoretical, here are few more points:

Clarity
===

I looked at the presentation and the parser contains

[jira] [Created] (LUCENE-5014) ANTLR Lucene query parser

2013-05-22 Thread Roman Chyla (JIRA)

Roman Chyla created LUCENE-5014:
---

 Summary: ANTLR Lucene query parser
 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla


I would like to propose a new way of building query parsers for Lucene.  
Currently, most Lucene parsers are hard to extend because they are either 
written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 
'married' with the query building logic (i.e. the standard lucene parser, 
generated by JavaCC) - which makes any extension really hard.


Few years back, Lucene got the contrib/modern query parser (later renamed to 
'flexible'), yet that parser didn't become a star (it must be very confusing 
for many users). However, that parsing framework is very powerful! And it is a 
real pity that there aren't more parsers already using it - because it allows 
us to add/extend/change almost any aspect of the query parsing. 

So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
framework for building almost any query language one can think of. And I hope 
this extension can become useful.

The details:

 - every new query syntax is written in EBNF, it lives in separate files (and 
can be tested/developed independently - using 'gunit')
 - ANTLR parser generates parsing code (and it can generate parsers in several 
languages, the main target is Java, but it can also do Python - which may be 
interesting for pylucene)
 - the parser generates AST (abstract syntax tree) which is consumed by a  
'pipeline' of processors, users can easily modify this pipeline to add a 
desired functionality
 - the new parser contains a few (very important) debugging functions; it can 
print results of every stage of the build, generate AST's as graphical charts; 
ant targets help to build/test/debug grammars
 - I've tried to reuse the existing queryparser.flexible components as much as 
possible, only adding new processors when necessary

Assumptions about the grammar:
 - every grammar must have one top parse rule called 'mainQ'
 - parsers must generate AST (Abstract Syntax Tree)

The structure of the AST is left open, there are components which make 
assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
FIELD) however users are free to choose/write different processors with 
different assumptions about the AST shape.



More documentation on how to use the parser can be seen here:

http://29min.wordpress.com/category/antlrqueryparser/


The parser has been created more than one year back and is used in production 
(http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
languages (with proximity operatos, functions, special logic etc) - can be seen 
here: 

https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
https://github.com/romanchyla/montysolr/tree/master/contrib/invenio




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-05-22 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

Patch without binary files (if possible, use the other patch)

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser

2013-05-22 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-5014:


Attachment: LUCENE-5014.txt

Includes binary files (ie. one jar and xls)

svn diff --force --diff-cmd /usr/bin/diff -x -au  LUCENE-5014.txt

 ANTLR Lucene query parser
 -

 Key: LUCENE-5014
 URL: https://issues.apache.org/jira/browse/LUCENE-5014
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser, modules/queryparser
Affects Versions: 4.3
 Environment: all
Reporter: Roman Chyla
  Labels: antlr, query, queryparser
 Attachments: LUCENE-5014.txt, LUCENE-5014.txt


 I would like to propose a new way of building query parsers for Lucene.  
 Currently, most Lucene parsers are hard to extend because they are either 
 written in Java (ie. the SOLR query parser, or edismax) or the parsing logic 
 is 'married' with the query building logic (i.e. the standard lucene parser, 
 generated by JavaCC) - which makes any extension really hard.
 Few years back, Lucene got the contrib/modern query parser (later renamed to 
 'flexible'), yet that parser didn't become a star (it must be very confusing 
 for many users). However, that parsing framework is very powerful! And it is 
 a real pity that there aren't more parsers already using it - because it 
 allows us to add/extend/change almost any aspect of the query parsing. 
 So, if we combine ANTLR + queryparser.flexible, we can get very powerful 
 framework for building almost any query language one can think of. And I hope 
 this extension can become useful.
 The details:
  - every new query syntax is written in EBNF, it lives in separate files (and 
 can be tested/developed independently - using 'gunit')
  - ANTLR parser generates parsing code (and it can generate parsers in 
 several languages, the main target is Java, but it can also do Python - which 
 may be interesting for pylucene)
  - the parser generates AST (abstract syntax tree) which is consumed by a  
 'pipeline' of processors, users can easily modify this pipeline to add a 
 desired functionality
  - the new parser contains a few (very important) debugging functions; it can 
 print results of every stage of the build, generate AST's as graphical 
 charts; ant targets help to build/test/debug grammars
  - I've tried to reuse the existing queryparser.flexible components as much 
 as possible, only adding new processors when necessary
 Assumptions about the grammar:
  - every grammar must have one top parse rule called 'mainQ'
  - parsers must generate AST (Abstract Syntax Tree)
 The structure of the AST is left open, there are components which make 
 assumptions about the shape of the AST (ie. that MODIFIER is parent of a a 
 FIELD) however users are free to choose/write different processors with 
 different assumptions about the AST shape.
 More documentation on how to use the parser can be seen here:
 http://29min.wordpress.com/category/antlrqueryparser/
 The parser has been created more than one year back and is used in production 
 (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query 
 languages (with proximity operatos, functions, special logic etc) - can be 
 seen here: 
 https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs
 https://github.com/romanchyla/montysolr/tree/master/contrib/invenio

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: New query parser?

2013-05-22 Thread Roman Chyla

Hello,
The new JIRA issue has been created -
https://issues.apache.org/jira/browse/LUCENE-5014
Thank you for trying it,

roman


On Wed, May 15, 2013 at 7:34 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Jan,

 Thanks for thumbs up


 On Tue, May 14, 2013 at 11:14 AM, Jan Høydahl jan@cominvent.comwrote:

 Hello :)

 I think it has been the intention of the dev community for a long time to
 start using the flex parser framework, and in this regard this contribution
 is much welcome as a kickstarter for that.
 I have not looked much at the code, but I hope it could be a starting
 point for writing future parsers in a less spaghetti way.

 One question. Say we want to add a new operator such as NEAR/N. Ideally
 this should be added in Lucene, then all the Solr QParsers extending the
 lucene flex parser would benefit from the same new operator. Would this be
 easily achieved with your code you think? We also have a ton of



 to add a new operator is very simple on the syntax level -- ie. when I
 want the NEAR/x operator, I just change the ANTLR grammar, which produces
 the approripate abstract syntax tree. The flex parser is consuming this.

 Yet, imagine the following query

 dog NEAR/5 cat

 if you are using synonyms, an analyzer could have expanded dog with
 synonyms, it becomes something like

 (dog | canin) NEAR/5 cat

 and since Lucene cannot handle these queries, the flex builder must
 rewrite them, effectively producing

 SpanNear(SpanOr(dog | cat), SpanTerm(cat), 5)

 but you could also argue, that a better way to handle this query is:

 SpanNear(dog, cat, 5) OR SpanNear(canin, cat, 5)

 If that is the case, then a different builder will have to be used -

 Just an example where syntax is relatively simple, but the semantics is
 the hard part. But I believe the flex parser gives all necessary tools to
 deal with that and avoid the spaghetti problem


 --roman



 feature requests on the eDisMax parser for new kinds of query syntax
 support. Before we start implementing that on top of the
 already-hard-to-maintain eDismax code, we should think about
 re-implementing eDismax on top of flex, perhaps on top of Roman's contrib
 here?


 btw: i am using edismax in one of my grammars -- ie. users can type: query
 AND edismax(foo OR (dog AND cat)) -- and the edismax() will be parsed
 by edismax, but I hit the problems there as well, it is not doing such a
 nice job with operators and of course it doesn't know how to handle
 multi-token synonym expansion, but I think it could be nicely extracted
 into a flex processor and effectively become a plugin for a solr parser
 (now, it is a parser of its own, which makes it hard to extend)






  --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 14. mai 2013 kl. 17:07 skrev Roman Chyla roman.ch...@gmail.com:

 Hello World!

 Following the recommended practice I'd like to let you know that I am
 about to start porting our existing query parser into JIRA with the aim of
 making it available to Lucene/SOLR community.

 The query parser is built on top of the flexible query parser, but it
 separates the parsing (ANTLR) and the query building - it allows for a very
 sophisticated custom logic and has self-retrospecting methods, so one can
 actually 'see' what is going on - I have had lots of FUN working with it
 (which I consider to be a feature, not a shameless plug ;)).

 Some write up is here:
 http://29min.wordpress.com/category/antlrqueryparser/

 You can see the source code at:

 https://github.com/romanchyla/montysolr/tree/master/contrib/antlrqueryparser


 If you think this project is duplicating something or even being useless
 (I hope not!) please let me know, stop me, say something...

 Thank you!

   roman

Re: New query parser?

2013-05-15 Thread Roman Chyla

Hi Jan,

Thanks for thumbs up


On Tue, May 14, 2013 at 11:14 AM, Jan Høydahl jan@cominvent.com wrote:

 Hello :)

 I think it has been the intention of the dev community for a long time to
 start using the flex parser framework, and in this regard this contribution
 is much welcome as a kickstarter for that.
 I have not looked much at the code, but I hope it could be a starting
 point for writing future parsers in a less spaghetti way.

 One question. Say we want to add a new operator such as NEAR/N. Ideally
 this should be added in Lucene, then all the Solr QParsers extending the
 lucene flex parser would benefit from the same new operator. Would this be
 easily achieved with your code you think? We also have a ton of



to add a new operator is very simple on the syntax level -- ie. when I want
the NEAR/x operator, I just change the ANTLR grammar, which produces the
approripate abstract syntax tree. The flex parser is consuming this.

Yet, imagine the following query

dog NEAR/5 cat

if you are using synonyms, an analyzer could have expanded dog with
synonyms, it becomes something like

(dog | canin) NEAR/5 cat

and since Lucene cannot handle these queries, the flex builder must rewrite
them, effectively producing

SpanNear(SpanOr(dog | cat), SpanTerm(cat), 5)

but you could also argue, that a better way to handle this query is:

SpanNear(dog, cat, 5) OR SpanNear(canin, cat, 5)

If that is the case, then a different builder will have to be used -

Just an example where syntax is relatively simple, but the semantics is the
hard part. But I believe the flex parser gives all necessary tools to deal
with that and avoid the spaghetti problem


--roman



 feature requests on the eDisMax parser for new kinds of query syntax
 support. Before we start implementing that on top of the
 already-hard-to-maintain eDismax code, we should think about
 re-implementing eDismax on top of flex, perhaps on top of Roman's contrib
 here?


btw: i am using edismax in one of my grammars -- ie. users can type: query
AND edismax(foo OR (dog AND cat)) -- and the edismax() will be parsed
by edismax, but I hit the problems there as well, it is not doing such a
nice job with operators and of course it doesn't know how to handle
multi-token synonym expansion, but I think it could be nicely extracted
into a flex processor and effectively become a plugin for a solr parser
(now, it is a parser of its own, which makes it hard to extend)






  --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 14. mai 2013 kl. 17:07 skrev Roman Chyla roman.ch...@gmail.com:

 Hello World!

 Following the recommended practice I'd like to let you know that I am
 about to start porting our existing query parser into JIRA with the aim of
 making it available to Lucene/SOLR community.

 The query parser is built on top of the flexible query parser, but it
 separates the parsing (ANTLR) and the query building - it allows for a very
 sophisticated custom logic and has self-retrospecting methods, so one can
 actually 'see' what is going on - I have had lots of FUN working with it
 (which I consider to be a feature, not a shameless plug ;)).

 Some write up is here:
 http://29min.wordpress.com/category/antlrqueryparser/

 You can see the source code at:

 https://github.com/romanchyla/montysolr/tree/master/contrib/antlrqueryparser


 If you think this project is duplicating something or even being useless
 (I hope not!) please let me know, stop me, say something...

 Thank you!

   roman

New query parser?

2013-05-14 Thread Roman Chyla

Hello World!

Following the recommended practice I'd like to let you know that I am about
to start porting our existing query parser into JIRA with the aim of making
it available to Lucene/SOLR community.

The query parser is built on top of the flexible query parser, but it
separates the parsing (ANTLR) and the query building - it allows for a very
sophisticated custom logic and has self-retrospecting methods, so one can
actually 'see' what is going on - I have had lots of FUN working with it
(which I consider to be a feature, not a shameless plug ;)).

Some write up is here:
http://29min.wordpress.com/category/antlrqueryparser/

You can see the source code at:
https://github.com/romanchyla/montysolr/tree/master/contrib/antlrqueryparser


If you think this project is duplicating something or even being useless (I
hope not!) please let me know, stop me, say something...

Thank you!

  roman

[jira] [Created] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries

2013-01-10 Thread Roman Chyla (JIRA)

Roman Chyla created LUCENE-4679:
---

 Summary: LowercaseExpandedTermsQueryNodeProcessor changes regex 
queries
 Key: LUCENE-4679
 URL: https://issues.apache.org/jira/browse/LUCENE-4679
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Roman Chyla
Priority: Trivial


This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, \\W should stay uppercase, 
but it will be lowercased.





--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries

2013-01-10 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-4679:


Attachment: LUCENE-4679.patch

 LowercaseExpandedTermsQueryNodeProcessor changes regex queries
 --

 Key: LUCENE-4679
 URL: https://issues.apache.org/jira/browse/LUCENE-4679
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Roman Chyla
Priority: Trivial
 Attachments: LUCENE-4679.patch


 This is really a very silly request, but could the lowercase processor 
 'abstain' from changing regex queries? For example, \\W should stay 
 uppercase, but it will be lowercased.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries

2013-01-10 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-4679:


Description: 
This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, W should stay 
uppercase, but it will be lowercased.





  was:
This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, \\W should stay uppercase, 
but it will be lowercased.






 LowercaseExpandedTermsQueryNodeProcessor changes regex queries
 --

 Key: LUCENE-4679
 URL: https://issues.apache.org/jira/browse/LUCENE-4679
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Roman Chyla
Priority: Trivial
 Attachments: LUCENE-4679.patch


 This is really a very silly request, but could the lowercase processor 
 'abstain' from changing regex queries? For example, W should stay 
 uppercase, but it will be lowercased.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries

2013-01-10 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-4679:


Description: 
This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, W should stay 
uppercase, but it is lowercased.





  was:
This is really a very silly request, but could the lowercase processor 
'abstain' from changing regex queries? For example, W should stay 
uppercase, but it will be lowercased.






 LowercaseExpandedTermsQueryNodeProcessor changes regex queries
 --

 Key: LUCENE-4679
 URL: https://issues.apache.org/jira/browse/LUCENE-4679
 Project: Lucene - Core
  Issue Type: Wish
Reporter: Roman Chyla
Priority: Trivial
 Attachments: LUCENE-4679.patch


 This is really a very silly request, but could the lowercase processor 
 'abstain' from changing regex queries? For example, W should stay 
 uppercase, but it is lowercased.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4499) Multi-word synonym filter (synonym expansion)

2012-12-04 Thread Roman Chyla (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Chyla updated LUCENE-4499:


Attachment: LUCENE-4499.patch

A new patch, as the old version was extending wrong class (which cause web 
tests to fail)

 Multi-word synonym filter (synonym expansion)
 -

 Key: LUCENE-4499
 URL: https://issues.apache.org/jira/browse/LUCENE-4499
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Affects Versions: 4.1, 5.0
Reporter: Roman Chyla
Priority: Minor
  Labels: analysis, multi-word, synonyms
 Fix For: 5.0

 Attachments: LUCENE-4499.patch, LUCENE-4499.patch


 I apologize for bringing the multi-token synonym expansion up again. There is 
 an old, unresolved issue at LUCENE-1622 [1]
 While solving the problem for our needs [2], I discovered that the current 
 SolrSynonym parser (and the wonderful FTS) have almost everything to 
 satisfactorily handle both the query and index time synonym expansion. It 
 seems that people often need to use the synonym filter *slightly* differently 
 at indexing and query time.
 In our case, we must do different things during indexing and querying.
 Example sentence: Mirrors of the Hubble space telescope pointed at XA5
 This is what we need (comma marks position bump):
 indexing: mirrors,hubble|hubble space 
 telescope|hst,space,telescope,pointed,xa5|astroobject#5
 querying: +mirrors +(hubble space telescope | hst) +pointed 
 +(xa5|astroboject#5)
 This translated to following needs:
   indexing time: 
 single-token synonyms = return only synonyms
 multi-token synonyms = return original tokens *AND* the synonyms
   query time:
 single-token: return only synonyms (but preserve case)
 multi-token: return only synonyms
  
 We need the original tokens for the proximity queries, if we indexed 'hubble 
 space telescope'
 as one token, we cannot search for 'hubble NEAR telescope'
 You may (not) be surprised, but Lucene already supports ALL of these 
 requirements. The patch is an attempt to state the problem differently. I am 
 not sure if it is the best option, however it works perfectly for our needs 
 and it seems it could work for general public too. Especially if the 
 SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and 
 people would just choose what situation they use. Please look at the unittest.
 links:
 [1] https://issues.apache.org/jira/browse/LUCENE-1622
 [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158
 [3] seems to have similar request: 
 http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: pro coding style

2012-12-01 Thread Roman Chyla

On Fri, Nov 30, 2012 at 8:56 AM, Robert Muir rcm...@gmail.com wrote:



 On Fri, Nov 30, 2012 at 8:50 AM, Per Steffensen st...@designware.dkwrote:

 Robert Muir skrev:

  Is it really git? Because its my understanding pull requests aren't
 actually a git thing but a github thing.

 The distinction is important.


 Actually Im not sure. Have never used git outside github, but at least
 part of it has to be git and not github (I think) - or else I couldnt
 imagine how you get the advantages you get. Remember that when using git
 you actually run a repository on every developers local machines. When
 you commit, you commit only to you local repository. You need to push
 in order to have it upstreamed (as they call it)


 Right, I'm positive this (pull requests) is github :)

 I just wanted make this point: when we have discussions about using git
 instead of svn, I'm not sure it makes things easier on anyone, actually
 probably worse and more complex.

 Its the github workflow that contributors want (I would +1 some scheme
 that supports this!), but git by itself, is pretty unusable.

 Github is like a nice front-end to this mess.


This is like a medicine to me! With all the craze about git (and we use it
for our main project and also for solr development) it just confirms my 3
years-long experience. Git is pain. Github is great (too bad there is git
behind it ;))

And now the problems of forks - with git the fork is the natural evil - git
just makes it established practice. But it still doesn't save us from the
(slow) process of incorporating new patches. While it is inevitable and we
cannot be more grateful to all the committers for their hard work (really
thanks!) perhaps there is a way to make solr/lucene more sandbox friendly?

In our organization we are doing something similar (to using SOLR as a
library), the automated build/deployment goes like this:

- checkout our sources
- downloadbuild solr sources
- compile our code
- merge with solr  test
- deploy

This avoids forking solr and we always develop against the chosen branch,
the pain was in porting the solr build infrastructure - if there was this
infrastructure inside solr, ready for developers to take advantage of it,
others were saved the pain or reinventing it. As far as I am aware, there
is only one hard problem - the confusing nature of the classloaders inside
webcontainers, i have really had hard time understanding it to make it
right - but there are surely more knowledgeable people here. And if the
worst comes to worst, the automated procedure could easily merge jars.
Sounds evil? Is forking Solr a better way?


roman

[jira] [Commented] (LUCENE-4499) Multi-word synonym filter (synonym expansion)

2012-11-30 Thread Roman Chyla (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507440#comment-13507440
 ] 

Roman Chyla commented on LUCENE-4499:
-

Hi Nolan, your case seems to confirm a need for some solution. You have decided 
to make a seaprate query parser, I have put the expanding logic into a query 
parser as well.

See this for the working example:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java

And its config
https://github.com/romanchyla/montysolr/blob/master/contrib/examples/adsabs/solr/collection1/conf/schema.xml#L325

I see two added benefits (besides not needing a query parser plugin - in our 
case, it must be plugged into our qparser):

 1. you can use the filter at index/query time inside a standard query parser
 2. special configuration for synonym expansion (for example, we have found it 
very useful to be able to search for multi-tokens in case-insensitive manner, 
but recognize single tokens only case-sensitively; or expand with multi-token 
synonyms only for multi-word originals and output also the original words, 
otherwise eat them (replace them))

Nice blog post, I wish I could write as instructively as well :)

 Multi-word synonym filter (synonym expansion)
 -

 Key: LUCENE-4499
 URL: https://issues.apache.org/jira/browse/LUCENE-4499
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Affects Versions: 4.1, 5.0
Reporter: Roman Chyla
Priority: Minor
  Labels: analysis, multi-word, synonyms
 Fix For: 5.0

 Attachments: LUCENE-4499.patch


 I apologize for bringing the multi-token synonym expansion up again. There is 
 an old, unresolved issue at LUCENE-1622 [1]
 While solving the problem for our needs [2], I discovered that the current 
 SolrSynonym parser (and the wonderful FTS) have almost everything to 
 satisfactorily handle both the query and index time synonym expansion. It 
 seems that people often need to use the synonym filter *slightly* differently 
 at indexing and query time.
 In our case, we must do different things during indexing and querying.
 Example sentence: Mirrors of the Hubble space telescope pointed at XA5
 This is what we need (comma marks position bump):
 indexing: mirrors,hubble|hubble space 
 telescope|hst,space,telescope,pointed,xa5|astroobject#5
 querying: +mirrors +(hubble space telescope | hst) +pointed 
 +(xa5|astroboject#5)
 This translated to following needs:
   indexing time: 
 single-token synonyms = return only synonyms
 multi-token synonyms = return original tokens *AND* the synonyms
   query time:
 single-token: return only synonyms (but preserve case)
 multi-token: return only synonyms
  
 We need the original tokens for the proximity queries, if we indexed 'hubble 
 space telescope'
 as one token, we cannot search for 'hubble NEAR telescope'
 You may (not) be surprised, but Lucene already supports ALL of these 
 requirements. The patch is an attempt to state the problem differently. I am 
 not sure if it is the best option, however it works perfectly for our needs 
 and it seems it could work for general public too. Especially if the 
 SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and 
 people would just choose what situation they use. Please look at the unittest.
 links:
 [1] https://issues.apache.org/jira/browse/LUCENE-1622
 [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158
 [3] seems to have similar request: 
 http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Changing Python class/module layout, dropping --rename ?

2012-07-19 Thread Roman Chyla

The script must have thought about it somehow :-) Have a great,
undisturbed vacation!

roman

On Thu, Jul 19, 2012 at 9:33 AM, Andi Vajda va...@apache.org wrote:

 On Fri, 13 Jul 2012, Roman Chyla wrote:

 Hi,
 I was playing with the idea of creating virtual packages, attached is a
 working script that illustrates it. I am getting this output:

 Dit it work?


 No, I haven't forgotten, I'm just on vacation.

 Andi..


 ==
 from org.apache.lucene.search import SearcherFactory; print
 SearcherFactory
 type 'SearcherFactory'
 from org.apache.lucene.analysis import Analyzer as Banalyzer; print
 Banalyzer
 type 'Analyzer'
 print sys.modules['org'] module 'org' (built-in)
 print sys.modules['org.apache'] module 'org.apache' (built-in)
 print sys.modules['org.apache.lucene'] module 'org.apache.lucene'
 (built-in)
 print sys.modules['org.apache.lucene.search'] module
 'org.apache.lucene.search' (built-in)

 Cheers,

  roman


 On Fri, Jul 13, 2012 at 1:34 PM, Andi Vajda va...@apache.org wrote:


 On Jul 13, 2012, at 18:33, Roman Chyla roman.ch...@gmail.com wrote:

 I think this would be great. Let me add little bit more to your
 observations (whole night yesterday was spent fighting with renames -
 because I was building a project which imports shared lucene and solr
 --
 there were thousands of same classes, I am not sure it would be possible
 without some sort of a flexible rename...)

 JCC is a great tool and is used by potentially many projects - so

 stripping

 org.apache seems right for pylucene, but looks arbitrary otherwise


 Yes, I forgot to say that there would be a way to declare one or more
 mappings  so that org.apache.lucene becomes lucene.

 Andi..

 (unless there is a flexible stripping mechanism). Also, if the full
 namespace remains original, then the code written in Python would be
 also
 executable by Jython, which is IMHO an advantage.

 But this being Python, the packages cannot be spread in different

 locations

 (ie. there can be only one org.apache.lucene.analysis package) - unless
 there exists (again) some flexible mechanism which populates the

 namespace

 with objects that belong there. It may seem an overkill to you, because

 for

 single projects it would work, but seems perfectly justifiable in case
 of
 imported shared libraries

 I don't know what is your idea for implementing the python packages, but
 your last email got me thinking as well - there might be a very simple

 way

 of getting to the java packages inside Python without too much work.

 Let's say the java org.apache.lucene.search.IndexSearcher is known to
 python as org_apache_lucene_search_IndexSearcher

 and users do:

 import lucene
 lucene.initVM()

 initVM() first initiates java VM (and populates the lucene namespace
 with
 all objects), but then it will call jcc.register_module(self)

 A new piece of code inside JCC grabs the lucene module and creates (on

 the

 fly) python packages -- using types.ModuleType (or new.module()) -- the

 new

 packages will be inserted into sys.modules

 so after lucene.initVM() returns

 users can do from org.apache.lucene.search import IndexSearcher and
 get
 lucene.org_apache_lucene_search_IndexSearcher object

 and also, when shared libraries are present (let's say 'solr') users do:

 import solr
 solr.initVM()

 The JCC will just update the existing packages and create new ones if
 needed (and from this perspective, having fully qualified name is safer
 than to have lucene.search.IndexSearcher)

 I think this change is totally possible and will not change the way how
 extensions are built. Does it have some serious flaw?

 I would be of course more than happy to contribute and test.

 Best,

  roman


 On Fri, Jul 13, 2012 at 11:47 AM, Andi Vajda va...@apache.org wrote:


 On Tue, 10 Jul 2012, Andi Vajda wrote:

 I would also like to propose a change, to allow for more flexible

 mechanism of generating Python class names. The patch doesn't change
 the default pylucene behaviour, but it gives people a way to replace
 class names with patterns. I have noticed that there are more
 same-name classes from different packages in the new lucene (and it
 becomes worse when one has to deal with both lucene and solr).


 Another way to fix this is to reproduce the namespace hierarchy used
 in
 Lucene, following along the Java packages, something I've been

 dreading to

 do. Lucene just loves a really long deeply nested class structure.
 I'm not convinced yet it is bad enough to go down that route, though.

 Your proposal to use patterns may in fact yield a much more convenient
 solution. Thanks !


 Rethinking this a bit, I'm prepared to change my mind on this. Your
 patterned rename patch shows that we're slowly but surely reaching the
 limit of the current setup that consists in throwing all wrapped
 classes
 under the one global 'lucene' namespace.

 Lucene 4.0 has seen a large number of deeply nested classes with
 similar
 names added since 3.x. Renaming

Re: Changing Python class/module layout, dropping --rename ?

2012-07-13 Thread Roman Chyla

 also say:
   - import lucene.document.Document as whateverOneLikes

 If that proposal isn't mortally flawed somewhere, I'm prepared to drop
 support for --rename and replace it with this new Python class/module
 layout.

 Since this is being talked about in the context of a major PyLucene
 release, version 4.0, and that all tests/samples have to be reworked
 anyway, this backwards compat break shouldn't be too controversial,
 hopefully.

 If it is, the old --rename could be preserved for sure, but I'd prefer
 simplying the JCC interface than to accrete more to it.

 What do you think ?

 Andi..


 Andi..


 I can confirm the test_test_BinaryDocument.py crashes the JVM no more.

 Roman


 On Tue, Jul 10, 2012 at 8:54 AM, Andi Vajda va...@apache.org wrote:


  Hi Roman,


 On Mon, 9 Jul 2012, Roman Chyla wrote:

  Thanks, I am attaching a new patch that adds the missing test base.
 Sorry for the tabs, I was probably messing around with a few editors
 (some of them not configured properly)



 I integrated your test class (renaming it to fit the naming scheme
 used).
 Thanks !


  So far, found one serious problem, crashes VM -- see. eg
 test/test_BinaryDocument.py - when getting the document using:
 reader.document(0)



 test/test_BInaryDocument.py doesn't seem to crash the VM but fails
 because
 of some API changes. I suspect the crash to be some issue related to
 using
 an older jcc.

 I see a comment saying: couldn't find any combination with lucene4.0
 where
 it would raise errors. Most of these unit tests are straight ports
 from the
 original Java version. If you're stumped about a change, check the
 original
 Java test, it may have changed too.

 Andi..

Re: Changing Python class/module layout, dropping --rename ?

2012-07-13 Thread Roman Chyla

Hi,
I was playing with the idea of creating virtual packages, attached is a
working script that illustrates it. I am getting this output:

Dit it work?
==
from org.apache.lucene.search import SearcherFactory; print SearcherFactory
type 'SearcherFactory'
from org.apache.lucene.analysis import Analyzer as Banalyzer; print
Banalyzer
type 'Analyzer'
print sys.modules['org'] module 'org' (built-in)
print sys.modules['org.apache'] module 'org.apache' (built-in)
print sys.modules['org.apache.lucene'] module 'org.apache.lucene'
(built-in)
print sys.modules['org.apache.lucene.search'] module
'org.apache.lucene.search' (built-in)

Cheers,

  roman


On Fri, Jul 13, 2012 at 1:34 PM, Andi Vajda va...@apache.org wrote:


 On Jul 13, 2012, at 18:33, Roman Chyla roman.ch...@gmail.com wrote:

  I think this would be great. Let me add little bit more to your
  observations (whole night yesterday was spent fighting with renames -
  because I was building a project which imports shared lucene and solr  --
  there were thousands of same classes, I am not sure it would be possible
  without some sort of a flexible rename...)
 
  JCC is a great tool and is used by potentially many projects - so
 stripping
  org.apache seems right for pylucene, but looks arbitrary otherwise

 Yes, I forgot to say that there would be a way to declare one or more
 mappings  so that org.apache.lucene becomes lucene.

 Andi..

  (unless there is a flexible stripping mechanism). Also, if the full
  namespace remains original, then the code written in Python would be also
  executable by Jython, which is IMHO an advantage.
 
  But this being Python, the packages cannot be spread in different
 locations
  (ie. there can be only one org.apache.lucene.analysis package) - unless
  there exists (again) some flexible mechanism which populates the
 namespace
  with objects that belong there. It may seem an overkill to you, because
 for
  single projects it would work, but seems perfectly justifiable in case of
  imported shared libraries
 
  I don't know what is your idea for implementing the python packages, but
  your last email got me thinking as well - there might be a very simple
 way
  of getting to the java packages inside Python without too much work.
 
  Let's say the java org.apache.lucene.search.IndexSearcher is known to
  python as org_apache_lucene_search_IndexSearcher
 
  and users do:
 
  import lucene
  lucene.initVM()
 
  initVM() first initiates java VM (and populates the lucene namespace with
  all objects), but then it will call jcc.register_module(self)
 
  A new piece of code inside JCC grabs the lucene module and creates (on
 the
  fly) python packages -- using types.ModuleType (or new.module()) -- the
 new
  packages will be inserted into sys.modules
 
  so after lucene.initVM() returns
 
  users can do from org.apache.lucene.search import IndexSearcher and get
  lucene.org_apache_lucene_search_IndexSearcher object
 
  and also, when shared libraries are present (let's say 'solr') users do:
 
  import solr
  solr.initVM()
 
  The JCC will just update the existing packages and create new ones if
  needed (and from this perspective, having fully qualified name is safer
  than to have lucene.search.IndexSearcher)
 
  I think this change is totally possible and will not change the way how
  extensions are built. Does it have some serious flaw?
 
  I would be of course more than happy to contribute and test.
 
  Best,
 
   roman
 
 
  On Fri, Jul 13, 2012 at 11:47 AM, Andi Vajda va...@apache.org wrote:
 
 
  On Tue, 10 Jul 2012, Andi Vajda wrote:
 
  I would also like to propose a change, to allow for more flexible
  mechanism of generating Python class names. The patch doesn't change
  the default pylucene behaviour, but it gives people a way to replace
  class names with patterns. I have noticed that there are more
  same-name classes from different packages in the new lucene (and it
  becomes worse when one has to deal with both lucene and solr).
 
 
  Another way to fix this is to reproduce the namespace hierarchy used in
  Lucene, following along the Java packages, something I've been
 dreading to
  do. Lucene just loves a really long deeply nested class structure.
  I'm not convinced yet it is bad enough to go down that route, though.
 
  Your proposal to use patterns may in fact yield a much more convenient
  solution. Thanks !
 
 
  Rethinking this a bit, I'm prepared to change my mind on this. Your
  patterned rename patch shows that we're slowly but surely reaching the
  limit of the current setup that consists in throwing all wrapped classes
  under the one global 'lucene' namespace.
 
  Lucene 4.0 has seen a large number of deeply nested classes with similar
  names added since 3.x. Renaming these one by one (or excluding some)
  doesn't scale. Using the proposed patterned rename scales more but
 makes it
  difficult to know what got renamed and how.
  Ultimately, the more classes that are like-named

Re: lucene4.0 release

2012-07-10 Thread Roman Chyla

Hi Andi,

Thanks again. With the new JCC I encountered new errors - about
already used class names - patch attached.

I would also like to propose a change, to allow for more flexible
mechanism of generating Python class names. The patch doesn't change
the default pylucene behaviour, but it gives people a way to replace
class names with patterns. I have noticed that there are more
same-name classes from different packages in the new lucene (and it
becomes worse when one has to deal with both lucene and solr).

I can confirm the test_test_BinaryDocument.py crashes the JVM no more.

Roman


On Tue, Jul 10, 2012 at 8:54 AM, Andi Vajda va...@apache.org wrote:

  Hi Roman,


 On Mon, 9 Jul 2012, Roman Chyla wrote:

 Thanks, I am attaching a new patch that adds the missing test base.
 Sorry for the tabs, I was probably messing around with a few editors
 (some of them not configured properly)


 I integrated your test class (renaming it to fit the naming scheme used).
 Thanks !


 So far, found one serious problem, crashes VM -- see. eg
 test/test_BinaryDocument.py - when getting the document using:
 reader.document(0)


 test/test_BInaryDocument.py doesn't seem to crash the VM but fails because
 of some API changes. I suspect the crash to be some issue related to using
 an older jcc.

 I see a comment saying: couldn't find any combination with lucene4.0 where
 it would raise errors. Most of these unit tests are straight ports from the
 original Java version. If you're stumped about a change, check the original
 Java test, it may have changed too.

 Andi..

Re: lucene4.0 release

2012-07-09 Thread Roman Chyla

Hi Andi,

Thanks, I am attaching a new patch that adds the missing test base.
Sorry for the tabs, I was probably messing around with a few editors
(some of them not configured properly)

The test_Analyzer.py works for me no more - it imports
PythonAttributeImpl which I cannot find in the trunk

I wasn't able to build JCC, there is a build error since the new
commit (tested on Debian with Python2.7 and CentOS with Python 2.6)


gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC
-fwrapv -I/usr/kerberos/include -DNDEBUG -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC
-fwrapv -fPIC -D_java_generics -DJCC_VER=2.13
-I/usr/lib/jvm/java-openjdk/include
-I/usr/lib/jvm/java-openjdk/include/linux -I_jcc -Ijcc/sources
-I/usr/include/python2.6 -c jcc/sources/functions.cpp -o
build/temp.linux-x86_64-2.6/jcc/sources/functions.o -DPYTHON
-fno-strict-aliasing -Wno-write-strings
jcc/sources/functions.cpp: In function ‘PyObject*
makeInterface(PyObject*, PyObject*)’:
jcc/sources/functions.cpp:153: error: ‘htons’ was not declared in this scope
jcc/sources/functions.cpp: In function ‘PyObject* makeClass(PyObject*,
PyObject*)’:
jcc/sources/functions.cpp:244: error: ‘htons’ was not declared in this scope
error: command 'gcc' failed with exit status 1


So I tried building pylucene with JCC 2.8, after adding to the Makefile

--reserved mutable \
--reserved token \

but got an error:

build/_lucene/__wrap01__.cpp: In function ‘PyObject*
org::apache::pylucene::util::t_PythonListIterator_next(org::apache::pylucene::util::t_PythonListIterator*,
PyObject*)’:
build/_lucene/__wrap01__.cpp:17920:38: error: ‘class
org::apache::pylucene::util::t_PythonListIterator’ has no member named
‘parameters’
build/_lucene/__wrap01__.cpp:17920:77: error: ‘class
org::apache::pylucene::util::t_PythonListIterator’ has no member named
‘parameters’
error: command 'gcc' failed with exit status 1

Then I tried using the pylucene code from Friday (just updated Lucene
java source) and it worked, it seems that changes inside lucene are
not cause of this

roman


On Sat, Jul 7, 2012 at 11:35 AM, Andi Vajda va...@apache.org wrote:

  Hi Roman,


 On Fri, 6 Jul 2012, Roman Chyla wrote:

 I figured this is not complete for jira, retrying /w email...


 I integrated your patch after merging 3.6.0 - 3.x and then 3.x into trunk.
 PyLucene's trunk is now setup to track Lucene's branch_4x branch.

 I wasn't able to run all tests that succeed for you as you didn't send in
 your new PyLuceneTestCase.py class. Please add it to the test directory
 (instead of a new package) along with the other test helper classes already
 there such as BaseTokenStreamTestCase.py and send it in.

 Also, please, please, please, avoid using tab characters in the Java code
 you send in. Tabs are pain to manage, they mess up indentation and make the
 code hard to read.

 As this time, PyLucene on trunk builds and runs the few tests you ported
 that don't require this missing file, such as test_Analyzers.py.

 Thanks !

 Andi..




 On Fri, Jul 6, 2012 at 1:55 PM, Andi Vajda va...@apache.org wrote:

 I think that the apache mail server is eating up the attachment. Try to
 make it a .diff file or attach the patch to a jira issue. Thanks !

 Andi..

 On Jul 6, 2012, at 18:54, Roman Chyla roman.ch...@gmail.com wrote:

 Attaching the patch (there is no chance I could do it in one go, but
 if parts are committed in the trunk, then we can do more...I have also
 introduced base class for unittests, so that may be st to wave)

 So far, found one serious problem, crashes VM -- see. eg
 test/test_BinaryDocument.py - when getting the document using:
 reader.document(0)


 What works fine now:

  test/
test_Analyzers
test_Binary
test_RegexQuery

  samples/LuceneInAction/
index.py
BasicSearchingTest.py



 On Thu, Jul 5, 2012 at 8:22 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

 The patch probably probably didn't make it to the list, I'll file a
 ticket later

 It is definitely lot of work with the python code, I have gone through
 1.5 test cases now, and it is just 'unpleasant', so many API changes
 out there - but I'll try to convert more

 roman

 On Thu, Jul 5, 2012 at 7:48 PM, Andi Vajda va...@apache.org wrote:


 On Jul 6, 2012, at 0:27, Roman Chyla roman.ch...@gmail.com wrote:

 Lucene is 4.0 in alpha release and we would like to start working
 with
 pylucene4.0 already. I checked out the pylucene trunk and made the
 necessary changes so that it compiles. Would it be possible to
 incorporate (some of) these changes?


 Absolutely, please send a patch to the list or file a bug and attach
 it there.

 The issue with a PyLucene 4.0 release is not so much getting it to
 compile and run but rewriting all the tests and samples (originally 
 ported
 from Java) since

Re: lucene4.0 release

2012-07-06 Thread Roman Chyla

You can also get it temporarily here:
https://github.com/romanchyla/pylucene-trunk

roman

On Fri, Jul 6, 2012 at 2:04 PM, Roman Chyla roman.ch...@gmail.com wrote:
 I figured this is not complete for jira, retrying /w email...
 r


 On Fri, Jul 6, 2012 at 1:55 PM, Andi Vajda va...@apache.org wrote:
 I think that the apache mail server is eating up the attachment. Try to make 
 it a .diff file or attach the patch to a jira issue. Thanks !

 Andi..

 On Jul 6, 2012, at 18:54, Roman Chyla roman.ch...@gmail.com wrote:

 Attaching the patch (there is no chance I could do it in one go, but
 if parts are committed in the trunk, then we can do more...I have also
 introduced base class for unittests, so that may be st to wave)

 So far, found one serious problem, crashes VM -- see. eg
 test/test_BinaryDocument.py - when getting the document using:
 reader.document(0)


 What works fine now:

  test/
test_Analyzers
test_Binary
test_RegexQuery

  samples/LuceneInAction/
index.py
BasicSearchingTest.py



 On Thu, Jul 5, 2012 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote:
 The patch probably probably didn't make it to the list, I'll file a ticket 
 later

 It is definitely lot of work with the python code, I have gone through
 1.5 test cases now, and it is just 'unpleasant', so many API changes
 out there - but I'll try to convert more

 roman

 On Thu, Jul 5, 2012 at 7:48 PM, Andi Vajda va...@apache.org wrote:

 On Jul 6, 2012, at 0:27, Roman Chyla roman.ch...@gmail.com wrote:

 Lucene is 4.0 in alpha release and we would like to start working with
 pylucene4.0 already. I checked out the pylucene trunk and made the
 necessary changes so that it compiles. Would it be possible to
 incorporate (some of) these changes?

 Absolutely, please send a patch to the list or file a bug and attach it 
 there.

 The issue with a PyLucene 4.0 release is not so much getting it to 
 compile and run but rewriting all the tests and samples (originally 
 ported from Java) since the Lucene api changed in many ways. That's a 
 large amount of work and some of the new analyzer/tokenizer framework 
 stuff needs some new jcc support for generating classes on the fly. I've 
 got that written to some extent already but porting the samples and tests 
 again is daunting.

 Andi..


 Thanks,

 Roman

Re: lucene4.0 release

2012-07-06 Thread Roman Chyla

Attaching the patch (there is no chance I could do it in one go, but
if parts are committed in the trunk, then we can do more...I have also
introduced base class for unittests, so that may be st to wave)

So far, found one serious problem, crashes VM -- see. eg
test/test_BinaryDocument.py - when getting the document using:
reader.document(0)


What works fine now:

  test/
test_Analyzers
test_Binary
test_RegexQuery

  samples/LuceneInAction/
index.py
BasicSearchingTest.py



On Thu, Jul 5, 2012 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote:
 The patch probably probably didn't make it to the list, I'll file a ticket 
 later

 It is definitely lot of work with the python code, I have gone through
 1.5 test cases now, and it is just 'unpleasant', so many API changes
 out there - but I'll try to convert more

 roman

 On Thu, Jul 5, 2012 at 7:48 PM, Andi Vajda va...@apache.org wrote:

 On Jul 6, 2012, at 0:27, Roman Chyla roman.ch...@gmail.com wrote:

 Lucene is 4.0 in alpha release and we would like to start working with
 pylucene4.0 already. I checked out the pylucene trunk and made the
 necessary changes so that it compiles. Would it be possible to
 incorporate (some of) these changes?

 Absolutely, please send a patch to the list or file a bug and attach it 
 there.

 The issue with a PyLucene 4.0 release is not so much getting it to compile 
 and run but rewriting all the tests and samples (originally ported from 
 Java) since the Lucene api changed in many ways. That's a large amount of 
 work and some of the new analyzer/tokenizer framework stuff needs some new 
 jcc support for generating classes on the fly. I've got that written to some 
 extent already but porting the samples and tests again is daunting.

 Andi..


 Thanks,

  Roman

Re: lucene4.0 release

2012-07-05 Thread Roman Chyla

The patch probably probably didn't make it to the list, I'll file a ticket later

It is definitely lot of work with the python code, I have gone through
1.5 test cases now, and it is just 'unpleasant', so many API changes
out there - but I'll try to convert more

roman

On Thu, Jul 5, 2012 at 7:48 PM, Andi Vajda va...@apache.org wrote:

 On Jul 6, 2012, at 0:27, Roman Chyla roman.ch...@gmail.com wrote:

 Lucene is 4.0 in alpha release and we would like to start working with
 pylucene4.0 already. I checked out the pylucene trunk and made the
 necessary changes so that it compiles. Would it be possible to
 incorporate (some of) these changes?

 Absolutely, please send a patch to the list or file a bug and attach it there.

 The issue with a PyLucene 4.0 release is not so much getting it to compile 
 and run but rewriting all the tests and samples (originally ported from Java) 
 since the Lucene api changed in many ways. That's a large amount of work and 
 some of the new analyzer/tokenizer framework stuff needs some new jcc support 
 for generating classes on the fly. I've got that written to some extent 
 already but porting the samples and tests again is daunting.

 Andi..


 Thanks,

  Roman

JArray not shared - TypeError

2011-11-24 Thread Roman Chyla

Hi,

I am using lucene together with other modules (all built in shared
mode, JCC=2.11). But JArray... objects are not built as shared

This works when using only lucene, but fails when I use built the
other module linked against lucene

# create array of string objects
x = j.JArray_object(5)
for i in range(5):
   x[i] = j.JArray_string(['x', 'z'])

In [7]: for i in range(5):
   x[i] = j.JArray_string(['x', 'z'])
   ...:
   ...:
---
TypeError Traceback (most recent call last)

/dvt/workspace/montysolr/src/python/ipython console in module()

TypeError: JArraystring[u'x', u'z']


The JArray functions/objects are different:

In [9]: id(lucene.JArray_string)
Out[9]: 140313957671376

In [10]: id(solr_java.JArray_string)
Out[10]: 140313919877648

In [11]: id(montysolr_java.JArray_string)
Out[11]: 140313909254704

In [12]: id(j.JArray_string)
Out[12]: 140313909254704



Others are shared:

In [18]: id(lucene.Weight)
Out[18]: 140313957203040

In [19]: id(solr_java.Weight)
Out[19]: 140313957203040

In [20]: id(j.Weight)
Out[20]: 140313957203040


The module 'j' is built with:
-m  jcc  --shared  --import  lucene  --import  solr_java  --package
org.apache.solr.request  --classpath ...  --include
../build/jar/montysolr_java-0.1.jar  --python  montysolr_java  --build
 --bdist


What am I doing wrong?

Thanks,

  roman

Re: set PYTHONPATH programatically from Java?

2011-11-14 Thread Roman Chyla

hi,

so after reading
http://docs.python.org/c-api/init.html#PySys_SetArgvEx and the source
code for _PythonVM_init i figured it out

I have to do:

PythonVM.start(/dvt/workspace/montysolr/src/python/montysolr);

and the sys.path then contains the parent folder (above montysolr) and
i can then set more things by loading some boostrap module

but something like
http://docs.python.org/c-api/veryhigh.html#PyRun_SimpleString would be
much more flexible. Is it something that could be added? I can prepare
a patch (as it seems really trivial my knowledge might be sufficient
for this :))

roman

On Mon, Nov 14, 2011 at 1:12 PM, Roman Chyla roman.ch...@gmail.com wrote:
 On Mon, Nov 14, 2011 at 4:25 AM, Andi Vajda va...@apache.org wrote:

 On Sun, 13 Nov 2011, Roman Chyla wrote:

 I am using JCC to run Python inside Java. For unittest, I'd like to
 set PYTHONPATH environment variable programmatically. I can change env
 vars inside Java (using

 http://stackoverflow.com/questions/318239/how-do-i-set-environment-variables-from-java)
 and System.getenv(PYTHONPATH) shows correct values

 However, I am still getting ImportError: no module named

 If I set PYTHONPATH before starting unittest, it works fine

 Is it possible what I would like to do?

 Why mess with the environment instead of setting sys.path directly instead ?

 That would be great, but I don't know how. I am doing roughly this:

 PythonVM.start(programName)
 vm = PythonVM.get()
 vm.instantiate(moduleName, className);

 I tried also:
 PythonVM.start(programName, new String[]{-c, import
 sys;sys.path.insert(0, \'/dvt/workspace/montysolr/src/python\'});

 it is failing on vm.instantiate when Python cannot find the module


 Alternatively, if JCC could execute/eval python string, I could set
 sys.argv that way

 I'm not sure what you mean here but JCC's Java PythonVM.init() method takes
 an array of strings that is fed into sys.argv. See _PythonVM_Init() sources
 in jcc.cpp for details.

 sorry, i meant sys.path, not sys.argv

 roman


 Andi..

Re: set PYTHONPATH programatically from Java?

2011-11-14 Thread Roman Chyla

On Mon, Nov 14, 2011 at 4:25 AM, Andi Vajda va...@apache.org wrote:

 On Sun, 13 Nov 2011, Roman Chyla wrote:

 I am using JCC to run Python inside Java. For unittest, I'd like to
 set PYTHONPATH environment variable programmatically. I can change env
 vars inside Java (using

 http://stackoverflow.com/questions/318239/how-do-i-set-environment-variables-from-java)
 and System.getenv(PYTHONPATH) shows correct values

 However, I am still getting ImportError: no module named

 If I set PYTHONPATH before starting unittest, it works fine

 Is it possible what I would like to do?

 Why mess with the environment instead of setting sys.path directly instead ?

That would be great, but I don't know how. I am doing roughly this:

PythonVM.start(programName)
vm = PythonVM.get()
vm.instantiate(moduleName, className);

I tried also:
PythonVM.start(programName, new String[]{-c, import
sys;sys.path.insert(0, \'/dvt/workspace/montysolr/src/python\'});

it is failing on vm.instantiate when Python cannot find the module


 Alternatively, if JCC could execute/eval python string, I could set
 sys.argv that way

 I'm not sure what you mean here but JCC's Java PythonVM.init() method takes
 an array of strings that is fed into sys.argv. See _PythonVM_Init() sources
 in jcc.cpp for details.

sorry, i meant sys.path, not sys.argv

roman


 Andi..

Re: Building is too difficult and release of a first pre-built egg

2011-06-02 Thread Roman Chyla

Hi Philippe,

On Thu, Jun 2, 2011 at 5:54 AM, Philippe Ombredanne
pombreda...@gmail.com wrote:
 On 2011-06-01 20:54, Roman Chyla wrote:

 I would build some other binaries and upload them, will you get me
 access?

 Done: I added you as a committer to
 http://code.google.com/a/apache-extras.org/p/pylucene-extra/

Thanks!

 I'll try to keep and post detailed logs for each build I do.
 I am planning to add some detailed egg building instructions too.
 I will also contact the dudes at:
 http://code.google.com/p/pylucene-win32-binary/
 They are building windows eggs already

I'll also do -- the project for which it is needed is this one:
https://github.com/romanchyla/montysolr


 But I also need to build JCC and upload them. Note that the
 location of the java that was used for the project built will be
 hardcoded inside the dynamic library, but I plan to change the header
 and set a few standard paths there.

 Ah... good point... meaning this is bad... a build would not be java
 location independent then?
 This would be a major bummer to have the path to java hardcoded in the .so.
 You could commit the patches there if you have some?

oh, I assumed that was not patcheable -- but maybe I was wrong; but
what I certainly planned to do is to change each binary produced and
set some standard paths. Any ideas of what would be the standard
library paths for linux?



 Building pylucene/jcc is indeed difficult for newcomers.

 Indeed too hard imho. A big deterrent. Such that it does likely impair the
 project reach, growth and health.

and it is a very wonderful project, i agree

roman


 On Wed, Jun 1, 2011 at 10:54 AM, Philippe Ombredanne
 pombreda...@gmail.com  wrote:

 Howdy!
 I think it is way too hard to build PyLucene for the mere mortals.
 Getting eggs is yet another level of difficulties
 I created an issue:
 https://issues.apache.org/jira/browse/PYLUCENE-10
 and started an Apache extra project, releasing a first egg for the Linux
 64/Python 2.5.2/Oracle JDK 1.5 combo

 http://code.google.com/a/apache-extras.org/p/pylucene-extra/downloads/list
 I hope that can help some folks.


 --
 Cordially
 Philippe

 philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com
 nexB - Open by Design (tm) - http://www.nexb.com
 http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep
 http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com

Re: Hardcoded java paths in shared objects [was:Re: Building is too difficult and release of a first pre-built egg]

2011-06-02 Thread Roman Chyla

On Thu, Jun 2, 2011 at 12:26 PM, Andi Vajda va...@apache.org wrote:

 On Jun 2, 2011, at 3:10, Philippe Ombredanne pombreda...@gmail.com wrote:

 On 2011-06-01 20:54, Roman Chyla wrote:
 Note that the
 location of the java that was used for the project built will be
 hardcoded inside the dynamic library, but I plan to change the header
 and set a few standard paths there.
 This is actually worse than I thought: not only the java location seems 
 hardcoded in the shared object as a hard path to the libs folder, but also 
 there is an implied dep on setuptools via pkg_resources
 So for now, you cannot even build on a jdk and deploy on a jre.

 If the solution to this is to remove the hardcoded paths and expect the 
 dynamic linker to find the dependencies via some environment variable like 
 LD_LIBRARY_PATH you'd be creating a security vulnerability.

I am not an expert on this, but i remember that LD_LIBRARY_PATH was
not recommended (as it could break other libraries, if i remember
well). So that's why I thought more about a 'more, standard' hardcoded
locations. Or is there something else besides LD_LIBRARY_PATH and
multiple hardcoded paths?

Roman

 This is how I did it originally (years ago) and people complained about it so 
 I switched to hardcoded paths for shared library dependencies wherever 
 possible.

 Andi..


 --
 Cordially
 Philippe

 philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com
 nexB - Open by Design (tm) - http://www.nexb.com
 http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep
 http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com

Re: Hardcoded java paths in shared objects [was:Re: Building is too difficult and release of a first pre-built egg]

2011-06-02 Thread Roman Chyla

On Thu, Jun 2, 2011 at 6:10 AM, Philippe Ombredanne
pombreda...@gmail.com wrote:
 On 2011-06-01 20:54, Roman Chyla wrote:

 Note that the
 location of the java that was used for the project built will be
 hardcoded inside the dynamic library, but I plan to change the header
 and set a few standard paths there.

 This is actually worse than I thought: not only the java location seems
 hardcoded in the shared object as a hard path to the libs folder, but also
 there is an implied dep on setuptools via pkg_resources
 So for now, you cannot even build on a jdk and deploy on a jre.

I am sorry, but I don't understand - what is the additional dependency
hardcoded there?
Thanks,

Roman


 --
 Cordially
 Philippe

 philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com
 nexB - Open by Design (tm) - http://www.nexb.com
 http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep
 http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com

Re: Building is too difficult and release of a first pre-built egg

2011-06-01 Thread Roman Chyla

Hi Philippe,

I would build some other binaries and upload them, will you get me
access? But I also need to build JCC and upload them. Note that the
location of the java that was used for the project built will be
hardcoded inside the dynamic library, but I plan to change the header
and set a few standard paths there.

Building pylucene/jcc is indeed difficult for newcomers.

Cheers,

Roman

On Wed, Jun 1, 2011 at 10:54 AM, Philippe Ombredanne
pombreda...@gmail.com wrote:
 Howdy!
 I think it is way too hard to build PyLucene for the mere mortals.
 Getting eggs is yet another level of difficulties

 I created an issue:
 https://issues.apache.org/jira/browse/PYLUCENE-10

 and started an Apache extra project, releasing a first egg for the Linux
 64/Python 2.5.2/Oracle JDK 1.5 combo

 http://code.google.com/a/apache-extras.org/p/pylucene-extra/downloads/list

 I hope that can help some folks.

 --
 Cordially
 Philippe

 philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com
 nexB - Open by Design (tm) - http://www.nexb.com
 http://twitter.com/pombr
 http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep
 http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com

Re: finding exceptions the crash pylucene

2011-04-15 Thread Roman Chyla

I have had similar experience, but it was always a problem on the java side.
What helped was to dump memory:

-Xms512m -Xmx4500m -XX:+HeapDumpOnCtrlBreak -XX:+HeapDumpOnOutOfMemoryError

Documentation says that upon catching the OOM, you should stop the JVM
immediately. But actually it was possible to handle these problems. I
started the processing inside a separate thread, cleaning properly --
if the thread raises OOM, it is possible to continue - I have done
tests on thousands of docs and it always worked. But the main benefit
of that solution is that I can see the errors inside Python and
gracefully stop execution (without being shut out into the space).
Marcus, I would recommend wrapping your processing inside a thread
that starts another worker thread and make sure no references are
kept.

Roman

On Fri, Apr 15, 2011 at 4:33 PM, Bill Janssen jans...@parc.com wrote:
 Marcus qwe...@gmail.com wrote:

 --bcaec53043296dfbfd04a0ece1ac
 Content-Type: text/plain; charset=ISO-8859-1

 we're currently using 4GB max heap.
 We recently moved from 2GB to 4GB when we discovered it prevented a crash
 with a certain set of docs.
 Marcus

 I've tried the same workaround with the heap in the past, and I found it
 caused NoMemory crashes in the Python side of the house, because the
 Python VM couldn't get enough memory to operate.  So, be careful.

 On Thu, Apr 14, 2011 at 5:01 PM, Andi Vajda va...@apache.org wrote:

 
  On Thu, 14 Apr 2011, Marcus wrote:
 
   thanks.
 
  I have documents that will consistently cause this upon writing them to
  the
  index. let me see if I can reduce them down to the crux of the crash.
  granted, these are docs are very large, unruly bad data, that should
  have
  never gotten this stage in our pipeline, but I was hoping for a java or
  lucene exception.
 
  I also get Java GC overhead exceptions passed into my code from time to
  time, but those manageable, and not crashes.
 
  Are there known memory constraint scenarios that force a c++ exception,
  whereas in a normal Java environment,  you would get a memory error?
 
 
  Not sure.
 
 
   and just confirming, do java.lang.OutOfMemoryError errors pass into
  python, or force a crash?
 
 
  Not sure, I've never seen these as I make sure I've got enough memory.
  initVM() is the place where you can configure the memory for your JVM.
 
  Andi..
 
 
 
  thanks again
  Marcus
 
  On Thu, Apr 14, 2011 at 2:07 PM, Andi Vajda va...@apache.org wrote:
 
 
  On Thu, 14 Apr 2011, Marcus wrote:
 
   in certain cases when a java/pylucene exception occurs,  it gets passed
  up
 
  in my code, and I'm able to analyze the situation.
  sometimes though,  the python process just crashes, and if I happen to
  be
  in
  top (linux top that is), I see a JCC exception flash up in the top
  console.
  where can I go to look for this exception, or is it just lost?
  I looked in the locations where a java crash would be located, but
  didn't
  find anything.
 
 
  If you're hitting a crash because of an unhandled C++ exception, running
  a
  debug build with symbols under gdb will help greatly in tracking it down.
 
  An unhandled C++ exception would be a PyLucene/JCC bug. If you have a
  simple way to reproduce this failure, send it to this list.
 
  Andi..
 
 
 

 --bcaec53043296dfbfd04a0ece1ac--

Re: Using JCC / PyLucene with JEPP?

2011-03-04 Thread Roman Chyla

Yes, and I can say it is working extremely well so far - we have done
and are doing some extensive benchmarking and tests. I also use
multiprocessing inside (python2.6) and I hope I would be able to
publish the source code soon, it could be re-usable. If you are
interested before that happens, please send me an email.

Best,

  Roman

On Fri, Mar 4, 2011 at 7:27 AM, Andi Vajda va...@apache.org wrote:

 On Mar 3, 2011, at 21:50, Bill Janssen jans...@parc.com wrote:

 New topic.

 I'd like to wrap my UpLib codebase, which is Python using PyLucene, in
 Java using JEPP (http://jepp.sourceforge.net/), so that I can use it
 with Tomcat.

 Now, am I going to have to do some trickery to get a VM?  Or will
 getVMEnv() just work with a previously initialized JVM?

 Not so long ago on this list someone asked about this, using python from java 
 via jcc, something I've been doing with tomcat for a couple of years now.
 I sent a long, detailed answer. I believe it was to Roman Chyla. A quick look 
 in this mailing list archives should help you locate that thread and get 
 answers to the above questions.

 Andi..


 Bill

Re: pass compressed string

2011-02-25 Thread Roman Chyla

Hi Andi,

Thanks, the JArray_byte() does what I needed - I was (wrongly) passing
bytestring (which I think got automatically converted to unicode) and
trying to get bytes of that string was not correct.

Though it would be interesting to find out if it is possible to pass
string and get the bytes in java, I don't know if what conversion
happening on the jni side, or only in java - i shall do some reading

Example in python:

In [4]: s = zlib.compress(python)

In [5]: repr(s)
Out[5]: 'x\\x9c+\\xa8,\\xc9\\xc8\\xcf\\x03\\x00\\tW\\x02\\xa3'

In [6]: lucene.JArray_byte(s)
Out[6]: JArraybyte(120, -100, 43, -88, 44, -55, -56, -49, 3, 0, 9, 87, 2, -93)

The same thing in Jython:

 s = zlib.compress(python)
 s
'x\x9c+\xa8,\xc9\xc8\xcf\x03\x00\tW\x02\xa3'
 repr(s)
'x\\x9c+\\xa8,\\xc9\\xc8\\xcf\\x03\\x00\\tW\\x02\\xa3'
 String(s).getBytes()
array('b', [120, -62, -100, 43, -62, -88, 44, -61, -119, -61, -120,
-61, -113, 3, 0, 9, 87, 2, -62, -93])
 String(s).getBytes('utf8')
array('b', [120, -62, -100, 43, -62, -88, 44, -61, -119, -61, -120,
-61, -113, 3, 0, 9, 87, 2, -62, -93])
 String(s).getBytes('utf16')
array('b', [-2, -1, 0, 120, 0, -100, 0, 43, 0, -88, 0, 44, 0, -55, 0,
-56, 0, -49, 0, 3, 0, 0, 0, 9, 0, 87, 0, 2, 0, -93])
 String(s).getBytes('ascii')
array('b', [120, 63, 43, 63, 44, 63, 63, 63, 3, 0, 9, 87, 2, 63])




Roman

On Thu, Feb 24, 2011 at 3:42 AM, Andi Vajda va...@apache.org wrote:

 On Thu, 24 Feb 2011, Roman Chyla wrote:

 I would like to transfer results from python to java:

 hello = zlib.compress(hello)

 on the java side do:

 byte[] data = string.getBytes()

 But I am not successful. Is there any translation going on somewhere?

 Can you be more specific ?
 Actual lines of code, errors, expected results, actual results...

 An array of bytes in JCC is not created with a string but a
 JArray('byte')(len or str)

   import lucene
   lucene.initVM()
  jcc.JCCEnv object at 0x1004100d8
   lucene.JArray('byte')(10)
  JArraybyte(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
   lucene.JArray('byte')(abcd)
  JArraybyte(97, 98, 99, 100)
  

 Andi..

pass compressed string

2011-02-23 Thread Roman Chyla

Hello,

I would like to transfer results from python to java:

hello = zlib.compress(hello)

on the java side do:

byte[] data = string.getBytes()

But I am not successful. Is there any translation going on somewhere?

Thank you,

  Roman

Re: Problem loading jcc from java : undefined symbol: PyExc_IOError

2011-02-15 Thread Roman Chyla

On Tue, Feb 15, 2011 at 4:22 AM, Andi Vajda va...@apache.org wrote:

 On Tue, 15 Feb 2011, Roman Chyla wrote:

 from:
 http://realmike.org/blog/2010/07/18/python-extensions-in-cpp-using-swig/

 Q. ?Fatal Python error: Interpreter not initialized (version mismatch?)?

 A. This error occurs when the version of the Python interpreter for
 which the extension module has been built is different from the
 version of the interpreter that attempts to import the module.

 Is there a way to find out which python interpreter version is inside
 JCC? Also, Is it somehow possible that the java process that load jcc
 library will be picking the default python (2.4) instead of the python
 (2.5)? PATH is set to python2.5.

 There is no Python interpreter inside jcc. It's dynamically linked.
 To know which version of the shared library is looked for and expected, use
 the 'ldd' utility against the various shared libraries involved to tell you.
 That version is selected at build time, when you run 'python setup.py ...'
 That version of python determines the version of libpython.so used.

This will be probably the problem (as you said before), the libjcc.so
shows no python -

bash-3.2$ ldd build/lib.linux-x86_64-2.5/libjcc.so
linux-vdso.so.1 =  (0x7fff7affc000)
/$LIB/snoopy.so = /lib64/snoopy.so (0x2b8ed0e74000)
libjava.so = 
/afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/libjava.so
(0x2b8ed1076000)
libjvm.so = 
/afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/server/libjvm.so
(0x2b8ed11a5000)
libstdc++.so.6 = /usr/lib64/libstdc++.so.6 (0x2b8ed1c3f000)
libm.so.6 = /lib64/libm.so.6 (0x2b8ed1f3f000)
libgcc_s.so.1 = /lib64/libgcc_s.so.1 (0x2b8ed21c2000)
libpthread.so.0 = /lib64/libpthread.so.0 (0x2b8ed23cf000)
libc.so.6 = /lib64/libc.so.6 (0x2b8ed25eb000)
libdl.so.2 = /lib64/libdl.so.2 (0x2b8ed2943000)
libverify.so =
/afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/libverify.so
(0x2b8ed2b47000)
libnsl.so.1 = /lib64/libnsl.so.1 (0x2b8ed2c57000)
/lib64/ld-linux-x86-64.so.2 (0x2b8ed08c9000)

And I think, the python2.4 (the default  on the system) is being
loaded -- but how to force loading of python2.5 (if that was possible
at all) I don't know. Compilation is definitely done with -lpython2.5

Cheers,

  roman


 Andi..


 Cheers,

  roman


 On Tue, Feb 15, 2011 at 2:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

 On Tue, Feb 15, 2011 at 1:32 AM, Andi Vajda va...@apache.org wrote:

 On Tue, 15 Feb 2011, Roman Chyla wrote:

 The python embedded in Java works really well on MacOsX and also
 Ubuntu. But I am trying hard to make it work also on Scientific Linux
 (SLC5) with *statically* built Python. The python is a build from
 ActiveState.

 You mean you're going to try to dynamically load libpython.a into a JVM
 ?
 I have no idea if this can work at all.

 I am very ignorant as far as the difference between statically and
 dynamically linked libraries go - I just wanted to use JCC wrapped
 code with this particular statically linked python

 I got little bit further, but just little:

 after I changed -Xlinker --export-dynamic into -Xlinker
 -export-dynamic (and installed python into /opt...) I am getting a
 different error:

 SEVERE: org.apache.jcc.PythonException: No module named
 solrpie.java_bridge
 null
        at org.apache.jcc.PythonVM.instantiate(Native Method)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.SolrpieVM.getBridge(Unknown Source)


 My understanding is that the previous error has gone (and the python
 module time is loaded), because if I set PYTHONPATH incorrectly, I
 get:
 This message is IMHO coming from Python

 But when I correct the PYTHONPATH, I am getting only this:

 [java] Fatal Python error: Interpreter not initialized (version
 mismatch?)
 [java] Java Result: 134




 If my understanding of static builds is correct, I'd imagine the only
 way
 for this to work would be to statically compile the JVM (hotspot) and
 python
 together.

 oooups, that is way over my head


 But why all this ?

 Because on the grid, we already had a statically linked python and it
 was working very well with pylucene (and after all, I managed to make
 it work also for solr and other packages)

 But if you think that it is not possible, I should do something else :)
 But it was fun trying, if you get some idea, please let me know.

 Thank you,

  Roman


 Andi..

 So far, I managed to build all the needed extensions (jcc, lucene,
 solr) and I can run them in python, but when I try to start the java
 app and use python, I get:

 SEVERE: org.apache.jcc.PythonException:


 /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/lib

Re: Problem loading jcc from java : undefined symbol: PyExc_IOError

2011-02-15 Thread Roman Chyla

In the end, I compiled a new python with the necessary modules, and
that works just fine.
But it was an interesting experience. Thank you Andi, your help is always great.

Cheers,

  roman

On Tue, Feb 15, 2011 at 9:22 AM, Roman Chyla roman.ch...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 4:22 AM, Andi Vajda va...@apache.org wrote:

 On Tue, 15 Feb 2011, Roman Chyla wrote:

 from:
 http://realmike.org/blog/2010/07/18/python-extensions-in-cpp-using-swig/

 Q. ?Fatal Python error: Interpreter not initialized (version mismatch?)?

 A. This error occurs when the version of the Python interpreter for
 which the extension module has been built is different from the
 version of the interpreter that attempts to import the module.

 Is there a way to find out which python interpreter version is inside
 JCC? Also, Is it somehow possible that the java process that load jcc
 library will be picking the default python (2.4) instead of the python
 (2.5)? PATH is set to python2.5.

 There is no Python interpreter inside jcc. It's dynamically linked.
 To know which version of the shared library is looked for and expected, use
 the 'ldd' utility against the various shared libraries involved to tell you.
 That version is selected at build time, when you run 'python setup.py ...'
 That version of python determines the version of libpython.so used.

 This will be probably the problem (as you said before), the libjcc.so
 shows no python -

 bash-3.2$ ldd build/lib.linux-x86_64-2.5/libjcc.so
        linux-vdso.so.1 =  (0x7fff7affc000)
        /$LIB/snoopy.so = /lib64/snoopy.so (0x2b8ed0e74000)
        libjava.so = 
 /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/libjava.so
 (0x2b8ed1076000)
        libjvm.so = 
 /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/server/libjvm.so
 (0x2b8ed11a5000)
        libstdc++.so.6 = /usr/lib64/libstdc++.so.6 (0x2b8ed1c3f000)
        libm.so.6 = /lib64/libm.so.6 (0x2b8ed1f3f000)
        libgcc_s.so.1 = /lib64/libgcc_s.so.1 (0x2b8ed21c2000)
        libpthread.so.0 = /lib64/libpthread.so.0 (0x2b8ed23cf000)
        libc.so.6 = /lib64/libc.so.6 (0x2b8ed25eb000)
        libdl.so.2 = /lib64/libdl.so.2 (0x2b8ed2943000)
        libverify.so =
 /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/libverify.so
 (0x2b8ed2b47000)
        libnsl.so.1 = /lib64/libnsl.so.1 (0x2b8ed2c57000)
        /lib64/ld-linux-x86-64.so.2 (0x2b8ed08c9000)

 And I think, the python2.4 (the default  on the system) is being
 loaded -- but how to force loading of python2.5 (if that was possible
 at all) I don't know. Compilation is definitely done with -lpython2.5

 Cheers,

  roman


 Andi..


 Cheers,

  roman


 On Tue, Feb 15, 2011 at 2:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

 On Tue, Feb 15, 2011 at 1:32 AM, Andi Vajda va...@apache.org wrote:

 On Tue, 15 Feb 2011, Roman Chyla wrote:

 The python embedded in Java works really well on MacOsX and also
 Ubuntu. But I am trying hard to make it work also on Scientific Linux
 (SLC5) with *statically* built Python. The python is a build from
 ActiveState.

 You mean you're going to try to dynamically load libpython.a into a JVM
 ?
 I have no idea if this can work at all.

 I am very ignorant as far as the difference between statically and
 dynamically linked libraries go - I just wanted to use JCC wrapped
 code with this particular statically linked python

 I got little bit further, but just little:

 after I changed -Xlinker --export-dynamic into -Xlinker
 -export-dynamic (and installed python into /opt...) I am getting a
 different error:

 SEVERE: org.apache.jcc.PythonException: No module named
 solrpie.java_bridge
 null
        at org.apache.jcc.PythonVM.instantiate(Native Method)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.SolrpieVM.getBridge(Unknown Source)


 My understanding is that the previous error has gone (and the python
 module time is loaded), because if I set PYTHONPATH incorrectly, I
 get:
 This message is IMHO coming from Python

 But when I correct the PYTHONPATH, I am getting only this:

 [java] Fatal Python error: Interpreter not initialized (version
 mismatch?)
 [java] Java Result: 134




 If my understanding of static builds is correct, I'd imagine the only
 way
 for this to work would be to statically compile the JVM (hotspot) and
 python
 together.

 oooups, that is way over my head


 But why all this ?

 Because on the grid, we already had a statically linked python and it
 was working very well with pylucene (and after all, I managed to make
 it work also for solr and other packages)

 But if you think that it is not possible, I should do something else :)
 But it was fun trying, if you get some idea, please let me know.

 Thank you,

  Roman


 Andi..

 So far, I managed to build all

Re: Problem loading jcc from java : undefined symbol: PyExc_IOError

2011-02-14 Thread Roman Chyla

from: http://realmike.org/blog/2010/07/18/python-extensions-in-cpp-using-swig/

Q. “Fatal Python error: Interpreter not initialized (version mismatch?)”

A. This error occurs when the version of the Python interpreter for
which the extension module has been built is different from the
version of the interpreter that attempts to import the module.

Is there a way to find out which python interpreter version is inside
JCC? Also, Is it somehow possible that the java process that load jcc
library will be picking the default python (2.4) instead of the python
(2.5)? PATH is set to python2.5.

Cheers,

  roman


On Tue, Feb 15, 2011 at 2:40 AM, Roman Chyla roman.ch...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 1:32 AM, Andi Vajda va...@apache.org wrote:

 On Tue, 15 Feb 2011, Roman Chyla wrote:

 The python embedded in Java works really well on MacOsX and also
 Ubuntu. But I am trying hard to make it work also on Scientific Linux
 (SLC5) with *statically* built Python. The python is a build from
 ActiveState.

 You mean you're going to try to dynamically load libpython.a into a JVM ?
 I have no idea if this can work at all.

 I am very ignorant as far as the difference between statically and
 dynamically linked libraries go - I just wanted to use JCC wrapped
 code with this particular statically linked python

 I got little bit further, but just little:

 after I changed -Xlinker --export-dynamic into -Xlinker
 -export-dynamic (and installed python into /opt...) I am getting a
 different error:

 SEVERE: org.apache.jcc.PythonException: No module named solrpie.java_bridge
 null
        at org.apache.jcc.PythonVM.instantiate(Native Method)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.PythonVMBridge.start(Unknown Source)
        at rca.python.jni.SolrpieVM.getBridge(Unknown Source)


 My understanding is that the previous error has gone (and the python
 module time is loaded), because if I set PYTHONPATH incorrectly, I
 get:
 This message is IMHO coming from Python

 But when I correct the PYTHONPATH, I am getting only this:

 [java] Fatal Python error: Interpreter not initialized (version mismatch?)
 [java] Java Result: 134




 If my understanding of static builds is correct, I'd imagine the only way
 for this to work would be to statically compile the JVM (hotspot) and python
 together.

 oooups, that is way over my head


 But why all this ?

 Because on the grid, we already had a statically linked python and it
 was working very well with pylucene (and after all, I managed to make
 it work also for solr and other packages)

 But if you think that it is not possible, I should do something else :)
 But it was fun trying, if you get some idea, please let me know.

 Thank you,

  Roman


 Andi..

 So far, I managed to build all the needed extensions (jcc, lucene,
 solr) and I can run them in python, but when I try to start the java
 app and use python, I get:

 SEVERE: org.apache.jcc.PythonException:

 /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/lib-dynload/time.so:
 undefined symbol: PyExc_IOError


 I understand, that the missing symbol PyExc_IOError is in the static
 python library:

 bash-3.2$ nm
 /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config/libpython2.5.a
 | grep IOError
 4120 D PyExc_IOError
 4140 d _PyExc_IOError
                U PyExc_IOError
                U PyExc_IOError
                U PyExc_IOError
                U PyExc_IOError
                U PyExc_IOError
                U PyExc_IOError
                U PyExc_IOError

 So when building JCC, I build with these arguments:

 lflags  +  ['-lpython%s.%s' %(sys.version_info[0:2]),
 '-L',

 '/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config',
 '-rdynamic',
 '-Wl,--export-dynamic',
 '-Xlinker',
 '--export-dynamic']

 I just found instructions at:

 http://stackoverflow.com/questions/4223312/python-interpreter-embedded-in-the-application-fails-to-load-native-modules
 I don't really understand g++, but the symbol is there after the
 compilation

 bash-3.2$ nm
 /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg/libjcc.so
 | grep IOError
 00352240 D PyExc_IOError
 00352260 d _PyExc_IOError

 And when starting java, I do

 -Djava.library.path=/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg

 The code works find on mac (python 2.6) and ubuntu (python2.6), but
 not this statically linked python2.5 - would you know what I can try?

 Thanks.


  roman


 PS: I tried several compilations, but I was usually re-compiling JCC
 without building lucene etc again, I hope that is not the problem.

Problem loading jcc from java : undefined symbol: PyExc_IOError

2011-02-14 Thread Roman Chyla

Hello Andi, all,

The python embedded in Java works really well on MacOsX and also
Ubuntu. But I am trying hard to make it work also on Scientific Linux
(SLC5) with *statically* built Python. The python is a build from
ActiveState.

So far, I managed to build all the needed extensions (jcc, lucene,
solr) and I can run them in python, but when I try to start the java
app and use python, I get:

SEVERE: org.apache.jcc.PythonException:
/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/lib-dynload/time.so:
undefined symbol: PyExc_IOError


I understand, that the missing symbol PyExc_IOError is in the static
python library:

bash-3.2$ nm 
/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config/libpython2.5.a
| grep IOError
4120 D PyExc_IOError
4140 d _PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError
 U PyExc_IOError

So when building JCC, I build with these arguments:

lflags  +  ['-lpython%s.%s' %(sys.version_info[0:2]),
'-L',
'/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config',
'-rdynamic',
'-Wl,--export-dynamic',
'-Xlinker',
'--export-dynamic']

I just found instructions at:
http://stackoverflow.com/questions/4223312/python-interpreter-embedded-in-the-application-fails-to-load-native-modules
I don't really understand g++, but the symbol is there after the compilation

bash-3.2$ nm 
/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg/libjcc.so
| grep IOError
00352240 D PyExc_IOError
00352260 d _PyExc_IOError

And when starting java, I do
-Djava.library.path=/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg

The code works find on mac (python 2.6) and ubuntu (python2.6), but
not this statically linked python2.5 - would you know what I can try?

Thanks.


  roman


PS: I tried several compilations, but I was usually re-compiling JCC
without building lucene etc again, I hope that is not the problem.

cannot instantiate HashMap until after shared-module initVM()

2011-02-04 Thread Roman Chyla

Hi Andi, all,

I have just came across behaviour which seems strange -- I have built
lucene and wrapped solr + my own extension with JCC (ver. 2.7; OS is
Mac, Python 32bi 2.6; using generics in all of the packages) All of
the packages are compiled in the shared mode -- I import them in the
correct order of: lucene, solr, my extension.

Now i realized it was not possible to initialize a HashMap until the
first extension (in this case lucene) is started

Is this the effect of building them in the shared mode - where one
depends on one another?

Thank you,

   roman

In [1]: from solrpie import initvm
Warning: we add the default folder to sys.path:
/x/dev/workspace/sandbox/solrpie/build/dist

In [2]: sj = initvm.solrpie_java

In [3]: sj.initVM()
Out[3]: jcc.JCCEnv object at 0x194b70

In [4]: sj.Hash
sj.HashDocSet  sj.HashMap sj.HashSet sj.Hashtable

In [4]: sj.HashMap().of_(sj.String, sj.String)
---
InvalidArgsError  Traceback (most recent call last)

/x/dev/workspace/sandbox/solrpie/python/ipython console in module()

InvalidArgsError: (type 'HashMap', 'of_', (type 'String', type 'String'))

In [5]: import lucene

In [6]: sj.HashMap().of_(lucene.String, lucene.String)
---
InvalidArgsError  Traceback (most recent call last)

/x/dev/workspace/sandbox/solrpie/python/ipython console in module()

InvalidArgsError: (type 'HashMap', 'of_', (type 'String', type 'String'))

In [7]: lucene.HashMap().of_(lucene.String, lucene.String)
---
InvalidArgsError  Traceback (most recent call last)

/x/dev/workspace/sandbox/solrpie/python/ipython console in module()

InvalidArgsError: (type 'HashMap', 'of_', (type 'String', type 'String'))

In [8]: lucene.initVM()
Out[8]: jcc.JCCEnv object at 0x194cd0

In [9]: lucene.HashMap().of_(lucene.String, lucene.String)
Out[9]: HashMap: {}

In [10]: sj.HashMap().of_(sj.String, sj.String)
Out[10]: HashMap: {}

--module option not playing nicely with relative paths

2011-01-13 Thread Roman Chyla

Hi,

Until recently, I wasn't using --module parameter. But now I do and
the compilation was failing, because I am not building things in the
top folder, but from inside build - to avoid clutter.

I believe I discovered a bug and I am sending a patch. Basically,
jcc.py is copying modules into the build dir.


my project is organized as:


  build
build
  java
 
  python
packageA
packageB

I build things inside build, if I specify a relative path, --module
'../python/packageA', jcc will correctly copy the tree structure
resulting in


extension
packageA
packageB


However, the package names (for distutils setup) will be set to
['extension', 'extension..python.packageA',
'extension..python.packageB']
Which ends up in this error:


 [exec] running install
 [exec] running bdist_egg
 [exec] running egg_info
 [exec] writing solrpie_java.egg-info/PKG-INFO
 [exec] writing top-level names to solrpie_java.egg-info/top_level.txt
 [exec] writing dependency_links to
solrpie_java.egg-info/dependency_links.txt
 [exec] warning: manifest_maker: standard file '__main__.py' not found
 [exec] error: package directory
'build/solrpie_java/python/solrpye' does not exist

Cheers,

  roman

Re: call python from java - what strategy do you use?

2011-01-12 Thread Roman Chyla

Hi Andi,

I think I will give it a try, if only because I am curious. Please see
one remaining question below.


On Tue, Jan 11, 2011 at 10:37 PM, Andi Vajda va...@apache.org wrote:


 On Tue, 11 Jan 2011, Roman Chyla wrote:

 Hi Andy,

 This is much more than I could have hoped! Just yesterday, I was
 looking for ways how to embed Python VM in Jetty, as that would be
 more natural, but found only jepp.sourceforge.net and off-putting was
 the necessity to compile it against the newly built python. I could
 not want it from the guys who may need my extension. And I realize
 only now, that embedding Python in Java is even documented on the
 website, but honestly i would not know how to do it without your
 detailed examples.

 Now to the questions, I apologize, some of them or all must seem very
 stupid to you

 - pylucene is used on many platforms and with jcc always worked as
 expected (i love it!), but is it as reliable in the opposite
 direction? The PythonVM.java loads jcc library, so I wonder if in
 principle there is any difference in the directionality - but I am not
 sure. To rephrase my convoluted question: would you expect this
 wrapping be as reliable as wrapping java inside python is now?

 I've been using this for over two years, in production.
 My main worry was memory leaks because a server process is expected to stay
 up and running for weeks at a time and it's been very stable on that front
 too. Of course, when there is a bug somewhere that causes your Python VM to
 crash, the entire server crashes. Just like when the JVM crashes (which is
 normally rare). In other words, this isn't any less reliable than a
 standalone Python VM process. It can be tricky, but is possible, to run gdb,
 pdb and jdb together to step through the three languages involved, python,
 java and C++. I've had to do this a few times but not in a long time.

 - in the past, i built jcc libraries on one host and distributed them on
 various machines. As long the family OS and the python main version were the
 same, it worked on Win/Lin/Mac just fine. As far as I can tell, this does
 not change, or will it be dependent on the python against which the egg was
 built?

 Distributing binaries is risky. The same caveats apply. I wouldn't do it,
 even in the simple PyLucene case.

unfortunately, I don't have that many choices left - this is not for
some client-software scenario, we are running the jobs on the grid,
and there I cannot compile the binaries. So, if previously the
location of the python interpreter or python minor version did not
cause problems, now perhaps it will be different. But that wasn't for
the Solr, wrapping Solr is not meant for the grid.


 - now a little tricky issue; when I wrap jetty inside python, I hoped
 to build it in a shared mode with lucene to be able to do some
 low-level lucene indexing tasks from inside Python. If I do the
 opposite and wrap Python VM in Java, I would still like to access the
 lucene (which is possible, as I see well from your examples) But on
 the python side, you are calling initVM() - will the initVM() call
 create a new Java VM or will it access the parent Java VM which
 started it?

 No, initVM() in this case just initializes your egg and adds its stuff to
 the CLASSPATH. No Java VM init is done. As with any shared-mode JCC-built
 extension, all calls to initVM() but the first one just do that.
 The first call to initVM() in the embedding Python case is like that too
 because there already is a Java VM running when PythonVM is instantiated and
 called.

And if in the python, I will do:

import lucene
import lucene.initVM(lucene.CLASSPATH)

Will it work in this case? Giving access to the java classes from
inside python. Or I will have to forget pylucene, and prepare some
extra java classes? (the jcc in reverse trick, as you put it)


 - you say that threads are not managed by the Python VM, does that
 mean there is no Python GIL?

 No, there is a Pythonn GIL (and that is the Achille's Heel of this setup if
 you expect high concurrent servlet performance from your server calling
 Python). That Python GIL is connected to this thread state I was mentioning
 earlier. Because the thread is not managed by Python, when Python is called
 (by way of the code generated by JCC) it doesn't find a thread state for the
 thread and creates one. When the call completes, the thread state is
 destroyed because its refcount goes to zero. My TerminatingThread class
 acquires a Python thread state and keeps it for the life of the thread,
 thereby working this problem around.

OK, this then looks like a normal Python - which is somehow making me
less worried :) I wanted to use multiprocessing inside python to deal
with GIL, and I see no reason why it should not work in this case.

Thank you very much.
Cheers,

  roman


 - I don't really know what is exactly in the python thread local
 storage, could that somehow negatively affect the Python process if
 acquireThreadState/releaseThreadState

Re: call python from java - what strategy do you use?

2011-01-12 Thread Roman Chyla

Hi Andi, all,

I tried to implement the PythonVM wrapping on Mac 10.6, with JDK
1.6.22, jcc is freshly built, in shared mode, v. 2.6. The python is
the standard Python distributed with MacOsX

When I try to run the java, it throws an error when it gets to:

static {
System.loadLibrary(jcc);
}

I am getting this error:

Exception in thread main java.lang.UnsatisfiedLinkError:
/Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib:
 Symbol not found: _PyExc_RuntimeError   Referenced from:
/Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib
  Expected in: flat namespace  in
/Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1823)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1746)
at java.lang.Runtime.loadLibrary0(Runtime.java:823)
at java.lang.System.loadLibrary(System.java:1045)
at org.apache.jcc.PythonVM.clinit(PythonVM.java:23)
at rca.solr.JettyRunnerPythonVM.start(JettyRunnerPythonVM.java:53)
at rca.solr.JettyRunnerPythonVM.main(JettyRunnerPythonVM.java:139)


MacBeth:JCC-2.6-py2.6-macosx-10.6-universal.egg rca$ nm libjcc.dylib | grep Exc
 U _PyExc_RuntimeError
 U _PyExc_TypeError
 U _PyExc_ValueError
3442 T __ZNK6JCCEnv15reportExceptionEv
21f0 T __ZNK6JCCEnv23getPythonExceptionClassEv


Any pointers what I could do wrong? Note, I haven't built any emql.egg
yet, I just run my java program and try to start PythonVM() and see if
that works.

Thanks,

  roman



On Wed, Jan 12, 2011 at 11:05 AM, Roman Chyla roman.ch...@gmail.com wrote:
 Hi Andi,

 I think I will give it a try, if only because I am curious. Please see
 one remaining question below.


 On Tue, Jan 11, 2011 at 10:37 PM, Andi Vajda va...@apache.org wrote:


 On Tue, 11 Jan 2011, Roman Chyla wrote:

 Hi Andy,

 This is much more than I could have hoped! Just yesterday, I was
 looking for ways how to embed Python VM in Jetty, as that would be
 more natural, but found only jepp.sourceforge.net and off-putting was
 the necessity to compile it against the newly built python. I could
 not want it from the guys who may need my extension. And I realize
 only now, that embedding Python in Java is even documented on the
 website, but honestly i would not know how to do it without your
 detailed examples.

 Now to the questions, I apologize, some of them or all must seem very
 stupid to you

 - pylucene is used on many platforms and with jcc always worked as
 expected (i love it!), but is it as reliable in the opposite
 direction? The PythonVM.java loads jcc library, so I wonder if in
 principle there is any difference in the directionality - but I am not
 sure. To rephrase my convoluted question: would you expect this
 wrapping be as reliable as wrapping java inside python is now?

 I've been using this for over two years, in production.
 My main worry was memory leaks because a server process is expected to stay
 up and running for weeks at a time and it's been very stable on that front
 too. Of course, when there is a bug somewhere that causes your Python VM to
 crash, the entire server crashes. Just like when the JVM crashes (which is
 normally rare). In other words, this isn't any less reliable than a
 standalone Python VM process. It can be tricky, but is possible, to run gdb,
 pdb and jdb together to step through the three languages involved, python,
 java and C++. I've had to do this a few times but not in a long time.

 - in the past, i built jcc libraries on one host and distributed them on
 various machines. As long the family OS and the python main version were the
 same, it worked on Win/Lin/Mac just fine. As far as I can tell, this does
 not change, or will it be dependent on the python against which the egg was
 built?

 Distributing binaries is risky. The same caveats apply. I wouldn't do it,
 even in the simple PyLucene case.

 unfortunately, I don't have that many choices left - this is not for
 some client-software scenario, we are running the jobs on the grid,
 and there I cannot compile the binaries. So, if previously the
 location of the python interpreter or python minor version did not
 cause problems, now perhaps it will be different. But that wasn't for
 the Solr, wrapping Solr is not meant for the grid.


 - now a little tricky issue; when I wrap jetty inside python, I hoped
 to build it in a shared mode with lucene to be able to do some
 low-level lucene indexing tasks from inside Python. If I do the
 opposite and wrap Python VM in Java, I would still like to access the
 lucene (which is possible, as I see well from your examples) But on
 the python side, you are calling initVM() - will the initVM() call
 create a new Java VM or will it access the parent Java

Re: call python from java - what strategy do you use?

2011-01-12 Thread Roman Chyla

Hi Andi,

Thanks for the help, now I was able to run the java and loaded
PythonVM. I then built the python egg, after a bit of fiddling with
parameters, it seems ok. I can import the jcc wrapped python class and
call it:

In [1]: from solrpie_java import emql

In [2]: em = emql.Emql()

In [3]: em.javaTestPrint()
java is printing

In [4]: em.pythonTestPrint()
just a test

But I haven't found out how to call the same from java.

The egg is built fine, it is named solrpie_java and contains one python module:

==

from solrpie_java import initVM, CLASSPATH, EMQL

initVM(CLASSPATH)


class Emql(EMQL):
'''
classdocs
'''

def __init__(self):
super(Emql, self).__init__()
print '__init__'


def init(self, me):
print self, me
return 'init'
def emql_refresh(self, tid, type):
print self, tid, type
return 'refresh'
def emql_status(self):
return some status

def pythonTestPrint(self):
print 'just a test'


The corresponding java class looks like this:


public class EMQL {

   private long pythonObject;

   public EMQL()
   {
   }

   public void pythonExtension(long pythonObject)
   {
   this.pythonObject = pythonObject;
   }
   public long pythonExtension()
   {
   return this.pythonObject;
   }

   public void finalize()
   throws Throwable
   {
   pythonDecRef();
   }

   public void javaTestPrint() {
   System.out.println(java is printing);
   }

   public native void pythonDecRef();

   // the methods implemented in python
   public native String init(EMQL me);
   public native String emql_refresh(String tid, String type);
   public native String emql_status();

   public native void pythonTestPrint();


}

===

I tried running it as:

PythonVM vm = PythonVM.start(sorlpie_java);
EMQL em = new EMQL();
em.javaTestPrint();
em.pythonTestPrint();

I get this:

java is printing
Exception in thread main java.lang.UnsatisfiedLinkError:
rca.pythonvm.EMQL.pythonTestPrint()V
at rca.pythonvm.EMQL.pythonTestPrint(Native Method)
at rca.solr.JettyRunnerPythonVM.start(JettyRunnerPythonVM.java:60)
at rca.solr.JettyRunnerPythonVM.main(JettyRunnerPythonVM.java:148)

I understand that java cannot find the linked c++ method, but I don't
know how to fix that.
If i try:

PythonVM vm = PythonVM.start(sorlpie_java);
Object m = vm.instantiate(emql, Emql);

I get:

org.apache.jcc.PythonException: No module named emql
ImportError: No module named emql

at org.apache.jcc.PythonVM.instantiate(Native Method)
at rca.solr.JettyRunnerPythonVM.start(JettyRunnerPythonVM.java:56)
at rca.solr.JettyRunnerPythonVM.main(JettyRunnerPythonVM.java:148)

I tried various combinations of instanatiation, and setting the
classpatt or -Djava.library.path
But no success. What am I doing wrong?

Thank you,

  roman



On Wed, Jan 12, 2011 at 7:55 PM, Andi Vajda va...@apache.org wrote:

 On Wed, 12 Jan 2011, Roman Chyla wrote:

 Hi Andi, all,

 I tried to implement the PythonVM wrapping on Mac 10.6, with JDK
 1.6.22, jcc is freshly built, in shared mode, v. 2.6. The python is
 the standard Python distributed with MacOsX

 When I try to run the java, it throws an error when it gets to:

 static {
       System.loadLibrary(jcc);
   }

 I am getting this error:

 Exception in thread main java.lang.UnsatisfiedLinkError:

 /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib:
 Symbol not found: _PyExc_RuntimeError   Referenced from:

 That's because Python's shared library wasn't found. The reason is that, by
 default, Python's shared lib not on JCC's link line because normally JCC is
 loaded into a Python process and the dynamic linker thus finds the symbols
 needed inside the process.

 Here, since you're not starting inside a Python process, you need to add
 '-framework Python' to JCC's LFLAGS in setup.py so that the dynamic linker
 can find the Python VM shared lib and load it.

 Andi..


 /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib
  Expected in: flat namespace  in

 /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib
        at java.lang.ClassLoader$NativeLibrary.load(Native Method)
        at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1823)
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1746)
        at java.lang.Runtime.loadLibrary0(Runtime.java:823)
        at java.lang.System.loadLibrary(System.java:1045)
        at org.apache.jcc.PythonVM.clinit(PythonVM.java:23)
        at rca.solr.JettyRunnerPythonVM.start

Re: call python from java - what strategy do you use?

2011-01-12 Thread Roman Chyla

Hi Andi,

Your help is great, thanks a lot! Without your detailed instructions,
I would not be able to figure it out - and the last bit with the
python...I should have thought before writing :-)

I call the class EMQL just because I was lazy to change it. But I will
do now that I understand little bit more. What I find very cool is the
fact, that if I build this extension the way you showed me, I can run
java from inside python, but also python from inside Java - and with
one jar and one compiled egg. Very handy. But as you said, evil is in
details, so I expect some bumps.

And about the thing with LFLAGS 'platform Python', also other
platforms will need something similar like Mac? I assume this is a mac
dynamic discovery of the libraries, will anything bad happen if I
changed the path of the Python now when the extension was built?


Cheers!

  roman

On Wed, Jan 12, 2011 at 11:54 PM, Andi Vajda va...@apache.org wrote:

  Hi Roman,

 On Wed, 12 Jan 2011, Roman Chyla wrote:

 Thanks for the help, now I was able to run the java and loaded
 PythonVM. I then built the python egg, after a bit of fiddling with
 parameters, it seems ok. I can import the jcc wrapped python class and
 call it:

 In [1]: from solrpie_java import emql

 Why are you calling your class EMQL ? (this name was just an example culled
 from my code).

 In [2]: em = emql.Emql()

 In [3]: em.javaTestPrint()
 java is printing

 In [4]: em.pythonTestPrint()
 just a test

 But I haven't found out how to call the same from java.

 Ah, yes, I forgot to tell you how to pull that in.
 In Java, you import that 'EMQL' java class and instantiate it by way of the
 PythonVM instance's instantiate() call:

            import org.blah.blah.EMQL;
            import org.apache.jcc.PythonVM;

            .

            PythonVM vm = PythonVM.get();

            emql = (EMQL) vm.instantiate(jemql.emql, emql);
            ... call method on emql instance just created ...

 The instantiate(foo, bar) method in effect asks Python to run
  from foo import bar
  return bar()

 Andi..

Re: call python from java - what strategy do you use?

2011-01-11 Thread Roman Chyla

Hi Andy,

This is much more than I could have hoped! Just yesterday, I was
looking for ways how to embed Python VM in Jetty, as that would be
more natural, but found only jepp.sourceforge.net and off-putting was
the necessity to compile it against the newly built python. I could
not want it from the guys who may need my extension. And I realize
only now, that embedding Python in Java is even documented on the
website, but honestly i would not know how to do it without your
detailed examples.

Now to the questions, I apologize, some of them or all must seem very
stupid to you

- pylucene is used on many platforms and with jcc always worked as
expected (i love it!), but is it as reliable in the opposite
direction? The PythonVM.java loads jcc library, so I wonder if in
principle there is any difference in the directionality - but I am not
sure. To rephrase my convoluted question: would you expect this
wrapping be as reliable as wrapping java inside python is now?

- in the past, i built jcc libraries on one host and distributed them
on various machines. As long the family OS and the python main version
were the same, it worked on Win/Lin/Mac just fine. As far as I can
tell, this does not change, or will it be dependent on the python
against which the egg was built?

- now a little tricky issue; when I wrap jetty inside python, I hoped
to build it in a shared mode with lucene to be able to do some
low-level lucene indexing tasks from inside Python. If I do the
opposite and wrap Python VM in Java, I would still like to access the
lucene (which is possible, as I see well from your examples) But on
the python side, you are calling initVM() - will the initVM() call
create a new Java VM or will it access the parent Java VM which
started it?

- you say that threads are not managed by the Python VM, does that
mean there is no Python GIL?

- I don't really know what is exactly in the python thread local
storage, could that somehow negatively affect the Python process if
acquireThreadState/releaseThreadState are not called?

Thank you.

Cheers,

  roman


On Tue, Jan 11, 2011 at 8:13 PM, Andi Vajda va...@apache.org wrote:

  Hi Roman,

 On Tue, 11 Jan 2011, Roman Chyla wrote:

 I have recently wrapped solr inside jetty with JCC (we need to access
 very big result sets quickly, via JNI, but also keep solr running as
 normal) and was wondering what strategies do you guys use to speak
 *from inside* Java towards the Python end.

 So far, I was able to think about these:

 - raise exceptions in java and catch in python (I think I have seen
 this in some posts from Bill Jansen)
 - communicate via sockets
 - wait passively - call some java method and wait for its return
 - monitor actively - in python check in loop some java object

 Is there something else?

 I'm not sure I completely understand your questions but if what you're
 asking is how to run Python code from inside a Java servlet container, that
 I've done with Tomcat and Lucene.

 Basically, instead of embedding a JVM inside a Python VM - as is done for
 PyLucene - you do the opposite, you embed a Python VM inside a JVM.

 For that purpose, see the org.apache.jcc.PythonVM class available in JCC's
 java tree. This class must be instantiated from the main thread at Java
 servlet engine startup time. In Tomcat, I patched some startup code, in
 BootStrap.java (see patches below) for this purpose.

 Then, to make some Python code accessible from Java, use the usual way of
 writing extensions, the so-called JCC in reverse trick. Define a Java
 class
 with some native methods implemented in Python; define a Python class that
 extends it; build the Java class into a JAR; include it into a JCC-built
 egg; install the egg into Python's env (site-packages, PYTHONPATH,
 whatever);
 Then, write servlet code in Java that imports your Java class and calls it.

 As you can see, this sounds simple but the devil is in the details. Of
 course,
 bending Jetty for this may have different requirements but the code snippets
 below should give you a good idea about what's required.

 This approach has been in production running the freebase.com's search
 server
 for over two years now.

 If you have questions, of course, please ask.
 Good luck !

 Andi..

 --
 Patch to Bootstrap.java to use JCC's PythonVM (which initializes the
 embedded
 Python VM)

 --- apache-tomcat-6.0.29-src/java/org/apache/catalina/startup/Bootstrap.java
    2010-07-19 06:02:32.0 -0700
 +++
 apache-tomcat-6.0.29-src/java/org/apache/catalina/startup/Bootstrap.java.patched
    2010-08-04 08:49:05.0 -0700
 @@ -30,16 +30,18 @@
  import javax.management.MBeanServer;
  import javax.management.MBeanServerFactory;
  import javax.management.ObjectName;

  import org.apache.catalina.security.SecurityClassLoad;
  import org.apache.juli.logging.Log;
  import org.apache.juli.logging.LogFactory;

 +import org.apache.jcc.PythonVM;
 +

  /**
  * Boostrap loader for Catalina.  This application

Re: building PyLucene 3.0.2 on Win7/MinGW with Python 2.7

2010-11-23 Thread Roman Chyla

On Mon, Nov 22, 2010 at 9:45 PM, Bill Janssen jans...@parc.com wrote:
 Roman Chyla roman.ch...@gmail.com wrote:

 I had similar/same issue on win xp, it was the space in the java path,
 but i can't recall details. What happens if you change config.py to?
 C:\\Program\ Files\ (x86)\\Java\\jdk1.6.0_22\\lib

 Wouldn't that eval to the same Python string?

Which reminds me I ended up with 'Program\\ Files', but that must have
been for the compilation - so nevermind, sorry, that was another
problem.


 I tried quoting all the spaces in the strings, with no help.

 It's when it attempts to load jcc/_jcc.pyd that it fails.

 One possible problem is that there are two different _jcc submodules
 there:

  -rw-rw-rw-   1 wjanssen root           282 11-22 12:29 _jcc.py
  -rw-rw-rw-   1 wjanssen root           577 11-22 12:29 _jcc.pyc
  -rw-rw-rw-   1 wjanssen root        512418 11-22 12:29 _jcc.pyd

 I'm not sure why, or if that's a problem.

 Using depends.exe on _jcc.pyd says that the missing file is
 Python27.dll, which seems odd.  Where should I find that?

 Bill


 roman

 On Mon, Nov 22, 2010 at 7:53 PM, Bill Janssen jans...@parc.com wrote:
  I got a brand-new Windows 7 machine, and thought I'd try building
  PyLucene with a newer version of Python, 2.7, the 32-bit version.
 
  I also had to move to setuptools-0.6c11, because 0.6c9 doesn't seem to
  work with Python 2.7.  Using 32-bit Java 6.0_22.
 
  But I can't get JCC to run here:
 
  sh-3.1$ which jcc.dll
  /c/Python27/Lib/site-packages/JCC-2.6-py2.7-win32.egg/jcc.dll
  sh-3.1$ which jvm.dll
  /c/Program Files (x86)/Java/jre6/bin/client/jvm.dll
  sh-3.1$ python -m jcc.__main__ --help
  c:\Python27\python.exe: DLL load failed: The specified module could not be 
  found.
  sh-3.1$
 
  sh-3.1$ python -c 'import os; print os.environ.get(PATH)'
  c:\Windows\system32;c:\Windows;c:\Windows\System32\Wbem;c:\Windows\System32\WindowsPowerShell\v1.0\;C:\MinGW\msys\1.0\bin;C:\MinGW\bin;c:\Python27;c:\Program
   Files\apache-ant-1.8.1\bin;c:\Program Files 
  (x86)\Java\jre6\bin\client;c:\Python27\Lib\site-packages\JCC-2.6-py2.7-win32.egg
 
  It seems to build and install OK, but when I run python in verbose mode,
  I see
 
  import jcc # directory 
  c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc
  # c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.pyc 
  matches 
  c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.py
  import jcc # precompiled from 
  c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.pyc
  # c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.pyc 
  matches c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.py
  import jcc.config # precompiled from 
  c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.pyc
  c:\Python27\python.exe: DLL load failed: The specified module could not be 
  found.
 
  So, what's in jcc/config.py?  Here's what's in it:
 
  INCLUDES=['C:\\Program Files (x86)\\Java\\jdk1.6.0_22\\include', 
  'C:\\Program Files (x86)\\Java\\jdk1.6.0_22\\include\\win32']
  CFLAGS=['-fno-strict-aliasing', '-Wno-write-strings']
  DEBUG_CFLAGS=['-O0', '-g', '-DDEBUG']
  LFLAGS=['-LC:\\Program Files (x86)\\Java\\jdk1.6.0_22\\lib', '-ljvm']
  IMPLIB_LFLAGS=['-Wl,--out-implib,%s']
  SHARED=True
  VERSION=2.6
 
  Any ideas about what's going wrong?  I suspect those parentheses in the
  path to the jvm, myself.
 
  Bill

Re: building PyLucene 3.0.2 on Win7/MinGW with Python 2.7

2010-11-22 Thread Roman Chyla

I had similar/same issue on win xp, it was the space in the java path,
but i can't recall details. What happens if you change config.py to?
C:\\Program\ Files\ (x86)\\Java\\jdk1.6.0_22\\lib

roman

On Mon, Nov 22, 2010 at 7:53 PM, Bill Janssen jans...@parc.com wrote:
 I got a brand-new Windows 7 machine, and thought I'd try building
 PyLucene with a newer version of Python, 2.7, the 32-bit version.

 I also had to move to setuptools-0.6c11, because 0.6c9 doesn't seem to
 work with Python 2.7.  Using 32-bit Java 6.0_22.

 But I can't get JCC to run here:

 sh-3.1$ which jcc.dll
 /c/Python27/Lib/site-packages/JCC-2.6-py2.7-win32.egg/jcc.dll
 sh-3.1$ which jvm.dll
 /c/Program Files (x86)/Java/jre6/bin/client/jvm.dll
 sh-3.1$ python -m jcc.__main__ --help
 c:\Python27\python.exe: DLL load failed: The specified module could not be 
 found.
 sh-3.1$

 sh-3.1$ python -c 'import os; print os.environ.get(PATH)'
 c:\Windows\system32;c:\Windows;c:\Windows\System32\Wbem;c:\Windows\System32\WindowsPowerShell\v1.0\;C:\MinGW\msys\1.0\bin;C:\MinGW\bin;c:\Python27;c:\Program
  Files\apache-ant-1.8.1\bin;c:\Program Files 
 (x86)\Java\jre6\bin\client;c:\Python27\Lib\site-packages\JCC-2.6-py2.7-win32.egg

 It seems to build and install OK, but when I run python in verbose mode,
 I see

 import jcc # directory 
 c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc
 # c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.pyc 
 matches c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.py
 import jcc # precompiled from 
 c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.pyc
 # c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.pyc 
 matches c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.py
 import jcc.config # precompiled from 
 c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.pyc
 c:\Python27\python.exe: DLL load failed: The specified module could not be 
 found.

 So, what's in jcc/config.py?  Here's what's in it:

 INCLUDES=['C:\\Program Files (x86)\\Java\\jdk1.6.0_22\\include', 'C:\\Program 
 Files (x86)\\Java\\jdk1.6.0_22\\include\\win32']
 CFLAGS=['-fno-strict-aliasing', '-Wno-write-strings']
 DEBUG_CFLAGS=['-O0', '-g', '-DDEBUG']
 LFLAGS=['-LC:\\Program Files (x86)\\Java\\jdk1.6.0_22\\lib', '-ljvm']
 IMPLIB_LFLAGS=['-Wl,--out-implib,%s']
 SHARED=True
 VERSION=2.6

 Any ideas about what's going wrong?  I suspect those parentheses in the
 path to the jvm, myself.

 Bill

Re: PatternAnalyzer not implemented?

2010-10-02 Thread Roman Chyla

Thank you, Andi. Recompilled, it works just fine now.

roman

On Fri, Oct 1, 2010 at 8:28 PM, Andi Vajda va...@apache.org wrote:

 On Fri, 1 Oct 2010, Roman Chyla wrote:

 I tried to use the PatternAnalyzer, but am getting NotImplementedError
 - in case it is not available, shall I rather use PythonAnalyzer and
 implement the regex pattern analyzer with that?

 using version: 2.9.3

 In [44]: import lucene
 In [45]: import pyjama #-- this package contains java.util.regex.Pattern
 In [46]: p = pyjama.Pattern.compile(\\s)
 In [47]: p
 Out[47]: Pattern: \s
 In [48]: import lucene.collections as col
 In [49]: s = col.JavaSet([])
 In [50]: s
 Out[50]: JavaSet: org.apache.pylucene.util.python...@16925b0
 In [51]: pa = lucene.PatternAnalyzer(p,True,s)

 ---
 NotImplementedError                       Traceback (most recent call
 last)

 /Users/rca/ipython console in module()

 NotImplementedError: ('instantiating java class', type
 'PatternAnalyzer')

 This is because no constructors were generated for PatternAnalyzer. That in
 turn is because the java.util.regex package is missing from the JCC command
 line in PyLucene's Makefile, causing methods and constructors using classes
 in that package to be skipped.

 To fix this, add
            --package java.util.regex \
 around line 214 to PyLucene's Makefile.

 It is also strongly recommended that you rebuild pyjama with --import lucene
 on the JCC command line so that you don't have JCC generate wrappers again
 for classes that are shared between pyjama and lucene.

 Andi..

PatternAnalyzer not implemented?

2010-10-01 Thread Roman Chyla

Hello,

I tried to use the PatternAnalyzer, but am getting NotImplementedError
- in case it is not available, shall I rather use PythonAnalyzer and
implement the regex pattern analyzer with that?

using version: 2.9.3

In [44]: import lucene
In [45]: import pyjama #-- this package contains java.util.regex.Pattern
In [46]: p = pyjama.Pattern.compile(\\s)
In [47]: p
Out[47]: Pattern: \s
In [48]: import lucene.collections as col
In [49]: s = col.JavaSet([])
In [50]: s
Out[50]: JavaSet: org.apache.pylucene.util.python...@16925b0
In [51]: pa = lucene.PatternAnalyzer(p,True,s)
---
NotImplementedError   Traceback (most recent call last)

/Users/rca/ipython console in module()

NotImplementedError: ('instantiating java class', type 'PatternAnalyzer')

In [52]:

Kind regards,

  roman

Re: Issues while connecting PyLucene code to Apache WSGI interface

2010-08-30 Thread Roman Chyla

I recently had problem with this:
http://stackoverflow.com/questions/548493/jcc-initvm-doesnt-return-when-mod-wsgi-is-configured-as-daemon-mode

you may want to check that too

roman

On Mon, Aug 30, 2010 at 8:50 PM, Andi Vajda va...@apache.org wrote:


 On Mon, 30 Aug 2010, technology inspired wrote:

 Thanks for the reply. My example runs fine when it runs alone (pure
 python).
 Here is the code:

 Ok, then the next step is to port it to a python http server such as [1] so
 that you get the threading and initialization story straight:
  - initVM() must be called from the main thread, once
  - any thread created from Python must call attachCurrentThread() before
    making any other calls that involve the JVM
 I'm not sure how this is done in the apache2/wsgi environment, that is a
 question for another forum. That being said, if you solve this problem,
 posting your answer here would be helpful as this has come up before.

 About the errors you're reporting, what you're seeing in your browser is
 irrelevant. Instead, you must log errors that happen on the Python side and
 look for these stacktraces there.

 Andi..

 [1] http://docs.python.org/library/simplehttpserver.html



 #import sys, os
 #sys.path.append(/home/v/workspace/example-project/src/trunk)
 #os.environ['DJANGO_SETTINGS_MODULE'] = 'example.settings'
 from lucene import Field, Document, initVM, NIOFSDirectory, IndexWriter,
 StandardAnalyzer, Version, File
 from lucene import SimpleFSLockFactory, NumericField, IndexSearcher,
 QueryParser, NumericRangeQuery
 from lucene import Integer, BooleanQuery, BooleanClause
 #from django.shortcuts import render_to_response
 def build():
     initVM()
     dir = NIOFSDirectory(File(/home/v/index), SimpleFSLockFactory())
     analyzer = StandardAnalyzer(Version.LUCENE_30)
     writer = IndexWriter(dir, analyzer, True,
 IndexWriter.MaxFieldLength(1024))

     field_rows = FieldDoc.objects.all() # Currently there is only one row
 in
 database
     for row in field_rows:
     doc = Document()
     if row.category != :
     doc.add(Field('category', row.category, Field.Store.YES,
 Field.Index.NOT_ANALYZED))
         writer.addDocument(doc)

     writer.close()
     #return render_to_response(index.html, {var: Success})

 But when I connect it with httpd/mod_wsgi, I see the Success page some
 times and other times, it says Internal Server Error with the errors as
 mentioned in previous email. I am not aware what is the best practice to
 run
 Python Lucene code from a web server.

 You have mentioned about using attachCurrentThread(). I tried using it
 this
 way:
 env = initVM()
 env.attachCurrentThread()

 but no change in the response. I don't know if this is how
 attachCurrentThread() should be used in above build function. Please guide
 how to connect Lucene code with Apache2/wsgi. My apache2/wsgi is
 configured
 properly as I can run non lucene coded web pages. Apache2 is using
 mpm-worker, a threaded environment.

 Thanks.

 Regards,
 Vin



 On Sun, Aug 29, 2010 at 12:21 PM, Andi Vajda va...@apache.org wrote:

      On Sun, 29 Aug 2010, technology inspired wrote:

            I am using PyLucene 3.0.2 on Ubuntu 10.04 with
            Python 2.6.5 and Sun Java
            1.6. I am written an example script to build index
            and store in a directory.
            Later on, I want it to search in my next example
            script which as of now I
            haven't written.

            There are two issues I have to mention and looking
            for your help:

            ISSUE 1:
            I am using Apache2 with mod_wsgi 3.3. I have got the
            index building script
            connected to a GET request. When I call that GET
            request, I get following
            errors:

            [error] [client 127.0.0.1] Premature end of script
            headers: wsgi
            [notice] child pid exit signal Aborted (6).

            With this error, I see Internal Server Error on my
            browser screen. This
            error appears only if I make GET request very often,
            i.e. around 1 per 2
            seconds. If I issue GET at the interval of 10
            seconds, I don't see these
            errors.

            ISSUE 2:
            When I index Date field using NumericField, the GET
            request gives Internal
            Server Error on every alternate request. and the
            Apache2 log files gets
            these errors:
            [error] [client 127.0.0.1] Premature end of script
            headers: wsgi
            [notice] child pid exit signal Segmentation fault
            (11)

            I am looking for help to solve these problems. I am
            running WSGI deamon
            mode. WSGI settings are:
            ...
            WSGIDaemonProcess example.com user=www-data
            group-www-data thread 25
            WSGIProcessGroup example.com
            WSGIScriptAlias /

InvalidArgsError - passing TopDocs object

2010-08-24 Thread Roman Chyla

Hi,

I am trying to understand PyLucene more and to see if it is faster to
retrieve result ids with java instead of with Python. The use case is
to retrieve millions of recids -- with python, 700K ids takes about
1.5s. (even if query takes just fraction of that).

I wrote a simple java code (works in java) which returns array of
ints. I have wrapped it with jcc, it is visible from inside python,
but callind the static method throws InvalidArgsError (below is an
example python session)

JCC is version 2.4, built with shared mode -- the DistUtils is in a
different package than lucene (ie. not inside lucene jars). Can this
problem be similar to passing jcc-wrapped objects between different
jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1

The java class is very simple:

import org.apache.lucene.search.TopDocs;

public class DumpUtils {
public static int[] GetDocIds(TopDocs topdocs) {
int[] out;
out = new int[topdocs.totalHits];
ScoreDoc[] hits = topdocs.scoreDocs;
for (int i=0; i  topdocs.totalHits; i++) {
out[i] = hits[i].doc;
}
return out;
}
}

Thanks for any help/pointers,

   roman


Here is an example python session:

In [1]: import pyjama

In [2]: pyjama.initVM(pyjama.CLASSPATH)
Out[2]: jcc.JCCEnv object at 0x00C0E1F0

In [3]: import lucene as lu

In [4]: pyjama.DumpUtils
Out[4]: type 'DumpUtils'

In [5]: pyjama.DumpUtils.GetDocIds
Out[5]: built-in method GetDocIds of type object at 0x0189E780

In [6]:

In [7]: import newseman.pyjamic.slucene.searcher as se

In [8]: s = se.Searcher();s.open('/tmp/whisper/')

In [9]: hits = s._search(s._query('key:bo*',None), 50)

In [10]: hits
Out[10]: TopDocs: org.apache.lucene.search.topd...@480457

In [11]:

In [12]: pyjama.DumpUtils.GetDocIds(hits)
---
InvalidArgsError  Traceback (most recent call last)

InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs: org.apache.lucene.
search.topd...@480457)

Re: InvalidArgsError - passing TopDocs object

2010-08-24 Thread Roman Chyla

Thank you very much, Andi.
Best,

  roman

On Tue, Aug 24, 2010 at 5:36 PM, Andi Vajda va...@apache.org wrote:

 On Aug 24, 2010, at 8:03, Roman Chyla roman.ch...@gmail.com wrote:

 I am trying to understand PyLucene more and to see if it is faster to
 retrieve result ids with java instead of with Python. The use case is
 to retrieve millions of recids -- with python, 700K ids takes about
 1.5s. (even if query takes just fraction of that).

 I wrote a simple java code (works in java) which returns array of
 ints. I have wrapped it with jcc, it is visible from inside python,
 but callind the static method throws InvalidArgsError (below is an
 example python session)

 JCC is version 2.4, built with shared mode -- the DistUtils is in a
 different package than lucene (ie. not inside lucene jars). Can this
 problem be similar to passing jcc-wrapped objects between different
 jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1

 The java class is very simple:

 import org.apache.lucene.search.TopDocs;

 public class DumpUtils {
   public static int[] GetDocIds(TopDocs topdocs) {
       int[] out;
       out = new int[topdocs.totalHits];
       ScoreDoc[] hits = topdocs.scoreDocs;
       for (int i=0; i  topdocs.totalHits; i++) {
           out[i] = hits[i].doc;
       }
       return out;
   }
 }

 Thanks for any help/pointers,

 Ah yes, importing separately built extensions that share classes (or
 dependencies) didn't work until support for the --import parameter was added
 in jcc 2.6 to solve the problem of incompatible shared classes. To make this
 work:
  - first, build PyLucene as usual, with --shared
  - then, build your DistUtils package with --import lucene and with --shared

 That way, instead of generating code and wrapper classes again for the
 lucene classes, jcc will import them at build time thus making a much
 smaller library and faster build. The resulting shared library is linked
 against the lucene one.

 See docs and list archives about --import for more examples. Then, when
 running all this, you should also import lucene first, then your other
 package.

 Andi..


  roman


 Here is an example python session:

 In [1]: import pyjama

 In [2]: pyjama.initVM(pyjama.CLASSPATH)
 Out[2]: jcc.JCCEnv object at 0x00C0E1F0

 In [3]: import lucene as lu

 In [4]: pyjama.DumpUtils
 Out[4]: type 'DumpUtils'

 In [5]: pyjama.DumpUtils.GetDocIds
 Out[5]: built-in method GetDocIds of type object at 0x0189E780

 In [6]:

 In [7]: import newseman.pyjamic.slucene.searcher as se

 In [8]: s = se.Searcher();s.open('/tmp/whisper/')

 In [9]: hits = s._search(s._query('key:bo*',None), 50)

 In [10]: hits
 Out[10]: TopDocs: org.apache.lucene.search.topd...@480457

 In [11]:

 In [12]: pyjama.DumpUtils.GetDocIds(hits)

 ---
 InvalidArgsError                          Traceback (most recent call
 last)

 InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs:
 org.apache.lucene.
 search.topd...@480457)

_addClasspath question

2010-08-18 Thread Roman Chyla

Hi,

I have noticed that the JCC 2.6 has a env.classpath and also a method
_addClassPath()

When I use _addClassPath, jvm.classpath shows the new change -- can we
use this method to add classpath when VM is already running? Will it
stay, the _name looks like is not meant to be public?

Thanks,

roman

Re: Building PyLucene on Windows

2010-03-09 Thread Roman Chyla

Hi,

I would also like to thank to Andi (and others?) for the great tool
and the samples, it is really excellent. I am using MSVC7.1 on win xp,
it builds fine, but it was quite difficult at the beginning
(especially, because I tried with mingw before falling back to msvc).

And indeed, is gnu make indispensable? In some previous posts it was
said that Ant is not an option (makes Python programmers scream and
run away) and 'make' is there because nobody provided something else.
This naturally brings us to the practical problem: it can be done, but
somebody has to DO IT, right? ;-) What would you think about scons?
http://www.scons.org/


roman

On Tue, Mar 9, 2010 at 3:50 PM, Andi Vajda va...@apache.org wrote:

 On Mar 9, 2010, at 13:13, Thomas Koch k...@orbiteam.de wrote:

 Dear PyLucene-fans,

 I just managed to build pylucene-2.9.1-1 on Windows with Python 2.6 and
 Java
 1.6 and like to tell my 'story' - just in case anyone else runs into
 similar
 problems...

 First I should mention that I'm using PyLucene for quite a while now -
 just
 never needed to build it on windows - there used to be binary
 distributions
 on the net (here: http://code.google.com/p/pylucene-win32-binary/ -
 however
 it's out-of-date). Also I am a bit familiar with Makefiles, Ant and other
 toolchains...

 Next it should be said that not only PyLucene is great piece of software
 but
 also Documentation (and samples / test-suite) is very well maintained.

 The only thing that's missing from my point of view is clear advise on
 requirements for building PyLucene on specific platforms. Maybe that's
 also
 the cause of the trouble I had in building it ... I knew I need a C++
 compiler, ANT, Java and Python. Also as Makefile is used some kind of
 make-utitilty would be needed. So here's the setup I've choosen:

 - Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19)
 - Java 1.6 (jdk1.6.0_06)
 - compiler: MS-Visual-Studio9 (Microsoft Visual C++ 2008 Express Edition)
 - mingw32-make from MinGW-5.1.6 - see http://www.mingw.org/
 (GNU Make 3.81 built for i386-pc-mingw32)
 - ANT 1.8.0
 - pylucene-2.9.1-1 / lucene-java-2.9.1
 - Windows7

 The building of JCC was no problem. The first issues came up when entering
 the make-toolchain: apparently there are some differences on Windows that
 either my windows binary GNU make couldn't handle very well or need to be
 fixed for windows anyway...

 This especially holds for path separators and command separators. For
 example I had to change


 $(LUCENE_JAR): $(LUCENE)
   cd $(LUCENE) ; $(ANT) -Dversion=$(LUCENE_VER)
 to
 $(LUCENE_JAR): $(LUCENE)
   cd $(LUCENE)  $(ANT) -Dversion=$(LUCENE_VER)

 (took me a while to figure this out ,-)

 PYLUCENE:=$(shell pwd)
 to
 PYLUCENE:=$(shell cd)

 BUILD_TEST:=$(PYLUCENE)/build/test
 to
 BUILD_TEST:=$(PYLUCENE)\build\test

 (note: cd may work with / but when it comes to mkdir this fails - e.h.
 mingw32-make test
 mkdir -p pylucene-2.9.1-1/build/test
 Syntaxfehler.
 mingw32-make: *** [install-test] Error 1
 )


 Finally herer are my Makefile settings:

 # Windows  (Win32, Python 2.6, Java 1.6, ant 1.8)
 SHELL=cmd.exe
 PYLUCENE:=$(shell cd)
 ANT=F:\devel\apache-ant-1.8.0\bin\ant
 JAVA_HOME=C:\\Program Files\\Java\\jdk1.6.0_06
 PREFIX_PYTHON=C:\\Python26
 PYTHON=$(PREFIX_PYTHON)\python.exe
 JCC=$(PYTHON) -m jcc.__main__
 NUM_FILES=3


 So either I've choosen the wrong tools or there should be others with
 similar problems.  If my toolchain is wrong or unsupported please advise.
 Is
 it recommended/required to use Cygwin on Windows?

 Yes, cygwin is required so that you have a functional gnu make.
 Note that you still need to use a MS compiler or mingw, which some people
 have been able to use.

 I test build pylucene every now and then on an old win2k system with cygwin
 (for make and shell) and msvc 7.1. Not a setup with the most recent software
 but that's all I've got for windows.

 Andi..


 If anyone is interested I can offer to
 - post my adapted Makefile here (or on the web)
 - provide binary version of PyLucene (on the web)

 Finally some suggestion: wouldn't it be possible to skip the Makefile
 completely? I'm not that familiar with ANT but know it has been developed
 to
 provide platform independant built processes - and it includes shell-tasks
 for anything that is not java... (I know this could be some work, just
 wanted to know if this question has been raised before or if this is a
 no-go
 option ?)

 best regards

 Thomas Koch
 --
 OrbiTeam Software GmbH  Co. KG
 Endenicher Allee 35
 53121 Bonn Germany
 i...@orbiteam.de
 http://www.orbiteam.de

Re: unload JVM?

2010-02-28 Thread Roman Chyla

 - consecutive calls to initVM raise errors

 Only if you use parameters other than classpath, right ?

yes

 Or did you find a different problem ?

 - in my program components interact with several JCC wrapped libraries

 Normally, it is no problem, but clashes may occur - especially in GUI
 when running complex workflows - the solution (in theory) would be to
 destroy JVM and load it again. Is it possible?

 In theory, it might be. The JNI API has a call to destroy a VM.
 http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/invocation.html
 But cleanly doing so it rather tricky so JCC doesn't support it.

I tried (naive) to add destroyJavaVM call into the source, recompiled
it and tried calling it from thread. The destroy call returns 0, but
as soon as I delete references and object is garbage collected
(probably), Python crashes. I have no idea what's going on :-)
Browsing through the Sun bug reports, it seems like it wasn't possible
(ie. DestroyJavaVM never worked, at least for others, who apparently
understand what they were doing).
But if possible and one day it happens, cool, JCC is really blessing
for connecting Python with Java and I imagine more people will start
using it, and with more programmers using it, there will come more
python packages in one installation...

roman


 A different approach to supporting your use case might be to consider
 compiling all your JCC wrapped libraries into one, picking only the APIs you
 need so as control the size of the resulting library.

 Andi..

Re: starting several modules in one VM

2010-02-10 Thread Roman Chyla

Thank you Andi for checking,

I am able to reproduce it again (please see below). My problem is
probably two packages with lucene (I started to play with PyLucene
only recently, and the older code is there doing other work).

The lucene is a PyLucene (lucene 2.9.1) and pyjama contains GATE and
also my own lucene (2.9.1) - so, effectively, I have two lucenes

 - but in the pyjama package, the jcc wrapper was built only for my
own classes (that talk to java-lucene behind the scene)
 - when ClassNotFoundException happens, java is apparently searching
inside pyjama's jars (and the classes are only in lucene's jars)


So that brings me to question like if it is safe to mix python
packages that contain the same java classes? Or not recommended at
all?

Best,

Roman


In [1]: import lucene,pyjama

In [2]: pyjama.initVM(pyjama.CLASSPATH, vmargs='-Dgate.site.config=C:/dev/worksp
ace/newseman/src/merkur/cfg//gate.xml,-Dgate.plugins.home=C:/dev/workspace/newse
man/src/merkur/cfg/ANNIE/plugins,-Dgate.user.config=C:/dev/workspace/newseman/sr
c/merkur/cfg//user.xml,-Dgate.user.session=C:/dev/workspace/newseman/src/merkur/
cfg//gate.session,-Xms32m,-Xmx256m')
Out[2]: jcc.JCCEnv object at 0x00A4E7B0

In [3]: lucene.initVM(lucene.CLASSPATH)
Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/ana
lysis/ar/ArabicAnalyzer
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.ar.Arabi
cAnalyzer
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
---
JavaError Traceback (most recent call last)

C:\dev\WORKSP~1\newseman\utils\pyjama\build\dist\pyjama\ipython console in mo
dule()

JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/ar/ArabicA
nalyzer
Java stacktrace:
java.lang.NoClassDefFoundError: org/apache/lucene/analysis/ar/ArabicAnalyzer
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.ar.Arabi
cAnalyzer
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)




On Tue, Feb 9, 2010 at 10:25 PM, Andi Vajda va...@apache.org wrote:

 On Tue, 9 Feb 2010, Roman Chyla wrote:

 I wanted to ask if there was any progress on this issue (extending
 classpath runtime):

 http://lists.osafoundation.org/pipermail/pylucene-dev/2008-March/002455.html

 Yes, this should work provided you invoke jcc with --shared when building
 your modules.

 I just verified this worked by using PyLucene and PyPDFBox together, both
 built with --shared.
 (Note that with a recent JCC, you no longer need to pass the classpath to
 initVM(), the parameter is defaulted to the module's CLASSPATH variable):

    yuzu:vajda python
    Python 2.6.2 (r262:71600, Sep 20 2009, 20:40:09)
    [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
    Type help, copyright, credits or license for more information.
     import pdfbox
     pdfbox.initVM(vmargs='-Djava.awt.headless=true')
    jcc.JCCEnv object at 0x1004030d8
     import lucene
     lucene.initVM()
    jcc.JCCEnv object at 0x1004034e0
     lucene.Document()
    Document: Document
     pdfbox.PDFTextStripper()
    PDFTextStripper: org.apache.pdfbox.util.pdftextstrip...@83e96cf
    

 or in a different order:

    yuzu:vajda python
    Python 2.6.2 (r262:71600, Sep 20 2009, 20:40:09)
    [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
    Type help, copyright, credits or license for more information.
     import lucene, pdfbox
     lucene.initVM(vmargs='-Djava.awt.headless=true')
    jcc.JCCEnv object at 0x1004030d8
     pdfbox.initVM()
    jcc.JCCEnv object at 0x100403150
     lucene.Document()
    Document: Document
     pdfbox.PDFTextStripper()
    PDFTextStripper: org.apache.pdfbox.util.pdftextstrip...@6548f8c8

 The vmargs='-Djava.awt.headless=true' parameter to the first initVM() is
 required by pdfbox. The first initVM() call starts and initializes the Java
 VM, the second one just updates its classpath and cannot change or set
 vmargs.

 Andi..

Re: starting several modules in one VM

2010-02-10 Thread Roman Chyla


 So that brings me to question like if it is safe to mix python
 packages that contain the same java classes? Or not recommended at
 all?

 Hmm, not sure. I've never tried that. It seems a little unsane to me.
 Where this may cause trouble is with the different sets of wrappers jcc has
 generated for the same classes. I don't expect them to be usable

So I think I should include two jars in my pyjama package, one with my
own lucene classes, the other one with lucene.jar -- and the
lucene.jar put into classpath only if pylucene is not available on the
system

 interchangeably since which methods get wrapped depends on the transitive
 closure of dependencies that was computed during generation.

 That being said, I don't see why the classes would not be found in the first
 place.

 What are the _exact_ jcc invocations you used to build both extensions ?

python -m jcc --shared --package java.util java.util.ArrayList
newseman.gate.PythonicAnnie newseman.lucene.whisperer.LuceneWhisperer
newseman.lucene.whisperer.IndexDictionary --python pyjama --build
--classpath 
../build/jar/lucene-standalone-pyjama-0.1.jar;../build/jar/gate-standalone-pyjama-0.1.jar
--include ../build/jar/lucene-standalone-pyjama-0.1.jar --include
../build/jar/gate-standalone-pyjama-0.1.jar --bdist --version 0.1

pylucene is 2.9.1 and I didn't change anything besided the windows section:

PREFIX_PYTHON=/cygdrive/c/dev/Python251/
ANT=/cygdrive/c/dev/apache-ant-1.7.1/bin/ant
JAVA_HOME=/cygdrive/c/Program Files/Java/jdk1.6.0_12
PYTHON=$(PREFIX_PYTHON)/python.exe
JCC=$(PYTHON) -m jcc --shared
NUM_FILES=3

I have updated JDK in the meantime (am using jcc built on 1.6.0_12,
now JDK1.6.0_18) - I can try to recompile JCC and both extensions with
new JDK - if it makes any sense (?)

roman




 Andi..

starting several modules in one VM

2010-02-09 Thread Roman Chyla

Hi,

I wanted to ask if there was any progress on this issue (extending
classpath runtime):
http://lists.osafoundation.org/pipermail/pylucene-dev/2008-March/002455.html



Here is a longer version:

I would like to use several Java libraries from python, one of them PyLucene,
the other GATE and others. I compiled GATE into separate egg and
after some experiments, I was able to start two jcc modules - however,
it fails if I import first my module and then lucene.


This works fine, but all the -Dgate.home are actually needed for the
second pyjama.initVM():
==

import lucene
import pyjama

lucene.initVM(lucene.CLASSPATH,
vmargs='-Dgate.site.config=C:/dev/workspace/newseman/src/merkur/cfg//gate.xml,-Dgate.plugins.home=C:/dev/workspace/newseman/src/merkur/cfg/ANNIE/plugins,-Dgate.user.config=C:/dev/workspace/newseman/src/merkur/cfg//user.xml,-Dgate.user.session=C:/dev/workspace/newseman/src/merkur/cfg//gate.session,-Xms32m,-Xmx256m')
pyjama.initVM(pyjama.CLASSPATH)

==

this will fail:

==

import lucene
import pyjama

pyjama.initVM(pyjama.CLASSPATH,
vmargs='-Dgate.site.config=C:/dev/workspace/newseman/src/merkur/cfg//gate.xml,-Dgate.plugins.home=C:/dev/workspace/newseman/src/merkur/cfg/ANNIE/plugins,-Dgate.user.config=C:/dev/workspace/newseman/src/merkur/cfg//user.xml,-Dgate.user.session=C:/dev/workspace/newseman/src/merkur/cfg//gate.session,-Xms32m,-Xmx256m')
lucene.initVM(lucene.CLASSPATH)



ERROR:root:Traceback (most recent call last):
 File C:\dev\workspace\newseman\src\merkur\runwf.py, line 79, in get_workflow
   execfile(filename, x.__dict__)
 File wtf_test.py, line 19, in module
   from merkur import test
 File C:\dev\workspace\newseman\src\merkur\test.py, line 24, in module
   lucene.initVM(lucene.CLASSPATH)
JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/ar/ArabicA
nalyzer
   Java stacktrace:
java.lang.NoClassDefFoundError: org/apache/lucene/analysis/ar/ArabicAnalyzer
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.ar.Arabi
cAnalyzer
       at java.net.URLClassLoader$1.run(Unknown Source)
       at java.security.AccessController.doPrivileged(Native Method)
       at java.net.URLClassLoader.findClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)

===

If I do this, everyrthing is OK:

pyjama.initVM(os.pathsep.join([lucene.CLASSPATH, pyjama.CLASSPATH]), vmargs=...
lucene.initVM(lucene.CLASSPATH)


So it seems to me, that the second initVM() call has no effect. And
obviously, I have to make sure it is me who calls initVM() with
correct arguments as a first one (which might be difficult to secure).
Am I doing something wrong?


Best,

 Roman

82 matches

Mail list logo