Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again
Hi Mike, Sorry for the delay, I was away last week. Now that I get back to it again my plan is to write a test for the WordDelimiterFilter and pinpoint the problem. Cheers, Roman On Thu, Aug 20, 2020 at 11:21 AM Michael McCandless wrote: > > Hi Roman, > > No need for anyone to be falling on swords here! This is really complicated > stuff, no worries. And I think we have a compelling plan to move forwards so > that we can index multi-token synonyms AND have 100% correct positional > queries at search time, thanks to Michael Gibney's cool approach on > https://issues.apache.org/jira/browse/LUCENE-4312. > > So it looks like WordDelimiterGraphFilter is producing buggy (out of order > offsets) tokens here? > > Or are you running SynonymGraphFilter after WordDelimiterFilter? > > Looking at that failing example, it should have output'd that spacetime token > immediately after the space token, not after the time token. > > Maybe use TokenStreamToDot to visualize what the heck token graph you are > getting ... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Aug 18, 2020 at 9:41 PM Roman Chyla wrote: >> >> Hi Mike, >> >> I'm sorry, the problem all the time is inside related to a >> word-delimiter filter factory. This is embarrassing but I have to >> admit publicly and self-flagellate. >> >> A word-delimiter filter is used to split tokens, these then are used >> to find multi-token synonyms (hence the connection). In my desire to >> simplify, I have omitted that detail while writing my first email. >> >> I went to generate the stack trace: >> >> ``` >> assertU(adoc("id", "603", "bibcode", "xx603", >> "title", "THE HUBBLE constant: a summary of the HUBBLE SPACE >> TELESCOPE program"));``` >> >> stage:indexer term=xx603 pos=1 type=word offsetStart=0 offsetEnd=13 >> stage:indexer term=acr::the pos=1 type=ACRONYM offsetStart=0 offsetEnd=3 >> stage:indexer term=hubble pos=1 type=word offsetStart=4 offsetEnd=10 >> stage:indexer term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 >> stage:indexer term=constant pos=1 type=word offsetStart=11 offsetEnd=20 >> stage:indexer term=summary pos=1 type=word offsetStart=23 offsetEnd=30 >> stage:indexer term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 >> stage:indexer term=syn::hubble space telescope pos=0 type=SYNONYM >> offsetStart=38 offsetEnd=60 >> stage:indexer term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 >> stage:indexer term=space pos=1 type=word offsetStart=45 offsetEnd=50 >> stage:indexer term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 >> stage:indexer term=program pos=1 type=word offsetStart=61 offsetEnd=68 >> >> that worked, only the next one failed: >> >> ```assertU(adoc("id", "605", "bibcode", "xx604", >> "title", "MIT and anti de sitter space-time"));``` >> >> >> stage:indexer term=xx604 pos=1 type=word offsetStart=0 offsetEnd=13 >> stage:indexer term=mit pos=1 type=word offsetStart=0 offsetEnd=3 >> stage:indexer term=acr::mit pos=0 type=ACRONYM offsetStart=0 offsetEnd=3 >> stage:indexer term=syn::massachusetts institute of technology pos=0 >> type=SYNONYM offsetStart=0 offsetEnd=3 >> stage:indexer term=syn::mit pos=0 type=SYNONYM offsetStart=0 offsetEnd=3 >> stage:indexer term=anti pos=1 type=word offsetStart=8 offsetEnd=12 >> stage:indexer term=syn::ads pos=0 type=SYNONYM offsetStart=8 offsetEnd=28 >> stage:indexer term=syn::anti de sitter space pos=0 type=SYNONYM >> offsetStart=8 offsetEnd=28 >> stage:indexer term=syn::antidesitter spacetime pos=0 type=SYNONYM >> offsetStart=8 offsetEnd=28 >> stage:indexer term=de pos=1 type=word offsetStart=13 offsetEnd=15 >> stage:indexer term=sitter pos=1 type=word offsetStart=16 offsetEnd=22 >> stage:indexer term=space pos=1 type=word offsetStart=23 offsetEnd=28 >> stage:indexer term=time pos=1 type=word offsetStart=29 offsetEnd=33 >> stage:indexer term=spacetime pos=0 type=word offsetStart=23 offsetEnd=33 >> >> ```325677 ERROR >> (TEST-TestAdsabsTypeFulltextParsing.testNoSynChain-seed#[ADFAB495DA8F6F40]) >> [] o.a.s.h.RequestHandlerBase >> org.apache.solr.common.SolrException: Exception writing document id >> 605 to the index; possible analysis error: startOffset must be >> non-negative, and endOffset must be >= startOffset, and offsets must >> not go backwards startOffset=23,endOffset=33,lastStar
Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again
es(DirectUpdateHandler2.java:969) at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:341) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:288) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:235) ... 61 more ``` Embarrassingly Yours, Roman On Mon, Aug 17, 2020 at 10:39 AM Michael McCandless wrote: > > Hi Roman, > > Can you share the full exception / stack trace that IndexWriter throws on > that one *'d token in your first example? I thought IndexWriter checks 1) > startOffset >= last token's startOffset, and 2) endOffset >= startOffset for > the current token. > > But you seem to be hitting an exception due to endOffset check across tokens, > which I didn't remember/realize IW was enforcing. > > Could you share a small standalone test case showing the first example? > Maybe attach it to the issue > (http://issues.apache.org/jira/browse/LUCENE-8776)? > > Thanks, > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Aug 14, 2020 at 12:09 PM Roman Chyla wrote: >> >> Hi Mike, >> >> Thanks for the question! And sorry for the delay, I haven't managed to >> get to it yesterday. I have generated better output, marked with (*) >> where it currently fails the first time and also included one extra >> case to illustrate the PositionLength attribute. >> >> assertU(adoc("id", "603", "bibcode", "xx603", >> "title", "THE HUBBLE constant: a summary of the hubble space >> telescope program")); >> >> >> term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10 >> term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10 >> term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20 >> term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30 >> term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44 >> term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM >> offsetStart=38 offsetEnd=60 >> term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60 >> term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60 >> * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50 >> term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60 >> term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68 >> >> * - fails because of offsetEnd < lastToken.offsetEnd; If reordered >> (the multi-token synonym emitted as a last token) it would fail as >> well, because of the check for lastToken.beginOffset < >> currentToken.beginOffset. Basically, any reordering would result in a >> failure (unless offsets are trimmed). >> >> >> >> The following example has additional twist because of `space-time`; >> the tokenizer first splits the word and generate two new tokens -- >> those alternative tokens are then used to find synonyms (space == >> universe) >> >> assertU(adoc("id", "605", "bibcode", "xx604", >> "title", "MIT and anti de sitter space-time")); >> >> >> term=xx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13 >> term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3 >> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3 >> term=syn::massachusetts institute of technology posInc=0 posLen=1 >> type=SYNONYM offsetStart=0 offsetEnd=3 >> term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3 >> term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3 >> term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12 >> term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28 >> term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28 >> term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM >> offsetStart=8 offsetEnd=28 >> term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM >> offsetStart=8 offsetEnd=28 >> term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM >> offsetStart=8 offsetEnd=28 >> * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15 >> term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22 >> term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28 >> term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28 >> term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33 >> term=spacetime posInc=0 p
Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again
ile offsets are still correct. This would, I think, affect not only highlighting, but also search (which is, at least for us, more important). But I can imagine that in more NLP-related domains, ability to identify the source of a transformation could be more than a highlighting problem. Admittedly, most users would not care to notice, but it might be important to some. Fundamentally, I think, the problem translates to inability to reconstruct the DAG graph (under certain circumstances) because of the lost pieces of information. ~roman On Wed, Aug 12, 2020 at 4:59 PM Michael McCandless wrote: > > Hi Roman, > > Sorry for the late reply! > > I think there remains substantial confusion about multi-token synonyms and > IW's enforcement of offsets. It really is worth thoroughly > iterating/understanding your examples so we can get to the bottom of this. > It looks to me it is possible to emit tokens whose offsets do not go > backwards and that properly model your example synonyms, so I do not yet see > what the problem is. Maybe I am being blind/tired ... > > What do you mean by pos=2, pos=0, etc.? I think that is really the position > increment? Can you re-do the examples with posInc instead? (Alternatively, > you could keep "pos" but make it the absolute position, not the increment?). > > Could you also add posLength to each token? This helps (me?) visualize the > resulting graph, even though IW does not enforce it today. > > Looking at your first example, "THE HUBBLE constant: a summary of the hubble > space telescope program", it looks to me like those tokens would all be > accepted by IW's checks as they are? startOffset never goes backwards, and > for every token, endOffset >= startOffset. Where in that first example does > IW throw an exception? Maybe insert a "** IW fails here" under the > problematic token? Or, maybe write a simple test case using e.g. > CannedTokenStream? > > Your second example should also be fine, and not at all weird, but could you > enumerate it into the specific tokens with posInc, posLength, start/end > offset, "** IW fails here", etc., so we have a concrete example to discuss? > > Lucene's TokenStreams are really serializing a directed acyclic graph (DAG), > in a specific order, one transition at a time. Ironically/strangely, it is > similar to the graph that git history maintains, and how "git log" then > serializes that graph into an ordered series of transitions. The simple int > position in Lucene's TokenStream corresponds to git's githashes, to uniquely > identify each "node", though, I do not think there is an analog in git to > Lucene's offsets. Hmm, maybe a timestamp? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla wrote: >> >> Hi Mike, >> >> Yes, they are not zero offsets - I was instinctively avoiding >> "negative offsets"; but they are indeed backward offsets. >> >> Here is the token stream as produced by the analyzer chain indexing >> "THE HUBBLE constant: a summary of the hubble space telescope program" >> >> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10 >> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 >> term=constant pos=1 type=word offsetStart=11 offsetEnd=20 >> term=summary pos=1 type=word offsetStart=23 offsetEnd=30 >> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 >> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 >> offsetEnd=60 >> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 >> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60 >> term=space pos=1 type=word offsetStart=45 offsetEnd=50 >> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 >> term=program pos=1 type=word offsetStart=61 offsetEnd=68 >> >> Sometimes, we'll even have a situation when synonyms overlap: for >> example "anti de sitter space time" >> >> "anti de sitter space time" -> "antidesitter space" (one token >> spanning offsets 0-26; it gets emitted with the first token "anti" >> right now) >> "space time" -> "spacetime" (synonym 16-26) >> "space" -> "universe" (25-26) >> >> Yes, weird, but useful if people want to search for `universe NEAR >> anti` -- but another usecase which would be prohibited by the "new" >> rule. >> >> DefaultIndexingChain checks new token offset against the last emitted >> token, so I don't see a way to emit the multi-token synonym with >> offsetts span
Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again
oh,thanks! that saves everybody some time. I have commented in there, pleading to be allowed to do something - if that proposal sounds even little bit reasonable, please consider amplifying the signal On Mon, Aug 10, 2020 at 4:22 PM David Smiley wrote: > > There already is one: https://issues.apache.org/jira/browse/LUCENE-8776 > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Mon, Aug 10, 2020 at 1:30 PM Roman Chyla wrote: >> >> I'll have to somehow find a solution for this situation, giving up >> offsets seems like too big a price to pay, I see that overriding >> DefaultIndexingChain is not exactly easy -- the only thing I can think >> of is to just trick the classloader into giving it a different version >> of the chain (praying this can be done without compromising security, >> I have not followed JDK evolutions for some time...) - aside from >> forking lucene and editing that; which I decidedly don't want to do >> (monkey-patching it, ok, i can live with that... :-)) >> >> It *seems* to me that the original reason for negative offset checks >> stemmed from the fact that vint could have been written (and possibly >> vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738 >> >> but the underlying issue and some of the patches seem to have been >> addressing those problems; but a much shorter version of the patch was >> committed -- despite the perf results not being indicative (i.e. it >> could have been good with the longer patch) -- but to really >> understand it, one would have to spend more than 10mins reading the >> comments >> >> Further to the point, I think negative offsets can be produced only on >> the very first token, unless there is a bug in a filter (there was/is >> a separate check for that in 6x and perhaps it is still there in 7x). >> That would be much less restrictive than the current condition which >> disallows all backward offsets. We never ran into an index corruption >> in lucene 4-6x, so I really wonder if the "forbid all backwards >> offsets" approach might be too restrictive. >> >> Looks like I should create an issue... >> >> On Thu, Aug 6, 2020 at 11:28 AM Gus Heck wrote: >> > >> > I've had a nearly identical experience to what Dave describes, I also >> > chafe under this restriction. >> > >> > On Thu, Aug 6, 2020 at 11:07 AM David Smiley wrote: >> >> >> >> I sympathize with your pain, Roman. >> >> >> >> It appears we can't really do index-time multi-word synonyms because of >> >> the offset ordering rule. But it's not just synonyms, it's other forms >> >> of multi-token expansion. Where I work, I've seen an interesting >> >> approach to mixed language text analysis in which a sophisticated >> >> Tokenizer effectively re-tokenizes an input multiple ways by producing a >> >> token stream that is a concatenation of different interpretations of the >> >> input. On a Lucene upgrade, we had to "coarsen" the offsets to the point >> >> of having highlights that point to a whole sentence instead of the words >> >> in that sentence :-(. I need to do something to fix this; I'm trying >> >> hard to resist modifying our Lucene fork for this constraint. Maybe >> >> instead of concatenating, it might be interleaved / overlapped but the >> >> interpretations aren't necessarily aligned to make this possible without >> >> risking breaking position-sensitive queries. >> >> >> >> So... I'm not a fan of this constraint on offsets. >> >> >> >> ~ David Smiley >> >> Apache Lucene/Solr Search Developer >> >> http://www.linkedin.com/in/davidwsmiley >> >> >> >> >> >> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla wrote: >> >>> >> >>> Hi Mike, >> >>> >> >>> Yes, they are not zero offsets - I was instinctively avoiding >> >>> "negative offsets"; but they are indeed backward offsets. >> >>> >> >>> Here is the token stream as produced by the analyzer chain indexing >> >>> "THE HUBBLE constant: a summary of the hubble space telescope program" >> >>> >> >>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10 >> >>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 >> >>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20 >
Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again
I'll have to somehow find a solution for this situation, giving up offsets seems like too big a price to pay, I see that overriding DefaultIndexingChain is not exactly easy -- the only thing I can think of is to just trick the classloader into giving it a different version of the chain (praying this can be done without compromising security, I have not followed JDK evolutions for some time...) - aside from forking lucene and editing that; which I decidedly don't want to do (monkey-patching it, ok, i can live with that... :-)) It *seems* to me that the original reason for negative offset checks stemmed from the fact that vint could have been written (and possibly vlong too) - https://issues.apache.org/jira/browse/LUCENE-3738 but the underlying issue and some of the patches seem to have been addressing those problems; but a much shorter version of the patch was committed -- despite the perf results not being indicative (i.e. it could have been good with the longer patch) -- but to really understand it, one would have to spend more than 10mins reading the comments Further to the point, I think negative offsets can be produced only on the very first token, unless there is a bug in a filter (there was/is a separate check for that in 6x and perhaps it is still there in 7x). That would be much less restrictive than the current condition which disallows all backward offsets. We never ran into an index corruption in lucene 4-6x, so I really wonder if the "forbid all backwards offsets" approach might be too restrictive. Looks like I should create an issue... On Thu, Aug 6, 2020 at 11:28 AM Gus Heck wrote: > > I've had a nearly identical experience to what Dave describes, I also chafe > under this restriction. > > On Thu, Aug 6, 2020 at 11:07 AM David Smiley wrote: >> >> I sympathize with your pain, Roman. >> >> It appears we can't really do index-time multi-word synonyms because of the >> offset ordering rule. But it's not just synonyms, it's other forms of >> multi-token expansion. Where I work, I've seen an interesting approach to >> mixed language text analysis in which a sophisticated Tokenizer effectively >> re-tokenizes an input multiple ways by producing a token stream that is a >> concatenation of different interpretations of the input. On a Lucene >> upgrade, we had to "coarsen" the offsets to the point of having highlights >> that point to a whole sentence instead of the words in that sentence :-(. I >> need to do something to fix this; I'm trying hard to resist modifying our >> Lucene fork for this constraint. Maybe instead of concatenating, it might >> be interleaved / overlapped but the interpretations aren't necessarily >> aligned to make this possible without risking breaking position-sensitive >> queries. >> >> So... I'm not a fan of this constraint on offsets. >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> >> On Thu, Aug 6, 2020 at 10:49 AM Roman Chyla wrote: >>> >>> Hi Mike, >>> >>> Yes, they are not zero offsets - I was instinctively avoiding >>> "negative offsets"; but they are indeed backward offsets. >>> >>> Here is the token stream as produced by the analyzer chain indexing >>> "THE HUBBLE constant: a summary of the hubble space telescope program" >>> >>> term=hubble pos=2 type=word offsetStart=4 offsetEnd=10 >>> term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 >>> term=constant pos=1 type=word offsetStart=11 offsetEnd=20 >>> term=summary pos=1 type=word offsetStart=23 offsetEnd=30 >>> term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 >>> term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 >>> offsetEnd=60 >>> term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 >>> term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60 >>> term=space pos=1 type=word offsetStart=45 offsetEnd=50 >>> term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 >>> term=program pos=1 type=word offsetStart=61 offsetEnd=68 >>> >>> Sometimes, we'll even have a situation when synonyms overlap: for >>> example "anti de sitter space time" >>> >>> "anti de sitter space time" -> "antidesitter space" (one token >>> spanning offsets 0-26; it gets emitted with the first token "anti" >>> right now) >>> "space time" -> "spacetime" (synonym 16-26) >>> "space" -> "universe" (25-26) >>> >>> Yes, weird, but useful if peop
Re: When zero offsets are not bad - a.k.a. multi-token synonyms yet again
Hi Mike, Yes, they are not zero offsets - I was instinctively avoiding "negative offsets"; but they are indeed backward offsets. Here is the token stream as produced by the analyzer chain indexing "THE HUBBLE constant: a summary of the hubble space telescope program" term=hubble pos=2 type=word offsetStart=4 offsetEnd=10 term=acr::hubble pos=0 type=ACRONYM offsetStart=4 offsetEnd=10 term=constant pos=1 type=word offsetStart=11 offsetEnd=20 term=summary pos=1 type=word offsetStart=23 offsetEnd=30 term=hubble pos=1 type=word offsetStart=38 offsetEnd=44 term=syn::hubble space telescope pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 term=syn::hst pos=0 type=SYNONYM offsetStart=38 offsetEnd=60 term=acr::hst pos=0 type=ACRONYM offsetStart=38 offsetEnd=60 term=space pos=1 type=word offsetStart=45 offsetEnd=50 term=telescope pos=1 type=word offsetStart=51 offsetEnd=60 term=program pos=1 type=word offsetStart=61 offsetEnd=68 Sometimes, we'll even have a situation when synonyms overlap: for example "anti de sitter space time" "anti de sitter space time" -> "antidesitter space" (one token spanning offsets 0-26; it gets emitted with the first token "anti" right now) "space time" -> "spacetime" (synonym 16-26) "space" -> "universe" (25-26) Yes, weird, but useful if people want to search for `universe NEAR anti` -- but another usecase which would be prohibited by the "new" rule. DefaultIndexingChain checks new token offset against the last emitted token, so I don't see a way to emit the multi-token synonym with offsetts spanning multiple tokens if even one of these tokens was already emitted. And the complement is equally true: if multi-token is emitted as last of the group - it trips over `startOffset < invertState.lastStartOffset` https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 -roman On Thu, Aug 6, 2020 at 6:17 AM Michael McCandless wrote: > > Hi Roman, > > Hmm, this is all very tricky! > > First off, why do you call this "zero offsets"? Isn't it "backwards offsets" > that your analysis chain is trying to produce? > > Second, in your first example, if you output the tokens in the right order, > they would not violate the "offsets do not go backwards" check in > IndexWriter? I thought IndexWriter is just checking that the startOffset for > a token is not lower than the previous token's startOffset? (And that the > token's endOffset is not lower than its startOffset). > > So I am confused why your first example is tripping up on IW's offset checks. > Could you maybe redo the example, listing single token per line with the > start/end offsets they are producing? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Aug 5, 2020 at 6:41 PM Roman Chyla wrote: >> >> Hello devs, >> >> I wanted to create an issue but the helpful message in red letters >> reminded me to ask first. >> >> While porting from lucene 6.x to 7x I'm struggling with a change that >> was introduced in LUCENE-7626 >> (https://issues.apache.org/jira/browse/LUCENE-7626) >> >> It is believed that zero offset tokens are bad bad - Mike McCandles >> made the change which made me automatically doubt myself. I must be >> wrong, hell, I was living in sin the past 5 years! >> >> Sadly, we have been indexing and searching large volumes of data >> without any corruption in index whatsover, but also without this new >> change: >> >> https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774 >> >> With that change, our multi-token synonyms house of cards is falling. >> >> Mike has this wonderful blogpost explaining troubles with multi-token >> synonyms: >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html >> >> Recommended way to index multi-token synonyms appears to be this: >> https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr >> >> BUT, but! We don't want to place multi-token synonym into the same >> position as the other words. We want to preserve their positions! We >> want to preserve informaiton about offsets! >> >> Here is an example: >> >> * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program >> >> This is how it gets indexed >> >> [(0, []), >> (1, ['acr::hubble']), >> (2, ['constant']), >> (3, ['summary']), >> (4, []), >> (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']), >> (6, ['a
When zero offsets are not bad - a.k.a. multi-token synonyms yet again
Hello devs, I wanted to create an issue but the helpful message in red letters reminded me to ask first. While porting from lucene 6.x to 7x I'm struggling with a change that was introduced in LUCENE-7626 (https://issues.apache.org/jira/browse/LUCENE-7626) It is believed that zero offset tokens are bad bad - Mike McCandles made the change which made me automatically doubt myself. I must be wrong, hell, I was living in sin the past 5 years! Sadly, we have been indexing and searching large volumes of data without any corruption in index whatsover, but also without this new change: https://github.com/apache/lucene-solr/commit/64b86331c29d074fa7b257d65d3fda3b662bf96a#diff-cbdbb154cb6f3553edff2fcdb914a0c2L774 With that change, our multi-token synonyms house of cards is falling. Mike has this wonderful blogpost explaining troubles with multi-token synonyms: http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html Recommended way to index multi-token synonyms appears to be this: https://stackoverflow.com/questions/19927537/multi-word-synonyms-in-solr BUT, but! We don't want to place multi-token synonym into the same position as the other words. We want to preserve their positions! We want to preserve informaiton about offsets! Here is an example: * THE HUBBLE constant: a summary of the HUBBLE SPACE TELESCOPE program This is how it gets indexed [(0, []), (1, ['acr::hubble']), (2, ['constant']), (3, ['summary']), (4, []), (5, ['acr::hubble', 'syn::hst', 'syn::hubble space telescope', 'hubble'']), (6, ['acr::space', 'space']), (7, ['acr::telescope', 'telescope']), (8, ['program']), Notice the position 5 - multi-token synonym `syn::hubble space telescope` token is on the first token which started the group (emitted by Lucene's synonym filter). hst is another synonym; we also index the 'hubble' word there. if you were to search for a phrase "HST program" it will be found because our search parser will search for ("HST ? ? program" | "Hubble Space Telescope program") It simply found that by looking at synonyms: HST -> Hubble Space Telescope And because of those funny 'syn::' prefixes, we don't suffer from the other problem that Mike described -- "hst space" phrase search will NOT find this paper (and that is a correct behaviour) But all of this is possible only because lucene was indexing tokens with offsets that can be lower than the last emitted token; for example 'hubble space telescope' wil have offset 21-45; and the next emitted token "space" will have offset 28-33 And it just works (lucene 6.x) Here is another proof with the appropriate verbiage ("crazy"): https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java#L618 Zero offsets have been working wonderfully for us so far. And I actually cannot imagine how it can work without them - i.e. without the ability to emit a token stream with offsets that are lower than the last seen token. I haven't tried SynonymFlatten filter, but because of this line in the DefaultIndexingChain - I'm convinced the flatten symbol is not going to do what we need (as seen in the example above) https://github.com/apache/lucene-solr/blame/master/lucene/core/src/java/org/apache/lucene/index/DefaultIndexingChain.java#L915 What would you say? Is it a bug, is it not a bug but just some special usecase? If it is a special usecase, what do we need to do? Plug in our own indexing chain? Thanks! -roman - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method
[ https://issues.apache.org/jira/browse/LUCENE-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-7481: Description: If used with a wildcard query, the result is a failure saying: "Rewrite query first" The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery just returns the query itself. this works: ``` spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true) ``` code to generate the query: ``` private Query getSpanQuery(String[] parts, int howMany, boolean truncate) throws UnsupportedEncodingException { SpanQuery[] clauses = new SpanQuery[howMany+1]; clauses[0] = new SpanTermQuery(new Term("vectrfield", parts[0])); // surname for (int i = 0; i < howMany; i++) { if (truncate) { SpanMultiTermQueryWrapper q = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); clauses[i+1] = q; } else { clauses[i+1] = new SpanTermQuery(new Term("vectrfield", parts[i+1])); } } SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // match in order return sq; } ``` and this fails: ``` spanPayCheck(spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;) ``` each clause is made of: ``` new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); ``` It is a regression; the code was working well in SOLR4.x was: If used with a wildcard query, the result is a failure saying: "Rewrite query first" The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery just returns the query itself. this works: ``` spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true) ``` code to generate the query: ``` private Query getSpanQuery(String[] parts, int howMany, boolean truncate) throws UnsupportedEncodingException { SpanQuery[] clauses = new SpanQuery[howMany+1]; clauses[0] = new SpanTermQuery(new Term("vectrfield", parts[0])); // surname for (int i = 0; i < howMany; i++) { if (truncate) { SpanMultiTermQueryWrapper q = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); clauses[i+1] = q; } else { clauses[i+1] = new SpanTermQuery(new Term("vectrfield", parts[i+1])); } } SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // match in order return sq; } ``` and this fails: {code:java} spanPayCheck(spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;) {/code} each clause is made of: ``` new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); ``` It is a regression; the code was working well in SOLR4.x > SpanPayloadCheckQuery is missing rewrite method > --- > > Key: LUCENE-7481 > URL: https://issues.apache.org/jira/browse/LUCENE-7481 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 6.x >Reporter: Roman Chyla > > If used with a wildcard query, the result is a failure saying: "Rewrite query > first" > The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery > just returns the query itself. > this works: > ``` > spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), > SpanMultiTermQueryWrapper(vectrfield:m*), > SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true) > ``` > code to generate the query: > ``` > private Query getSpanQuery(String[] parts, int howMany, boolean truncate) > throws UnsupportedEncodingException { > SpanQuery[] clauses = new SpanQuery[howMany+1]; >
[jira] [Updated] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method
[ https://issues.apache.org/jira/browse/LUCENE-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-7481: Description: If used with a wildcard query, the result is a failure saying: "Rewrite query first" The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery just returns the query itself. this works: ``` spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true) ``` code to generate the query: ``` private Query getSpanQuery(String[] parts, int howMany, boolean truncate) throws UnsupportedEncodingException { SpanQuery[] clauses = new SpanQuery[howMany+1]; clauses[0] = new SpanTermQuery(new Term("vectrfield", parts[0])); // surname for (int i = 0; i < howMany; i++) { if (truncate) { SpanMultiTermQueryWrapper q = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); clauses[i+1] = q; } else { clauses[i+1] = new SpanTermQuery(new Term("vectrfield", parts[i+1])); } } SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // match in order return sq; } ``` and this fails: {code:java} spanPayCheck(spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;) {/code} each clause is made of: ``` new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); ``` It is a regression; the code was working well in SOLR4.x was: If used with a wildcard query, the result is a failure saying: "Rewrite query first" The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery just returns the query itself. this works: ``` spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true) ``` code to generate the query: ``` private Query getSpanQuery(String[] parts, int howMany, boolean truncate) throws UnsupportedEncodingException { SpanQuery[] clauses = new SpanQuery[howMany+1]; clauses[0] = new SpanTermQuery(new Term("vectrfield", parts[0])); // surname for (int i = 0; i < howMany; i++) { if (truncate) { SpanMultiTermQueryWrapper q = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); clauses[i+1] = q; } else { clauses[i+1] = new SpanTermQuery(new Term("vectrfield", parts[i+1])); } } SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // match in order return sq; } ``` and this fails: ``` spanPayCheck(spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;) ``` each clause is made of: ``` new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); ``` It is a regression; the code was working well in SOLR4.x > SpanPayloadCheckQuery is missing rewrite method > --- > > Key: LUCENE-7481 > URL: https://issues.apache.org/jira/browse/LUCENE-7481 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 6.x >Reporter: Roman Chyla > > If used with a wildcard query, the result is a failure saying: "Rewrite query > first" > The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery > just returns the query itself. > this works: > ``` > spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), > SpanMultiTermQueryWrapper(vectrfield:m*), > SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true) > ``` > code to generate the query: > ``` > private Query getSpanQuery(String[] parts, int howMany, boolean truncate) > throws UnsupportedEncodingException { > SpanQuery[] clauses = new SpanQuery[howMany+1]; >
[jira] [Created] (LUCENE-7481) SpanPayloadCheckQuery is missing rewrite method
Roman Chyla created LUCENE-7481: --- Summary: SpanPayloadCheckQuery is missing rewrite method Key: LUCENE-7481 URL: https://issues.apache.org/jira/browse/LUCENE-7481 Project: Lucene - Core Issue Type: Bug Affects Versions: 6.x Reporter: Roman Chyla If used with a wildcard query, the result is a failure saying: "Rewrite query first" The SpanNearQuery has the rewrite method; however the SpanPayloadCheckQuery just returns the query itself. this works: ``` spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 0, true) ``` code to generate the query: ``` private Query getSpanQuery(String[] parts, int howMany, boolean truncate) throws UnsupportedEncodingException { SpanQuery[] clauses = new SpanQuery[howMany+1]; clauses[0] = new SpanTermQuery(new Term("vectrfield", parts[0])); // surname for (int i = 0; i < howMany; i++) { if (truncate) { SpanMultiTermQueryWrapper q = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); clauses[i+1] = q; } else { clauses[i+1] = new SpanTermQuery(new Term("vectrfield", parts[i+1])); } } SpanNearQuery sq = new SpanNearQuery(clauses, 0, true); // match in order return sq; } ``` and this fails: ``` spanPayCheck(spanNear([vectrfield:ebyuugz, SpanMultiTermQueryWrapper(vectrfield:e*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:f*)], 1, true), payloadRef: 0;1;2;3;) ``` each clause is made of: ``` new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("vectrfield", parts[i+1].substring(0, 1) + "*"))); ``` It is a regression; the code was working well in SOLR4.x -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6468) Regression: StopFilterFactory doesn't work properly without enablePositionIncrements="false"
[ https://issues.apache.org/jira/browse/SOLR-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514785#comment-15514785 ] Roman Chyla commented on SOLR-6468: --- Ha! :-) I've found my own comment above, 2 years later I'm facing this situation again, I completely forgot (and truth be told: preferred running old solr 4x). This is how the new solr sees things: A 350-MHz GBT Survey of 50 Faint Fermi γ ray Sources for Radio Millisecond Pulsars is indexed as ``` null_1 1 :350|350mhz 2 :mhz|syn::mhz 3 :acr::gbt|gbt|syn::gbt|syn::green bank telescope 4 :survey|syn::survey null_1 6 :50 ``` the 1st and 5th position is a gap - so the search for "350-MHz GBT Survey of 50 Faint" will fail - because 'of' is a stopword and the stop-filter will always increment the position (what's the purpose of a stopfilter; if it is leaving gaps?) anyways, the solution with CharFilterFactory cannot work for me, I have to do this: 1. search for synonyms (they can contain stopwords) 2. remove stopwords 3. search for other synonyms (that don't have stopwords) I'm afraid the real life is little bit more complex than what it seems; but there is a logic to your choices, SOLR devs, I'm afraid I can agree with you. People who understand the *why* will make it work again as it *should*. Others will happily keep using the 'simplified' version. > Regression: StopFilterFactory doesn't work properly without > enablePositionIncrements="false" > > > Key: SOLR-6468 > URL: https://issues.apache.org/jira/browse/SOLR-6468 > Project: Solr > Issue Type: Bug >Affects Versions: 4.8.1, 4.9 >Reporter: Alexander S. > > Setup: > * Schema version is 1.5 > * Field config: > {code} > autoGeneratePhraseQueries="true"> > > > ignoreCase="true" /> > > > > {code} > * Stop words: > {code} > http > https > ftp > www > {code} > So very simple. In the index I have: > * twitter.com/testuser > All these queries do match: > * twitter.com/testuser > * com/testuser > * testuser > But none of these does: > * https://twitter.com/testuser > * https://www.twitter.com/testuser > * www.twitter.com/testuser > Debug output shows: > "parsedquery_toString": "+(url_words_ngram:\"? twitter com testuser\")" > But we need: > "parsedquery_toString": "+(url_words_ngram:\"twitter com testuser\")" > Complete debug outputs: > * a valid search: > http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za > * an invalid search: > http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww > The complete discussion and explanation of the problem is here: > http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-td4153839.html > I didn't find a clear explanation how can we upgrade Solr, there's no any > replacement or a workarround to this, so this is not just a major change but > a major disrespect to all existing Solr users who are using this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6468) Regression: StopFilterFactory doesn't work properly without enablePositionIncrements=false
[ https://issues.apache.org/jira/browse/SOLR-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225186#comment-14225186 ] Roman Chyla commented on SOLR-6468: --- I also find this change to be unfortunate. If this is just a developers making decisions for users (then it causes problems to users who really know why they do need that feature: for phrase search that should ignore stopwords). But if the underlying issue is something serious with the indexer not being able to work with the position, than it would be even weirder - and actually very bad for many users. I don't really understand benefits of this change. Any chance to return to the original? Regression: StopFilterFactory doesn't work properly without enablePositionIncrements=false Key: SOLR-6468 URL: https://issues.apache.org/jira/browse/SOLR-6468 Project: Solr Issue Type: Bug Affects Versions: 4.8.1, 4.9 Reporter: Alexander S. Setup: * Schema version is 1.5 * Field config: {code} fieldType name=words_ngram class=solr.TextField omitNorms=false autoGeneratePhraseQueries=true analyzer tokenizer class=solr.PatternTokenizerFactory pattern=[^\w]+ / filter class=solr.StopFilterFactory words=url_stopwords.txt ignoreCase=true / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType {code} * Stop words: {code} http https ftp www {code} So very simple. In the index I have: * twitter.com/testuser All these queries do match: * twitter.com/testuser * com/testuser * testuser But none of these does: * https://twitter.com/testuser * https://www.twitter.com/testuser * www.twitter.com/testuser Debug output shows: parsedquery_toString: +(url_words_ngram:\? twitter com testuser\) But we need: parsedquery_toString: +(url_words_ngram:\twitter com testuser\) Complete debug outputs: * a valid search: http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za * an invalid search: http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww The complete discussion and explanation of the problem is here: http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-td4153839.html I didn't find a clear explanation how can we upgrade Solr, there's no any replacement or a workarround to this, so this is not just a major change but a major disrespect to all existing Solr users who are using this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Query parsing - what's new?
Hi guys, Could sb please take a look at the LUCENE-5014 and comment on it? That JIRA ticket is proposing a new way to build query parsers: https://issues.apache.org/jira/browse/LUCENE-5014 The thing is: the new code has been there lying for about 6 months, and I don't know whether it is because people don't have time to actually look at it, or because it is a bad solution, or anything else... I don't want to assume anything at this point, but your input would be much appreciated. I know you are busy and I understand that parsers are not as exciting as cloud etc, but at the same time I do NOT understand how lucene can live so long with 'that' standard query parser... Thank you! roman
Re: Measuring SOLR performance
Hi Dmitry, probably mistake in the readme, try calling it with -q /home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries as for the base_url, i was testing it on solr4.0, where it tries contactin /solr/admin/system - is it different for 4.3? I guess I should make it configurable (it already is, the endpoint is set at the check_options()) thanks roman On Wed, Jul 31, 2013 at 10:01 AM, Dmitry Kan solrexp...@gmail.com wrote: Ok, got the error fixed by modifying the base solr ulr in solrjmeter.py (added core name after /solr part). Next error is: WARNING: no test name(s) supplied nor found in: ['/home/dmitry/projects/lab/solrjmeter/demo/queries/demo.queries'] It is a 'slow start with new tool' symptom I guess.. :) On Wed, Jul 31, 2013 at 4:39 PM, Dmitry Kan solrexp...@gmail.com wrote: Hi Roman, What version and config of SOLR does the tool expect? Tried to run, but got: **ERROR** File solrjmeter.py, line 1390, in module main(sys.argv) File solrjmeter.py, line 1296, in main check_prerequisities(options) File solrjmeter.py, line 351, in check_prerequisities error('Cannot contact: %s' % options.query_endpoint) File solrjmeter.py, line 66, in error traceback.print_stack() Cannot contact: http://localhost:8983/solr complains about URL, clicking which leads properly to the admin page... solr 4.3.1, 2 cores shard Dmitry On Wed, Jul 31, 2013 at 3:59 AM, Roman Chyla roman.ch...@gmail.comwrote: Hello, I have been wanting some tools for measuring performance of SOLR, similar to Mike McCandles' lucene benchmark. so yet another monitor was born, is described here: http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/ I tested it on the problem of garbage collectors (see the blogs for details) and so far I can't conclude whether highly customized G1 is better than highly customized CMS, but I think interesting details can be seen there. Hope this helps someone, and of course, feel free to improve the tool and share! roman
Re: Measuring SOLR performance
On Wed, Jul 31, 2013 at 1:21 PM, Shawn Heisey s...@elyograg.org wrote: On 7/31/2013 10:21 AM, Roman Chyla wrote: Hi Dmitry, probably mistake in the readme, try calling it with -q /home/dmitry/projects/lab/**solrjmeter/queries/demo/demo.**queries as for the base_url, i was testing it on solr4.0, where it tries contactin /solr/admin/system - is it different for 4.3? I guess I should make it configurable (it already is, the endpoint is set at the check_options()) /solr URLs that don't include a core name (like /solr/admin/system) will only work if you have a defaultCoreName attribute in your solr.xml file and its value refers to an existing core. Behind the scenes, Solr just directs those queries to the default core. thanks, so i should add a way to specify a core, or rather i will make the whole endpoint user configurable If you use the new solr.xml format (required in trunk), then there is no defaultCoreName, so these URLs currently don't work at all. I think this behavior is correct, but it's early days for this feature. The default core name might get re-introduced. and which urls will work? /solr/admin/collection or /solr/collection/admin? can we assume the info handlers will be available under the collection url as well? Exceptions to the above rule include the CoreAdmin API, the Collections API, and the new admin info handler introduced in Solr 4.4 by SOLR-4943. In 4.5, SOLR-3633 will use the new info handler allow the UI to work when there are no cores present. hmm, ok, i guess i'm fine now, i'll worry about that later roman Thanks, Shawn --**--**- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.**orgdev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Measuring SOLR performance
Hello, I have been wanting some tools for measuring performance of SOLR, similar to Mike McCandles' lucene benchmark. so yet another monitor was born, is described here: http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/ I tested it on the problem of garbage collectors (see the blogs for details) and so far I can't conclude whether highly customized G1 is better than highly customized CMS, but I think interesting details can be seen there. Hope this helps someone, and of course, feel free to improve the tool and share! roman
Re: for those of you using gmail...
On Wed, Jul 17, 2013 at 10:26 AM, Michael McCandless luc...@mikemccandless.com wrote: Can you try this search in your gmail: from:jenk...@thetaphi.de regression build 6605 And let me know if you get 1 or 0 results back? 0 results back I get 0 results back but I should get 1, I think. Furthermore, if I search for: from:jenk...@thetaphi.de regression I only get results up to Jul 2, even though there are many build failures after that. I am getting many before Jul 2, even March and beyond --roman It's as if on Jul 2 Google made regression an index-time-only stopword; failed, replication, handler also became stopwords (but apparently at different times). Frustrating ... Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698981#comment-13698981 ] Roman Chyla commented on LUCENE-5014: - HiErik, i'll add a solr qparser plugin too. thanks for reminding me. ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698985#comment-13698985 ] Roman Chyla commented on LUCENE-5014: - will it be OK to include the solr parts in this ticket? besides the jira name, that seems s aa best option to me. ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699417#comment-13699417 ] Roman Chyla commented on LUCENE-5014: - New addition: solr qparser plugin. It is unfortunately not as easy as one may think, because of various defaults - e.g. user may want to specify different defaultField, whether wildcards are allowed at the beginning, what is the maximum range for proximity values... some of which should be only in solrconfig.xml, and some also in query params. So here is a stab at it, it works, but may require more config options - there is also a new unittest. Only that Ivy mirrors decided to not work now (ughhh) so I could not test solr unittests - ihope it works. Lucene's 'ant test' went fine. If sb wants to try in solr, please make sure you have antlr-runtime.jar in your solr libs and this should go inside solrconfig.xml {code} queryParser name=lucene2 class=AqpLuceneQParserPlugin lst name=defaults str name=defaultFieldtext/str /lst /queryParser {code} ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-5014: Attachment: LUCENE-5014.txt Added solr qparserplugin ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-5014: Attachment: LUCENE-5014.txt The patch that *actually* contains the extended parser with NEAR operator support ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13695149#comment-13695149 ] Roman Chyla commented on LUCENE-5014: - Adding an example, standard lucene grammar extended with NEAR operators (as discussed above) This should illustrate how easy it is to extend/modify/add a new query dialect. Handling of NEAR operators is not at all trivial, so I hope you will have some fun realizing it can be done in two lines ;) {code} setGrammarName(ExtendedLuceneGrammar); ((AqpQueryTreeBuilder) qp.getQueryBuilder()).setBuilder(AqpNearQueryNode.class, new AqpNearQueryNodeBuilder()); {code} Have a look at TestAqpExtendedLGSimple ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-5014: Attachment: LUCENE-5014.txt The same patch + lucene grammar extended with NEARx operator ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667908#comment-13667908 ] Roman Chyla commented on LUCENE-5014: - Hi David, In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, backtracking,memoization) - see this http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference But it is also capable of doing more things than PEG (ie. better error recovery - PEG parser needs to parse the whole tree before it discovers an error; then the error recovery is not the same thing) PEG's can be easier *especially* because of the first-choice operator; in fact at times I wished that ANTLR just chose the first available option (well, it does, but it reports and error and I didn't want to have grammar with errors). So, in CFGANTLR world, ambiguity is solved using syntactic predicated (lookahead) -- so far, this has been a theoretical, here are few more points: Clarity === I looked at the presentation and the parser contains the operator precedence, however there it is spread across several screens of java code, i find the following much more readable {code} mainQ : clauseOr+ EOF ; clauseOr : clauseAnd (or clauseAnd )* ; clauseAnd : clauseNot (and clauseNot)* ; {code} It is essentially the same thing, but it is independent of the Java and I can see it on few lines - and extend it adding few more lines. The patch I wrote makes the handling of separate grammar and generated code seamless. So the 2/3 advantages of PEG over ANTLR disappear. Syntax vs semantics (business logic) The example from the presentation needs to be much more involved if it is to be used in the real life. Consider this query: {noformat} dog NEAR cat {noformat} This is going to work only in the simplest case, where each term is a single TermQuery. Yet if there was a synonym expansion (where would it go inside the PEG parser, is one question) - the parser needs to *rewrite* the query something like: {noformat} (dog|canin) NEAR cat -- (dog NEAR cat) OR (canin NEAR cat) {noformat} So, there you get the 'spaghetti problem' - in the example presented, the logic that rewrites the query must reside in the same place as the query parsing. That is not an improvement IMO, it is the same thing as the old Lucene parsers written in JavaCC which are very difficult to extend or debug I think I'll add a new grammar with the proximity operators so that you can see how easy it is to solve the same situation with ANTLR (but you will need to read the patch this time ;)) btw. the patch is big because i included the html with SVG charts of the generated parse trees and one Excel file (that one helps in writing unittest for the grammar) Developer vs user experience I think PEG definitely looks simpler (in the presented example) and its main advantage is the first-choice operator. But since ANTLR can do the same and it has programming language independent grammar, it can do the same job. The difference may be in maturity of the project, tools available (ie debuggers) - and of course implementation (see the link above for details) I can imagine that for PEG you can use your IDE of choice, while with ANTLR there is this 'pesky' level of abstraction - but there are tools that make life bearable, such as ANTLRWorks or Eclipse ANTLR debugger (though I have not liked that one); grammar unittest and I added ways to debug/view the grammar. Again, I recommend trying it, e.g. {code} ant -f aqp-build.xml gunit # edit StandardLuceneGrammar and save as 'mytestgrammar' ant -f aqp-build.xml try-view -Dquery=foo NEAR bar -Dgrammar=mytestgrammar {code} There may be of course more things to consider, but I believe the 3 issues above present some interesting vantage points. ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star
[jira] [Comment Edited] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667908#comment-13667908 ] Roman Chyla edited comment on LUCENE-5014 at 5/27/13 7:04 PM: -- Hi David, In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, backtracking,memoization) - see this http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference But it is also capable of doing more things than PEG (ie. better error recovery - PEG parser needs to parse the whole tree before it discovers an error; then the error recovery is not the same thing) PEG's can be easier *especially* because of the first-choice operator; in fact at times I wished that ANTLR just chose the first available option (well, it does, but it reports and error and I didn't want to have grammar with errors). So, in CFGANTLR world, ambiguity is solved using syntactic predicates (lookahead) -- so far, this has been a theoretical, here are few more points: Grammar vs code === I looked at the presentation and the parser contains the operator precedence, however there it is spread across several screens of java code, i find the following much more readable {code} mainQ : clauseOr+ EOF ; clauseOr : clauseAnd (or clauseAnd )* ; clauseAnd : clauseNot (and clauseNot)* ; {code} It is essentially the same thing, but it is independent of the Java and I can see it on few lines - and extend it adding few more lines. The patch I wrote makes the handling of separate grammar and generated code seamless. So the 2/3 advantages of PEG over ANTLR disappear. Syntax vs semantics (business logic) The example from the presentation needs to be much more involved if it is to be used in the real life. Consider this query: {noformat} dog NEAR cat {noformat} This is going to work only in the simplest case, where each term is a single TermQuery. Yet if there was a synonym expansion (where would it go inside the PEG parser, is one question) - the parser needs to *rewrite* the query something like: {noformat} (dog|canin) NEAR cat -- (dog NEAR cat) OR (canin NEAR cat) {noformat} So, there you get the 'spaghetti problem' - in the example presented, the logic that rewrites the query must reside in the same place as the query parsing. That is not an improvement IMO, it is the same thing as the old Lucene parsers written in JavaCC which are very difficult to extend or debug I think I'll add a new grammar with the proximity operators so that you can see how easy it is to solve the same situation with ANTLR (but you will need to read the patch this time ;)) btw. the patch is big because i included the html with SVG charts of the generated parse trees and one Excel file (that one helps in writing unittest for the grammar) Developer vs user experience I think PEG definitely looks simpler to developers (in the presented example) and its main advantage is the first-choice operator. But since ANTLR can do the same and it has programming language independent grammar, it can do the same job. The difference may be in maturity of the project, tools available (ie debuggers) - and of course implementation (see the link above for details) I can imagine that for PEG you can use your IDE of choice, while with ANTLR there is this 'pesky' level of abstraction - but there are tools that make life bearable, such as ANTLRWorks or Eclipse ANTLR debugger (though I have not liked that one); grammar unittest and I added ways to debug/view the grammar. If you apply the patch, you can try: {code} ant -f aqp-build.xml gunit # edit StandardLuceneGrammar and save as 'mytestgrammar' ant -f aqp-build.xml try-view -Dquery=foo NEAR bar -Dgrammar=mytestgrammar {code} There may be of course more things to consider, but I believe the 3 issues above present some interesting vantage points. was (Author: rchyla): Hi David, In practical terms ANTLR can do exactly the same thing as PEG (ie lookahead, backtracking,memoization) - see this http://stackoverflow.com/questions/8816759/ll-versus-peg-parsers-what-is-the-difference But it is also capable of doing more things than PEG (ie. better error recovery - PEG parser needs to parse the whole tree before it discovers an error; then the error recovery is not the same thing) PEG's can be easier *especially* because of the first-choice operator; in fact at times I wished that ANTLR just chose the first available option (well, it does, but it reports and error and I didn't want to have grammar with errors). So, in CFGANTLR world, ambiguity is solved using syntactic predicated (lookahead) -- so far, this has been a theoretical, here are few more points: Clarity === I looked at the presentation and the parser contains
[jira] [Created] (LUCENE-5014) ANTLR Lucene query parser
Roman Chyla created LUCENE-5014: --- Summary: ANTLR Lucene query parser Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-5014: Attachment: LUCENE-5014.txt Patch without binary files (if possible, use the other patch) ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5014) ANTLR Lucene query parser
[ https://issues.apache.org/jira/browse/LUCENE-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-5014: Attachment: LUCENE-5014.txt Includes binary files (ie. one jar and xls) svn diff --force --diff-cmd /usr/bin/diff -x -au LUCENE-5014.txt ANTLR Lucene query parser - Key: LUCENE-5014 URL: https://issues.apache.org/jira/browse/LUCENE-5014 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser, modules/queryparser Affects Versions: 4.3 Environment: all Reporter: Roman Chyla Labels: antlr, query, queryparser Attachments: LUCENE-5014.txt, LUCENE-5014.txt I would like to propose a new way of building query parsers for Lucene. Currently, most Lucene parsers are hard to extend because they are either written in Java (ie. the SOLR query parser, or edismax) or the parsing logic is 'married' with the query building logic (i.e. the standard lucene parser, generated by JavaCC) - which makes any extension really hard. Few years back, Lucene got the contrib/modern query parser (later renamed to 'flexible'), yet that parser didn't become a star (it must be very confusing for many users). However, that parsing framework is very powerful! And it is a real pity that there aren't more parsers already using it - because it allows us to add/extend/change almost any aspect of the query parsing. So, if we combine ANTLR + queryparser.flexible, we can get very powerful framework for building almost any query language one can think of. And I hope this extension can become useful. The details: - every new query syntax is written in EBNF, it lives in separate files (and can be tested/developed independently - using 'gunit') - ANTLR parser generates parsing code (and it can generate parsers in several languages, the main target is Java, but it can also do Python - which may be interesting for pylucene) - the parser generates AST (abstract syntax tree) which is consumed by a 'pipeline' of processors, users can easily modify this pipeline to add a desired functionality - the new parser contains a few (very important) debugging functions; it can print results of every stage of the build, generate AST's as graphical charts; ant targets help to build/test/debug grammars - I've tried to reuse the existing queryparser.flexible components as much as possible, only adding new processors when necessary Assumptions about the grammar: - every grammar must have one top parse rule called 'mainQ' - parsers must generate AST (Abstract Syntax Tree) The structure of the AST is left open, there are components which make assumptions about the shape of the AST (ie. that MODIFIER is parent of a a FIELD) however users are free to choose/write different processors with different assumptions about the AST shape. More documentation on how to use the parser can be seen here: http://29min.wordpress.com/category/antlrqueryparser/ The parser has been created more than one year back and is used in production (http://labs.adsabs.harvard.edu/adsabs/). A different dialects of query languages (with proximity operatos, functions, special logic etc) - can be seen here: https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs https://github.com/romanchyla/montysolr/tree/master/contrib/invenio -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: New query parser?
Hello, The new JIRA issue has been created - https://issues.apache.org/jira/browse/LUCENE-5014 Thank you for trying it, roman On Wed, May 15, 2013 at 7:34 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi Jan, Thanks for thumbs up On Tue, May 14, 2013 at 11:14 AM, Jan Høydahl jan@cominvent.comwrote: Hello :) I think it has been the intention of the dev community for a long time to start using the flex parser framework, and in this regard this contribution is much welcome as a kickstarter for that. I have not looked much at the code, but I hope it could be a starting point for writing future parsers in a less spaghetti way. One question. Say we want to add a new operator such as NEAR/N. Ideally this should be added in Lucene, then all the Solr QParsers extending the lucene flex parser would benefit from the same new operator. Would this be easily achieved with your code you think? We also have a ton of to add a new operator is very simple on the syntax level -- ie. when I want the NEAR/x operator, I just change the ANTLR grammar, which produces the approripate abstract syntax tree. The flex parser is consuming this. Yet, imagine the following query dog NEAR/5 cat if you are using synonyms, an analyzer could have expanded dog with synonyms, it becomes something like (dog | canin) NEAR/5 cat and since Lucene cannot handle these queries, the flex builder must rewrite them, effectively producing SpanNear(SpanOr(dog | cat), SpanTerm(cat), 5) but you could also argue, that a better way to handle this query is: SpanNear(dog, cat, 5) OR SpanNear(canin, cat, 5) If that is the case, then a different builder will have to be used - Just an example where syntax is relatively simple, but the semantics is the hard part. But I believe the flex parser gives all necessary tools to deal with that and avoid the spaghetti problem --roman feature requests on the eDisMax parser for new kinds of query syntax support. Before we start implementing that on top of the already-hard-to-maintain eDismax code, we should think about re-implementing eDismax on top of flex, perhaps on top of Roman's contrib here? btw: i am using edismax in one of my grammars -- ie. users can type: query AND edismax(foo OR (dog AND cat)) -- and the edismax() will be parsed by edismax, but I hit the problems there as well, it is not doing such a nice job with operators and of course it doesn't know how to handle multi-token synonym expansion, but I think it could be nicely extracted into a flex processor and effectively become a plugin for a solr parser (now, it is a parser of its own, which makes it hard to extend) -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 14. mai 2013 kl. 17:07 skrev Roman Chyla roman.ch...@gmail.com: Hello World! Following the recommended practice I'd like to let you know that I am about to start porting our existing query parser into JIRA with the aim of making it available to Lucene/SOLR community. The query parser is built on top of the flexible query parser, but it separates the parsing (ANTLR) and the query building - it allows for a very sophisticated custom logic and has self-retrospecting methods, so one can actually 'see' what is going on - I have had lots of FUN working with it (which I consider to be a feature, not a shameless plug ;)). Some write up is here: http://29min.wordpress.com/category/antlrqueryparser/ You can see the source code at: https://github.com/romanchyla/montysolr/tree/master/contrib/antlrqueryparser If you think this project is duplicating something or even being useless (I hope not!) please let me know, stop me, say something... Thank you! roman
Re: New query parser?
Hi Jan, Thanks for thumbs up On Tue, May 14, 2013 at 11:14 AM, Jan Høydahl jan@cominvent.com wrote: Hello :) I think it has been the intention of the dev community for a long time to start using the flex parser framework, and in this regard this contribution is much welcome as a kickstarter for that. I have not looked much at the code, but I hope it could be a starting point for writing future parsers in a less spaghetti way. One question. Say we want to add a new operator such as NEAR/N. Ideally this should be added in Lucene, then all the Solr QParsers extending the lucene flex parser would benefit from the same new operator. Would this be easily achieved with your code you think? We also have a ton of to add a new operator is very simple on the syntax level -- ie. when I want the NEAR/x operator, I just change the ANTLR grammar, which produces the approripate abstract syntax tree. The flex parser is consuming this. Yet, imagine the following query dog NEAR/5 cat if you are using synonyms, an analyzer could have expanded dog with synonyms, it becomes something like (dog | canin) NEAR/5 cat and since Lucene cannot handle these queries, the flex builder must rewrite them, effectively producing SpanNear(SpanOr(dog | cat), SpanTerm(cat), 5) but you could also argue, that a better way to handle this query is: SpanNear(dog, cat, 5) OR SpanNear(canin, cat, 5) If that is the case, then a different builder will have to be used - Just an example where syntax is relatively simple, but the semantics is the hard part. But I believe the flex parser gives all necessary tools to deal with that and avoid the spaghetti problem --roman feature requests on the eDisMax parser for new kinds of query syntax support. Before we start implementing that on top of the already-hard-to-maintain eDismax code, we should think about re-implementing eDismax on top of flex, perhaps on top of Roman's contrib here? btw: i am using edismax in one of my grammars -- ie. users can type: query AND edismax(foo OR (dog AND cat)) -- and the edismax() will be parsed by edismax, but I hit the problems there as well, it is not doing such a nice job with operators and of course it doesn't know how to handle multi-token synonym expansion, but I think it could be nicely extracted into a flex processor and effectively become a plugin for a solr parser (now, it is a parser of its own, which makes it hard to extend) -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 14. mai 2013 kl. 17:07 skrev Roman Chyla roman.ch...@gmail.com: Hello World! Following the recommended practice I'd like to let you know that I am about to start porting our existing query parser into JIRA with the aim of making it available to Lucene/SOLR community. The query parser is built on top of the flexible query parser, but it separates the parsing (ANTLR) and the query building - it allows for a very sophisticated custom logic and has self-retrospecting methods, so one can actually 'see' what is going on - I have had lots of FUN working with it (which I consider to be a feature, not a shameless plug ;)). Some write up is here: http://29min.wordpress.com/category/antlrqueryparser/ You can see the source code at: https://github.com/romanchyla/montysolr/tree/master/contrib/antlrqueryparser If you think this project is duplicating something or even being useless (I hope not!) please let me know, stop me, say something... Thank you! roman
New query parser?
Hello World! Following the recommended practice I'd like to let you know that I am about to start porting our existing query parser into JIRA with the aim of making it available to Lucene/SOLR community. The query parser is built on top of the flexible query parser, but it separates the parsing (ANTLR) and the query building - it allows for a very sophisticated custom logic and has self-retrospecting methods, so one can actually 'see' what is going on - I have had lots of FUN working with it (which I consider to be a feature, not a shameless plug ;)). Some write up is here: http://29min.wordpress.com/category/antlrqueryparser/ You can see the source code at: https://github.com/romanchyla/montysolr/tree/master/contrib/antlrqueryparser If you think this project is duplicating something or even being useless (I hope not!) please let me know, stop me, say something... Thank you! roman
[jira] [Created] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries
Roman Chyla created LUCENE-4679: --- Summary: LowercaseExpandedTermsQueryNodeProcessor changes regex queries Key: LUCENE-4679 URL: https://issues.apache.org/jira/browse/LUCENE-4679 Project: Lucene - Core Issue Type: Wish Reporter: Roman Chyla Priority: Trivial This is really a very silly request, but could the lowercase processor 'abstain' from changing regex queries? For example, \\W should stay uppercase, but it will be lowercased. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries
[ https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-4679: Attachment: LUCENE-4679.patch LowercaseExpandedTermsQueryNodeProcessor changes regex queries -- Key: LUCENE-4679 URL: https://issues.apache.org/jira/browse/LUCENE-4679 Project: Lucene - Core Issue Type: Wish Reporter: Roman Chyla Priority: Trivial Attachments: LUCENE-4679.patch This is really a very silly request, but could the lowercase processor 'abstain' from changing regex queries? For example, \\W should stay uppercase, but it will be lowercased. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries
[ https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-4679: Description: This is really a very silly request, but could the lowercase processor 'abstain' from changing regex queries? For example, W should stay uppercase, but it will be lowercased. was: This is really a very silly request, but could the lowercase processor 'abstain' from changing regex queries? For example, \\W should stay uppercase, but it will be lowercased. LowercaseExpandedTermsQueryNodeProcessor changes regex queries -- Key: LUCENE-4679 URL: https://issues.apache.org/jira/browse/LUCENE-4679 Project: Lucene - Core Issue Type: Wish Reporter: Roman Chyla Priority: Trivial Attachments: LUCENE-4679.patch This is really a very silly request, but could the lowercase processor 'abstain' from changing regex queries? For example, W should stay uppercase, but it will be lowercased. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4679) LowercaseExpandedTermsQueryNodeProcessor changes regex queries
[ https://issues.apache.org/jira/browse/LUCENE-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-4679: Description: This is really a very silly request, but could the lowercase processor 'abstain' from changing regex queries? For example, W should stay uppercase, but it is lowercased. was: This is really a very silly request, but could the lowercase processor 'abstain' from changing regex queries? For example, W should stay uppercase, but it will be lowercased. LowercaseExpandedTermsQueryNodeProcessor changes regex queries -- Key: LUCENE-4679 URL: https://issues.apache.org/jira/browse/LUCENE-4679 Project: Lucene - Core Issue Type: Wish Reporter: Roman Chyla Priority: Trivial Attachments: LUCENE-4679.patch This is really a very silly request, but could the lowercase processor 'abstain' from changing regex queries? For example, W should stay uppercase, but it is lowercased. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4499) Multi-word synonym filter (synonym expansion)
[ https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-4499: Attachment: LUCENE-4499.patch A new patch, as the old version was extending wrong class (which cause web tests to fail) Multi-word synonym filter (synonym expansion) - Key: LUCENE-4499 URL: https://issues.apache.org/jira/browse/LUCENE-4499 Project: Lucene - Core Issue Type: Improvement Components: core/other Affects Versions: 4.1, 5.0 Reporter: Roman Chyla Priority: Minor Labels: analysis, multi-word, synonyms Fix For: 5.0 Attachments: LUCENE-4499.patch, LUCENE-4499.patch I apologize for bringing the multi-token synonym expansion up again. There is an old, unresolved issue at LUCENE-1622 [1] While solving the problem for our needs [2], I discovered that the current SolrSynonym parser (and the wonderful FTS) have almost everything to satisfactorily handle both the query and index time synonym expansion. It seems that people often need to use the synonym filter *slightly* differently at indexing and query time. In our case, we must do different things during indexing and querying. Example sentence: Mirrors of the Hubble space telescope pointed at XA5 This is what we need (comma marks position bump): indexing: mirrors,hubble|hubble space telescope|hst,space,telescope,pointed,xa5|astroobject#5 querying: +mirrors +(hubble space telescope | hst) +pointed +(xa5|astroboject#5) This translated to following needs: indexing time: single-token synonyms = return only synonyms multi-token synonyms = return original tokens *AND* the synonyms query time: single-token: return only synonyms (but preserve case) multi-token: return only synonyms We need the original tokens for the proximity queries, if we indexed 'hubble space telescope' as one token, we cannot search for 'hubble NEAR telescope' You may (not) be surprised, but Lucene already supports ALL of these requirements. The patch is an attempt to state the problem differently. I am not sure if it is the best option, however it works perfectly for our needs and it seems it could work for general public too. Especially if the SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and people would just choose what situation they use. Please look at the unittest. links: [1] https://issues.apache.org/jira/browse/LUCENE-1622 [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158 [3] seems to have similar request: http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: pro coding style
On Fri, Nov 30, 2012 at 8:56 AM, Robert Muir rcm...@gmail.com wrote: On Fri, Nov 30, 2012 at 8:50 AM, Per Steffensen st...@designware.dkwrote: Robert Muir skrev: Is it really git? Because its my understanding pull requests aren't actually a git thing but a github thing. The distinction is important. Actually Im not sure. Have never used git outside github, but at least part of it has to be git and not github (I think) - or else I couldnt imagine how you get the advantages you get. Remember that when using git you actually run a repository on every developers local machines. When you commit, you commit only to you local repository. You need to push in order to have it upstreamed (as they call it) Right, I'm positive this (pull requests) is github :) I just wanted make this point: when we have discussions about using git instead of svn, I'm not sure it makes things easier on anyone, actually probably worse and more complex. Its the github workflow that contributors want (I would +1 some scheme that supports this!), but git by itself, is pretty unusable. Github is like a nice front-end to this mess. This is like a medicine to me! With all the craze about git (and we use it for our main project and also for solr development) it just confirms my 3 years-long experience. Git is pain. Github is great (too bad there is git behind it ;)) And now the problems of forks - with git the fork is the natural evil - git just makes it established practice. But it still doesn't save us from the (slow) process of incorporating new patches. While it is inevitable and we cannot be more grateful to all the committers for their hard work (really thanks!) perhaps there is a way to make solr/lucene more sandbox friendly? In our organization we are doing something similar (to using SOLR as a library), the automated build/deployment goes like this: - checkout our sources - downloadbuild solr sources - compile our code - merge with solr test - deploy This avoids forking solr and we always develop against the chosen branch, the pain was in porting the solr build infrastructure - if there was this infrastructure inside solr, ready for developers to take advantage of it, others were saved the pain or reinventing it. As far as I am aware, there is only one hard problem - the confusing nature of the classloaders inside webcontainers, i have really had hard time understanding it to make it right - but there are surely more knowledgeable people here. And if the worst comes to worst, the automated procedure could easily merge jars. Sounds evil? Is forking Solr a better way? roman
[jira] [Commented] (LUCENE-4499) Multi-word synonym filter (synonym expansion)
[ https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507440#comment-13507440 ] Roman Chyla commented on LUCENE-4499: - Hi Nolan, your case seems to confirm a need for some solution. You have decided to make a seaprate query parser, I have put the expanding logic into a query parser as well. See this for the working example: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java And its config https://github.com/romanchyla/montysolr/blob/master/contrib/examples/adsabs/solr/collection1/conf/schema.xml#L325 I see two added benefits (besides not needing a query parser plugin - in our case, it must be plugged into our qparser): 1. you can use the filter at index/query time inside a standard query parser 2. special configuration for synonym expansion (for example, we have found it very useful to be able to search for multi-tokens in case-insensitive manner, but recognize single tokens only case-sensitively; or expand with multi-token synonyms only for multi-word originals and output also the original words, otherwise eat them (replace them)) Nice blog post, I wish I could write as instructively as well :) Multi-word synonym filter (synonym expansion) - Key: LUCENE-4499 URL: https://issues.apache.org/jira/browse/LUCENE-4499 Project: Lucene - Core Issue Type: Improvement Components: core/other Affects Versions: 4.1, 5.0 Reporter: Roman Chyla Priority: Minor Labels: analysis, multi-word, synonyms Fix For: 5.0 Attachments: LUCENE-4499.patch I apologize for bringing the multi-token synonym expansion up again. There is an old, unresolved issue at LUCENE-1622 [1] While solving the problem for our needs [2], I discovered that the current SolrSynonym parser (and the wonderful FTS) have almost everything to satisfactorily handle both the query and index time synonym expansion. It seems that people often need to use the synonym filter *slightly* differently at indexing and query time. In our case, we must do different things during indexing and querying. Example sentence: Mirrors of the Hubble space telescope pointed at XA5 This is what we need (comma marks position bump): indexing: mirrors,hubble|hubble space telescope|hst,space,telescope,pointed,xa5|astroobject#5 querying: +mirrors +(hubble space telescope | hst) +pointed +(xa5|astroboject#5) This translated to following needs: indexing time: single-token synonyms = return only synonyms multi-token synonyms = return original tokens *AND* the synonyms query time: single-token: return only synonyms (but preserve case) multi-token: return only synonyms We need the original tokens for the proximity queries, if we indexed 'hubble space telescope' as one token, we cannot search for 'hubble NEAR telescope' You may (not) be surprised, but Lucene already supports ALL of these requirements. The patch is an attempt to state the problem differently. I am not sure if it is the best option, however it works perfectly for our needs and it seems it could work for general public too. Especially if the SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and people would just choose what situation they use. Please look at the unittest. links: [1] https://issues.apache.org/jira/browse/LUCENE-1622 [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158 [3] seems to have similar request: http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Changing Python class/module layout, dropping --rename ?
The script must have thought about it somehow :-) Have a great, undisturbed vacation! roman On Thu, Jul 19, 2012 at 9:33 AM, Andi Vajda va...@apache.org wrote: On Fri, 13 Jul 2012, Roman Chyla wrote: Hi, I was playing with the idea of creating virtual packages, attached is a working script that illustrates it. I am getting this output: Dit it work? No, I haven't forgotten, I'm just on vacation. Andi.. == from org.apache.lucene.search import SearcherFactory; print SearcherFactory type 'SearcherFactory' from org.apache.lucene.analysis import Analyzer as Banalyzer; print Banalyzer type 'Analyzer' print sys.modules['org'] module 'org' (built-in) print sys.modules['org.apache'] module 'org.apache' (built-in) print sys.modules['org.apache.lucene'] module 'org.apache.lucene' (built-in) print sys.modules['org.apache.lucene.search'] module 'org.apache.lucene.search' (built-in) Cheers, roman On Fri, Jul 13, 2012 at 1:34 PM, Andi Vajda va...@apache.org wrote: On Jul 13, 2012, at 18:33, Roman Chyla roman.ch...@gmail.com wrote: I think this would be great. Let me add little bit more to your observations (whole night yesterday was spent fighting with renames - because I was building a project which imports shared lucene and solr -- there were thousands of same classes, I am not sure it would be possible without some sort of a flexible rename...) JCC is a great tool and is used by potentially many projects - so stripping org.apache seems right for pylucene, but looks arbitrary otherwise Yes, I forgot to say that there would be a way to declare one or more mappings so that org.apache.lucene becomes lucene. Andi.. (unless there is a flexible stripping mechanism). Also, if the full namespace remains original, then the code written in Python would be also executable by Jython, which is IMHO an advantage. But this being Python, the packages cannot be spread in different locations (ie. there can be only one org.apache.lucene.analysis package) - unless there exists (again) some flexible mechanism which populates the namespace with objects that belong there. It may seem an overkill to you, because for single projects it would work, but seems perfectly justifiable in case of imported shared libraries I don't know what is your idea for implementing the python packages, but your last email got me thinking as well - there might be a very simple way of getting to the java packages inside Python without too much work. Let's say the java org.apache.lucene.search.IndexSearcher is known to python as org_apache_lucene_search_IndexSearcher and users do: import lucene lucene.initVM() initVM() first initiates java VM (and populates the lucene namespace with all objects), but then it will call jcc.register_module(self) A new piece of code inside JCC grabs the lucene module and creates (on the fly) python packages -- using types.ModuleType (or new.module()) -- the new packages will be inserted into sys.modules so after lucene.initVM() returns users can do from org.apache.lucene.search import IndexSearcher and get lucene.org_apache_lucene_search_IndexSearcher object and also, when shared libraries are present (let's say 'solr') users do: import solr solr.initVM() The JCC will just update the existing packages and create new ones if needed (and from this perspective, having fully qualified name is safer than to have lucene.search.IndexSearcher) I think this change is totally possible and will not change the way how extensions are built. Does it have some serious flaw? I would be of course more than happy to contribute and test. Best, roman On Fri, Jul 13, 2012 at 11:47 AM, Andi Vajda va...@apache.org wrote: On Tue, 10 Jul 2012, Andi Vajda wrote: I would also like to propose a change, to allow for more flexible mechanism of generating Python class names. The patch doesn't change the default pylucene behaviour, but it gives people a way to replace class names with patterns. I have noticed that there are more same-name classes from different packages in the new lucene (and it becomes worse when one has to deal with both lucene and solr). Another way to fix this is to reproduce the namespace hierarchy used in Lucene, following along the Java packages, something I've been dreading to do. Lucene just loves a really long deeply nested class structure. I'm not convinced yet it is bad enough to go down that route, though. Your proposal to use patterns may in fact yield a much more convenient solution. Thanks ! Rethinking this a bit, I'm prepared to change my mind on this. Your patterned rename patch shows that we're slowly but surely reaching the limit of the current setup that consists in throwing all wrapped classes under the one global 'lucene' namespace. Lucene 4.0 has seen a large number of deeply nested classes with similar names added since 3.x. Renaming
Re: Changing Python class/module layout, dropping --rename ?
also say: - import lucene.document.Document as whateverOneLikes If that proposal isn't mortally flawed somewhere, I'm prepared to drop support for --rename and replace it with this new Python class/module layout. Since this is being talked about in the context of a major PyLucene release, version 4.0, and that all tests/samples have to be reworked anyway, this backwards compat break shouldn't be too controversial, hopefully. If it is, the old --rename could be preserved for sure, but I'd prefer simplying the JCC interface than to accrete more to it. What do you think ? Andi.. Andi.. I can confirm the test_test_BinaryDocument.py crashes the JVM no more. Roman On Tue, Jul 10, 2012 at 8:54 AM, Andi Vajda va...@apache.org wrote: Hi Roman, On Mon, 9 Jul 2012, Roman Chyla wrote: Thanks, I am attaching a new patch that adds the missing test base. Sorry for the tabs, I was probably messing around with a few editors (some of them not configured properly) I integrated your test class (renaming it to fit the naming scheme used). Thanks ! So far, found one serious problem, crashes VM -- see. eg test/test_BinaryDocument.py - when getting the document using: reader.document(0) test/test_BInaryDocument.py doesn't seem to crash the VM but fails because of some API changes. I suspect the crash to be some issue related to using an older jcc. I see a comment saying: couldn't find any combination with lucene4.0 where it would raise errors. Most of these unit tests are straight ports from the original Java version. If you're stumped about a change, check the original Java test, it may have changed too. Andi..
Re: Changing Python class/module layout, dropping --rename ?
Hi, I was playing with the idea of creating virtual packages, attached is a working script that illustrates it. I am getting this output: Dit it work? == from org.apache.lucene.search import SearcherFactory; print SearcherFactory type 'SearcherFactory' from org.apache.lucene.analysis import Analyzer as Banalyzer; print Banalyzer type 'Analyzer' print sys.modules['org'] module 'org' (built-in) print sys.modules['org.apache'] module 'org.apache' (built-in) print sys.modules['org.apache.lucene'] module 'org.apache.lucene' (built-in) print sys.modules['org.apache.lucene.search'] module 'org.apache.lucene.search' (built-in) Cheers, roman On Fri, Jul 13, 2012 at 1:34 PM, Andi Vajda va...@apache.org wrote: On Jul 13, 2012, at 18:33, Roman Chyla roman.ch...@gmail.com wrote: I think this would be great. Let me add little bit more to your observations (whole night yesterday was spent fighting with renames - because I was building a project which imports shared lucene and solr -- there were thousands of same classes, I am not sure it would be possible without some sort of a flexible rename...) JCC is a great tool and is used by potentially many projects - so stripping org.apache seems right for pylucene, but looks arbitrary otherwise Yes, I forgot to say that there would be a way to declare one or more mappings so that org.apache.lucene becomes lucene. Andi.. (unless there is a flexible stripping mechanism). Also, if the full namespace remains original, then the code written in Python would be also executable by Jython, which is IMHO an advantage. But this being Python, the packages cannot be spread in different locations (ie. there can be only one org.apache.lucene.analysis package) - unless there exists (again) some flexible mechanism which populates the namespace with objects that belong there. It may seem an overkill to you, because for single projects it would work, but seems perfectly justifiable in case of imported shared libraries I don't know what is your idea for implementing the python packages, but your last email got me thinking as well - there might be a very simple way of getting to the java packages inside Python without too much work. Let's say the java org.apache.lucene.search.IndexSearcher is known to python as org_apache_lucene_search_IndexSearcher and users do: import lucene lucene.initVM() initVM() first initiates java VM (and populates the lucene namespace with all objects), but then it will call jcc.register_module(self) A new piece of code inside JCC grabs the lucene module and creates (on the fly) python packages -- using types.ModuleType (or new.module()) -- the new packages will be inserted into sys.modules so after lucene.initVM() returns users can do from org.apache.lucene.search import IndexSearcher and get lucene.org_apache_lucene_search_IndexSearcher object and also, when shared libraries are present (let's say 'solr') users do: import solr solr.initVM() The JCC will just update the existing packages and create new ones if needed (and from this perspective, having fully qualified name is safer than to have lucene.search.IndexSearcher) I think this change is totally possible and will not change the way how extensions are built. Does it have some serious flaw? I would be of course more than happy to contribute and test. Best, roman On Fri, Jul 13, 2012 at 11:47 AM, Andi Vajda va...@apache.org wrote: On Tue, 10 Jul 2012, Andi Vajda wrote: I would also like to propose a change, to allow for more flexible mechanism of generating Python class names. The patch doesn't change the default pylucene behaviour, but it gives people a way to replace class names with patterns. I have noticed that there are more same-name classes from different packages in the new lucene (and it becomes worse when one has to deal with both lucene and solr). Another way to fix this is to reproduce the namespace hierarchy used in Lucene, following along the Java packages, something I've been dreading to do. Lucene just loves a really long deeply nested class structure. I'm not convinced yet it is bad enough to go down that route, though. Your proposal to use patterns may in fact yield a much more convenient solution. Thanks ! Rethinking this a bit, I'm prepared to change my mind on this. Your patterned rename patch shows that we're slowly but surely reaching the limit of the current setup that consists in throwing all wrapped classes under the one global 'lucene' namespace. Lucene 4.0 has seen a large number of deeply nested classes with similar names added since 3.x. Renaming these one by one (or excluding some) doesn't scale. Using the proposed patterned rename scales more but makes it difficult to know what got renamed and how. Ultimately, the more classes that are like-named
Re: lucene4.0 release
Hi Andi, Thanks again. With the new JCC I encountered new errors - about already used class names - patch attached. I would also like to propose a change, to allow for more flexible mechanism of generating Python class names. The patch doesn't change the default pylucene behaviour, but it gives people a way to replace class names with patterns. I have noticed that there are more same-name classes from different packages in the new lucene (and it becomes worse when one has to deal with both lucene and solr). I can confirm the test_test_BinaryDocument.py crashes the JVM no more. Roman On Tue, Jul 10, 2012 at 8:54 AM, Andi Vajda va...@apache.org wrote: Hi Roman, On Mon, 9 Jul 2012, Roman Chyla wrote: Thanks, I am attaching a new patch that adds the missing test base. Sorry for the tabs, I was probably messing around with a few editors (some of them not configured properly) I integrated your test class (renaming it to fit the naming scheme used). Thanks ! So far, found one serious problem, crashes VM -- see. eg test/test_BinaryDocument.py - when getting the document using: reader.document(0) test/test_BInaryDocument.py doesn't seem to crash the VM but fails because of some API changes. I suspect the crash to be some issue related to using an older jcc. I see a comment saying: couldn't find any combination with lucene4.0 where it would raise errors. Most of these unit tests are straight ports from the original Java version. If you're stumped about a change, check the original Java test, it may have changed too. Andi..
Re: lucene4.0 release
Hi Andi, Thanks, I am attaching a new patch that adds the missing test base. Sorry for the tabs, I was probably messing around with a few editors (some of them not configured properly) The test_Analyzer.py works for me no more - it imports PythonAttributeImpl which I cannot find in the trunk I wasn't able to build JCC, there is a build error since the new commit (tested on Debian with Python2.7 and CentOS with Python 2.6) gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -I/usr/kerberos/include -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -D_java_generics -DJCC_VER=2.13 -I/usr/lib/jvm/java-openjdk/include -I/usr/lib/jvm/java-openjdk/include/linux -I_jcc -Ijcc/sources -I/usr/include/python2.6 -c jcc/sources/functions.cpp -o build/temp.linux-x86_64-2.6/jcc/sources/functions.o -DPYTHON -fno-strict-aliasing -Wno-write-strings jcc/sources/functions.cpp: In function ‘PyObject* makeInterface(PyObject*, PyObject*)’: jcc/sources/functions.cpp:153: error: ‘htons’ was not declared in this scope jcc/sources/functions.cpp: In function ‘PyObject* makeClass(PyObject*, PyObject*)’: jcc/sources/functions.cpp:244: error: ‘htons’ was not declared in this scope error: command 'gcc' failed with exit status 1 So I tried building pylucene with JCC 2.8, after adding to the Makefile --reserved mutable \ --reserved token \ but got an error: build/_lucene/__wrap01__.cpp: In function ‘PyObject* org::apache::pylucene::util::t_PythonListIterator_next(org::apache::pylucene::util::t_PythonListIterator*, PyObject*)’: build/_lucene/__wrap01__.cpp:17920:38: error: ‘class org::apache::pylucene::util::t_PythonListIterator’ has no member named ‘parameters’ build/_lucene/__wrap01__.cpp:17920:77: error: ‘class org::apache::pylucene::util::t_PythonListIterator’ has no member named ‘parameters’ error: command 'gcc' failed with exit status 1 Then I tried using the pylucene code from Friday (just updated Lucene java source) and it worked, it seems that changes inside lucene are not cause of this roman On Sat, Jul 7, 2012 at 11:35 AM, Andi Vajda va...@apache.org wrote: Hi Roman, On Fri, 6 Jul 2012, Roman Chyla wrote: I figured this is not complete for jira, retrying /w email... I integrated your patch after merging 3.6.0 - 3.x and then 3.x into trunk. PyLucene's trunk is now setup to track Lucene's branch_4x branch. I wasn't able to run all tests that succeed for you as you didn't send in your new PyLuceneTestCase.py class. Please add it to the test directory (instead of a new package) along with the other test helper classes already there such as BaseTokenStreamTestCase.py and send it in. Also, please, please, please, avoid using tab characters in the Java code you send in. Tabs are pain to manage, they mess up indentation and make the code hard to read. As this time, PyLucene on trunk builds and runs the few tests you ported that don't require this missing file, such as test_Analyzers.py. Thanks ! Andi.. On Fri, Jul 6, 2012 at 1:55 PM, Andi Vajda va...@apache.org wrote: I think that the apache mail server is eating up the attachment. Try to make it a .diff file or attach the patch to a jira issue. Thanks ! Andi.. On Jul 6, 2012, at 18:54, Roman Chyla roman.ch...@gmail.com wrote: Attaching the patch (there is no chance I could do it in one go, but if parts are committed in the trunk, then we can do more...I have also introduced base class for unittests, so that may be st to wave) So far, found one serious problem, crashes VM -- see. eg test/test_BinaryDocument.py - when getting the document using: reader.document(0) What works fine now: test/ test_Analyzers test_Binary test_RegexQuery samples/LuceneInAction/ index.py BasicSearchingTest.py On Thu, Jul 5, 2012 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote: The patch probably probably didn't make it to the list, I'll file a ticket later It is definitely lot of work with the python code, I have gone through 1.5 test cases now, and it is just 'unpleasant', so many API changes out there - but I'll try to convert more roman On Thu, Jul 5, 2012 at 7:48 PM, Andi Vajda va...@apache.org wrote: On Jul 6, 2012, at 0:27, Roman Chyla roman.ch...@gmail.com wrote: Lucene is 4.0 in alpha release and we would like to start working with pylucene4.0 already. I checked out the pylucene trunk and made the necessary changes so that it compiles. Would it be possible to incorporate (some of) these changes? Absolutely, please send a patch to the list or file a bug and attach it there. The issue with a PyLucene 4.0 release is not so much getting it to compile and run but rewriting all the tests and samples (originally ported from Java) since
Re: lucene4.0 release
You can also get it temporarily here: https://github.com/romanchyla/pylucene-trunk roman On Fri, Jul 6, 2012 at 2:04 PM, Roman Chyla roman.ch...@gmail.com wrote: I figured this is not complete for jira, retrying /w email... r On Fri, Jul 6, 2012 at 1:55 PM, Andi Vajda va...@apache.org wrote: I think that the apache mail server is eating up the attachment. Try to make it a .diff file or attach the patch to a jira issue. Thanks ! Andi.. On Jul 6, 2012, at 18:54, Roman Chyla roman.ch...@gmail.com wrote: Attaching the patch (there is no chance I could do it in one go, but if parts are committed in the trunk, then we can do more...I have also introduced base class for unittests, so that may be st to wave) So far, found one serious problem, crashes VM -- see. eg test/test_BinaryDocument.py - when getting the document using: reader.document(0) What works fine now: test/ test_Analyzers test_Binary test_RegexQuery samples/LuceneInAction/ index.py BasicSearchingTest.py On Thu, Jul 5, 2012 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote: The patch probably probably didn't make it to the list, I'll file a ticket later It is definitely lot of work with the python code, I have gone through 1.5 test cases now, and it is just 'unpleasant', so many API changes out there - but I'll try to convert more roman On Thu, Jul 5, 2012 at 7:48 PM, Andi Vajda va...@apache.org wrote: On Jul 6, 2012, at 0:27, Roman Chyla roman.ch...@gmail.com wrote: Lucene is 4.0 in alpha release and we would like to start working with pylucene4.0 already. I checked out the pylucene trunk and made the necessary changes so that it compiles. Would it be possible to incorporate (some of) these changes? Absolutely, please send a patch to the list or file a bug and attach it there. The issue with a PyLucene 4.0 release is not so much getting it to compile and run but rewriting all the tests and samples (originally ported from Java) since the Lucene api changed in many ways. That's a large amount of work and some of the new analyzer/tokenizer framework stuff needs some new jcc support for generating classes on the fly. I've got that written to some extent already but porting the samples and tests again is daunting. Andi.. Thanks, Roman
Re: lucene4.0 release
Attaching the patch (there is no chance I could do it in one go, but if parts are committed in the trunk, then we can do more...I have also introduced base class for unittests, so that may be st to wave) So far, found one serious problem, crashes VM -- see. eg test/test_BinaryDocument.py - when getting the document using: reader.document(0) What works fine now: test/ test_Analyzers test_Binary test_RegexQuery samples/LuceneInAction/ index.py BasicSearchingTest.py On Thu, Jul 5, 2012 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote: The patch probably probably didn't make it to the list, I'll file a ticket later It is definitely lot of work with the python code, I have gone through 1.5 test cases now, and it is just 'unpleasant', so many API changes out there - but I'll try to convert more roman On Thu, Jul 5, 2012 at 7:48 PM, Andi Vajda va...@apache.org wrote: On Jul 6, 2012, at 0:27, Roman Chyla roman.ch...@gmail.com wrote: Lucene is 4.0 in alpha release and we would like to start working with pylucene4.0 already. I checked out the pylucene trunk and made the necessary changes so that it compiles. Would it be possible to incorporate (some of) these changes? Absolutely, please send a patch to the list or file a bug and attach it there. The issue with a PyLucene 4.0 release is not so much getting it to compile and run but rewriting all the tests and samples (originally ported from Java) since the Lucene api changed in many ways. That's a large amount of work and some of the new analyzer/tokenizer framework stuff needs some new jcc support for generating classes on the fly. I've got that written to some extent already but porting the samples and tests again is daunting. Andi.. Thanks, Roman
Re: lucene4.0 release
The patch probably probably didn't make it to the list, I'll file a ticket later It is definitely lot of work with the python code, I have gone through 1.5 test cases now, and it is just 'unpleasant', so many API changes out there - but I'll try to convert more roman On Thu, Jul 5, 2012 at 7:48 PM, Andi Vajda va...@apache.org wrote: On Jul 6, 2012, at 0:27, Roman Chyla roman.ch...@gmail.com wrote: Lucene is 4.0 in alpha release and we would like to start working with pylucene4.0 already. I checked out the pylucene trunk and made the necessary changes so that it compiles. Would it be possible to incorporate (some of) these changes? Absolutely, please send a patch to the list or file a bug and attach it there. The issue with a PyLucene 4.0 release is not so much getting it to compile and run but rewriting all the tests and samples (originally ported from Java) since the Lucene api changed in many ways. That's a large amount of work and some of the new analyzer/tokenizer framework stuff needs some new jcc support for generating classes on the fly. I've got that written to some extent already but porting the samples and tests again is daunting. Andi.. Thanks, Roman
JArray not shared - TypeError
Hi, I am using lucene together with other modules (all built in shared mode, JCC=2.11). But JArray... objects are not built as shared This works when using only lucene, but fails when I use built the other module linked against lucene # create array of string objects x = j.JArray_object(5) for i in range(5): x[i] = j.JArray_string(['x', 'z']) In [7]: for i in range(5): x[i] = j.JArray_string(['x', 'z']) ...: ...: --- TypeError Traceback (most recent call last) /dvt/workspace/montysolr/src/python/ipython console in module() TypeError: JArraystring[u'x', u'z'] The JArray functions/objects are different: In [9]: id(lucene.JArray_string) Out[9]: 140313957671376 In [10]: id(solr_java.JArray_string) Out[10]: 140313919877648 In [11]: id(montysolr_java.JArray_string) Out[11]: 140313909254704 In [12]: id(j.JArray_string) Out[12]: 140313909254704 Others are shared: In [18]: id(lucene.Weight) Out[18]: 140313957203040 In [19]: id(solr_java.Weight) Out[19]: 140313957203040 In [20]: id(j.Weight) Out[20]: 140313957203040 The module 'j' is built with: -m jcc --shared --import lucene --import solr_java --package org.apache.solr.request --classpath ... --include ../build/jar/montysolr_java-0.1.jar --python montysolr_java --build --bdist What am I doing wrong? Thanks, roman
Re: set PYTHONPATH programatically from Java?
hi, so after reading http://docs.python.org/c-api/init.html#PySys_SetArgvEx and the source code for _PythonVM_init i figured it out I have to do: PythonVM.start(/dvt/workspace/montysolr/src/python/montysolr); and the sys.path then contains the parent folder (above montysolr) and i can then set more things by loading some boostrap module but something like http://docs.python.org/c-api/veryhigh.html#PyRun_SimpleString would be much more flexible. Is it something that could be added? I can prepare a patch (as it seems really trivial my knowledge might be sufficient for this :)) roman On Mon, Nov 14, 2011 at 1:12 PM, Roman Chyla roman.ch...@gmail.com wrote: On Mon, Nov 14, 2011 at 4:25 AM, Andi Vajda va...@apache.org wrote: On Sun, 13 Nov 2011, Roman Chyla wrote: I am using JCC to run Python inside Java. For unittest, I'd like to set PYTHONPATH environment variable programmatically. I can change env vars inside Java (using http://stackoverflow.com/questions/318239/how-do-i-set-environment-variables-from-java) and System.getenv(PYTHONPATH) shows correct values However, I am still getting ImportError: no module named If I set PYTHONPATH before starting unittest, it works fine Is it possible what I would like to do? Why mess with the environment instead of setting sys.path directly instead ? That would be great, but I don't know how. I am doing roughly this: PythonVM.start(programName) vm = PythonVM.get() vm.instantiate(moduleName, className); I tried also: PythonVM.start(programName, new String[]{-c, import sys;sys.path.insert(0, \'/dvt/workspace/montysolr/src/python\'}); it is failing on vm.instantiate when Python cannot find the module Alternatively, if JCC could execute/eval python string, I could set sys.argv that way I'm not sure what you mean here but JCC's Java PythonVM.init() method takes an array of strings that is fed into sys.argv. See _PythonVM_Init() sources in jcc.cpp for details. sorry, i meant sys.path, not sys.argv roman Andi..
Re: set PYTHONPATH programatically from Java?
On Mon, Nov 14, 2011 at 4:25 AM, Andi Vajda va...@apache.org wrote: On Sun, 13 Nov 2011, Roman Chyla wrote: I am using JCC to run Python inside Java. For unittest, I'd like to set PYTHONPATH environment variable programmatically. I can change env vars inside Java (using http://stackoverflow.com/questions/318239/how-do-i-set-environment-variables-from-java) and System.getenv(PYTHONPATH) shows correct values However, I am still getting ImportError: no module named If I set PYTHONPATH before starting unittest, it works fine Is it possible what I would like to do? Why mess with the environment instead of setting sys.path directly instead ? That would be great, but I don't know how. I am doing roughly this: PythonVM.start(programName) vm = PythonVM.get() vm.instantiate(moduleName, className); I tried also: PythonVM.start(programName, new String[]{-c, import sys;sys.path.insert(0, \'/dvt/workspace/montysolr/src/python\'}); it is failing on vm.instantiate when Python cannot find the module Alternatively, if JCC could execute/eval python string, I could set sys.argv that way I'm not sure what you mean here but JCC's Java PythonVM.init() method takes an array of strings that is fed into sys.argv. See _PythonVM_Init() sources in jcc.cpp for details. sorry, i meant sys.path, not sys.argv roman Andi..
Re: Building is too difficult and release of a first pre-built egg
Hi Philippe, On Thu, Jun 2, 2011 at 5:54 AM, Philippe Ombredanne pombreda...@gmail.com wrote: On 2011-06-01 20:54, Roman Chyla wrote: I would build some other binaries and upload them, will you get me access? Done: I added you as a committer to http://code.google.com/a/apache-extras.org/p/pylucene-extra/ Thanks! I'll try to keep and post detailed logs for each build I do. I am planning to add some detailed egg building instructions too. I will also contact the dudes at: http://code.google.com/p/pylucene-win32-binary/ They are building windows eggs already I'll also do -- the project for which it is needed is this one: https://github.com/romanchyla/montysolr But I also need to build JCC and upload them. Note that the location of the java that was used for the project built will be hardcoded inside the dynamic library, but I plan to change the header and set a few standard paths there. Ah... good point... meaning this is bad... a build would not be java location independent then? This would be a major bummer to have the path to java hardcoded in the .so. You could commit the patches there if you have some? oh, I assumed that was not patcheable -- but maybe I was wrong; but what I certainly planned to do is to change each binary produced and set some standard paths. Any ideas of what would be the standard library paths for linux? Building pylucene/jcc is indeed difficult for newcomers. Indeed too hard imho. A big deterrent. Such that it does likely impair the project reach, growth and health. and it is a very wonderful project, i agree roman On Wed, Jun 1, 2011 at 10:54 AM, Philippe Ombredanne pombreda...@gmail.com wrote: Howdy! I think it is way too hard to build PyLucene for the mere mortals. Getting eggs is yet another level of difficulties I created an issue: https://issues.apache.org/jira/browse/PYLUCENE-10 and started an Apache extra project, releasing a first egg for the Linux 64/Python 2.5.2/Oracle JDK 1.5 combo http://code.google.com/a/apache-extras.org/p/pylucene-extra/downloads/list I hope that can help some folks. -- Cordially Philippe philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com nexB - Open by Design (tm) - http://www.nexb.com http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com
Re: Hardcoded java paths in shared objects [was:Re: Building is too difficult and release of a first pre-built egg]
On Thu, Jun 2, 2011 at 12:26 PM, Andi Vajda va...@apache.org wrote: On Jun 2, 2011, at 3:10, Philippe Ombredanne pombreda...@gmail.com wrote: On 2011-06-01 20:54, Roman Chyla wrote: Note that the location of the java that was used for the project built will be hardcoded inside the dynamic library, but I plan to change the header and set a few standard paths there. This is actually worse than I thought: not only the java location seems hardcoded in the shared object as a hard path to the libs folder, but also there is an implied dep on setuptools via pkg_resources So for now, you cannot even build on a jdk and deploy on a jre. If the solution to this is to remove the hardcoded paths and expect the dynamic linker to find the dependencies via some environment variable like LD_LIBRARY_PATH you'd be creating a security vulnerability. I am not an expert on this, but i remember that LD_LIBRARY_PATH was not recommended (as it could break other libraries, if i remember well). So that's why I thought more about a 'more, standard' hardcoded locations. Or is there something else besides LD_LIBRARY_PATH and multiple hardcoded paths? Roman This is how I did it originally (years ago) and people complained about it so I switched to hardcoded paths for shared library dependencies wherever possible. Andi.. -- Cordially Philippe philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com nexB - Open by Design (tm) - http://www.nexb.com http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com
Re: Hardcoded java paths in shared objects [was:Re: Building is too difficult and release of a first pre-built egg]
On Thu, Jun 2, 2011 at 6:10 AM, Philippe Ombredanne pombreda...@gmail.com wrote: On 2011-06-01 20:54, Roman Chyla wrote: Note that the location of the java that was used for the project built will be hardcoded inside the dynamic library, but I plan to change the header and set a few standard paths there. This is actually worse than I thought: not only the java location seems hardcoded in the shared object as a hard path to the libs folder, but also there is an implied dep on setuptools via pkg_resources So for now, you cannot even build on a jdk and deploy on a jre. I am sorry, but I don't understand - what is the additional dependency hardcoded there? Thanks, Roman -- Cordially Philippe philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com nexB - Open by Design (tm) - http://www.nexb.com http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com
Re: Building is too difficult and release of a first pre-built egg
Hi Philippe, I would build some other binaries and upload them, will you get me access? But I also need to build JCC and upload them. Note that the location of the java that was used for the project built will be hardcoded inside the dynamic library, but I plan to change the header and set a few standard paths there. Building pylucene/jcc is indeed difficult for newcomers. Cheers, Roman On Wed, Jun 1, 2011 at 10:54 AM, Philippe Ombredanne pombreda...@gmail.com wrote: Howdy! I think it is way too hard to build PyLucene for the mere mortals. Getting eggs is yet another level of difficulties I created an issue: https://issues.apache.org/jira/browse/PYLUCENE-10 and started an Apache extra project, releasing a first egg for the Linux 64/Python 2.5.2/Oracle JDK 1.5 combo http://code.google.com/a/apache-extras.org/p/pylucene-extra/downloads/list I hope that can help some folks. -- Cordially Philippe philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com nexB - Open by Design (tm) - http://www.nexb.com http://twitter.com/pombr http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com
Re: finding exceptions the crash pylucene
I have had similar experience, but it was always a problem on the java side. What helped was to dump memory: -Xms512m -Xmx4500m -XX:+HeapDumpOnCtrlBreak -XX:+HeapDumpOnOutOfMemoryError Documentation says that upon catching the OOM, you should stop the JVM immediately. But actually it was possible to handle these problems. I started the processing inside a separate thread, cleaning properly -- if the thread raises OOM, it is possible to continue - I have done tests on thousands of docs and it always worked. But the main benefit of that solution is that I can see the errors inside Python and gracefully stop execution (without being shut out into the space). Marcus, I would recommend wrapping your processing inside a thread that starts another worker thread and make sure no references are kept. Roman On Fri, Apr 15, 2011 at 4:33 PM, Bill Janssen jans...@parc.com wrote: Marcus qwe...@gmail.com wrote: --bcaec53043296dfbfd04a0ece1ac Content-Type: text/plain; charset=ISO-8859-1 we're currently using 4GB max heap. We recently moved from 2GB to 4GB when we discovered it prevented a crash with a certain set of docs. Marcus I've tried the same workaround with the heap in the past, and I found it caused NoMemory crashes in the Python side of the house, because the Python VM couldn't get enough memory to operate. So, be careful. On Thu, Apr 14, 2011 at 5:01 PM, Andi Vajda va...@apache.org wrote: On Thu, 14 Apr 2011, Marcus wrote: thanks. I have documents that will consistently cause this upon writing them to the index. let me see if I can reduce them down to the crux of the crash. granted, these are docs are very large, unruly bad data, that should have never gotten this stage in our pipeline, but I was hoping for a java or lucene exception. I also get Java GC overhead exceptions passed into my code from time to time, but those manageable, and not crashes. Are there known memory constraint scenarios that force a c++ exception, whereas in a normal Java environment, you would get a memory error? Not sure. and just confirming, do java.lang.OutOfMemoryError errors pass into python, or force a crash? Not sure, I've never seen these as I make sure I've got enough memory. initVM() is the place where you can configure the memory for your JVM. Andi.. thanks again Marcus On Thu, Apr 14, 2011 at 2:07 PM, Andi Vajda va...@apache.org wrote: On Thu, 14 Apr 2011, Marcus wrote: in certain cases when a java/pylucene exception occurs, it gets passed up in my code, and I'm able to analyze the situation. sometimes though, the python process just crashes, and if I happen to be in top (linux top that is), I see a JCC exception flash up in the top console. where can I go to look for this exception, or is it just lost? I looked in the locations where a java crash would be located, but didn't find anything. If you're hitting a crash because of an unhandled C++ exception, running a debug build with symbols under gdb will help greatly in tracking it down. An unhandled C++ exception would be a PyLucene/JCC bug. If you have a simple way to reproduce this failure, send it to this list. Andi.. --bcaec53043296dfbfd04a0ece1ac--
Re: Using JCC / PyLucene with JEPP?
Yes, and I can say it is working extremely well so far - we have done and are doing some extensive benchmarking and tests. I also use multiprocessing inside (python2.6) and I hope I would be able to publish the source code soon, it could be re-usable. If you are interested before that happens, please send me an email. Best, Roman On Fri, Mar 4, 2011 at 7:27 AM, Andi Vajda va...@apache.org wrote: On Mar 3, 2011, at 21:50, Bill Janssen jans...@parc.com wrote: New topic. I'd like to wrap my UpLib codebase, which is Python using PyLucene, in Java using JEPP (http://jepp.sourceforge.net/), so that I can use it with Tomcat. Now, am I going to have to do some trickery to get a VM? Or will getVMEnv() just work with a previously initialized JVM? Not so long ago on this list someone asked about this, using python from java via jcc, something I've been doing with tomcat for a couple of years now. I sent a long, detailed answer. I believe it was to Roman Chyla. A quick look in this mailing list archives should help you locate that thread and get answers to the above questions. Andi.. Bill
Re: pass compressed string
Hi Andi, Thanks, the JArray_byte() does what I needed - I was (wrongly) passing bytestring (which I think got automatically converted to unicode) and trying to get bytes of that string was not correct. Though it would be interesting to find out if it is possible to pass string and get the bytes in java, I don't know if what conversion happening on the jni side, or only in java - i shall do some reading Example in python: In [4]: s = zlib.compress(python) In [5]: repr(s) Out[5]: 'x\\x9c+\\xa8,\\xc9\\xc8\\xcf\\x03\\x00\\tW\\x02\\xa3' In [6]: lucene.JArray_byte(s) Out[6]: JArraybyte(120, -100, 43, -88, 44, -55, -56, -49, 3, 0, 9, 87, 2, -93) The same thing in Jython: s = zlib.compress(python) s 'x\x9c+\xa8,\xc9\xc8\xcf\x03\x00\tW\x02\xa3' repr(s) 'x\\x9c+\\xa8,\\xc9\\xc8\\xcf\\x03\\x00\\tW\\x02\\xa3' String(s).getBytes() array('b', [120, -62, -100, 43, -62, -88, 44, -61, -119, -61, -120, -61, -113, 3, 0, 9, 87, 2, -62, -93]) String(s).getBytes('utf8') array('b', [120, -62, -100, 43, -62, -88, 44, -61, -119, -61, -120, -61, -113, 3, 0, 9, 87, 2, -62, -93]) String(s).getBytes('utf16') array('b', [-2, -1, 0, 120, 0, -100, 0, 43, 0, -88, 0, 44, 0, -55, 0, -56, 0, -49, 0, 3, 0, 0, 0, 9, 0, 87, 0, 2, 0, -93]) String(s).getBytes('ascii') array('b', [120, 63, 43, 63, 44, 63, 63, 63, 3, 0, 9, 87, 2, 63]) Roman On Thu, Feb 24, 2011 at 3:42 AM, Andi Vajda va...@apache.org wrote: On Thu, 24 Feb 2011, Roman Chyla wrote: I would like to transfer results from python to java: hello = zlib.compress(hello) on the java side do: byte[] data = string.getBytes() But I am not successful. Is there any translation going on somewhere? Can you be more specific ? Actual lines of code, errors, expected results, actual results... An array of bytes in JCC is not created with a string but a JArray('byte')(len or str) import lucene lucene.initVM() jcc.JCCEnv object at 0x1004100d8 lucene.JArray('byte')(10) JArraybyte(0, 0, 0, 0, 0, 0, 0, 0, 0, 0) lucene.JArray('byte')(abcd) JArraybyte(97, 98, 99, 100) Andi..
pass compressed string
Hello, I would like to transfer results from python to java: hello = zlib.compress(hello) on the java side do: byte[] data = string.getBytes() But I am not successful. Is there any translation going on somewhere? Thank you, Roman
Re: Problem loading jcc from java : undefined symbol: PyExc_IOError
On Tue, Feb 15, 2011 at 4:22 AM, Andi Vajda va...@apache.org wrote: On Tue, 15 Feb 2011, Roman Chyla wrote: from: http://realmike.org/blog/2010/07/18/python-extensions-in-cpp-using-swig/ Q. ?Fatal Python error: Interpreter not initialized (version mismatch?)? A. This error occurs when the version of the Python interpreter for which the extension module has been built is different from the version of the interpreter that attempts to import the module. Is there a way to find out which python interpreter version is inside JCC? Also, Is it somehow possible that the java process that load jcc library will be picking the default python (2.4) instead of the python (2.5)? PATH is set to python2.5. There is no Python interpreter inside jcc. It's dynamically linked. To know which version of the shared library is looked for and expected, use the 'ldd' utility against the various shared libraries involved to tell you. That version is selected at build time, when you run 'python setup.py ...' That version of python determines the version of libpython.so used. This will be probably the problem (as you said before), the libjcc.so shows no python - bash-3.2$ ldd build/lib.linux-x86_64-2.5/libjcc.so linux-vdso.so.1 = (0x7fff7affc000) /$LIB/snoopy.so = /lib64/snoopy.so (0x2b8ed0e74000) libjava.so = /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/libjava.so (0x2b8ed1076000) libjvm.so = /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/server/libjvm.so (0x2b8ed11a5000) libstdc++.so.6 = /usr/lib64/libstdc++.so.6 (0x2b8ed1c3f000) libm.so.6 = /lib64/libm.so.6 (0x2b8ed1f3f000) libgcc_s.so.1 = /lib64/libgcc_s.so.1 (0x2b8ed21c2000) libpthread.so.0 = /lib64/libpthread.so.0 (0x2b8ed23cf000) libc.so.6 = /lib64/libc.so.6 (0x2b8ed25eb000) libdl.so.2 = /lib64/libdl.so.2 (0x2b8ed2943000) libverify.so = /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/libverify.so (0x2b8ed2b47000) libnsl.so.1 = /lib64/libnsl.so.1 (0x2b8ed2c57000) /lib64/ld-linux-x86-64.so.2 (0x2b8ed08c9000) And I think, the python2.4 (the default on the system) is being loaded -- but how to force loading of python2.5 (if that was possible at all) I don't know. Compilation is definitely done with -lpython2.5 Cheers, roman Andi.. Cheers, roman On Tue, Feb 15, 2011 at 2:40 AM, Roman Chyla roman.ch...@gmail.com wrote: On Tue, Feb 15, 2011 at 1:32 AM, Andi Vajda va...@apache.org wrote: On Tue, 15 Feb 2011, Roman Chyla wrote: The python embedded in Java works really well on MacOsX and also Ubuntu. But I am trying hard to make it work also on Scientific Linux (SLC5) with *statically* built Python. The python is a build from ActiveState. You mean you're going to try to dynamically load libpython.a into a JVM ? I have no idea if this can work at all. I am very ignorant as far as the difference between statically and dynamically linked libraries go - I just wanted to use JCC wrapped code with this particular statically linked python I got little bit further, but just little: after I changed -Xlinker --export-dynamic into -Xlinker -export-dynamic (and installed python into /opt...) I am getting a different error: SEVERE: org.apache.jcc.PythonException: No module named solrpie.java_bridge null at org.apache.jcc.PythonVM.instantiate(Native Method) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.SolrpieVM.getBridge(Unknown Source) My understanding is that the previous error has gone (and the python module time is loaded), because if I set PYTHONPATH incorrectly, I get: This message is IMHO coming from Python But when I correct the PYTHONPATH, I am getting only this: [java] Fatal Python error: Interpreter not initialized (version mismatch?) [java] Java Result: 134 If my understanding of static builds is correct, I'd imagine the only way for this to work would be to statically compile the JVM (hotspot) and python together. oooups, that is way over my head But why all this ? Because on the grid, we already had a statically linked python and it was working very well with pylucene (and after all, I managed to make it work also for solr and other packages) But if you think that it is not possible, I should do something else :) But it was fun trying, if you get some idea, please let me know. Thank you, Roman Andi.. So far, I managed to build all the needed extensions (jcc, lucene, solr) and I can run them in python, but when I try to start the java app and use python, I get: SEVERE: org.apache.jcc.PythonException: /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/lib
Re: Problem loading jcc from java : undefined symbol: PyExc_IOError
In the end, I compiled a new python with the necessary modules, and that works just fine. But it was an interesting experience. Thank you Andi, your help is always great. Cheers, roman On Tue, Feb 15, 2011 at 9:22 AM, Roman Chyla roman.ch...@gmail.com wrote: On Tue, Feb 15, 2011 at 4:22 AM, Andi Vajda va...@apache.org wrote: On Tue, 15 Feb 2011, Roman Chyla wrote: from: http://realmike.org/blog/2010/07/18/python-extensions-in-cpp-using-swig/ Q. ?Fatal Python error: Interpreter not initialized (version mismatch?)? A. This error occurs when the version of the Python interpreter for which the extension module has been built is different from the version of the interpreter that attempts to import the module. Is there a way to find out which python interpreter version is inside JCC? Also, Is it somehow possible that the java process that load jcc library will be picking the default python (2.4) instead of the python (2.5)? PATH is set to python2.5. There is no Python interpreter inside jcc. It's dynamically linked. To know which version of the shared library is looked for and expected, use the 'ldd' utility against the various shared libraries involved to tell you. That version is selected at build time, when you run 'python setup.py ...' That version of python determines the version of libpython.so used. This will be probably the problem (as you said before), the libjcc.so shows no python - bash-3.2$ ldd build/lib.linux-x86_64-2.5/libjcc.so linux-vdso.so.1 = (0x7fff7affc000) /$LIB/snoopy.so = /lib64/snoopy.so (0x2b8ed0e74000) libjava.so = /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/libjava.so (0x2b8ed1076000) libjvm.so = /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/server/libjvm.so (0x2b8ed11a5000) libstdc++.so.6 = /usr/lib64/libstdc++.so.6 (0x2b8ed1c3f000) libm.so.6 = /lib64/libm.so.6 (0x2b8ed1f3f000) libgcc_s.so.1 = /lib64/libgcc_s.so.1 (0x2b8ed21c2000) libpthread.so.0 = /lib64/libpthread.so.0 (0x2b8ed23cf000) libc.so.6 = /lib64/libc.so.6 (0x2b8ed25eb000) libdl.so.2 = /lib64/libdl.so.2 (0x2b8ed2943000) libverify.so = /afs/cern.ch/user/r/rchyla/public/jdk1.6.0_18/jre/lib/amd64/libverify.so (0x2b8ed2b47000) libnsl.so.1 = /lib64/libnsl.so.1 (0x2b8ed2c57000) /lib64/ld-linux-x86-64.so.2 (0x2b8ed08c9000) And I think, the python2.4 (the default on the system) is being loaded -- but how to force loading of python2.5 (if that was possible at all) I don't know. Compilation is definitely done with -lpython2.5 Cheers, roman Andi.. Cheers, roman On Tue, Feb 15, 2011 at 2:40 AM, Roman Chyla roman.ch...@gmail.com wrote: On Tue, Feb 15, 2011 at 1:32 AM, Andi Vajda va...@apache.org wrote: On Tue, 15 Feb 2011, Roman Chyla wrote: The python embedded in Java works really well on MacOsX and also Ubuntu. But I am trying hard to make it work also on Scientific Linux (SLC5) with *statically* built Python. The python is a build from ActiveState. You mean you're going to try to dynamically load libpython.a into a JVM ? I have no idea if this can work at all. I am very ignorant as far as the difference between statically and dynamically linked libraries go - I just wanted to use JCC wrapped code with this particular statically linked python I got little bit further, but just little: after I changed -Xlinker --export-dynamic into -Xlinker -export-dynamic (and installed python into /opt...) I am getting a different error: SEVERE: org.apache.jcc.PythonException: No module named solrpie.java_bridge null at org.apache.jcc.PythonVM.instantiate(Native Method) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.SolrpieVM.getBridge(Unknown Source) My understanding is that the previous error has gone (and the python module time is loaded), because if I set PYTHONPATH incorrectly, I get: This message is IMHO coming from Python But when I correct the PYTHONPATH, I am getting only this: [java] Fatal Python error: Interpreter not initialized (version mismatch?) [java] Java Result: 134 If my understanding of static builds is correct, I'd imagine the only way for this to work would be to statically compile the JVM (hotspot) and python together. oooups, that is way over my head But why all this ? Because on the grid, we already had a statically linked python and it was working very well with pylucene (and after all, I managed to make it work also for solr and other packages) But if you think that it is not possible, I should do something else :) But it was fun trying, if you get some idea, please let me know. Thank you, Roman Andi.. So far, I managed to build all
Re: Problem loading jcc from java : undefined symbol: PyExc_IOError
from: http://realmike.org/blog/2010/07/18/python-extensions-in-cpp-using-swig/ Q. “Fatal Python error: Interpreter not initialized (version mismatch?)” A. This error occurs when the version of the Python interpreter for which the extension module has been built is different from the version of the interpreter that attempts to import the module. Is there a way to find out which python interpreter version is inside JCC? Also, Is it somehow possible that the java process that load jcc library will be picking the default python (2.4) instead of the python (2.5)? PATH is set to python2.5. Cheers, roman On Tue, Feb 15, 2011 at 2:40 AM, Roman Chyla roman.ch...@gmail.com wrote: On Tue, Feb 15, 2011 at 1:32 AM, Andi Vajda va...@apache.org wrote: On Tue, 15 Feb 2011, Roman Chyla wrote: The python embedded in Java works really well on MacOsX and also Ubuntu. But I am trying hard to make it work also on Scientific Linux (SLC5) with *statically* built Python. The python is a build from ActiveState. You mean you're going to try to dynamically load libpython.a into a JVM ? I have no idea if this can work at all. I am very ignorant as far as the difference between statically and dynamically linked libraries go - I just wanted to use JCC wrapped code with this particular statically linked python I got little bit further, but just little: after I changed -Xlinker --export-dynamic into -Xlinker -export-dynamic (and installed python into /opt...) I am getting a different error: SEVERE: org.apache.jcc.PythonException: No module named solrpie.java_bridge null at org.apache.jcc.PythonVM.instantiate(Native Method) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.PythonVMBridge.start(Unknown Source) at rca.python.jni.SolrpieVM.getBridge(Unknown Source) My understanding is that the previous error has gone (and the python module time is loaded), because if I set PYTHONPATH incorrectly, I get: This message is IMHO coming from Python But when I correct the PYTHONPATH, I am getting only this: [java] Fatal Python error: Interpreter not initialized (version mismatch?) [java] Java Result: 134 If my understanding of static builds is correct, I'd imagine the only way for this to work would be to statically compile the JVM (hotspot) and python together. oooups, that is way over my head But why all this ? Because on the grid, we already had a statically linked python and it was working very well with pylucene (and after all, I managed to make it work also for solr and other packages) But if you think that it is not possible, I should do something else :) But it was fun trying, if you get some idea, please let me know. Thank you, Roman Andi.. So far, I managed to build all the needed extensions (jcc, lucene, solr) and I can run them in python, but when I try to start the java app and use python, I get: SEVERE: org.apache.jcc.PythonException: /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/lib-dynload/time.so: undefined symbol: PyExc_IOError I understand, that the missing symbol PyExc_IOError is in the static python library: bash-3.2$ nm /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config/libpython2.5.a | grep IOError 4120 D PyExc_IOError 4140 d _PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError So when building JCC, I build with these arguments: lflags + ['-lpython%s.%s' %(sys.version_info[0:2]), '-L', '/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config', '-rdynamic', '-Wl,--export-dynamic', '-Xlinker', '--export-dynamic'] I just found instructions at: http://stackoverflow.com/questions/4223312/python-interpreter-embedded-in-the-application-fails-to-load-native-modules I don't really understand g++, but the symbol is there after the compilation bash-3.2$ nm /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg/libjcc.so | grep IOError 00352240 D PyExc_IOError 00352260 d _PyExc_IOError And when starting java, I do -Djava.library.path=/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg The code works find on mac (python 2.6) and ubuntu (python2.6), but not this statically linked python2.5 - would you know what I can try? Thanks. roman PS: I tried several compilations, but I was usually re-compiling JCC without building lucene etc again, I hope that is not the problem.
Problem loading jcc from java : undefined symbol: PyExc_IOError
Hello Andi, all, The python embedded in Java works really well on MacOsX and also Ubuntu. But I am trying hard to make it work also on Scientific Linux (SLC5) with *statically* built Python. The python is a build from ActiveState. So far, I managed to build all the needed extensions (jcc, lucene, solr) and I can run them in python, but when I try to start the java app and use python, I get: SEVERE: org.apache.jcc.PythonException: /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/lib-dynload/time.so: undefined symbol: PyExc_IOError I understand, that the missing symbol PyExc_IOError is in the static python library: bash-3.2$ nm /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config/libpython2.5.a | grep IOError 4120 D PyExc_IOError 4140 d _PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError U PyExc_IOError So when building JCC, I build with these arguments: lflags + ['-lpython%s.%s' %(sys.version_info[0:2]), '-L', '/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/config', '-rdynamic', '-Wl,--export-dynamic', '-Xlinker', '--export-dynamic'] I just found instructions at: http://stackoverflow.com/questions/4223312/python-interpreter-embedded-in-the-application-fails-to-load-native-modules I don't really understand g++, but the symbol is there after the compilation bash-3.2$ nm /afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg/libjcc.so | grep IOError 00352240 D PyExc_IOError 00352260 d _PyExc_IOError And when starting java, I do -Djava.library.path=/afs/cern.ch/user/r/rchyla/public/ActivePython-2.5.5.7-linux-x86_64/INSTALLDIR/lib/python2.5/site-packages/JCC-2.7-py2.5-linux-x86_64.egg The code works find on mac (python 2.6) and ubuntu (python2.6), but not this statically linked python2.5 - would you know what I can try? Thanks. roman PS: I tried several compilations, but I was usually re-compiling JCC without building lucene etc again, I hope that is not the problem.
cannot instantiate HashMap until after shared-module initVM()
Hi Andi, all, I have just came across behaviour which seems strange -- I have built lucene and wrapped solr + my own extension with JCC (ver. 2.7; OS is Mac, Python 32bi 2.6; using generics in all of the packages) All of the packages are compiled in the shared mode -- I import them in the correct order of: lucene, solr, my extension. Now i realized it was not possible to initialize a HashMap until the first extension (in this case lucene) is started Is this the effect of building them in the shared mode - where one depends on one another? Thank you, roman In [1]: from solrpie import initvm Warning: we add the default folder to sys.path: /x/dev/workspace/sandbox/solrpie/build/dist In [2]: sj = initvm.solrpie_java In [3]: sj.initVM() Out[3]: jcc.JCCEnv object at 0x194b70 In [4]: sj.Hash sj.HashDocSet sj.HashMap sj.HashSet sj.Hashtable In [4]: sj.HashMap().of_(sj.String, sj.String) --- InvalidArgsError Traceback (most recent call last) /x/dev/workspace/sandbox/solrpie/python/ipython console in module() InvalidArgsError: (type 'HashMap', 'of_', (type 'String', type 'String')) In [5]: import lucene In [6]: sj.HashMap().of_(lucene.String, lucene.String) --- InvalidArgsError Traceback (most recent call last) /x/dev/workspace/sandbox/solrpie/python/ipython console in module() InvalidArgsError: (type 'HashMap', 'of_', (type 'String', type 'String')) In [7]: lucene.HashMap().of_(lucene.String, lucene.String) --- InvalidArgsError Traceback (most recent call last) /x/dev/workspace/sandbox/solrpie/python/ipython console in module() InvalidArgsError: (type 'HashMap', 'of_', (type 'String', type 'String')) In [8]: lucene.initVM() Out[8]: jcc.JCCEnv object at 0x194cd0 In [9]: lucene.HashMap().of_(lucene.String, lucene.String) Out[9]: HashMap: {} In [10]: sj.HashMap().of_(sj.String, sj.String) Out[10]: HashMap: {}
--module option not playing nicely with relative paths
Hi, Until recently, I wasn't using --module parameter. But now I do and the compilation was failing, because I am not building things in the top folder, but from inside build - to avoid clutter. I believe I discovered a bug and I am sending a patch. Basically, jcc.py is copying modules into the build dir. my project is organized as: build build java python packageA packageB I build things inside build, if I specify a relative path, --module '../python/packageA', jcc will correctly copy the tree structure resulting in extension packageA packageB However, the package names (for distutils setup) will be set to ['extension', 'extension..python.packageA', 'extension..python.packageB'] Which ends up in this error: [exec] running install [exec] running bdist_egg [exec] running egg_info [exec] writing solrpie_java.egg-info/PKG-INFO [exec] writing top-level names to solrpie_java.egg-info/top_level.txt [exec] writing dependency_links to solrpie_java.egg-info/dependency_links.txt [exec] warning: manifest_maker: standard file '__main__.py' not found [exec] error: package directory 'build/solrpie_java/python/solrpye' does not exist Cheers, roman
Re: call python from java - what strategy do you use?
Hi Andi, I think I will give it a try, if only because I am curious. Please see one remaining question below. On Tue, Jan 11, 2011 at 10:37 PM, Andi Vajda va...@apache.org wrote: On Tue, 11 Jan 2011, Roman Chyla wrote: Hi Andy, This is much more than I could have hoped! Just yesterday, I was looking for ways how to embed Python VM in Jetty, as that would be more natural, but found only jepp.sourceforge.net and off-putting was the necessity to compile it against the newly built python. I could not want it from the guys who may need my extension. And I realize only now, that embedding Python in Java is even documented on the website, but honestly i would not know how to do it without your detailed examples. Now to the questions, I apologize, some of them or all must seem very stupid to you - pylucene is used on many platforms and with jcc always worked as expected (i love it!), but is it as reliable in the opposite direction? The PythonVM.java loads jcc library, so I wonder if in principle there is any difference in the directionality - but I am not sure. To rephrase my convoluted question: would you expect this wrapping be as reliable as wrapping java inside python is now? I've been using this for over two years, in production. My main worry was memory leaks because a server process is expected to stay up and running for weeks at a time and it's been very stable on that front too. Of course, when there is a bug somewhere that causes your Python VM to crash, the entire server crashes. Just like when the JVM crashes (which is normally rare). In other words, this isn't any less reliable than a standalone Python VM process. It can be tricky, but is possible, to run gdb, pdb and jdb together to step through the three languages involved, python, java and C++. I've had to do this a few times but not in a long time. - in the past, i built jcc libraries on one host and distributed them on various machines. As long the family OS and the python main version were the same, it worked on Win/Lin/Mac just fine. As far as I can tell, this does not change, or will it be dependent on the python against which the egg was built? Distributing binaries is risky. The same caveats apply. I wouldn't do it, even in the simple PyLucene case. unfortunately, I don't have that many choices left - this is not for some client-software scenario, we are running the jobs on the grid, and there I cannot compile the binaries. So, if previously the location of the python interpreter or python minor version did not cause problems, now perhaps it will be different. But that wasn't for the Solr, wrapping Solr is not meant for the grid. - now a little tricky issue; when I wrap jetty inside python, I hoped to build it in a shared mode with lucene to be able to do some low-level lucene indexing tasks from inside Python. If I do the opposite and wrap Python VM in Java, I would still like to access the lucene (which is possible, as I see well from your examples) But on the python side, you are calling initVM() - will the initVM() call create a new Java VM or will it access the parent Java VM which started it? No, initVM() in this case just initializes your egg and adds its stuff to the CLASSPATH. No Java VM init is done. As with any shared-mode JCC-built extension, all calls to initVM() but the first one just do that. The first call to initVM() in the embedding Python case is like that too because there already is a Java VM running when PythonVM is instantiated and called. And if in the python, I will do: import lucene import lucene.initVM(lucene.CLASSPATH) Will it work in this case? Giving access to the java classes from inside python. Or I will have to forget pylucene, and prepare some extra java classes? (the jcc in reverse trick, as you put it) - you say that threads are not managed by the Python VM, does that mean there is no Python GIL? No, there is a Pythonn GIL (and that is the Achille's Heel of this setup if you expect high concurrent servlet performance from your server calling Python). That Python GIL is connected to this thread state I was mentioning earlier. Because the thread is not managed by Python, when Python is called (by way of the code generated by JCC) it doesn't find a thread state for the thread and creates one. When the call completes, the thread state is destroyed because its refcount goes to zero. My TerminatingThread class acquires a Python thread state and keeps it for the life of the thread, thereby working this problem around. OK, this then looks like a normal Python - which is somehow making me less worried :) I wanted to use multiprocessing inside python to deal with GIL, and I see no reason why it should not work in this case. Thank you very much. Cheers, roman - I don't really know what is exactly in the python thread local storage, could that somehow negatively affect the Python process if acquireThreadState/releaseThreadState
Re: call python from java - what strategy do you use?
Hi Andi, all, I tried to implement the PythonVM wrapping on Mac 10.6, with JDK 1.6.22, jcc is freshly built, in shared mode, v. 2.6. The python is the standard Python distributed with MacOsX When I try to run the java, it throws an error when it gets to: static { System.loadLibrary(jcc); } I am getting this error: Exception in thread main java.lang.UnsatisfiedLinkError: /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib: Symbol not found: _PyExc_RuntimeError Referenced from: /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib Expected in: flat namespace in /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1823) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1746) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1045) at org.apache.jcc.PythonVM.clinit(PythonVM.java:23) at rca.solr.JettyRunnerPythonVM.start(JettyRunnerPythonVM.java:53) at rca.solr.JettyRunnerPythonVM.main(JettyRunnerPythonVM.java:139) MacBeth:JCC-2.6-py2.6-macosx-10.6-universal.egg rca$ nm libjcc.dylib | grep Exc U _PyExc_RuntimeError U _PyExc_TypeError U _PyExc_ValueError 3442 T __ZNK6JCCEnv15reportExceptionEv 21f0 T __ZNK6JCCEnv23getPythonExceptionClassEv Any pointers what I could do wrong? Note, I haven't built any emql.egg yet, I just run my java program and try to start PythonVM() and see if that works. Thanks, roman On Wed, Jan 12, 2011 at 11:05 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi Andi, I think I will give it a try, if only because I am curious. Please see one remaining question below. On Tue, Jan 11, 2011 at 10:37 PM, Andi Vajda va...@apache.org wrote: On Tue, 11 Jan 2011, Roman Chyla wrote: Hi Andy, This is much more than I could have hoped! Just yesterday, I was looking for ways how to embed Python VM in Jetty, as that would be more natural, but found only jepp.sourceforge.net and off-putting was the necessity to compile it against the newly built python. I could not want it from the guys who may need my extension. And I realize only now, that embedding Python in Java is even documented on the website, but honestly i would not know how to do it without your detailed examples. Now to the questions, I apologize, some of them or all must seem very stupid to you - pylucene is used on many platforms and with jcc always worked as expected (i love it!), but is it as reliable in the opposite direction? The PythonVM.java loads jcc library, so I wonder if in principle there is any difference in the directionality - but I am not sure. To rephrase my convoluted question: would you expect this wrapping be as reliable as wrapping java inside python is now? I've been using this for over two years, in production. My main worry was memory leaks because a server process is expected to stay up and running for weeks at a time and it's been very stable on that front too. Of course, when there is a bug somewhere that causes your Python VM to crash, the entire server crashes. Just like when the JVM crashes (which is normally rare). In other words, this isn't any less reliable than a standalone Python VM process. It can be tricky, but is possible, to run gdb, pdb and jdb together to step through the three languages involved, python, java and C++. I've had to do this a few times but not in a long time. - in the past, i built jcc libraries on one host and distributed them on various machines. As long the family OS and the python main version were the same, it worked on Win/Lin/Mac just fine. As far as I can tell, this does not change, or will it be dependent on the python against which the egg was built? Distributing binaries is risky. The same caveats apply. I wouldn't do it, even in the simple PyLucene case. unfortunately, I don't have that many choices left - this is not for some client-software scenario, we are running the jobs on the grid, and there I cannot compile the binaries. So, if previously the location of the python interpreter or python minor version did not cause problems, now perhaps it will be different. But that wasn't for the Solr, wrapping Solr is not meant for the grid. - now a little tricky issue; when I wrap jetty inside python, I hoped to build it in a shared mode with lucene to be able to do some low-level lucene indexing tasks from inside Python. If I do the opposite and wrap Python VM in Java, I would still like to access the lucene (which is possible, as I see well from your examples) But on the python side, you are calling initVM() - will the initVM() call create a new Java VM or will it access the parent Java
Re: call python from java - what strategy do you use?
Hi Andi, Thanks for the help, now I was able to run the java and loaded PythonVM. I then built the python egg, after a bit of fiddling with parameters, it seems ok. I can import the jcc wrapped python class and call it: In [1]: from solrpie_java import emql In [2]: em = emql.Emql() In [3]: em.javaTestPrint() java is printing In [4]: em.pythonTestPrint() just a test But I haven't found out how to call the same from java. The egg is built fine, it is named solrpie_java and contains one python module: == from solrpie_java import initVM, CLASSPATH, EMQL initVM(CLASSPATH) class Emql(EMQL): ''' classdocs ''' def __init__(self): super(Emql, self).__init__() print '__init__' def init(self, me): print self, me return 'init' def emql_refresh(self, tid, type): print self, tid, type return 'refresh' def emql_status(self): return some status def pythonTestPrint(self): print 'just a test' The corresponding java class looks like this: public class EMQL { private long pythonObject; public EMQL() { } public void pythonExtension(long pythonObject) { this.pythonObject = pythonObject; } public long pythonExtension() { return this.pythonObject; } public void finalize() throws Throwable { pythonDecRef(); } public void javaTestPrint() { System.out.println(java is printing); } public native void pythonDecRef(); // the methods implemented in python public native String init(EMQL me); public native String emql_refresh(String tid, String type); public native String emql_status(); public native void pythonTestPrint(); } === I tried running it as: PythonVM vm = PythonVM.start(sorlpie_java); EMQL em = new EMQL(); em.javaTestPrint(); em.pythonTestPrint(); I get this: java is printing Exception in thread main java.lang.UnsatisfiedLinkError: rca.pythonvm.EMQL.pythonTestPrint()V at rca.pythonvm.EMQL.pythonTestPrint(Native Method) at rca.solr.JettyRunnerPythonVM.start(JettyRunnerPythonVM.java:60) at rca.solr.JettyRunnerPythonVM.main(JettyRunnerPythonVM.java:148) I understand that java cannot find the linked c++ method, but I don't know how to fix that. If i try: PythonVM vm = PythonVM.start(sorlpie_java); Object m = vm.instantiate(emql, Emql); I get: org.apache.jcc.PythonException: No module named emql ImportError: No module named emql at org.apache.jcc.PythonVM.instantiate(Native Method) at rca.solr.JettyRunnerPythonVM.start(JettyRunnerPythonVM.java:56) at rca.solr.JettyRunnerPythonVM.main(JettyRunnerPythonVM.java:148) I tried various combinations of instanatiation, and setting the classpatt or -Djava.library.path But no success. What am I doing wrong? Thank you, roman On Wed, Jan 12, 2011 at 7:55 PM, Andi Vajda va...@apache.org wrote: On Wed, 12 Jan 2011, Roman Chyla wrote: Hi Andi, all, I tried to implement the PythonVM wrapping on Mac 10.6, with JDK 1.6.22, jcc is freshly built, in shared mode, v. 2.6. The python is the standard Python distributed with MacOsX When I try to run the java, it throws an error when it gets to: static { System.loadLibrary(jcc); } I am getting this error: Exception in thread main java.lang.UnsatisfiedLinkError: /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib: Symbol not found: _PyExc_RuntimeError Referenced from: That's because Python's shared library wasn't found. The reason is that, by default, Python's shared lib not on JCC's link line because normally JCC is loaded into a Python process and the dynamic linker thus finds the symbols needed inside the process. Here, since you're not starting inside a Python process, you need to add '-framework Python' to JCC's LFLAGS in setup.py so that the dynamic linker can find the Python VM shared lib and load it. Andi.. /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib Expected in: flat namespace in /Library/Python/2.6/site-packages/JCC-2.6-py2.6-macosx-10.6-universal.egg/libjcc.dylib at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1823) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1746) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1045) at org.apache.jcc.PythonVM.clinit(PythonVM.java:23) at rca.solr.JettyRunnerPythonVM.start
Re: call python from java - what strategy do you use?
Hi Andi, Your help is great, thanks a lot! Without your detailed instructions, I would not be able to figure it out - and the last bit with the python...I should have thought before writing :-) I call the class EMQL just because I was lazy to change it. But I will do now that I understand little bit more. What I find very cool is the fact, that if I build this extension the way you showed me, I can run java from inside python, but also python from inside Java - and with one jar and one compiled egg. Very handy. But as you said, evil is in details, so I expect some bumps. And about the thing with LFLAGS 'platform Python', also other platforms will need something similar like Mac? I assume this is a mac dynamic discovery of the libraries, will anything bad happen if I changed the path of the Python now when the extension was built? Cheers! roman On Wed, Jan 12, 2011 at 11:54 PM, Andi Vajda va...@apache.org wrote: Hi Roman, On Wed, 12 Jan 2011, Roman Chyla wrote: Thanks for the help, now I was able to run the java and loaded PythonVM. I then built the python egg, after a bit of fiddling with parameters, it seems ok. I can import the jcc wrapped python class and call it: In [1]: from solrpie_java import emql Why are you calling your class EMQL ? (this name was just an example culled from my code). In [2]: em = emql.Emql() In [3]: em.javaTestPrint() java is printing In [4]: em.pythonTestPrint() just a test But I haven't found out how to call the same from java. Ah, yes, I forgot to tell you how to pull that in. In Java, you import that 'EMQL' java class and instantiate it by way of the PythonVM instance's instantiate() call: import org.blah.blah.EMQL; import org.apache.jcc.PythonVM; . PythonVM vm = PythonVM.get(); emql = (EMQL) vm.instantiate(jemql.emql, emql); ... call method on emql instance just created ... The instantiate(foo, bar) method in effect asks Python to run from foo import bar return bar() Andi..
Re: call python from java - what strategy do you use?
Hi Andy, This is much more than I could have hoped! Just yesterday, I was looking for ways how to embed Python VM in Jetty, as that would be more natural, but found only jepp.sourceforge.net and off-putting was the necessity to compile it against the newly built python. I could not want it from the guys who may need my extension. And I realize only now, that embedding Python in Java is even documented on the website, but honestly i would not know how to do it without your detailed examples. Now to the questions, I apologize, some of them or all must seem very stupid to you - pylucene is used on many platforms and with jcc always worked as expected (i love it!), but is it as reliable in the opposite direction? The PythonVM.java loads jcc library, so I wonder if in principle there is any difference in the directionality - but I am not sure. To rephrase my convoluted question: would you expect this wrapping be as reliable as wrapping java inside python is now? - in the past, i built jcc libraries on one host and distributed them on various machines. As long the family OS and the python main version were the same, it worked on Win/Lin/Mac just fine. As far as I can tell, this does not change, or will it be dependent on the python against which the egg was built? - now a little tricky issue; when I wrap jetty inside python, I hoped to build it in a shared mode with lucene to be able to do some low-level lucene indexing tasks from inside Python. If I do the opposite and wrap Python VM in Java, I would still like to access the lucene (which is possible, as I see well from your examples) But on the python side, you are calling initVM() - will the initVM() call create a new Java VM or will it access the parent Java VM which started it? - you say that threads are not managed by the Python VM, does that mean there is no Python GIL? - I don't really know what is exactly in the python thread local storage, could that somehow negatively affect the Python process if acquireThreadState/releaseThreadState are not called? Thank you. Cheers, roman On Tue, Jan 11, 2011 at 8:13 PM, Andi Vajda va...@apache.org wrote: Hi Roman, On Tue, 11 Jan 2011, Roman Chyla wrote: I have recently wrapped solr inside jetty with JCC (we need to access very big result sets quickly, via JNI, but also keep solr running as normal) and was wondering what strategies do you guys use to speak *from inside* Java towards the Python end. So far, I was able to think about these: - raise exceptions in java and catch in python (I think I have seen this in some posts from Bill Jansen) - communicate via sockets - wait passively - call some java method and wait for its return - monitor actively - in python check in loop some java object Is there something else? I'm not sure I completely understand your questions but if what you're asking is how to run Python code from inside a Java servlet container, that I've done with Tomcat and Lucene. Basically, instead of embedding a JVM inside a Python VM - as is done for PyLucene - you do the opposite, you embed a Python VM inside a JVM. For that purpose, see the org.apache.jcc.PythonVM class available in JCC's java tree. This class must be instantiated from the main thread at Java servlet engine startup time. In Tomcat, I patched some startup code, in BootStrap.java (see patches below) for this purpose. Then, to make some Python code accessible from Java, use the usual way of writing extensions, the so-called JCC in reverse trick. Define a Java class with some native methods implemented in Python; define a Python class that extends it; build the Java class into a JAR; include it into a JCC-built egg; install the egg into Python's env (site-packages, PYTHONPATH, whatever); Then, write servlet code in Java that imports your Java class and calls it. As you can see, this sounds simple but the devil is in the details. Of course, bending Jetty for this may have different requirements but the code snippets below should give you a good idea about what's required. This approach has been in production running the freebase.com's search server for over two years now. If you have questions, of course, please ask. Good luck ! Andi.. -- Patch to Bootstrap.java to use JCC's PythonVM (which initializes the embedded Python VM) --- apache-tomcat-6.0.29-src/java/org/apache/catalina/startup/Bootstrap.java 2010-07-19 06:02:32.0 -0700 +++ apache-tomcat-6.0.29-src/java/org/apache/catalina/startup/Bootstrap.java.patched 2010-08-04 08:49:05.0 -0700 @@ -30,16 +30,18 @@ import javax.management.MBeanServer; import javax.management.MBeanServerFactory; import javax.management.ObjectName; import org.apache.catalina.security.SecurityClassLoad; import org.apache.juli.logging.Log; import org.apache.juli.logging.LogFactory; +import org.apache.jcc.PythonVM; + /** * Boostrap loader for Catalina. This application
Re: building PyLucene 3.0.2 on Win7/MinGW with Python 2.7
On Mon, Nov 22, 2010 at 9:45 PM, Bill Janssen jans...@parc.com wrote: Roman Chyla roman.ch...@gmail.com wrote: I had similar/same issue on win xp, it was the space in the java path, but i can't recall details. What happens if you change config.py to? C:\\Program\ Files\ (x86)\\Java\\jdk1.6.0_22\\lib Wouldn't that eval to the same Python string? Which reminds me I ended up with 'Program\\ Files', but that must have been for the compilation - so nevermind, sorry, that was another problem. I tried quoting all the spaces in the strings, with no help. It's when it attempts to load jcc/_jcc.pyd that it fails. One possible problem is that there are two different _jcc submodules there: -rw-rw-rw- 1 wjanssen root 282 11-22 12:29 _jcc.py -rw-rw-rw- 1 wjanssen root 577 11-22 12:29 _jcc.pyc -rw-rw-rw- 1 wjanssen root 512418 11-22 12:29 _jcc.pyd I'm not sure why, or if that's a problem. Using depends.exe on _jcc.pyd says that the missing file is Python27.dll, which seems odd. Where should I find that? Bill roman On Mon, Nov 22, 2010 at 7:53 PM, Bill Janssen jans...@parc.com wrote: I got a brand-new Windows 7 machine, and thought I'd try building PyLucene with a newer version of Python, 2.7, the 32-bit version. I also had to move to setuptools-0.6c11, because 0.6c9 doesn't seem to work with Python 2.7. Using 32-bit Java 6.0_22. But I can't get JCC to run here: sh-3.1$ which jcc.dll /c/Python27/Lib/site-packages/JCC-2.6-py2.7-win32.egg/jcc.dll sh-3.1$ which jvm.dll /c/Program Files (x86)/Java/jre6/bin/client/jvm.dll sh-3.1$ python -m jcc.__main__ --help c:\Python27\python.exe: DLL load failed: The specified module could not be found. sh-3.1$ sh-3.1$ python -c 'import os; print os.environ.get(PATH)' c:\Windows\system32;c:\Windows;c:\Windows\System32\Wbem;c:\Windows\System32\WindowsPowerShell\v1.0\;C:\MinGW\msys\1.0\bin;C:\MinGW\bin;c:\Python27;c:\Program Files\apache-ant-1.8.1\bin;c:\Program Files (x86)\Java\jre6\bin\client;c:\Python27\Lib\site-packages\JCC-2.6-py2.7-win32.egg It seems to build and install OK, but when I run python in verbose mode, I see import jcc # directory c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc # c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.pyc matches c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.py import jcc # precompiled from c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.pyc # c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.pyc matches c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.py import jcc.config # precompiled from c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.pyc c:\Python27\python.exe: DLL load failed: The specified module could not be found. So, what's in jcc/config.py? Here's what's in it: INCLUDES=['C:\\Program Files (x86)\\Java\\jdk1.6.0_22\\include', 'C:\\Program Files (x86)\\Java\\jdk1.6.0_22\\include\\win32'] CFLAGS=['-fno-strict-aliasing', '-Wno-write-strings'] DEBUG_CFLAGS=['-O0', '-g', '-DDEBUG'] LFLAGS=['-LC:\\Program Files (x86)\\Java\\jdk1.6.0_22\\lib', '-ljvm'] IMPLIB_LFLAGS=['-Wl,--out-implib,%s'] SHARED=True VERSION=2.6 Any ideas about what's going wrong? I suspect those parentheses in the path to the jvm, myself. Bill
Re: building PyLucene 3.0.2 on Win7/MinGW with Python 2.7
I had similar/same issue on win xp, it was the space in the java path, but i can't recall details. What happens if you change config.py to? C:\\Program\ Files\ (x86)\\Java\\jdk1.6.0_22\\lib roman On Mon, Nov 22, 2010 at 7:53 PM, Bill Janssen jans...@parc.com wrote: I got a brand-new Windows 7 machine, and thought I'd try building PyLucene with a newer version of Python, 2.7, the 32-bit version. I also had to move to setuptools-0.6c11, because 0.6c9 doesn't seem to work with Python 2.7. Using 32-bit Java 6.0_22. But I can't get JCC to run here: sh-3.1$ which jcc.dll /c/Python27/Lib/site-packages/JCC-2.6-py2.7-win32.egg/jcc.dll sh-3.1$ which jvm.dll /c/Program Files (x86)/Java/jre6/bin/client/jvm.dll sh-3.1$ python -m jcc.__main__ --help c:\Python27\python.exe: DLL load failed: The specified module could not be found. sh-3.1$ sh-3.1$ python -c 'import os; print os.environ.get(PATH)' c:\Windows\system32;c:\Windows;c:\Windows\System32\Wbem;c:\Windows\System32\WindowsPowerShell\v1.0\;C:\MinGW\msys\1.0\bin;C:\MinGW\bin;c:\Python27;c:\Program Files\apache-ant-1.8.1\bin;c:\Program Files (x86)\Java\jre6\bin\client;c:\Python27\Lib\site-packages\JCC-2.6-py2.7-win32.egg It seems to build and install OK, but when I run python in verbose mode, I see import jcc # directory c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc # c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.pyc matches c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.py import jcc # precompiled from c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\__init__.pyc # c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.pyc matches c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.py import jcc.config # precompiled from c:\Python27\lib\site-packages\jcc-2.6-py2.7-win32.egg\jcc\config.pyc c:\Python27\python.exe: DLL load failed: The specified module could not be found. So, what's in jcc/config.py? Here's what's in it: INCLUDES=['C:\\Program Files (x86)\\Java\\jdk1.6.0_22\\include', 'C:\\Program Files (x86)\\Java\\jdk1.6.0_22\\include\\win32'] CFLAGS=['-fno-strict-aliasing', '-Wno-write-strings'] DEBUG_CFLAGS=['-O0', '-g', '-DDEBUG'] LFLAGS=['-LC:\\Program Files (x86)\\Java\\jdk1.6.0_22\\lib', '-ljvm'] IMPLIB_LFLAGS=['-Wl,--out-implib,%s'] SHARED=True VERSION=2.6 Any ideas about what's going wrong? I suspect those parentheses in the path to the jvm, myself. Bill
Re: PatternAnalyzer not implemented?
Thank you, Andi. Recompilled, it works just fine now. roman On Fri, Oct 1, 2010 at 8:28 PM, Andi Vajda va...@apache.org wrote: On Fri, 1 Oct 2010, Roman Chyla wrote: I tried to use the PatternAnalyzer, but am getting NotImplementedError - in case it is not available, shall I rather use PythonAnalyzer and implement the regex pattern analyzer with that? using version: 2.9.3 In [44]: import lucene In [45]: import pyjama #-- this package contains java.util.regex.Pattern In [46]: p = pyjama.Pattern.compile(\\s) In [47]: p Out[47]: Pattern: \s In [48]: import lucene.collections as col In [49]: s = col.JavaSet([]) In [50]: s Out[50]: JavaSet: org.apache.pylucene.util.python...@16925b0 In [51]: pa = lucene.PatternAnalyzer(p,True,s) --- NotImplementedError Traceback (most recent call last) /Users/rca/ipython console in module() NotImplementedError: ('instantiating java class', type 'PatternAnalyzer') This is because no constructors were generated for PatternAnalyzer. That in turn is because the java.util.regex package is missing from the JCC command line in PyLucene's Makefile, causing methods and constructors using classes in that package to be skipped. To fix this, add --package java.util.regex \ around line 214 to PyLucene's Makefile. It is also strongly recommended that you rebuild pyjama with --import lucene on the JCC command line so that you don't have JCC generate wrappers again for classes that are shared between pyjama and lucene. Andi..
PatternAnalyzer not implemented?
Hello, I tried to use the PatternAnalyzer, but am getting NotImplementedError - in case it is not available, shall I rather use PythonAnalyzer and implement the regex pattern analyzer with that? using version: 2.9.3 In [44]: import lucene In [45]: import pyjama #-- this package contains java.util.regex.Pattern In [46]: p = pyjama.Pattern.compile(\\s) In [47]: p Out[47]: Pattern: \s In [48]: import lucene.collections as col In [49]: s = col.JavaSet([]) In [50]: s Out[50]: JavaSet: org.apache.pylucene.util.python...@16925b0 In [51]: pa = lucene.PatternAnalyzer(p,True,s) --- NotImplementedError Traceback (most recent call last) /Users/rca/ipython console in module() NotImplementedError: ('instantiating java class', type 'PatternAnalyzer') In [52]: Kind regards, roman
Re: Issues while connecting PyLucene code to Apache WSGI interface
I recently had problem with this: http://stackoverflow.com/questions/548493/jcc-initvm-doesnt-return-when-mod-wsgi-is-configured-as-daemon-mode you may want to check that too roman On Mon, Aug 30, 2010 at 8:50 PM, Andi Vajda va...@apache.org wrote: On Mon, 30 Aug 2010, technology inspired wrote: Thanks for the reply. My example runs fine when it runs alone (pure python). Here is the code: Ok, then the next step is to port it to a python http server such as [1] so that you get the threading and initialization story straight: - initVM() must be called from the main thread, once - any thread created from Python must call attachCurrentThread() before making any other calls that involve the JVM I'm not sure how this is done in the apache2/wsgi environment, that is a question for another forum. That being said, if you solve this problem, posting your answer here would be helpful as this has come up before. About the errors you're reporting, what you're seeing in your browser is irrelevant. Instead, you must log errors that happen on the Python side and look for these stacktraces there. Andi.. [1] http://docs.python.org/library/simplehttpserver.html #import sys, os #sys.path.append(/home/v/workspace/example-project/src/trunk) #os.environ['DJANGO_SETTINGS_MODULE'] = 'example.settings' from lucene import Field, Document, initVM, NIOFSDirectory, IndexWriter, StandardAnalyzer, Version, File from lucene import SimpleFSLockFactory, NumericField, IndexSearcher, QueryParser, NumericRangeQuery from lucene import Integer, BooleanQuery, BooleanClause #from django.shortcuts import render_to_response def build(): initVM() dir = NIOFSDirectory(File(/home/v/index), SimpleFSLockFactory()) analyzer = StandardAnalyzer(Version.LUCENE_30) writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(1024)) field_rows = FieldDoc.objects.all() # Currently there is only one row in database for row in field_rows: doc = Document() if row.category != : doc.add(Field('category', row.category, Field.Store.YES, Field.Index.NOT_ANALYZED)) writer.addDocument(doc) writer.close() #return render_to_response(index.html, {var: Success}) But when I connect it with httpd/mod_wsgi, I see the Success page some times and other times, it says Internal Server Error with the errors as mentioned in previous email. I am not aware what is the best practice to run Python Lucene code from a web server. You have mentioned about using attachCurrentThread(). I tried using it this way: env = initVM() env.attachCurrentThread() but no change in the response. I don't know if this is how attachCurrentThread() should be used in above build function. Please guide how to connect Lucene code with Apache2/wsgi. My apache2/wsgi is configured properly as I can run non lucene coded web pages. Apache2 is using mpm-worker, a threaded environment. Thanks. Regards, Vin On Sun, Aug 29, 2010 at 12:21 PM, Andi Vajda va...@apache.org wrote: On Sun, 29 Aug 2010, technology inspired wrote: I am using PyLucene 3.0.2 on Ubuntu 10.04 with Python 2.6.5 and Sun Java 1.6. I am written an example script to build index and store in a directory. Later on, I want it to search in my next example script which as of now I haven't written. There are two issues I have to mention and looking for your help: ISSUE 1: I am using Apache2 with mod_wsgi 3.3. I have got the index building script connected to a GET request. When I call that GET request, I get following errors: [error] [client 127.0.0.1] Premature end of script headers: wsgi [notice] child pid exit signal Aborted (6). With this error, I see Internal Server Error on my browser screen. This error appears only if I make GET request very often, i.e. around 1 per 2 seconds. If I issue GET at the interval of 10 seconds, I don't see these errors. ISSUE 2: When I index Date field using NumericField, the GET request gives Internal Server Error on every alternate request. and the Apache2 log files gets these errors: [error] [client 127.0.0.1] Premature end of script headers: wsgi [notice] child pid exit signal Segmentation fault (11) I am looking for help to solve these problems. I am running WSGI deamon mode. WSGI settings are: ... WSGIDaemonProcess example.com user=www-data group-www-data thread 25 WSGIProcessGroup example.com WSGIScriptAlias /
InvalidArgsError - passing TopDocs object
Hi, I am trying to understand PyLucene more and to see if it is faster to retrieve result ids with java instead of with Python. The use case is to retrieve millions of recids -- with python, 700K ids takes about 1.5s. (even if query takes just fraction of that). I wrote a simple java code (works in java) which returns array of ints. I have wrapped it with jcc, it is visible from inside python, but callind the static method throws InvalidArgsError (below is an example python session) JCC is version 2.4, built with shared mode -- the DistUtils is in a different package than lucene (ie. not inside lucene jars). Can this problem be similar to passing jcc-wrapped objects between different jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1 The java class is very simple: import org.apache.lucene.search.TopDocs; public class DumpUtils { public static int[] GetDocIds(TopDocs topdocs) { int[] out; out = new int[topdocs.totalHits]; ScoreDoc[] hits = topdocs.scoreDocs; for (int i=0; i topdocs.totalHits; i++) { out[i] = hits[i].doc; } return out; } } Thanks for any help/pointers, roman Here is an example python session: In [1]: import pyjama In [2]: pyjama.initVM(pyjama.CLASSPATH) Out[2]: jcc.JCCEnv object at 0x00C0E1F0 In [3]: import lucene as lu In [4]: pyjama.DumpUtils Out[4]: type 'DumpUtils' In [5]: pyjama.DumpUtils.GetDocIds Out[5]: built-in method GetDocIds of type object at 0x0189E780 In [6]: In [7]: import newseman.pyjamic.slucene.searcher as se In [8]: s = se.Searcher();s.open('/tmp/whisper/') In [9]: hits = s._search(s._query('key:bo*',None), 50) In [10]: hits Out[10]: TopDocs: org.apache.lucene.search.topd...@480457 In [11]: In [12]: pyjama.DumpUtils.GetDocIds(hits) --- InvalidArgsError Traceback (most recent call last) InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs: org.apache.lucene. search.topd...@480457)
Re: InvalidArgsError - passing TopDocs object
Thank you very much, Andi. Best, roman On Tue, Aug 24, 2010 at 5:36 PM, Andi Vajda va...@apache.org wrote: On Aug 24, 2010, at 8:03, Roman Chyla roman.ch...@gmail.com wrote: I am trying to understand PyLucene more and to see if it is faster to retrieve result ids with java instead of with Python. The use case is to retrieve millions of recids -- with python, 700K ids takes about 1.5s. (even if query takes just fraction of that). I wrote a simple java code (works in java) which returns array of ints. I have wrapped it with jcc, it is visible from inside python, but callind the static method throws InvalidArgsError (below is an example python session) JCC is version 2.4, built with shared mode -- the DistUtils is in a different package than lucene (ie. not inside lucene jars). Can this problem be similar to passing jcc-wrapped objects between different jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1 The java class is very simple: import org.apache.lucene.search.TopDocs; public class DumpUtils { public static int[] GetDocIds(TopDocs topdocs) { int[] out; out = new int[topdocs.totalHits]; ScoreDoc[] hits = topdocs.scoreDocs; for (int i=0; i topdocs.totalHits; i++) { out[i] = hits[i].doc; } return out; } } Thanks for any help/pointers, Ah yes, importing separately built extensions that share classes (or dependencies) didn't work until support for the --import parameter was added in jcc 2.6 to solve the problem of incompatible shared classes. To make this work: - first, build PyLucene as usual, with --shared - then, build your DistUtils package with --import lucene and with --shared That way, instead of generating code and wrapper classes again for the lucene classes, jcc will import them at build time thus making a much smaller library and faster build. The resulting shared library is linked against the lucene one. See docs and list archives about --import for more examples. Then, when running all this, you should also import lucene first, then your other package. Andi.. roman Here is an example python session: In [1]: import pyjama In [2]: pyjama.initVM(pyjama.CLASSPATH) Out[2]: jcc.JCCEnv object at 0x00C0E1F0 In [3]: import lucene as lu In [4]: pyjama.DumpUtils Out[4]: type 'DumpUtils' In [5]: pyjama.DumpUtils.GetDocIds Out[5]: built-in method GetDocIds of type object at 0x0189E780 In [6]: In [7]: import newseman.pyjamic.slucene.searcher as se In [8]: s = se.Searcher();s.open('/tmp/whisper/') In [9]: hits = s._search(s._query('key:bo*',None), 50) In [10]: hits Out[10]: TopDocs: org.apache.lucene.search.topd...@480457 In [11]: In [12]: pyjama.DumpUtils.GetDocIds(hits) --- InvalidArgsError Traceback (most recent call last) InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs: org.apache.lucene. search.topd...@480457)
_addClasspath question
Hi, I have noticed that the JCC 2.6 has a env.classpath and also a method _addClassPath() When I use _addClassPath, jvm.classpath shows the new change -- can we use this method to add classpath when VM is already running? Will it stay, the _name looks like is not meant to be public? Thanks, roman
Re: Building PyLucene on Windows
Hi, I would also like to thank to Andi (and others?) for the great tool and the samples, it is really excellent. I am using MSVC7.1 on win xp, it builds fine, but it was quite difficult at the beginning (especially, because I tried with mingw before falling back to msvc). And indeed, is gnu make indispensable? In some previous posts it was said that Ant is not an option (makes Python programmers scream and run away) and 'make' is there because nobody provided something else. This naturally brings us to the practical problem: it can be done, but somebody has to DO IT, right? ;-) What would you think about scons? http://www.scons.org/ roman On Tue, Mar 9, 2010 at 3:50 PM, Andi Vajda va...@apache.org wrote: On Mar 9, 2010, at 13:13, Thomas Koch k...@orbiteam.de wrote: Dear PyLucene-fans, I just managed to build pylucene-2.9.1-1 on Windows with Python 2.6 and Java 1.6 and like to tell my 'story' - just in case anyone else runs into similar problems... First I should mention that I'm using PyLucene for quite a while now - just never needed to build it on windows - there used to be binary distributions on the net (here: http://code.google.com/p/pylucene-win32-binary/ - however it's out-of-date). Also I am a bit familiar with Makefiles, Ant and other toolchains... Next it should be said that not only PyLucene is great piece of software but also Documentation (and samples / test-suite) is very well maintained. The only thing that's missing from my point of view is clear advise on requirements for building PyLucene on specific platforms. Maybe that's also the cause of the trouble I had in building it ... I knew I need a C++ compiler, ANT, Java and Python. Also as Makefile is used some kind of make-utitilty would be needed. So here's the setup I've choosen: - Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) - Java 1.6 (jdk1.6.0_06) - compiler: MS-Visual-Studio9 (Microsoft Visual C++ 2008 Express Edition) - mingw32-make from MinGW-5.1.6 - see http://www.mingw.org/ (GNU Make 3.81 built for i386-pc-mingw32) - ANT 1.8.0 - pylucene-2.9.1-1 / lucene-java-2.9.1 - Windows7 The building of JCC was no problem. The first issues came up when entering the make-toolchain: apparently there are some differences on Windows that either my windows binary GNU make couldn't handle very well or need to be fixed for windows anyway... This especially holds for path separators and command separators. For example I had to change $(LUCENE_JAR): $(LUCENE) cd $(LUCENE) ; $(ANT) -Dversion=$(LUCENE_VER) to $(LUCENE_JAR): $(LUCENE) cd $(LUCENE) $(ANT) -Dversion=$(LUCENE_VER) (took me a while to figure this out ,-) PYLUCENE:=$(shell pwd) to PYLUCENE:=$(shell cd) BUILD_TEST:=$(PYLUCENE)/build/test to BUILD_TEST:=$(PYLUCENE)\build\test (note: cd may work with / but when it comes to mkdir this fails - e.h. mingw32-make test mkdir -p pylucene-2.9.1-1/build/test Syntaxfehler. mingw32-make: *** [install-test] Error 1 ) Finally herer are my Makefile settings: # Windows (Win32, Python 2.6, Java 1.6, ant 1.8) SHELL=cmd.exe PYLUCENE:=$(shell cd) ANT=F:\devel\apache-ant-1.8.0\bin\ant JAVA_HOME=C:\\Program Files\\Java\\jdk1.6.0_06 PREFIX_PYTHON=C:\\Python26 PYTHON=$(PREFIX_PYTHON)\python.exe JCC=$(PYTHON) -m jcc.__main__ NUM_FILES=3 So either I've choosen the wrong tools or there should be others with similar problems. If my toolchain is wrong or unsupported please advise. Is it recommended/required to use Cygwin on Windows? Yes, cygwin is required so that you have a functional gnu make. Note that you still need to use a MS compiler or mingw, which some people have been able to use. I test build pylucene every now and then on an old win2k system with cygwin (for make and shell) and msvc 7.1. Not a setup with the most recent software but that's all I've got for windows. Andi.. If anyone is interested I can offer to - post my adapted Makefile here (or on the web) - provide binary version of PyLucene (on the web) Finally some suggestion: wouldn't it be possible to skip the Makefile completely? I'm not that familiar with ANT but know it has been developed to provide platform independant built processes - and it includes shell-tasks for anything that is not java... (I know this could be some work, just wanted to know if this question has been raised before or if this is a no-go option ?) best regards Thomas Koch -- OrbiTeam Software GmbH Co. KG Endenicher Allee 35 53121 Bonn Germany i...@orbiteam.de http://www.orbiteam.de
Re: unload JVM?
- consecutive calls to initVM raise errors Only if you use parameters other than classpath, right ? yes Or did you find a different problem ? - in my program components interact with several JCC wrapped libraries Normally, it is no problem, but clashes may occur - especially in GUI when running complex workflows - the solution (in theory) would be to destroy JVM and load it again. Is it possible? In theory, it might be. The JNI API has a call to destroy a VM. http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/invocation.html But cleanly doing so it rather tricky so JCC doesn't support it. I tried (naive) to add destroyJavaVM call into the source, recompiled it and tried calling it from thread. The destroy call returns 0, but as soon as I delete references and object is garbage collected (probably), Python crashes. I have no idea what's going on :-) Browsing through the Sun bug reports, it seems like it wasn't possible (ie. DestroyJavaVM never worked, at least for others, who apparently understand what they were doing). But if possible and one day it happens, cool, JCC is really blessing for connecting Python with Java and I imagine more people will start using it, and with more programmers using it, there will come more python packages in one installation... roman A different approach to supporting your use case might be to consider compiling all your JCC wrapped libraries into one, picking only the APIs you need so as control the size of the resulting library. Andi..
Re: starting several modules in one VM
Thank you Andi for checking, I am able to reproduce it again (please see below). My problem is probably two packages with lucene (I started to play with PyLucene only recently, and the older code is there doing other work). The lucene is a PyLucene (lucene 2.9.1) and pyjama contains GATE and also my own lucene (2.9.1) - so, effectively, I have two lucenes - but in the pyjama package, the jcc wrapper was built only for my own classes (that talk to java-lucene behind the scene) - when ClassNotFoundException happens, java is apparently searching inside pyjama's jars (and the classes are only in lucene's jars) So that brings me to question like if it is safe to mix python packages that contain the same java classes? Or not recommended at all? Best, Roman In [1]: import lucene,pyjama In [2]: pyjama.initVM(pyjama.CLASSPATH, vmargs='-Dgate.site.config=C:/dev/worksp ace/newseman/src/merkur/cfg//gate.xml,-Dgate.plugins.home=C:/dev/workspace/newse man/src/merkur/cfg/ANNIE/plugins,-Dgate.user.config=C:/dev/workspace/newseman/sr c/merkur/cfg//user.xml,-Dgate.user.session=C:/dev/workspace/newseman/src/merkur/ cfg//gate.session,-Xms32m,-Xmx256m') Out[2]: jcc.JCCEnv object at 0x00A4E7B0 In [3]: lucene.initVM(lucene.CLASSPATH) Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/ana lysis/ar/ArabicAnalyzer Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.ar.Arabi cAnalyzer at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) --- JavaError Traceback (most recent call last) C:\dev\WORKSP~1\newseman\utils\pyjama\build\dist\pyjama\ipython console in mo dule() JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/ar/ArabicA nalyzer Java stacktrace: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/ar/ArabicAnalyzer Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.ar.Arabi cAnalyzer at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) On Tue, Feb 9, 2010 at 10:25 PM, Andi Vajda va...@apache.org wrote: On Tue, 9 Feb 2010, Roman Chyla wrote: I wanted to ask if there was any progress on this issue (extending classpath runtime): http://lists.osafoundation.org/pipermail/pylucene-dev/2008-March/002455.html Yes, this should work provided you invoke jcc with --shared when building your modules. I just verified this worked by using PyLucene and PyPDFBox together, both built with --shared. (Note that with a recent JCC, you no longer need to pass the classpath to initVM(), the parameter is defaulted to the module's CLASSPATH variable): yuzu:vajda python Python 2.6.2 (r262:71600, Sep 20 2009, 20:40:09) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type help, copyright, credits or license for more information. import pdfbox pdfbox.initVM(vmargs='-Djava.awt.headless=true') jcc.JCCEnv object at 0x1004030d8 import lucene lucene.initVM() jcc.JCCEnv object at 0x1004034e0 lucene.Document() Document: Document pdfbox.PDFTextStripper() PDFTextStripper: org.apache.pdfbox.util.pdftextstrip...@83e96cf or in a different order: yuzu:vajda python Python 2.6.2 (r262:71600, Sep 20 2009, 20:40:09) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type help, copyright, credits or license for more information. import lucene, pdfbox lucene.initVM(vmargs='-Djava.awt.headless=true') jcc.JCCEnv object at 0x1004030d8 pdfbox.initVM() jcc.JCCEnv object at 0x100403150 lucene.Document() Document: Document pdfbox.PDFTextStripper() PDFTextStripper: org.apache.pdfbox.util.pdftextstrip...@6548f8c8 The vmargs='-Djava.awt.headless=true' parameter to the first initVM() is required by pdfbox. The first initVM() call starts and initializes the Java VM, the second one just updates its classpath and cannot change or set vmargs. Andi..
Re: starting several modules in one VM
So that brings me to question like if it is safe to mix python packages that contain the same java classes? Or not recommended at all? Hmm, not sure. I've never tried that. It seems a little unsane to me. Where this may cause trouble is with the different sets of wrappers jcc has generated for the same classes. I don't expect them to be usable So I think I should include two jars in my pyjama package, one with my own lucene classes, the other one with lucene.jar -- and the lucene.jar put into classpath only if pylucene is not available on the system interchangeably since which methods get wrapped depends on the transitive closure of dependencies that was computed during generation. That being said, I don't see why the classes would not be found in the first place. What are the _exact_ jcc invocations you used to build both extensions ? python -m jcc --shared --package java.util java.util.ArrayList newseman.gate.PythonicAnnie newseman.lucene.whisperer.LuceneWhisperer newseman.lucene.whisperer.IndexDictionary --python pyjama --build --classpath ../build/jar/lucene-standalone-pyjama-0.1.jar;../build/jar/gate-standalone-pyjama-0.1.jar --include ../build/jar/lucene-standalone-pyjama-0.1.jar --include ../build/jar/gate-standalone-pyjama-0.1.jar --bdist --version 0.1 pylucene is 2.9.1 and I didn't change anything besided the windows section: PREFIX_PYTHON=/cygdrive/c/dev/Python251/ ANT=/cygdrive/c/dev/apache-ant-1.7.1/bin/ant JAVA_HOME=/cygdrive/c/Program Files/Java/jdk1.6.0_12 PYTHON=$(PREFIX_PYTHON)/python.exe JCC=$(PYTHON) -m jcc --shared NUM_FILES=3 I have updated JDK in the meantime (am using jcc built on 1.6.0_12, now JDK1.6.0_18) - I can try to recompile JCC and both extensions with new JDK - if it makes any sense (?) roman Andi..
starting several modules in one VM
Hi, I wanted to ask if there was any progress on this issue (extending classpath runtime): http://lists.osafoundation.org/pipermail/pylucene-dev/2008-March/002455.html Here is a longer version: I would like to use several Java libraries from python, one of them PyLucene, the other GATE and others. I compiled GATE into separate egg and after some experiments, I was able to start two jcc modules - however, it fails if I import first my module and then lucene. This works fine, but all the -Dgate.home are actually needed for the second pyjama.initVM(): == import lucene import pyjama lucene.initVM(lucene.CLASSPATH, vmargs='-Dgate.site.config=C:/dev/workspace/newseman/src/merkur/cfg//gate.xml,-Dgate.plugins.home=C:/dev/workspace/newseman/src/merkur/cfg/ANNIE/plugins,-Dgate.user.config=C:/dev/workspace/newseman/src/merkur/cfg//user.xml,-Dgate.user.session=C:/dev/workspace/newseman/src/merkur/cfg//gate.session,-Xms32m,-Xmx256m') pyjama.initVM(pyjama.CLASSPATH) == this will fail: == import lucene import pyjama pyjama.initVM(pyjama.CLASSPATH, vmargs='-Dgate.site.config=C:/dev/workspace/newseman/src/merkur/cfg//gate.xml,-Dgate.plugins.home=C:/dev/workspace/newseman/src/merkur/cfg/ANNIE/plugins,-Dgate.user.config=C:/dev/workspace/newseman/src/merkur/cfg//user.xml,-Dgate.user.session=C:/dev/workspace/newseman/src/merkur/cfg//gate.session,-Xms32m,-Xmx256m') lucene.initVM(lucene.CLASSPATH) ERROR:root:Traceback (most recent call last): File C:\dev\workspace\newseman\src\merkur\runwf.py, line 79, in get_workflow execfile(filename, x.__dict__) File wtf_test.py, line 19, in module from merkur import test File C:\dev\workspace\newseman\src\merkur\test.py, line 24, in module lucene.initVM(lucene.CLASSPATH) JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/ar/ArabicA nalyzer Java stacktrace: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/ar/ArabicAnalyzer Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.ar.Arabi cAnalyzer at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) === If I do this, everyrthing is OK: pyjama.initVM(os.pathsep.join([lucene.CLASSPATH, pyjama.CLASSPATH]), vmargs=... lucene.initVM(lucene.CLASSPATH) So it seems to me, that the second initVM() call has no effect. And obviously, I have to make sure it is me who calls initVM() with correct arguments as a first one (which might be difficult to secure). Am I doing something wrong? Best, Roman