from:"Roman Chyla"

Re: Semantic autocomplete with Solr

2012-02-14 Thread Roman Chyla

done something along these lines:

https://svnweb.cern.ch/trac/rcarepo/wiki/InspireAutoSuggest#Autosuggestautocompletefunctionality

but you would need MontySolr for that - https://github.com/romanchyla/montysolr

roman

On Tue, Feb 14, 2012 at 11:10 PM, Octavian Covalschi
octavian.covals...@gmail.com wrote:
 Hey guys,

 Has anyone done any kind of smart autocomplete? Let's say we have a web
 store, and we'd like to autocomplete user's searches. So if I'll type in
 jacket next word that will be suggested should be something related to
 jacket (color, fabric) etc...

 It seems to me I have to structure this data in a particular way, but that
 way I can do without solr, so I was wondering if Solr could help us.

 Thank you in advance.

Re: Regexp and speed

2012-11-30 Thread Roman Chyla

found also some 1M test


258033ms.  Buiding index of 100 docs
29703ms.  Verifying data integrity with 100 docs
1821ms.  Preparing 1 random queries
2867284ms.  Regex queries
18772ms.  Regexp queries (new style)
29257ms.  Wildcard queries
4920ms.  Boolean queries
Totals: [1749708, 1744494, 1749708, 1744494]


On Fri, Nov 30, 2012 at 12:13 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi,

 Some time ago we have done some measurement of the performance fo the
 regexp queries and found that they are VERY FAST! We can't be grateful
 enough, it saves many days/lives ;)

 This was an old lenovo x61 laptop, core2 due, 1.7GHz,no special memory
 allocation, SSD disk:


 51459ms.  Buiding index of 10 docs
 181175ms.  Verifying data integrity with 100 docs
 315ms.  Preparing 1000 random queries

 61167ms.  Regex queries - Stopping execution, # queries finished: 150
 2795ms.  Regexp queries (new style)
 3936ms.  Wildcard queries
 777ms.  Boolean queries
 893ms.  Boolean queries (truncated)
 3596ms.  Span queries
 91751ms.  Span queries (truncated)Stopping execution, # queries finished: 100
 3937ms.  Payload queries
 93726ms.  Payload queries (truncated)Stopping execution, # queries finished: 
 100
 Totals: [4865, 18284, 18286, 18284, 18405, 287934, 44375, 18284, 2489]

 Examples of queries:
 
 regex:bgiyodjrr, k\w* michael\w* jay\w* .*
 regexp:/bgiyodjrr, k\w* michael\w* jay\w* .*/
 wildcard:bgiyodjrr, k*1 michael*2 jay*3 *
 +n0:bgiyodjrr +n1:k +n2:michael +n3:jay
 +n0:bgiyodjrr +n1:k* +n2:m* +n3:j*
 spanNear([vectrfield:bgiyodjrr, vectrfield:k, vectrfield:michael, 
 vectrfield:jay], 0, true)
 spanNear([vectrfield:bgiyodjrr, SpanMultiTermQueryWrapper(vectrfield:k*), 
 SpanMultiTermQueryWrapper(vectrfield:m*), 
 SpanMultiTermQueryWrapper(vectrfield:j*)], 0, true)
 spanPayCheck(spanNear([vectrfield:bgiyodjrr, vectrfield:k, 
 vectrfield:michael, vectrfield:jay], 1, true), payloadRef: 
 b[0]=48;b[0]=49;b[0]=50;b[0]=51;)
 spanPayCheck(spanNear([vectrfield:bgiyodjrr, 
 SpanMultiTermQueryWrapper(vectrfield:k*), 
 SpanMultiTermQueryWrapper(vectrfield:m*), 
 SpanMultiTermQueryWrapper(vectrfield:j*)], 1, true), payloadRef: 
 b[0]=48;b[0]=49;b[0]=50;b[0]=51;)


 The code here:

 https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java

 The benchmark should probably not be called 'benchmark', do you think it
 may be too simplistic? Can we expect some bad surprises somewhere?

 Thanks,

   roman

Re: Multi word synonyms

2012-11-30 Thread Roman Chyla

Try separating multi word synonyms with a null byte

simple\0syrup,sugar\0syrup,stock\0syrup

see https://issues.apache.org/jira/browse/LUCENE-4499 for details

roman

On Sun, Feb 5, 2012 at 10:31 PM, Zac Smith z...@trinkit.com wrote:

 Thanks for your response. When I don't include the KeywordTokenizerFactory
 in the SynonymFilter definition, I get additional term values that I don't
 want.

 e.g. synonyms.txt looks like:
 simple syrup,sugar syrup,stock syrup

 A document with a value containing 'simple syrup' can now be found when
 searching for just 'stock'.

 So the problem I am trying to address with KeywordTokenizerFactory, is to
 prevent my multi word synonyms from getting broken down into single words.

 Thanks
 Zac

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Sunday, February 05, 2012 8:07 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Multi word synonyms

 I'm not quite sure what you're trying to do with KeywordTokenizerFactory
 in your SynonymFilter definition, but if I use the defaults, then the
 all-phrase form works just fine.

 So the question is what problem are you trying to address by using
 KeywordTokenizerFactory?

 Best
 Erick

 On Sun, Feb 5, 2012 at 8:21 AM, O. Klein kl...@octoweb.nl wrote:
  Your query analyser will tokenize simple sirup into simple and
 sirup
  and wont match on simple syrup in the synonyms.txt
 
  So you have to change the query analyzer into KeywordTokenizerFactory
  as well.
 
  It might be idea to make a field for synonyms only with this tokenizer
  and another field to search on and use dismax. Never tried this though.
 
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p37172
  15.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can a field with defined synonym be searched without the synonym?

2012-12-12 Thread Roman Chyla

@wunder
It is a misconception (well, supported by that wiki description) that the
query time synonym filter have these problems. It is actually the default
parser, that is causing these problems. Look at this if you still think
that index time synonyms are cure for all:
https://issues.apache.org/jira/browse/LUCENE-4499

@joe
If you can use the flexible query parser (as linked in by @Swati) then all
you need to do is to define a different field with a different tokenizer
chain and then swap the field names before the analyzers processes the
document (and then rewrite the field name back - for example, we have
fields called author and author_nosyn)

roman

On Wed, Dec 12, 2012 at 12:38 PM, Walter Underwood wun...@wunderwood.orgwrote:

Query time synonyms have known problems. They are slower, cause incorrect
IDF, and don't work for phrase synonyms.

Apply synonyms at index time and you will have none of those problems.

See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

wunder

On Dec 12, 2012, at 9:34 AM, Swati Swoboda wrote:

Query-time analyzers are still applied, even if you include a string in
quotes. Would you expect foo to not match Foo just because it's
enclosed in quotes?

Also look at this, someone who had similar requirements:

http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-td2919876.html

-Original Message-
From: joe.cohe...@gmail.com [mailto:joe.cohe...@gmail.com]
Sent: Wednesday, December 12, 2012 12:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Can a field with defined synonym be searched without the
synonym?

I'm aplying only query-time synonym, so I have the original values
stored and indexed.
I would've expected that if I search a strin with quotations, i'll get
the exact match, without applying a synonym.

any way to achieve that?

Upayavira wrote
You can only search against terms that are stored in your index. If
you have applied index time synonyms, you can't remove them at query
time.

You can, however, use copyField to clone an incoming field to another
field that doesn't use synonyms, and search against that field instead.

Upayavira

On Wed, Dec 12, 2012, at 04:26 PM,

joe.cohen.m@

wrote:
Hi
I hava a field type without defined synonym.txt which retrieves both
records with home and house when I search either one of them.

I want to be able to search this field on the specific value that I
enter, without the synonym filter.

is it possible?

thanks.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-a-field-with-defined-synonym-b
e-searched-without-the-synonym-tp4026381.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-a-field-with-defined-synonym-be-searched-without-the-synonym-tp4026381p4026405.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org

Re: Can a field with defined synonym be searched without the synonym?

2012-12-12 Thread Roman Chyla

Well, this IDF problem has more sides. So, let's say your synonym file
contains multi-token synonyms (it does, right? or perhaps you don't need
it? well, some people do)

TV, TV set, TV foo, television

if you use the default synonym expansion, when you index 'television'

you have increased frequency of also 'set', 'foo', so, the IDF of 'TV' is
the same as that of 'television' - but IDF of 'foo' and 'set' has changed
(their frequency increased, their IDF decreased) -- TV's have in fact made
'foo' term very frequent and undesirable

So, you might be sure that IDF of 'TV' and 'television' are the same, but
you are not aware it has 'screwed' other (desirable) terms - so it really
depends. And I wouldn't argue these cases are esoteric.

And finally: there are use cases out there, where people NEED to switch off
synonym expansion at will (find only these documents, that contain the word
'TV' and not that bloody 'foo'). This cannot be done if the index contains
all synonym terms (unless you have a way to mark the original and the
synonym in the index).

roman

On Wed, Dec 12, 2012 at 12:50 PM, Walter Underwood wun...@wunderwood.orgwrote:

Query parsers cannot fix the IDF problem or make query-time synonyms
faster. Query synonym expansion makes more search terms. More search terms
are more work at query time.

The IDF problem is real; I've run up against it. The most rare variant of
the synonym have the highest score. This probably the opposite of what you
want. For me, it was TV and television. Documents with TV had higher
scores than those with television.

wunder

On Dec 12, 2012, at 9:45 AM, Roman Chyla wrote:

@joe
If you can use the flexible query parser (as linked in by @Swati) then
all
you need to do is to define a different field with a different tokenizer
chain and then swap the field names before the analyzers processes the
document (and then rewrite the field name back - for example, we have
fields called author and author_nosyn)

roman

On Wed, Dec 12, 2012 at 12:38 PM, Walter Underwood
wun...@wunderwood.orgwrote:

Query time synonyms have known problems. They are slower, cause
incorrect
IDF, and don't work for phrase synonyms.

Apply synonyms at index time and you will have none of those problems.

See:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

wunder

On Dec 12, 2012, at 9:34 AM, Swati Swoboda wrote:

Query-time analyzers are still applied, even if you include a string in
quotes. Would you expect foo to not match Foo just because it's
enclosed in quotes?

Also look at this, someone who had similar requirements:

http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-td2919876.html

any way to achieve that?

Upayavira wrote
You can only search against terms that are stored in your index. If
you have applied index time synonyms, you can't remove them at query
time.

You can, however, use copyField to clone an incoming field to another
field that doesn't use synonyms, and search against that field
instead.

Upayavira

On Wed, Dec 12, 2012, at 04:26 PM,

joe.cohen.m@

wrote:
Hi
I hava a field type without defined synonym.txt which retrieves both
records with home and house when I search either one of them.

I want to be able to search this field on the specific value that I
enter, without the synonym filter.

is it possible?

thanks.

--
View this message in context:

http://lucene.472066.n3.nabble.com/Can-a-field-with-defined-synonym-b
e-searched-without-the-synonym-tp4026381.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:

http://lucene.472066.n3.nabble.com/Can-a-field-with-defined-synonym-be-searched-without-the-synonym-tp4026381p4026405.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org

Re: MoreLikeThis supporting multiple document IDs as input?

2012-12-26 Thread Roman Chyla

Jay Luker has written MoreLikeThese which is probably what you want. You
may give it a try, though I am not sure if it works with Solr4.0 at this
point (we didn't port it yet)

https://github.com/romanchyla/montysolr/blob/MLT/contrib/adsabs/src/java/org/apache/solr/handler/MoreLikeTheseHandler.java

roman

On Wed, Dec 26, 2012 at 12:06 AM, Jack Krupansky j...@basetechnology.comwrote:

 MLT has both a request handler and a search component.

 The MLT handler returns similar documents only for the first document that
 the query matches.

 The MLT search component returns similar documents for each of the
 documents in the search results, but processes each search result base
 document one at a time and keeps its similar documents segregated by each
 of the base documents.

 It sounds like you wanted to merge the base search results and then find
 documents similar to that merged super-document. Is that what you were
 really seeking, as opposed to what the MLT component does? Unfortunately,
 you can't do that with the components as they are.

 You would have to manually merge the values from the base documents and
 then you could POST that text back to the MLT handler and find similar
 documents using the posted text rather than a query. Kind of messy, but in
 theory that should work.

 -- Jack Krupansky

 -Original Message- From: David Parks
 Sent: Tuesday, December 25, 2012 5:04 AM
 To: solr-user@lucene.apache.org
 Subject: MoreLikeThis supporting multiple document IDs as input?


 I'm unclear on this point from the documentation. Is it possible to give
 Solr X # of document IDs and tell it that I want documents similar to those
 X documents?

 Example:

  - The user is browsing 5 different articles
  - I send Solr the IDs of these 5 articles so I can present the user other
 similar articles

 I see this example for sending it 1 document ID:
 http://localhost:8080/solr/**select/?qt=mltq=id:[documenthttp://localhost:8080/solr/select/?qt=mltq=id:[document
 id]mlt.fl=[field1],[field2],[**field3]fl=idrows=10

 But can I send it 2+ document IDs as the query?

Re: Getting Lucense Query from Solr query (Or converting Solr Query to Lucense's query)

2013-01-07 Thread Roman Chyla

if you are inside solr, as it seems to be the case, you can do this

QParserPlugin qplug =
req.getCore().getQueryPlugin(LuceneQParserPlugin.NAME);
QParser parser =  qplug.createParser(PATIENT_GENDER:Male OR
STUDY_DIVISION:\Cancer Center\, null, req.getParams(), req);
Query q = parser.parse();

maybe there is a one-line call to get the parser from solr core, but i
can't find it now. Have a look at one of the subclasses of QParser

--roman

On Mon, Jan 7, 2013 at 4:27 AM, Sabeer Hussain shuss...@del.aithent.comwrote:

 Is there a way to get Lucene's query from Solr query?. I have a requirement
 to search for terms in multiple heterogeneous indices. Presently, I am
 using
 the following approach

 try {
 Directory directory1 = FSDirectory.open(new
 File(E:\\database\\patient\\index));
 Directory directory2 = FSDirectory.open(new
 File(E:\\database\\study\\index));

 BooleanQuery myQuery = new BooleanQuery();
 myQuery.add(new TermQuery(new
 Term(PATIENT_GENDER, Male)),
 BooleanClause.Occur.SHOULD);
 myQuery.add(new TermQuery(new
 Term(STUDY_DIVISION,Cancer Center)),
 BooleanClause.Occur.SHOULD);

 int indexCount = 2;
 IndexReader[] indexReader = new
 IndexReader[indexCount];
 indexReader[0] = DirectoryReader.open(directory1);
 indexReader[1] = DirectoryReader.open(directory2);

 IndexSearcher searcher = new IndexSearcher(new
 MultiReader(indexReader));
 TopDocs col  = searcher.search(myQuery, 10);

 //results
 ScoreDoc[] docs =  col.scoreDocs;

 } catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
 }

 Here, I need to create TermQuery based on Field Names and its value. If I
 can get this boolean query directly from Solr query q=PATIENT_GENDER:Male
 OR
 STUDY_DIVISION:Cancer Center, that will save my coding effort. This one
 is
 a simple example but when we need to create more complex query it will be a
 time consuming activity and error prone. So, is there a way to get the
 lucense's query from solr query.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Getting-Lucense-Query-from-Solr-query-Or-converting-Solr-Query-to-Lucense-s-query-tp4031187.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: unittest fail (sometimes) for float field search

2013-01-08 Thread Roman Chyla

apparently, it fails also with @SuppressCodecs(Lucene3x)

roman


On Tue, Jan 8, 2013 at 6:15 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi,

 I have a float field 'read_count' - and unittest like:

 assertQ(req(q, read_count:1.0),
 //doc/int[@name='recid'][.='9218920'],
 //*[@numFound='1']);

 sometimes, the unittest will fail, sometimes it succeeds.

 @SuppressCodecs(Lucene3x)

 Seems to solve the issue, however I don't understand what's wrong. Is this
 behaviour expected?

 thanks,

   roman


 INFO: Opening Searcher@752a2259 main
 9.1.2013 06:51:32 org.apache.solr.search.SolrIndexSearcher getIndexDir
 WARNING: WARNING: Directory impl does not support setting indexDir:
 org.apache.lucene.store.MockDirectoryWrapper
 9.1.2013 06:51:32 org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: end_commit_flush
 9.1.2013 06:51:32 org.apache.solr.core.QuerySenderListener newSearcher
 INFO: QuerySenderListener sending requests to 
 Searcher@752a2259main{StandardDirectoryReader(segments_2:3 _0(4.0.0.2):C30)}
 9.1.2013 06:51:32 org.apache.solr.core.QuerySenderListener newSearcher
 INFO: QuerySenderListener done.
 9.1.2013 06:51:32 org.apache.solr.core.SolrCore registerSearcher
 INFO: [collection1] Registered new searcher 
 Searcher@752a2259main{StandardDirectoryReader(segments_2:3 _0(4.0.0.2):C30)}

Re: unittest fail (sometimes) for float field search

2013-01-08 Thread Roman Chyla

The test checks we are properly getting/indexing data  - we index database
and fetch parts of the documents separately from mongodb. You can look at
the file here:
https://github.com/romanchyla/montysolr/blob/3c18312b325874bdecefceb9df63096b2cf20ca2/contrib/adsabs/src/test/org/apache/solr/update/TestAdsDataImport.java

But your comment made me to run the tests on command line and I am seeing I
can't make it fail (it fails only inside Eclipse). Sorry, I should have
tried that myself, but I am so used to running unittests inside Eclipse it
didn't occur to me...i'll try to find out what is going on...

thanks,

  roman



On Tue, Jan 8, 2013 at 6:53 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : apparently, it fails also with @SuppressCodecs(Lucene3x)

 what exactly is the test failure message?

 When you run tests that use the lucene test framework, any failure should
 include information about the random seed used to run the test -- that
 random seed affects things like the codec used, the directoryfactory used,
 etc...

 Can you confirm wether the test reliably passes/fails consistently when
 you reuse the same seed?

 Can you elaborate more on what exactly your test does? ... we probably
 need to see the entire test to make sense of why you might get
 inconsistent failures.



 -Hoss

Re: unittest fail (sometimes) for float field search

2013-01-09 Thread Roman Chyla

Hi,

It is not Eclipse related, neither codec related. There were two issues

I had a wrong configuration of NumericConfig:

new NumericConfig(4, NumberFormat.getNumberInstance(), NumericType.FLOAT))

I changed that to:
new NumericConfig(4, NumberFormat.getNumberInstance(Locale.US),
NumericType.FLOAT))

And the second problem was that I used the default float with
precisionStep=0, however NumericRangeQuery requires precision step =1
I tried all steps 1-8, and it worked only if the precison step of the field
and of the NumericConfig are the same (for range queries)


  roman





On Tue, Jan 8, 2013 at 7:34 PM, Roman Chyla roman.ch...@gmail.com wrote:

 The test checks we are properly getting/indexing data  - we index database
 and fetch parts of the documents separately from mongodb. You can look at
 the file here:
 https://github.com/romanchyla/montysolr/blob/3c18312b325874bdecefceb9df63096b2cf20ca2/contrib/adsabs/src/test/org/apache/solr/update/TestAdsDataImport.java

 But your comment made me to run the tests on command line and I am seeing
 I can't make it fail (it fails only inside Eclipse). Sorry, I should have
 tried that myself, but I am so used to running unittests inside Eclipse it
 didn't occur to me...i'll try to find out what is going on...

 thanks,

   roman




 On Tue, Jan 8, 2013 at 6:53 PM, Chris Hostetter 
 hossman_luc...@fucit.orgwrote:


 : apparently, it fails also with @SuppressCodecs(Lucene3x)

 what exactly is the test failure message?

 When you run tests that use the lucene test framework, any failure should
 include information about the random seed used to run the test -- that
 random seed affects things like the codec used, the directoryfactory used,
 etc...

 Can you confirm wether the test reliably passes/fails consistently when
 you reuse the same seed?

 Can you elaborate more on what exactly your test does? ... we probably
 need to see the entire test to make sense of why you might get
 inconsistent failures.



 -Hoss

Re: Large data importing getting rollback with solr

2013-01-22 Thread Roman Chyla

hi,
it is probably correct to revisit your design/requirements, but it you
still find you need it, then there may be a different way

DIH is using a writer to commit documents, you can detect errors inside
these and try to recover - ie. in some situations, you want to commit,
instead of calling rollback

These writers can be specified in the solrconfig.xml, for example:

  requestHandler name=/invenio/import
class=solr.WaitingDataImportHandler
lst name=defaults
  str name=configdata-config.xml/str
  bool name=cleanfalse/bool
  bool name=commitfalse/bool
  str name=update.chainblanketyblank/str
  !-- this parameter activates the logging/restart of failed imports
--
  str
name=writerImplorg.apache.solr.handler.dataimport.FailSafeInvenioNoRollbackWriter/str
/lst
  /requestHandler

when error happens, DIH will call rollback - that is when you can inspect
what was going on (but alas, it is not always easy) and do something.

You can see an example here:
https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/src/java/org/apache/solr/handler/dataimport/FailSafeInvenioNoRollbackWriter.java

This handler will find which documents were already indexed and call a
handler to register the missing ones into the queue.

But DIH needs to use its own interface properly if you want to write these
writers, please vote on this issue!

https://issues.apache.org/jira/browse/SOLR-3671

Best,

  Roman

On Tue, Jan 22, 2013 at 8:57 AM, Gora Mohanty g...@mimirtech.com wrote:

 On 21 January 2013 17:06, ashimbose ashimb...@gmail.com wrote:
 [...]
  Here I used two data config
  1. data_conf1.xml
  2. data_conf2.xml
 [...]

 Your configuration looks fine.

  Any one of them running fine at a single instant. Means,
  If I run first dataimport, it will successfully index, if after that I
 run
  dataimport1, it is giving below error..
 
  Caused by: java.sql.SQLException: JBC0088E: JBC0002E: Socket timeout
  detected: Read timed out

 Have you looked at the load on your database server?
 I am guessing that is where the bottleneck lies. This
 configuration is useful only of you can scale your
 database server, or have multiple servers, each with
 a different set of tables.

 As Upayavira, suggested you could look into SolrJ,
 or a similar library to control your indexing. I would
 once again suggest starting with smaller goals, and
 fixing issues one by one, rather than jumping in and
 trying to get everything working at once.

 Regards,
 Gora

Re: Getting Lucense Query from Solr query (Or converting Solr Query to Lucense's query)

2013-02-04 Thread Roman Chyla

You could use LocalSolrQueryRequest to create the request, but it is not
necessary, if all what you need is to get the lucene query parser, just do:

import org.apache.lucene.queryparser.classic.QueryParser

qp = new QueryParser(Version.LUCENE_40, defaultField, new SimpleAnalyzer());
Query q = qp.parse(queryString)

hth

roman

On Mon, Feb 4, 2013 at 3:57 AM, Sabeer Hussain shuss...@del.aithent.comwrote:

 Hi,
 Thanks for the reply. In my application, I am using some servlets to
 receive
 the request from user since I need to authenticate the user and adding
 conditions like userid= before sending the request to Solr Server using
 one of the two approaches

 1) Using SolrServer
 SolrServer server = new CommonsHttpSolrServer(.);
 ModifiableSolrParams params = new ModifiableSolrParams();
 params.set();
 QueryResponse response = server.query(params);

 2) Using URLConnection
 ModifiableSolrParams params = new ModifiableSolrParams();
 params.set();
 String paramString = params.toString();

 URL url = new URL(http://localhost:8080/solr/select?+paramString);
 URLConnection connection = null;
 try
 {
 connection = url.openConnection();
 }
 catch(Exception e)
 {
 e.printStackTrace();
 }

 . reading the response

 All I am doing is using SolrJ APIs. So, please tell me how I can get
 SolrQueryRequest object or anything like that to get the instance of
 QParserPlugin. Is it possible to create SolrQueryRequest  from
 HttpServletRequest? I would like to use SolrJ to create Lucense Query from
 Solr query (but I do not know whether it is possible or not)


 Regards
 Sabeer



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Getting-Lucense-Query-from-Solr-query-Or-converting-Solr-Query-to-Lucense-s-query-tp4031187p4038300.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Anyone else see this error when running unit tests?

2013-02-04 Thread Roman Chyla

Me too, it fails randomly with test classes. We use Solr4.0 for testing, no
maven, only ant.
--roman
On 4 Feb 2013 20:48, Mike Schultz mike.schu...@gmail.com wrote:

 Yes.  Just today actually.  I had some unit test based on
 AbstractSolrTestCase which worked in 4.0 but in 4.1 they would fail
 intermittently with that error message.  The key to this behavior is found
 by looking at the code in the lucene class:
 TestRuleSetupAndRestoreClassEnv.
 I don't understand it completely but there are a number of random code
 paths
 through there.  The following helped me get around the problem, at least in
 the short term.


 @org.apache.lucene.util.LuceneTestCase.SuppressCodecs({Lucene3x,Lucene40})
 public class CoreLevelTest extends AbstractSolrTestCase {

 I also need to call this inside my setUp() method, in 4.0 this wasn't
 required.
 initCore(solrconfig.xml, schema.xml, /tmp/my-solr-home);



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Anyone-else-see-this-error-when-running-unit-tests-tp4015034p4038472.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: what do you use for testing relevance?

2013-02-13 Thread Roman Chyla

All,

Thank you for your comments and links, I will explore them.

I think that many people are facing similar questions - when they tune
their search engines. Especially in Solr/Lucene community. While the
requirements will be different, ultimately it is what they can do w
lucene/solr that guides such efforts. As an example, let me use this

https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/test-plot-showing-factors.pdf?raw=true

The graph shows you the effect of different values of qf parameter. This
usecase is probably very common, so somebody already had probably done st
similar

In the real world, I would like to: 1) change something, 2) collect
(clicks) data 3) apply statistical test (of my choice) to see if changes
had the effect (be it worse or better) and see if that change is
statistically significant. But do we have to write these tools from scratch
again?

All your comments are very valuable and useful. But I am still wondering if
there are more tools one could use to tune the search. More comments
welcome!

Thank you!

roman

On Wed, Feb 13, 2013 at 1:04 PM, Amit Nithian anith...@gmail.com wrote:

Ultimately this is dependent on what your metrics for success are. For some
places it may be just raw CTR (did my click through rate increase) but for
other places it may be a function of money (either it may be gross revenue,
profits, # items sold etc). I don't know if there is a generic answer for
this question which is leading those to write their own frameworks b/c it's
very specific to your needs. A scoring change that leads to an increase in
CTR may not necessarily lead to an increase in the metric that makes your
business go.

On Tue, Feb 12, 2013 at 10:31 PM, Steffen Elberg Godskesen
steffen.godske...@gmail.com wrote:

Hi Roman,

If you're looking for regression testing then
https://github.com/sul-dlss/rspec-solr might be worth looking at. If
you're not a ruby shop, doing something similar in another language
shouldn't be to hard.

The basic idea is that you setup a set of tests like

If the query is X, then the document with id Y should be in the first 10
results
If the query is S, then a document with title T should be the first
result
If the query is P, then a document with author Q should not be in the
first 10 result

and that you run these whenever you tune your scoring formula to ensure
that you haven't introduced unintended effects. New ideas/requirements
for
your relevance ranking should always result in writing new tests - that
will probably fail until you tune your scoring formula. This is certainly
no magic bullet, but it will give you some confidence that you didn't
make
things worse. And - in my humble opinion - it also gives you the benefit
of
discouraging you from tuning your scoring just for fun. To put it
bluntly:
if you cannot write up a requirement in form of a test, you probably have
no need to tune your scoring.

Regards,

--
Steffen

On Tuesday, February 12, 2013 at 23:03 , Roman Chyla wrote:

Hi,
I do realize this is a very broad question, but still I need to ask it.
Suppose you make a change into the scoring formula. How do you
test/know/see what impact it had? Any framework out there?

It seems like people are writing their own tools to measure relevancy.

Thanks for any pointers,

roman

Re: [ANN] vifun: tool to help visually tweak Solr boosting

2013-02-25 Thread Roman Chyla

Oh, wonderful! Thank you :) I was hacking some simple python/R scripts that
can do a similar job for qf... the idea was to let the algorithm create
possible combinations of params and compare that against the baseline.

Would it be possible/easy to instruct the tool to harvest results for
different combinations and export it? I would like to make plots similar to
those:

https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/test-plot-showing-factors.pdf?raw=true

roman

On Sat, Feb 23, 2013 at 9:12 AM, jmlucjav jmluc...@gmail.com wrote:

 Hi,

 I have built a small tool to help me tweak some params in Solr (typically
 qf, bf in edismax). As maybe others find it useful, I am open sourcing it
 on github: https://github.com/jmlucjav/vifun

 Check github for some more info and screenshots. I include part of the
 github page below.
 regards

 Description

 Did you ever spend lots of time trying to tweak all numbers in a *edismax*
  handler *qf*, *bf*, etc params so docs get scored to your liking? Imagine
 you have the params below, is 20 the right boosting for *name* or is it too
 much? Is *population* being boosted too much versus distance? What about
 new documents?

 !-- fields, boost some --
 str name=qfname^20 textsuggest^10 edge^5 ngram^2
 phonetic^1/str
 str name=mm33%/str
 !-- boost closest hits --
 str name=bfrecip(geodist(),1,500,0)/str
 !-- boost by population --
 str name=bfproduct(log(sum(population,1)),100)/str
 !-- boost newest docs --
 str name=bfrecip(rord(moddate),1,1000,1000)/str

 This tool was developed in order to help me tweak the values of boosting
 functions etc in Solr, typically when using edismax handler. If you are fed
 up of: change a number a bit, restart Solr, run the same query to see how
 documents are scored now...then this tool is for you.
  https://github.com/jmlucjav/vifun#featuresFeatures

- Can tweak numeric values in the following params: *qf, pf, bf, bq,
boost, mm* (others can be easily added) even in *appends or
invariants*
- View side by side a Baseline query result and how it changes when you
gradually change each value in the params
- Colorized values, color depends on how the document does related to
baseline query
- Tooltips give you Explain info
- Works on remote Solr installations
- Tested with Solr 3.6, 4.0 and 4.1 (other versions would work too, as
long as wt=javabin format is compatible)
- Developed using Groovy/Griffon

  https://github.com/jmlucjav/vifun#requirementsRequirements

- */select* handler should be available, and not have any *appends or
invariants*, as it could interfere with how vifun works.
- Java6 is needed (maybe it runs on Java5 too). A JRE should be enough.

  https://github.com/jmlucjav/vifun#getting-startedGetting started
 
 https://github.com/jmlucjav/vifun#click-here-to-download-latest-version-and-unzip
 Click
 here to download latest
 versionhttp://code.google.com/p/vifun/downloads/detail?name=vifun-0.4.zip
 
 and
 unzip

- Run vifun-0.4\bin\vifun.bat or vifun-04\bin\vifun if on linux/OSX
- Edit *Solr URL* to match yours (in Sol4.1 default is
http://localhost:8983/solr/collection1 for example) [image: hander
selection]
 https://github.com/jmlucjav/vifun/raw/master/img/screenshot-handlers.jpg
- *Show Handerls*, and select the handler you wish to tweak from *
Handerls* dropdown. The text area below shows the parameters of the
handler.
- Modify the values to run a baseline query:
   - *q*: query string you want to use
   - *rows*: as in Solr, don't choose a number too small, so you can see
   more documents, I typically use 500
   - *fl*: comma separated list of fields you want to show for each doc,
   keep it short (other fields needed will be added, like the id, score)
   - *rest*: in case you need to add more params, for example: sfield,
   fq etc) [image: query
 params]
 https://github.com/jmlucjav/vifun/raw/master/img/screenshot-qparams.jpg
- *Run Query*. The two panels on the right will show the same result,
sorted by score.[image:
 results]
 https://github.com/jmlucjav/vifun/raw/master/img/screenshot-results.jpg
- Use the mouse to select the number you want to tweak in Score params
(select all the digits). Note the label of the field is highlighted with
current value. [image: target
 selection]
 https://github.com/jmlucjav/vifun/raw/master/img/screenshot-selecttarget.jpg
 
- Move the slider, release and see how a new query is run, and you can
compare how result changes with the current value. In the Current table,
you can see current position/score and also delta relative to the
 baseline.
The colour of the row reflects how much the doc gained/lost. [image:
tweaking a value]
 https://github.com/jmlucjav/vifun/raw/master/img/screenshot-baseline.jpg
- You can increase the limits of the

Re: Formal Query Grammar

2013-02-27 Thread Roman Chyla

Or if you prefer EBNF, look here (but it differs slghtly from the grammar
Jack linked to):

https://github.com/romanchyla/montysolr/blob/master/contrib/antlrqueryparser/grammars/StandardLuceneGrammar.g

roman

On Wed, Feb 27, 2013 at 1:38 PM, Jack Krupansky j...@basetechnology.comwrote:

 Right here:

 http://svn.apache.org/viewvc/**lucene/dev/tags/lucene_solr_4_**
 1_0/solr/core/src/java/org/**apache/solr/parser/**QueryParser.jj?revision=
 **1436334view=markuphttp://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/core/src/java/org/apache/solr/parser/QueryParser.jj?revision=1436334view=markup

 -- Jack Krupansky

 -Original Message- From: z...@navigo.com
 Sent: Wednesday, February 27, 2013 11:44 AM
 To: solr-user@lucene.apache.org
 Subject: Formal Query Grammar


 I found where this had been asked, but did not find an answer.

 Is there a formal definition of the solr query grammar? Like a Chomsky
 grammar?

 Previous ask:

 http://lucene.472066.n3.**nabble.com/FW-Formal-grammar-**
 for-solr-lucene-td4010949.htmlhttp://lucene.472066.n3.nabble.com/FW-Formal-grammar-for-solr-lucene-td4010949.html



 --
 View this message in context: http://lucene.472066.n3.**
 nabble.com/Formal-Query-**Grammar-tp4043419.htmlhttp://lucene.472066.n3.nabble.com/Formal-Query-Grammar-tp4043419.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Roman Chyla

hi Andy,

It seems like a common type of operation and I would be also curious what
others think. My take on this is to create a compressed intbitset and send
it as a query filter, then have the handler decompress/deserialize it, and
use it as a filter query. We have already done experiments with intbitsets
and it is fast to send/receive

look at page 20
http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component

it is not on my immediate list of tasks, but if you want to help, it can be
done sooner

roman

On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:

We've got an 11,000,000-document index. Most documents have a unique ID
called flrid, plus a different ID called solrid that is Solr's PK. For
some searches, we need to be able to limit the searches to a subset of
documents defined by a list of FLRID values. The list of FLRID values can
change between every search and it will be rare enough to call it never
that any two searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND
(flrid:(123 125 139 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We
have to subgroup to get past Solr's limitations on the number of terms that
can be ORed together.

The problem with this approach (besides that it's clunky) is that it seems
to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms
or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000
FLRIDs, that jumps up to about 75000ms. We want it be on the order of
1000-2000ms at most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
* Considered: dumping all the FLRIDs for a given search into another core
and doing a join between it and the main core, but if we do five or ten
searches per second, it seems like Solr would die from all the commits.
The set of FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID
instead, so that Solr doesn't have to hit the documents in order to
translate FLRID-SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to
pull them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a
naive one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because
strings of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of
situation a few times, but no answers that I see beyond what we're doing
now.

*
http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
*
http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
*
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
*
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Thanks,
Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Roman Chyla

I think we speak of one use case where user wants to limit the search into
a collection of documents but there is no unifying (easy) way to select
those papers - besides a loong query: id:1 OR id:5 OR id:90...

And no, the latency of several hundred milliseconds is perfectly achievable
with several hundred thousands of ids, you should explore the link...

roman

On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood wun...@wunderwood.orgwrote:

First, terms used to subset the index should be a filter query, not part
of the main query. That may help, because the filter query terms are not
used for relevance scoring.

Have you done any system profiling? Where is the bottleneck: CPU or disk?
There is no point in optimising things before you know the bottleneck.

Also, your latency goals may be impossible. Assume roughly one disk access
per term in the query. You are not going to be able to do 100,000 random
access disk IOs in 2 seconds, let alone process the results.

wunder

On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:

hi Andy,

It seems like a common type of operation and I would be also curious what
others think. My take on this is to create a compressed intbitset and
send
it as a query filter, then have the handler decompress/deserialize it,
and
use it as a filter query. We have already done experiments with
intbitsets
and it is fast to send/receive

look at page 20

http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component

it is not on my immediate list of tasks, but if you want to help, it can
be
done sooner

roman

On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:

We've got an 11,000,000-document index. Most documents have a unique ID
called flrid, plus a different ID called solrid that is Solr's PK.
For
some searches, we need to be able to limit the searches to a subset of
documents defined by a list of FLRID values. The list of FLRID values
can
change between every search and it will be rare enough to call it
never
that any two searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND
(flrid:(123 125 139 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together.
We
have to subgroup to get past Solr's limitations on the number of terms
that
can be ORed together.

The problem with this approach (besides that it's clunky) is that it
seems
to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in
50ms
or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With
100,000
FLRIDs, that jumps up to about 75000ms. We want it be on the order of
1000-2000ms at most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query.
No
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q. No
improvement.
* Considered: dumping all the FLRIDs for a given search into another
core
and doing a join between it and the main core, but if we do five or ten
searches per second, it seems like Solr would die from all the commits.
The set of FLRIDs is unique between searches so there is no reuse
possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID
instead, so that Solr doesn't have to hit the documents in order to
translate FLRID-SolrID to do the matching.

What we're hoping for:

I've searched SO and the web and found people asking about this type of
situation a few times, but no answers that I see beyond what we're doing
now.

http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
*

http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
*

http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
*

http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Thanks,
Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

How to plug a new ANTLR grammar

2011-09-13 Thread Roman Chyla

Hi,

The standard lucene/solr parsing is nice but not really flexible. I
saw questions and discussion about ANTLR, but unfortunately never a
working grammar, so... maybe you find this useful:
https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr

In the grammar, the parsing is completely abstracted from the Lucene
objects, and the parser is not mixed with Java code. At first it
produces structures like this:
https://svnweb.cern.ch/trac/rcarepo/raw-attachment/wiki/MontySolrQueryParser/index.html

But now I have a problem. I don't know if I should use query parsing
framework in contrib.

It seems that the qParser in contrib can use different parser
generators (the default JavaCC, but also ANTLR). But I am confused and
I don't understand this new queryParser from contrib. It is really
very confusing to me. Is there any benefit in trying to plug the ANTLR
tree into it? Because looking at the AST pictures, it seems that with
a relatively simple tree walker we could build the same queries as the
current standard lucene query parser. And it would be much simpler and
flexible. Does it bring something new? I have a feeling I miss
something...

Many thanks for help,

  Roman

Re: How to plug a new ANTLR grammar

2011-09-14 Thread Roman Chyla

Hi Peter,

Yes, with the tree it is pretty straightforward. I'd prefer to do it
that way, but what is the purpose of the new qParser then? Is it just
that the qParser was built with a different paradigms in mind where
the parse tree was not in the equation? Anybody knows if there is any
advantage?

I looked bit more into the contrib

org.apache.lucene.queryParser.standard.StandardQueryParser.java
org.apache.lucene.queryParser.standard.QueryParserWrapper.java

And some things there (like setting default fuzzy value) are in my
case set directly in the grammar. So the query builder is still
somehow involved in parsing (IMHO not good).

But if someone knows some reasons to keep using the qParser, please
let me know.

Also, a question for Peter, at which stage do you use lucene analyzers
on the query? After it was parsed into the tree, or before we start
processing the query string?

Thanks!

Roman

On Tue, Sep 13, 2011 at 10:14 PM, Peter Keegan peterlkee...@gmail.com wrote:
Roman,

I'm not familiar with the contrib, but you can write your own Java code to
create Query objects from the tree produced by your lexer and parser
something like this:

StandardLuceneGrammarLexer lexer = new ANTLRReaderStream(new
StringReader(queryString));
CommonTokenStream tokens = new CommonTokenStream(lexer);
StandardLuceneGrammarParser parser = new
StandardLuceneGrammarParser(tokens);
StandardLuceneGrammarParser.query_return ret = parser.mainQ();
CommonTree t = (CommonTree) ret.getTree();
parseTree(t);

parseTree (Tree t) {

// recursively parse the Tree, visit each node

visit (node);

}

visit (Tree node) {

switch (node.getType()) {
case (StandardLuceneGrammarParser.AND:
// Create BooleanQuery, push onto stack
...
}
}

I use the stack to build up the final Query from the queries produced in the
tree parsing.

Hope this helps.
Peter

On Tue, Sep 13, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote:

I'd love to see the progress on this.

On Tue, Sep 13, 2011 at 10:34 AM, Roman Chyla roman.ch...@gmail.com
wrote:

Hi,

The standard lucene/solr parsing is nice but not really flexible. I
saw questions and discussion about ANTLR, but unfortunately never a
working grammar, so... maybe you find this useful:

https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr

In the grammar, the parsing is completely abstracted from the Lucene
objects, and the parser is not mixed with Java code. At first it
produces structures like this:

https://svnweb.cern.ch/trac/rcarepo/raw-attachment/wiki/MontySolrQueryParser/index.html

But now I have a problem. I don't know if I should use query parsing
framework in contrib.

It seems that the qParser in contrib can use different parser
generators (the default JavaCC, but also ANTLR). But I am confused and
I don't understand this new queryParser from contrib. It is really
very confusing to me. Is there any benefit in trying to plug the ANTLR
tree into it? Because looking at the AST pictures, it seems that with
a relatively simple tree walker we could build the same queries as the
current standard lucene query parser. And it would be much simpler and
flexible. Does it bring something new? I have a feeling I miss
something...

Many thanks for help,

Roman

--
- sent from my mobile
6176064373

Re: ANTLR SOLR query/filter parser

2011-09-22 Thread Roman Chyla

Hi, I agree that people can register arbitrary qparsers, however  the
question might have been understoo differently - about the ANLR parser
that can handle what solr qparser does (and that one is looking at
_query_: and similar stuff -- or at local params, which is what can be
copypasted into the business logic of the new parser; ie. the
solution might be similar to what is already done in solr qparser)

I think I'm going to try just that :)

So here is my working ANTLR grammar for Lucene in case anybody is interested:
https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr

And I plan to build now a wrapper that calls this parser to parse the
query, get the tree, then translate the tree into lucene query object.
The local stuff {} may not even be part of the grammar -- some unclear
ideas in here, but they will be sorted out...

roman

On Wed, Aug 17, 2011 at 9:26 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : I'm looking for an ANTLR parser that consumes solr queries and filters.
 : Before I write my own, thought I'd ask if anyone has one they are
 : willing to share or can point me to one?

 I'm pretty sure that this will be imposisble to do in the general case --
 arbitrary QParser instances (that support arbitrary syntax) can be
 registered in the solrconfig.xml and specified using either localparams or
 defType.  so even if you did write a parser that understood all of the
 rules of all of hte default QParsers, and even if you made your parser
 smart enough to know how to look at other params (ie: defType, or
 variable substitution of type) to understand which subset of parse rules
 to use, that still might give false positives or false failures if hte
 user registered their own QParser using a new name (or changed the
 names used in registrating existing parsers)

 The main question i have is: why are you looking for an ANTLR paser to do
 this?  what is your goal?

 https://people.apache.org/~hossman/#xyproblem
 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341




 -Hoss

Is there anything like MultiSearcher?

2011-02-05 Thread Roman Chyla

Dear Solr experts,

Could you recommend some strategies or perhaps tell me if I approach
my problem from a wrong side? I was hoping to use MultiSearcher to
search across multiple indexes in Solr, but there is no such a thing
and MultiSearcher was removed according to this post:
http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html

I though I had two use cases:

1. maintenance - I wanted to build two separate indexes, one for
fulltext and one for metadata (the docs have the unique ids) -
indexing them separately would make things much simpler
2. ability to switch indexes at search time (ie. for testing purposes
- one fulltext index could be built by Solr standard mechanism, the
other by a rather different process - independent instance of lucene)

I think the recommended approach is to use the Distributed search - I
found a nice solution here:
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set
- however it seems to me, that data are sent over HTTP (5M from one
core, and 5M from the other core being merged by the 3rd solr core?)
and I would like to do it only for local indexes and without the
network overhead.

Could you please shed some light if there already exist an optimal
solution to my use cases? And if not, whether I could just try to
build a new SolrQuerySearcher that is extending lucene MultiSearcher
instead of IndexSearch - or you think there are some deeply rooted
problems there and the MultiSearch-er cannot work inside Solr?

Thank you,

  Roman

Re: Is there anything like MultiSearcher?

2011-02-05 Thread Roman Chyla

Unless I am wrong, sharding across two cores is done over HTTP and has
the limitations as listed at:
http://wiki.apache.org/solr/DistributedSearch
While MultiSearcher is just a decorator over IndexSearcher - therefore
the limitations there would (?) not apply and if indexes reside
locally, would be also faster

Cheers,

roman

On Sat, Feb 5, 2011 at 10:02 PM, Bill Bell billnb...@gmail.com wrote:
 Why not just use sharding across the 2 cores?

 On 2/5/11 8:49 AM, Roman Chyla roman.ch...@gmail.com wrote:

Dear Solr experts,

Could you recommend some strategies or perhaps tell me if I approach
my problem from a wrong side? I was hoping to use MultiSearcher to
search across multiple indexes in Solr, but there is no such a thing
and MultiSearcher was removed according to this post:
http://osdir.com/ml/solr-user.lucene.apache.org/2011-01/msg00250.html

I though I had two use cases:

1. maintenance - I wanted to build two separate indexes, one for
fulltext and one for metadata (the docs have the unique ids) -
indexing them separately would make things much simpler
2. ability to switch indexes at search time (ie. for testing purposes
- one fulltext index could be built by Solr standard mechanism, the
other by a rather different process - independent instance of lucene)

I think the recommended approach is to use the Distributed search - I
found a nice solution here:
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-
return-one-result-set
- however it seems to me, that data are sent over HTTP (5M from one
core, and 5M from the other core being merged by the 3rd solr core?)
and I would like to do it only for local indexes and without the
network overhead.

Could you please shed some light if there already exist an optimal
solution to my use cases? And if not, whether I could just try to
build a new SolrQuerySearcher that is extending lucene MultiSearcher
instead of IndexSearch - or you think there are some deeply rooted
problems there and the MultiSearch-er cannot work inside Solr?

Thank you,

  Roman

multiple localParams for each query clause

2011-03-02 Thread Roman Chyla

Hi,

Is it possible to set local arguments for each query clause?

example:

{!type=x q.field=z}something AND {!type=database}something


I am pulling together result sets coming from two sources, Solr index
and DB engine - however I realized that local parameters apply only to
the whole query - so I don't know how to set the query to mark the
second clause as db-searchable.

Thanks,

  Roman

Re: multiple localParams for each query clause

2011-03-02 Thread Roman Chyla

Thanks Jonathan, this will be useful -- in the meantime, I have
implemented the query rewriting, using the QueryParsing.toString()
utility as an example.

On Wed, Mar 2, 2011 at 5:40 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Not per clause, no. But you can use the nested queries feature to set
 local params for each nested query instead.  Which is in fact one of the
 most common use cases for local params.

 q=_query_:{type=x q.field=z}something AND
 _query_:{!type=database}something

 URL encode that whole thing though.

 http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/

 On 3/2/2011 10:24 AM, Roman Chyla wrote:

 Hi,

 Is it possible to set local arguments for each query clause?

 example:

 {!type=x q.field=z}something AND {!type=database}something


 I am pulling together result sets coming from two sources, Solr index
 and DB engine - however I realized that local parameters apply only to
 the whole query - so I don't know how to set the query to mark the
 second clause as db-searchable.

 Thanks,

   Roman

Re: Help! Confused about using Jquery for the Search query - Want to ditch it

2012-06-08 Thread Roman Chyla

Hi,
what you want to do is not that difficult, you can use json, eg.

try:
conn = urllib.urlopen(url, params)
page = conn.read()
rsp = simplejson.loads(page)
conn.close()
return rsp
except Exception, e:
log.error(str(e))
log.error(page)
raise e

but this way you are initiating connection each time, which is
expensive - it would be better to pool the connections

but as you can see, you can get json or xml either way

another option is to use solrpy

import solr
import urllib
# create a connection to a solr server
s = solr.SolrConnection('http://localhost:8984/solr')
s.select = solr.SearchHandler(s, '/invenio')

def search(query, kwargs=None, fields=['id'], qt='invenio'):

# do a remote search in solr
url_params = urllib.urlencode([(k, v) for k,v in kwargs.items() if
k not in ['_', 'req']])

if 'rg' in kwargs and kwargs['rg']:
rows = min(kwargs['rg'], 100) #inv maximum limit is 100
else:
rows = 25
response = s.query(query, fields=fields, rows=rows, qt=qt,
inv_params=url_params)
num_found = response.numFound
q_time = response.header['QTime']
# more and return

r


On Thu, Jun 7, 2012 at 3:16 PM, Ben Woods bwo...@quincyinc.com wrote:
 But, check out things like httplib2 and urllib2.

 -Original Message-
 From: Spadez [mailto:james_will...@hotmail.com]
 Sent: Thursday, June 07, 2012 2:09 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Help! Confused about using Jquery for the Search query - Want to 
 ditch it

 Thank you, that helps. The bit I am still confused about how the server sends 
 the response to the server though. I get the impression that there are 
 different ways that this could be done, but is sending an XML response back 
 to the Python server the best way to do this?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Help-Confused-about-using-Jquery-for-the-Search-query-Want-to-ditch-it-tp3988123p3988302.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 Quincy and its subsidiaries do not discriminate in the sale of advertising in 
 any medium (broadcast, print, or internet), and will accept no advertising 
 which is placed with an intent to discriminate on the basis of race or 
 ethnicity.

Re: 4.0.ALPHA vs 4.0 branch/trunk - what is best for upgrade?

2012-07-15 Thread Roman Chyla

I am using AbstractSolrTestCase (which in turn uses
solr.util.TestHarness) as a basis for unittests, but the solr
installation is outside of my source tree and I don't want to
duplicate it just to change a few lines (and with the new solr4.0 I
hope I can get the test-framework in a jar file, previously that
wasn't possible). So in essence, I have to deal with the expected
folder structure for all my unittests.

The way I make the configuration visible outside the solr standard
paths is to get the classloader and add folders to it, this way test
extensions for solr without having the same configuration. But I
should mimick the folder structure to be compatible.

Thanks all for you help, it is much appreciated.

roman

On Sun, Jul 15, 2012 at 1:46 PM, Mark Miller markrmil...@gmail.com wrote:
 The beta will have files that where in solr/conf and solr/data in 
 solr/collection1/conf|data instead.

 What Solr test cases are you referring to? The only ones that should care 
 about this would have to be looking at the file system. If that is the case, 
 simply update the path. The built in tests had to be adjusted for this as 
 well.

 The problem with having the default core use /solr as a conf dir is that if 
 you create another core, where does it logically go? The default collection 
 is called collection1, so now its conf and data lives in a folder called 
 collection1. A new SolrCore called newsarticles would have it's conf and data 
 in /solr/newsarticles.

 There are still going to be some bumps as you move from alpha to beta to 
 release if you are depending on very specific file system locations - 
 however, they should be small bumps that are easily handled.

 Just send an email to the user list if you'd like some help with anything in 
 particular.

 In this case, I'd update what you have to look at /solr/collection1 rather 
 than simply /solr. It's still the default core, so simple URLs without the 
 core name will still work. It won't affect HTTP communication. Just file 
 system location.

 On Jul 14, 2012, at 9:54 PM, Roman Chyla wrote:

 Hi,

 Is it intentional that the ALPHA release has a different folder structure
 as opposed to the trunk?

 eg. collection1 folder is missing in the ALPHA, but present in branch_4x
 and trunk

 lucene-trunk/solr/example/solr/collection1/conf/xslt/example_atom.xsl
 4.0.0-ALPHA/solr/example/solr/conf/xslt/example_atom.xsl
 lucene_4x/solr/example/solr/collection1/conf/xslt/example_atom.xsl


 This has consequences for development - e.g. solr testcases do not expect
 that the collection1 is there for ALPHA.

 In general, what is your advice for developers who are upgrading from solr
 3.x to solr 4.x? What codebase should we follow to minimize the pain of
 porting to the next BETA and stable releases?

 Thanks!

  roman

 - Mark Miller
 lucidimagination.com

java.lang.AssertionError: System properties invariant violated.

2012-07-17 Thread Roman Chyla

Hello,

(Please excuse cross-posting, my problem is with a solr component, but
the underlying issue is inside the lucene test-framework)

I am porting 3x unittests to the solr/lucene trunk. My unittests are
OK and pass, but in the end fail because the new rule checks for
modifier properties. I know what the problem is, I am creating new
system properties in the @beforeClass, but I think I need to do it
there, because the project loads C library before initializing tests.

Anybody knows how to work around it cleanly? There is a property that
can be set to ignore certain names
(LuceneTestCase.IGNORED_INVARIANT_PROPERTIES), but unfortunately it is
declared as private.

Thank you,

  Roman


Exception:

java.lang.AssertionError: System properties invariant violated.
New keys:
  montysolr.bridge=montysolr.java_bridge.SimpleBridge
  montysolr.home=/dvt/workspace/montysolr
  montysolr.modulepath=/dvt/workspace/montysolr/src/python/montysolr
  solr.test.sys.prop1=propone
  solr.test.sys.prop2=proptwo

at 
com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:66)
at 
org.apache.lucene.util.TestRuleNoInstanceHooksOverrides$1.evaluate(TestRuleNoInstanceHooksOverrides.java:53)
at 
org.apache.lucene.util.TestRuleNoStaticHooksShadowing$1.evaluate(TestRuleNoStaticHooksShadowing.java:52)
at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:36)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:605)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.access$400(RandomizedRunner.java:132)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:551)

Re: java.lang.AssertionError: System properties invariant violated.

2012-07-18 Thread Roman Chyla

Thank you! I haven't really understood the LuceneTestCase.classRules
before this.

roman

On Wed, Jul 18, 2012 at 3:11 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : I am porting 3x unittests to the solr/lucene trunk. My unittests are
 : OK and pass, but in the end fail because the new rule checks for
 : modifier properties. I know what the problem is, I am creating new
 : system properties in the @beforeClass, but I think I need to do it
 : there, because the project loads C library before initializing tests.

 The purpose ot the assertion is to verify that no code being tested is
 modifying system properties -- if you are setting hte properties yourself
 in some @BeforeClass methods, just use System.clearProperty to unset them
 in corrisponding @AfterClass methods


 -Hoss

Re: using Solr to search for names

2012-07-22 Thread Roman Chyla

Or for names that are more involved, you can use special
tokenizer/filter chain and index different variants of the name into
one index

example:
https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/java/org/apache/lucene/analysis/synonym/AuthorSynonymFilter.java

roman

On Sun, Jul 22, 2012 at 10:52 AM, Alireza Salimi
alireza.sal...@gmail.com wrote:
Hi Ahmet,

Thanks for the reply, Yes, actually after I posted the first question,
I found that edismax is very helpful in this use case. There is another
problem which is about hyphens in the search query.

I guess I need to post it in another email.

Thank you very much

On Sun, Jul 22, 2012 at 3:35 AM, Ahmet Arslan iori...@yahoo.com wrote:

So here is the problem, I have a requirement to implement
search by a
person name.
Names consist of
- first name
- middle name
- last name
- nickname

there is a list of synonyms which should be applied just for
first name and
middle name.

In search, all fields should be searched for the search
keyword. That's why
I thought
maybe having an aggregate field - named 'name' - which keeps
all fields - by
copyField tag - can be used for search.

The problem is: how can I apply synonyms for first names and
middle names,
when I
want to copy them into 'name' field?

If you know of any link which is for using Solr to search
for names,
I would appreciate if you let me know.

There is a flexible approach when you want to search over multiple fields
having different field types. http://wiki.apache.org/solr/ExtendedDisMax
You just specify the list of fields by qf parameter.

defType=edismaxqf=firstName^1.2 middleName lastName^1.5 nickname

--
Alireza Salimi
Java EE Developer

Re: Batch Search Query

2013-03-28 Thread Roman Chyla

Apologies if you already do something similar, but perhaps of general
interest...

One (different approach) to your problem is to implement a local
fingerprint - if you want to find documents with overlapping segments, this
algorithm will dramatically reduce the number of segments you create/search
for every document

http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

Then you simply end up indexing each document, and upon submission:
computing fingerprints and querying for them. I don't know (ie. remember)
exact numbers, but my feeling is that you end up storing ~13% of document
text (besides, it is a one token fingerprint, therefore quite fast to
search for - you could even try one huge boolean query with 1024 clauses,
ouch... :))

roman

On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas mikehaas...@gmail.com wrote:

 Hello. My company is currently thinking of switching over to Solr 4.2,
 coming off of SQL Server. However, what we need to do is a bit weird.

 Right now, we have ~12 million segments and growing. Usually these are
 sentences but can be other things. These segments are what will be stored
 in Solr. I’ve already done that.

 Now, what happens is a user will upload say a word document to us. We then
 parse it and process it into segments. It very well could be 5000 segments
 or even more in that word document. Each one of those ~5000 segments needs
 to be searched for similar segments in solr. I’m not quite sure how I will
 do the query (whether proximate or something else). The point though, is to
 get back similar results for each segment.

 However, I think I’m seeing a bigger problem first. I have to search
 against ~5000 segments. That would be 5000 http requests. That’s a lot! I’m
 pretty sure that would take a LOT of hardware. Keep in mind this could be
 happening with maybe 4 different users at once right now (and of course
 more in the future). Is there a good way to send a batch query over one (or
 at least a lot fewer) http requests?

 If not, what kinds of things could I do to implement such a feature (if
 feasible, of course)?


 Thanks,

 Mike

Re: Batch Search Query

2013-03-28 Thread Roman Chyla

On Thu, Mar 28, 2013 at 12:27 PM, Mike Haas mikehaas...@gmail.com wrote:

 Thanks for your reply, Roman. Unfortunately, the business has been running
 this way forever so I don't think it would be feasible to switch to a whole


sure, no arguing against that :)


 document store versus segments store. Even then, if I understand you
 correctly it would not work for our needs. I'm thinking because we don't
 care about any other parts of the document, just the segment. If a similar
 segment is in an entirely different document, we want that segment.


the algo should work for this case - the beauty of the local winnowing is
that it is *local*, ie it tends to select the same segments from the text
(ie. you process two documents, written by two different people - but if
they cited the same thing, and it is longer than 'm' tokens, you will have
at least one identical fingerprints from both documents - which means:
match!) then of course, you can store the position offset of the original
words of the fingerprint and retrieve the original, compute ratio of
overlap etc... but a database seems to be better suited for these kind of
jobs...

let us know what you adopt!

ps: MoreLikeThis selects 'significant' tokens from the document you
selected and then constructs a new boolean query searching for those.
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/


 I'll keep taking any and all feedback however so that I can develop an idea
 and present it to my manager.


 On Thu, Mar 28, 2013 at 11:16 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Apologies if you already do something similar, but perhaps of general
  interest...
 
  One (different approach) to your problem is to implement a local
  fingerprint - if you want to find documents with overlapping segments,
 this
  algorithm will dramatically reduce the number of segments you
 create/search
  for every document
 
  http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
 
  Then you simply end up indexing each document, and upon submission:
  computing fingerprints and querying for them. I don't know (ie. remember)
  exact numbers, but my feeling is that you end up storing ~13% of document
  text (besides, it is a one token fingerprint, therefore quite fast to
  search for - you could even try one huge boolean query with 1024 clauses,
  ouch... :))
 
  roman
 
  On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas mikehaas...@gmail.com
 wrote:
 
   Hello. My company is currently thinking of switching over to Solr 4.2,
   coming off of SQL Server. However, what we need to do is a bit weird.
  
   Right now, we have ~12 million segments and growing. Usually these are
   sentences but can be other things. These segments are what will be
 stored
   in Solr. I’ve already done that.
  
   Now, what happens is a user will upload say a word document to us. We
  then
   parse it and process it into segments. It very well could be 5000
  segments
   or even more in that word document. Each one of those ~5000 segments
  needs
   to be searched for similar segments in solr. I’m not quite sure how I
  will
   do the query (whether proximate or something else). The point though,
 is
  to
   get back similar results for each segment.
  
   However, I think I’m seeing a bigger problem first. I have to search
   against ~5000 segments. That would be 5000 http requests. That’s a lot!
  I’m
   pretty sure that would take a LOT of hardware. Keep in mind this could
 be
   happening with maybe 4 different users at once right now (and of course
   more in the future). Is there a good way to send a batch query over one
  (or
   at least a lot fewer) http requests?
  
   If not, what kinds of things could I do to implement such a feature (if
   feasible, of course)?
  
  
   Thanks,
  
   Mike

Re: Query Parser OR AND and NOT

2013-04-15 Thread Roman Chyla

should be: -city:H* OR zip:30*




On Mon, Apr 15, 2013 at 12:03 PM, Peter Schütt newsgro...@pstt.de wrote:

 Hallo,
 I do not really understand the query language of the SOLR-Queryparser.

 I use SOLR 4.2 und I have nearly 20 sample address records in the
 SOLR-Database.

 I only use the q field in the SOLR Admin Web GUI and every other
 controls  on this website is on default.


 First category:

 zip:30* numFound=2896

 city:H* OR zip:30*  numFound=12519

 city:H* AND zip:30* numFound=376

 These results seems to me correct.

 Now I tried with negations:

 !city:H*numFound:194577(seems to be correct)

 !city:H* AND zip:30*numFound:2520(seems to be correct)


 !city:H* OR zip:30* numFound:2520(!! this is wrong !!)

 Or do I do not understand something?


 (!city:H*) OR zip:30*numFound: 2896

 This is also wrong.

 Thanks for any hint to understand the negation handling of the query
 language.

 Ciao
   Peter Schütt

Re: Query Parser OR AND and NOT

2013-04-15 Thread Roman Chyla

Oh, sorry, I have assumed lucene query parser. I think SOLR qp must be
different then, because for me it works as expected (our qp parser is
identical with lucene in the way it treats modifiers +/- and operators
AND/OR/NOT -- NOT must be joining two clauses: a NOT b, the first cannot be
negative, as Chris points out; the modifier however can be first - but it
cannot be alone, there must be at least one positive clause). Otherwise,
-field:x it is changed into field:x

http://labs.adsabs.harvard.edu/adsabs/search/?q=%28*+-abstract%3Ablack%29+AND+abstract%3Ahole*db_key=ASTRONOMYsort_type=DATE
http://labs.adsabs.harvard.edu/adsabs/search/?q=%28-abstract%3Ablack%29+AND+abstract%3Ahole*db_key=ASTRONOMYsort_type=DATE

roman


On Mon, Apr 15, 2013 at 12:25 PM, Peter Schütt newsgro...@pstt.de wrote:

 Hallo,


 Roman Chyla roman.ch...@gmail.com wrote in
 news:caen8dywjrl+e3b0hpc9ntlmjtrkasrqlvkzhkqxopmlhhfn...@mail.gmail.com:

  should be: -city:H* OR zip:30*
 
 -city:H* OR zip:30*   numFound:2520

 gives the same wrong result.


 Another Idea?

 Ciao
   Peter Schütt

Why filter query doesn't use the same query parser as the main query?

2013-04-16 Thread Roman Chyla

Hi,

Is there some profound reason why the defType is not passed onto the filter
query?

Both query and filterQuery are created inside the QueryComponent, however
differently:

QParser parser = QParser.getParser(rb.getQueryString(), defType, req);
QParser fqp = QParser.getParser(fq, null, req);

So the filter query parser will default to 'lucene' and besides local
params such as '{!regex}' the only way to force solr to use a different
parser is to override the lucene query parser in the solrconfig.xml

queryParser name=lucene class=solr.SomeOtherQParserPlugin /

That doesn't seem right. Are there other options I missed?

If not, should the defType be passed to instantiate fqp?

Thanks,

  roman

Re: Why filter query doesn't use the same query parser as the main query?

2013-04-17 Thread Roman Chyla

Makes sense, thanks. One more question. Shouldn't there be a mechanism to
define a default query parser?

something like (inside QParserPlugin):

public static String DEFAULT_QTYPE = default; // now it
is LuceneQParserPlugin.NAME;

public static final Object[] standardPlugins = {
DEFAULT_QTYPE, LuceneQParserPlugin.class,
LuceneQParserPlugin.NAME, LuceneQParserPlugin.class,
   ...
}

in this way we can use solrconfig.xml to override the default qparser

Or does that break some assumptions?

roman



On Wed, Apr 17, 2013 at 8:34 AM, Yonik Seeley yo...@lucidworks.com wrote:

 On Tue, Apr 16, 2013 at 9:44 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
  Is there some profound reason why the defType is not passed onto the
 filter
  query?

 defType is a convenience so that the main query parameter q can
 directly be the user query (without specifying it's type like
 edismax).
 Filter queries are normally machine generated.

 -Yonik
 http://lucidworks.com

Re: List of Solr Query Parsers

2013-05-06 Thread Roman Chyla

Hi Jan,
Please add this one http://29min.wordpress.com/category/antlrqueryparser/
- I can't edit the wiki

This parser is written with ANTLR and on top of lucene modern query parser.
There is a version which implements Lucene standard QP as well as a version
which includes proximity operators, multi token synonym handling and all of
solr qparsers using function syntax - ie,. for a query like: multi synonym
NEAR/5 edismax(foo)

I would like to create a JIRA ticket soon

Thanks

Roman
On 6 May 2013 09:21, Jan Høydahl jan@cominvent.com wrote:

 Hi,

 I just added a Wiki page to try to gather a list of all known Solr query
 parsers in one place, both those which are part of Solr and those in JIRA
 or 3rd party.

   http://wiki.apache.org/solr/QueryParser

 If you known about other cool parsers out there, please add to the list.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

Re: List of Solr Query Parsers

2013-05-06 Thread Roman Chyla

Hi Jan,
My login is RomanChyla
Thanks,

Roman
On 6 May 2013 10:00, Jan Høydahl jan@cominvent.com wrote:

 Hi Roman,

 This sounds great! Please register as a user on the WIKI and give us your
 username here, then we'll grant you editing karma so you can edit the page
 yourself! The NEAR/5 syntax is really something I think we should get into
 the default lucene parser. Can't wait to have a look at your code.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 6. mai 2013 kl. 15:41 skrev Roman Chyla roman.ch...@gmail.com:

  Hi Jan,
  Please add this one
 http://29min.wordpress.com/category/antlrqueryparser/
  - I can't edit the wiki
 
  This parser is written with ANTLR and on top of lucene modern query
 parser.
  There is a version which implements Lucene standard QP as well as a
 version
  which includes proximity operators, multi token synonym handling and all
 of
  solr qparsers using function syntax - ie,. for a query like: multi
 synonym
  NEAR/5 edismax(foo)
 
  I would like to create a JIRA ticket soon
 
  Thanks
 
  Roman
  On 6 May 2013 09:21, Jan Høydahl jan@cominvent.com wrote:
 
  Hi,
 
  I just added a Wiki page to try to gather a list of all known Solr query
  parsers in one place, both those which are part of Solr and those in
 JIRA
  or 3rd party.
 
   http://wiki.apache.org/solr/QueryParser
 
  If you known about other cool parsers out there, please add to the list.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com

RE: Solr Cloud with large synonyms.txt

2013-05-07 Thread Roman Chyla

We have synonym files bigger than 5MB so even with compression that would
be probably failing (not using solr cloud yet)
Roman
On 6 May 2013 23:09, David Parks davidpark...@yahoo.com wrote:

 Wouldn't it make more sense to only store a pointer to a synonyms file in
 zookeeper? Maybe just make the synonyms file accessible via http so other
 boxes can copy it if needed? Zookeeper was never meant for storing
 significant amounts of data.


 -Original Message-
 From: Jan Høydahl [mailto:jan@cominvent.com]
 Sent: Tuesday, May 07, 2013 4:35 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr Cloud with large synonyms.txt

 See discussion here
 http://lucene.472066.n3.nabble.com/gt-1MB-file-to-Zookeeper-td3958614.html

 One idea was compression. Perhaps if we add gzip support to SynonymFilter
 it
 can read synonyms.txt.gz which would then fit larger raw dicts?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 6. mai 2013 kl. 18:32 skrev Son Nguyen s...@trancorp.com:

  Hello,
 
  I'm building a Solr Cloud (version 4.1.0) with 2 shards and a Zookeeper
 (the Zookeeer is on different machine, version 3.4.5).
  I've tried to start with a 1.7MB synonyms.txt, but got a
 ConnectionLossException:
  Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /configs/solr1/synonyms.txt
 at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
 at
 org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:270)
 at
 org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:267)
 at

 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java
 :65)
 at
 org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:267)
 at
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:436)
 at
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:315)
 at
 org.apache.solr.cloud.ZkController.uploadToZK(ZkController.java:1135)
 at
 org.apache.solr.cloud.ZkController.uploadConfigDir(ZkController.java:955)
 at
 org.apache.solr.core.CoreContainer.initZooKeeper(CoreContainer.java:285)
 ... 43 more
 
  I did some researches on internet and found out that because Zookeeper
 znode size limit is 1MB. I tried to increase the system property
 jute.maxbuffer but it won't work.
  Does anyone have experience of dealing with it?
 
  Thanks,
  Son

RE: Solr Cloud with large synonyms.txt

2013-05-08 Thread Roman Chyla

David, have you seen the finite state automata the synonym lookup is built
on? The lookup is very efficient and fast. You have a point though, it is
going to fail for someone.
Roman
On 8 May 2013 03:11, David Parks davidpark...@yahoo.com wrote:

 I can see your point, though I think edge cases would be one concern, if
 someone *can* create a very large synonyms file, someone *will* create that
 file.  What  would you set the zookeeper max data size to be? 50MB? 100MB?
 Someone is going to do something bad if there's nothing to tell them not
 to.
 Today solr cloud just crashes if you try to create a modest sized synonyms
 file, clearly at a minimum some zookeeper settings should be configured out
 of the box.  Any reasonable setting you come up with for zookeeper is
 virtually guaranteed to fail for some percentage of users over a reasonably
 sized user-base (which solr has).

 What if I plugged in a 200MB synonyms file just for testing purposes (I
 don't care about performance implications)?  I don't think most users would
 catch the footnote in the docs that calls out a max synonyms file size.

 Dave


 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: Tuesday, May 07, 2013 11:53 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr Cloud with large synonyms.txt

 I'm not so worried about the large file in zk issue myself.

 The concern is that you start storing and accessing lots of large files in
 ZK. This is not what it was made for, and everything stays in RAM, so they
 guard against this type of usage.

 We are talking about a config file that is loaded on Core load though. It's
 uploaded and read very rarely. On modern hardware and networks, making that
 file 5MB rather than 1MB is not going to ruin your day. It just won't. Solr
 does not use ZooKeeper heavily - in a steady state cluster, it doesn't read
 or write from ZooKeeper at all to any degree that registers. I'm going to
 have to see problems loading these larger config files from ZooKeeper
 before
 I'm worried that it's a problem.

 - Mark

 On May 7, 2013, at 12:21 PM, Son Nguyen s...@trancorp.com wrote:

  Mark,
 
  I tried to set that property on both ZK (I have only one ZK instance) and
 Solr, but it still didn't work.
  But I read somewhere that ZK is not really designed for keeping large
 data
 files, so this solution - increasing jute.maxbuffer (if I can implement it)
 should be just temporary.
 
  Son
 
  -Original Message-
  From: Mark Miller [mailto:markrmil...@gmail.com]
  Sent: Tuesday, May 07, 2013 9:35 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr Cloud with large synonyms.txt
 
 
  On May 7, 2013, at 10:24 AM, Mark Miller markrmil...@gmail.com wrote:
 
 
  On May 6, 2013, at 12:32 PM, Son Nguyen s...@trancorp.com wrote:
 
  I did some researches on internet and found out that because Zookeeper
 znode size limit is 1MB. I tried to increase the system property
 jute.maxbuffer but it won't work.
  Does anyone have experience of dealing with it?
 
  Perhaps hit up the ZK list? They doc it as simply raising
 jute.maxbuffer,
 though you have to do it for each ZK instance.
 
  - Mark
 
 
  the system property must be set on all servers and clients otherwise
 problems will arise.
 
  Make sure you try passing it both to ZK *and* to Solr.
 
  - Mark

Re: Portability of Solr index

2013-05-10 Thread Roman Chyla

Hi Mukesh,
This seems like something lucene developers should be aware of - you have
probably spent quiet some time to find problem/solution. Could you create a
JIRA ticket?

Roman
On 10 May 2013 03:29, mukesh katariya mukesh.katar...@e-zest.in wrote:

 There is a problem with Base64 encoding. There is a project specific
 requirement where i need to do some processing on solr string field type
 and
 then base64 encode it. I was using Sun's base64 encoder which is dependent
 on the JRE of the system. So when i used to index the base64 it was adding
 system specific new line character after every 77 characters.   I googled a
 bit and changed the base64 encoder to apache codec  for base64 encoding.
 this fixed the problem.


 Thanks for all your time and help.

 Best Regards
 Mukesh Katariya



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Portability-of-Solr-index-tp4061829p4062230.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: List of Solr Query Parsers

2013-05-22 Thread Roman Chyla

Hello,
I have just created a new JIRA issue, if you are interested in trying out
the new query parser, please visit:
https://issues.apache.org/jira/browse/LUCENE-5014
Thanks,

roman

On Mon, May 6, 2013 at 5:36 PM, Jan Høydahl jan@cominvent.com wrote:

 Added. Please try editing the page now.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 6. mai 2013 kl. 19:58 skrev Roman Chyla roman.ch...@gmail.com:

  Hi Jan,
  My login is RomanChyla
  Thanks,
 
  Roman
  On 6 May 2013 10:00, Jan Høydahl jan@cominvent.com wrote:
 
  Hi Roman,
 
  This sounds great! Please register as a user on the WIKI and give us
 your
  username here, then we'll grant you editing karma so you can edit the
 page
  yourself! The NEAR/5 syntax is really something I think we should get
 into
  the default lucene parser. Can't wait to have a look at your code.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
 
  6. mai 2013 kl. 15:41 skrev Roman Chyla roman.ch...@gmail.com:
 
  Hi Jan,
  Please add this one
  http://29min.wordpress.com/category/antlrqueryparser/
  - I can't edit the wiki
 
  This parser is written with ANTLR and on top of lucene modern query
  parser.
  There is a version which implements Lucene standard QP as well as a
  version
  which includes proximity operators, multi token synonym handling and
 all
  of
  solr qparsers using function syntax - ie,. for a query like: multi
  synonym
  NEAR/5 edismax(foo)
 
  I would like to create a JIRA ticket soon
 
  Thanks
 
  Roman
  On 6 May 2013 09:21, Jan Høydahl jan@cominvent.com wrote:
 
  Hi,
 
  I just added a Wiki page to try to gather a list of all known Solr
 query
  parsers in one place, both those which are part of Solr and those in
  JIRA
  or 3rd party.
 
  http://wiki.apache.org/solr/QueryParser
 
  If you known about other cool parsers out there, please add to the
 list.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com

Re: Prevention of heavy wildcard queries

2013-05-27 Thread Roman Chyla

You are right that starting to parse the query before the query component
can get soon very ugly and complicated. You should take advantage of the
flex parser, it is already in lucene contrib - but if you are interested in
the better version, look at
https://issues.apache.org/jira/browse/LUCENE-5014

The way you can solve this is:

1. use the standard syntax grammar (which allows *foo*)
2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or
raise error etc

this way, you are changing semantics - but don't need to touch the syntax
definition; of course, you may also change the grammar and allow only one
instance of wildcard (or some combination) but for that you should probably
use LUCENE-5014

roman

On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi.

 Searching terms with wildcard in their start, is solved with
 ReversedWildcardFilterFactory. But, what about terms with wildcard in both
 start AND end?

 This query is heavy, and I want to disallow such queries from my users.

 I'm looking for a way to cause these queries to fail.
 I guess there is no built-in support for my need, so it is OK to write a
 new solution.

 My current plan is to create a search component (which will run before
 QueryComponent). It should analyze the query string, and to drop the query
 if too heavy wildcard are found.

 Another option is to create a query parser, which wraps the current
 (specified or default) qparser, and does the same work as above.

 These two options require an analysis of the query text, which might be an
 ugly work (just think about nested queries [using _query_], OR even a lot
 of more basic scenarios like quoted terms, etc.)

 Am I missing a simple and clean way to do this?
 What would you do?

 P.S. if no simple solution exists, timeAllowed limit is the best
 work-around I could think about. Any other suggestions?

Re: Prevention of heavy wildcard queries

2013-05-27 Thread Roman Chyla

Hi Issac,
it is as you say, with the exception that you create a QParserPlugin, not a
search component

* create QParserPlugin, give it some name, eg. 'nw'
* make a copy of the pipeline - your component should be at the same place,
or just above, the wildcard processor

also make sure you are setting your qparser for FQ queries, ie.
fq={!nw}foo


On Mon, May 27, 2013 at 5:01 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Thanks Roman.
 Based on some of your suggestions, will the steps below do the work?

 * Create (and register) a new SearchComponent
 * In its prepare method: Do for Q and all of the FQs (so this
 SearchComponent should run AFTER QueryComponent, in order to see all of the
 FQs)
 * Create org.apache.lucene.queryparser.flexible.core.StandardQueryParser,
 with a special implementation of QueryNodeProcessorPipeline, which contains
 my NodeProcessor in the top of its list.
 * Set my analyzer into that StandardQueryParser
 * My NodeProcessor will be called for each term in the query, so it can
 throw an exception if a (basic) querynode contains wildcard in both start
 and end of the term.

 Do I have a way to avoid from reimplementing the whole StandardQueryParser
 class?


you can try subclassing it, if it allows it


 Will this work for both LuceneQParser and EdismaxQParser queries?


this will not work for edismax, nothing but changing the edismax qparser
will do the trick



 Any other solution/work-around? How do other production environments of
 Solr overcome this issue?


you can also try modifying the standard solr parser, or even the JavaCC
generated classes
I believe many people do just that (or some sort of preprocessing)

roman




 On Mon, May 27, 2013 at 10:15 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  You are right that starting to parse the query before the query component
  can get soon very ugly and complicated. You should take advantage of the
  flex parser, it is already in lucene contrib - but if you are interested
 in
  the better version, look at
  https://issues.apache.org/jira/browse/LUCENE-5014
 
  The way you can solve this is:
 
  1. use the standard syntax grammar (which allows *foo*)
  2. add (or modify) WildcardQueryNodeProcessor to dis/allow that case, or
  raise error etc
 
  this way, you are changing semantics - but don't need to touch the syntax
  definition; of course, you may also change the grammar and allow only one
  instance of wildcard (or some combination) but for that you should
 probably
  use LUCENE-5014
 
  roman
 
  On Mon, May 27, 2013 at 2:18 PM, Isaac Hebsh isaac.he...@gmail.com
  wrote:
 
   Hi.
  
   Searching terms with wildcard in their start, is solved with
   ReversedWildcardFilterFactory. But, what about terms with wildcard in
  both
   start AND end?
  
   This query is heavy, and I want to disallow such queries from my users.
  
   I'm looking for a way to cause these queries to fail.
   I guess there is no built-in support for my need, so it is OK to write
 a
   new solution.
  
   My current plan is to create a search component (which will run before
   QueryComponent). It should analyze the query string, and to drop the
  query
   if too heavy wildcard are found.
  
   Another option is to create a query parser, which wraps the current
   (specified or default) qparser, and does the same work as above.
  
   These two options require an analysis of the query text, which might be
  an
   ugly work (just think about nested queries [using _query_], OR even a
 lot
   of more basic scenarios like quoted terms, etc.)
  
   Am I missing a simple and clean way to do this?
   What would you do?
  
   P.S. if no simple solution exists, timeAllowed limit is the best
   work-around I could think about. Any other suggestions?

Re: Solr/Lucene Analayzer That Writes To File

2013-05-28 Thread Roman Chyla

You can store them and then use different analyzer chains on it (stored,
doesn't need to be indexed)

I'd probably use the collector pattern


se.search(new MatchAllDocsQuery(), new Collector() {
  private AtomicReader reader;
  private int i = 0;

  @Override
  public boolean acceptsDocsOutOfOrder() {
return true;
  }

  @Override

  public void collect(int i) {
Document d;
try {
  d = reader.document(i, fieldsToLoad);
  for (String f: fieldsToLoad) {
String[] vals = d.getValues(f);
for (String s: vals) {
  TokenStream ts = analyzer.tokenStream(targetAnalyzer,
new StringReader(s));
  ts.reset();
  while (ts.incrementToken()) {
//do something with the analyzed tokens
  }

}
  }
} catch (IOException e) {
  // pass

}
  }
  @Override

  public void setNextReader(AtomicReaderContext context) {
this.reader = context.reader();
  }

  @Override
  public void setScorer(org.apache.lucene.search.Scorer scorer) {
// Do Nothing

  }
});

// or persist the data here if one of your components knows to
write to disk, but there is no api...
TokenStream ts = analyzer.tokenStream(data.targetField, new
StringReader(xxx));
ts.reset();
ts.reset();
ts.reset();

  }



On Mon, May 27, 2013 at 9:37 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi;

 I want to use Solr for an academical research. One step of my purpose is I
 want to store tokens in a file (I will store it at a database later) and I
 don't want to index them. For such kind of purposes should I use core
 Lucene or Solr? Is there an example for writing a custom analyzer and just
 storing tokens in a file?

Re: how are you handling killer queries?

2013-06-03 Thread Roman Chyla

I think you should take a look at the TimeLimitingCollector (it is used
also inside SolrIndexSearcher).
My understanding is that it will stop your server from consuming
unnecessary resources.

--roman


On Mon, Jun 3, 2013 at 4:39 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 How are you handling killer queries with solr?

 While solr/lucene (currently 4.2.1) is trying to do its best I see
 sometimes stupid queries
 in my logs, located with extremly long query time.

 Example:
 q=???+and+??+and+???+and++and+???+and+??

 I even get hits for this (hits=34091309 status=0 QTime=88667).

 But the jetty log says:
 WARN:oejs.Response:Committed before 500 {msg=Datenübergabe unterbrochen
  (broken pipe),trace=org.eclipse.jetty.io.EofException...
 org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)|?...
 35 more|,code=500}
 WARN:oejs.ServletHandler:/solr/base/select
 java.lang.IllegalStateException: Committed
 at
 org.eclipse.jetty.server.Response.resetBuffer(Response.java:1136)

 Because I get hits and qtime the search is successful, right?

 But jetty/http has already closed the connection and solr doesn't know
 about this?

 How are you handling killer queries, just ignoring?
 Or something to tune (jetty config about timeout) or filter (query
 filtering)?

 Would be pleased to hear your comments.

 Bernd

Two instances of solr - the same datadir?

2013-06-04 Thread Roman Chyla

Hello,

I need your expert advice. I am thinking about running two instances of
solr that share the same datadirectory. The *reason* being: indexing
instance is constantly building cache after every commit (we have a big
cache) and this slows it down. But indexing doesn't need much RAM, only the
search does (and server has lots of CPUs)

So, it is like having two solr instances

1. solr-indexing-master
2. solr-read-only-master

In the solrconfig.xml I can disable update components, It should be fine.
However, I don't know how to 'trigger' index re-opening on (2) after the
commit happens on (1).

Ideally, the second instance could monitor the disk and re-open disk after
new files appear there. Do I have to implement custom IndexReaderFactory?
Or something else?

Please note: I know about the replication, this usecase is IMHO slightly
different - in fact, write-only-master (1) is also a replication master

Googling turned out only this
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 - no
pointers there.

But If I am approaching the problem wrongly, please don't hesitate to
're-educate' me :)

Thanks!

  roman

Re: Two instances of solr - the same datadir?

2013-06-04 Thread Roman Chyla

OK, so I have verified the two instances can run alongside, sharing the
same datadir

All update handlers are unaccessible in the read-only master

updateHandler class=solr.DirectUpdateHandler2
 enable=${solr.can.write:true}

java -Dsolr.can.write=false .

And I can reload the index manually:

curl 
http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1


But this is not an ideal solution; I'd like for the read-only server to
discover index changes on its own. Any pointers?

Thanks,

  roman


On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hello,

 I need your expert advice. I am thinking about running two instances of
 solr that share the same datadirectory. The *reason* being: indexing
 instance is constantly building cache after every commit (we have a big
 cache) and this slows it down. But indexing doesn't need much RAM, only the
 search does (and server has lots of CPUs)

 So, it is like having two solr instances

 1. solr-indexing-master
 2. solr-read-only-master

 In the solrconfig.xml I can disable update components, It should be fine.
 However, I don't know how to 'trigger' index re-opening on (2) after the
 commit happens on (1).

 Ideally, the second instance could monitor the disk and re-open disk after
 new files appear there. Do I have to implement custom IndexReaderFactory?
 Or something else?

 Please note: I know about the replication, this usecase is IMHO slightly
 different - in fact, write-only-master (1) is also a replication master

 Googling turned out only this
 http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 - no
 pointers there.

 But If I am approaching the problem wrongly, please don't hesitate to
 're-educate' me :)

 Thanks!

   roman

Re: Two instances of solr - the same datadir?

2013-06-04 Thread Roman Chyla

Replication is fine, I am going to use it, but I wanted it for instances
*distributed* across several (physical) machines - but here I have one
physical machine, it has many cores. I want to run 2 instances of solr
because I think it has these benefits:

1) I can give less RAM to the writer (4GB), and use more RAM for the
searcher (28GB)
2) I can deactivate warming for the writer and keep it for the searcher
(this considerably speeds up indexing - each time we commit, the server is
rebuilding a citation network of 80M edges)
3) saving disk space and better OS caching (OS should be able to use more
RAM for the caching, which should result in faster operations - the two
processes are accessing the same index)

Maybe I should just forget it and go with the replication, but it doesn't
'feel right' IFF it is on the same physical machine. And Lucene
specifically has a method for discovering changes and re-opening the index
(DirectoryReader.openIfChanged)

Am I not seeing something?

roman



On Tue, Jun 4, 2013 at 5:30 PM, Jason Hellman 
jhell...@innoventsolutions.com wrote:

 Roman,

 Could you be more specific as to why replication doesn't meet your
 requirements?  It was geared explicitly for this purpose, including the
 automatic discovery of changes to the data on the index master.

 Jason

 On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com wrote:

  OK, so I have verified the two instances can run alongside, sharing the
  same datadir
 
  All update handlers are unaccessible in the read-only master
 
  updateHandler class=solr.DirectUpdateHandler2
  enable=${solr.can.write:true}
 
  java -Dsolr.can.write=false .
 
  And I can reload the index manually:
 
  curl 
 
 http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1
  
 
  But this is not an ideal solution; I'd like for the read-only server to
  discover index changes on its own. Any pointers?
 
  Thanks,
 
   roman
 
 
  On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
  Hello,
 
  I need your expert advice. I am thinking about running two instances of
  solr that share the same datadirectory. The *reason* being: indexing
  instance is constantly building cache after every commit (we have a big
  cache) and this slows it down. But indexing doesn't need much RAM, only
 the
  search does (and server has lots of CPUs)
 
  So, it is like having two solr instances
 
  1. solr-indexing-master
  2. solr-read-only-master
 
  In the solrconfig.xml I can disable update components, It should be
 fine.
  However, I don't know how to 'trigger' index re-opening on (2) after the
  commit happens on (1).
 
  Ideally, the second instance could monitor the disk and re-open disk
 after
  new files appear there. Do I have to implement custom
 IndexReaderFactory?
  Or something else?
 
  Please note: I know about the replication, this usecase is IMHO slightly
  different - in fact, write-only-master (1) is also a replication master
 
  Googling turned out only this
  http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 -
 no
  pointers there.
 
  But If I am approaching the problem wrongly, please don't hesitate to
  're-educate' me :)
 
  Thanks!
 
   roman

Re: Two instances of solr - the same datadir?

2013-06-05 Thread Roman Chyla

Hi Peter,

Thank you, I am glad to read that this usecase is not alien.

I'd like to make the second instance (searcher) completely read-only, so I
have disabled all the components that can write.

(being lazy ;)) I'll probably use
http://wiki.apache.org/solr/CollectionDistribution to call the curl after
commit, or write some IndexReaderFactory that checks for changes

The problem with calling the 'core reload' - is that it seems lots of work
for just opening a new searcher, eeekkk...somewhere I read that it is cheap
to reload a core, but re-opening the index searches must be definitely
cheaper...

roman


On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com wrote:

 Hi,
 We use this very same scenario to great effect - 2 instances using the same
 dataDir with many cores - 1 is a writer (no caching), the other is a
 searcher (lots of caching).
 To get the searcher to see the index changes from the writer, you need the
 searcher to do an empty commit - i.e. you invoke a commit with 0 documents.
 This will refresh the caches (including autowarming), [re]build the
 relevant searchers etc. and make any index changes visible to the RO
 instance.
 Also, make sure to use lockTypenative/lockType in solrconfig.xml to
 ensure the two instances don't try to commit at the same time.
 There are several ways to trigger a commit:
 Call commit() periodically within your own code.
 Use autoCommit in solrconfig.xml.
 Use an RPC/IPC mechanism between the 2 instance processes to tell the
 searcher the index has changed, then call commit when called (more complex
 coding, but good if the index changes on an ad-hoc basis).
 Note, doing things this way isn't really suitable for an NRT environment.

 HTH,
 Peter



 On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Replication is fine, I am going to use it, but I wanted it for instances
  *distributed* across several (physical) machines - but here I have one
  physical machine, it has many cores. I want to run 2 instances of solr
  because I think it has these benefits:
 
  1) I can give less RAM to the writer (4GB), and use more RAM for the
  searcher (28GB)
  2) I can deactivate warming for the writer and keep it for the searcher
  (this considerably speeds up indexing - each time we commit, the server
 is
  rebuilding a citation network of 80M edges)
  3) saving disk space and better OS caching (OS should be able to use more
  RAM for the caching, which should result in faster operations - the two
  processes are accessing the same index)
 
  Maybe I should just forget it and go with the replication, but it doesn't
  'feel right' IFF it is on the same physical machine. And Lucene
  specifically has a method for discovering changes and re-opening the
 index
  (DirectoryReader.openIfChanged)
 
  Am I not seeing something?
 
  roman
 
 
 
  On Tue, Jun 4, 2013 at 5:30 PM, Jason Hellman 
  jhell...@innoventsolutions.com wrote:
 
   Roman,
  
   Could you be more specific as to why replication doesn't meet your
   requirements?  It was geared explicitly for this purpose, including the
   automatic discovery of changes to the data on the index master.
  
   Jason
  
   On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com wrote:
  
OK, so I have verified the two instances can run alongside, sharing
 the
same datadir
   
All update handlers are unaccessible in the read-only master
   
updateHandler class=solr.DirectUpdateHandler2
enable=${solr.can.write:true}
   
java -Dsolr.can.write=false .
   
And I can reload the index manually:
   
curl 
   
  
 
 http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1

   
But this is not an ideal solution; I'd like for the read-only server
 to
discover index changes on its own. Any pointers?
   
Thanks,
   
 roman
   
   
On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
   
Hello,
   
I need your expert advice. I am thinking about running two instances
  of
solr that share the same datadirectory. The *reason* being: indexing
instance is constantly building cache after every commit (we have a
  big
cache) and this slows it down. But indexing doesn't need much RAM,
  only
   the
search does (and server has lots of CPUs)
   
So, it is like having two solr instances
   
1. solr-indexing-master
2. solr-read-only-master
   
In the solrconfig.xml I can disable update components, It should be
   fine.
However, I don't know how to 'trigger' index re-opening on (2) after
  the
commit happens on (1).
   
Ideally, the second instance could monitor the disk and re-open disk
   after
new files appear there. Do I have to implement custom
   IndexReaderFactory?
Or something else?
   
Please note: I know about the replication, this usecase is IMHO
  slightly
different - in fact, write-only-master (1) is also a replication

Re: Two instances of solr - the same datadir?

2013-06-05 Thread Roman Chyla

So here it is for a record how I am solving it right now:

Write-master is started with: -Dmontysolr.warming.enabled=false
-Dmontysolr.write.master=true -Dmontysolr.read.master=http://localhost:5005
Read-master is started with: -Dmontysolr.warming.enabled=true
-Dmontysolr.write.master=false


solrconfig.xml changes:

1. all index changing components have this bit,
enable=${montysolr.master:true} - ie.

updateHandler class=solr.DirectUpdateHandler2
 enable=${montysolr.master:true}

2. for cache warming de/activation

listener event=newSearcher
  class=solr.QuerySenderListener
  enable=${montysolr.enable.warming:true}...

3. to trigger refresh of the read-only-master (from write-master):

listener event=postCommit
  class=solr.RunExecutableListener
  enable=${montysolr.master:true}
  str name=execurl/str
  str name=dir./str
  bool name=waitfalse/bool
  arr name=args str${montysolr.read.master:http://localhost
}/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
/listener

This works, I still don't like the reload of the whole core, but it seems
like the easiest thing to do now.

-- roman


On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Peter,

 Thank you, I am glad to read that this usecase is not alien.

 I'd like to make the second instance (searcher) completely read-only, so I
 have disabled all the components that can write.

 (being lazy ;)) I'll probably use
 http://wiki.apache.org/solr/CollectionDistribution to call the curl after
 commit, or write some IndexReaderFactory that checks for changes

 The problem with calling the 'core reload' - is that it seems lots of work
 for just opening a new searcher, eeekkk...somewhere I read that it is cheap
 to reload a core, but re-opening the index searches must be definitely
 cheaper...

 roman


 On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,
 We use this very same scenario to great effect - 2 instances using the
 same
 dataDir with many cores - 1 is a writer (no caching), the other is a
 searcher (lots of caching).
 To get the searcher to see the index changes from the writer, you need the
 searcher to do an empty commit - i.e. you invoke a commit with 0
 documents.
 This will refresh the caches (including autowarming), [re]build the
 relevant searchers etc. and make any index changes visible to the RO
 instance.
 Also, make sure to use lockTypenative/lockType in solrconfig.xml to
 ensure the two instances don't try to commit at the same time.
 There are several ways to trigger a commit:
 Call commit() periodically within your own code.
 Use autoCommit in solrconfig.xml.
 Use an RPC/IPC mechanism between the 2 instance processes to tell the
 searcher the index has changed, then call commit when called (more complex
 coding, but good if the index changes on an ad-hoc basis).
 Note, doing things this way isn't really suitable for an NRT environment.

 HTH,
 Peter



 On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Replication is fine, I am going to use it, but I wanted it for instances
  *distributed* across several (physical) machines - but here I have one
  physical machine, it has many cores. I want to run 2 instances of solr
  because I think it has these benefits:
 
  1) I can give less RAM to the writer (4GB), and use more RAM for the
  searcher (28GB)
  2) I can deactivate warming for the writer and keep it for the searcher
  (this considerably speeds up indexing - each time we commit, the server
 is
  rebuilding a citation network of 80M edges)
  3) saving disk space and better OS caching (OS should be able to use
 more
  RAM for the caching, which should result in faster operations - the two
  processes are accessing the same index)
 
  Maybe I should just forget it and go with the replication, but it
 doesn't
  'feel right' IFF it is on the same physical machine. And Lucene
  specifically has a method for discovering changes and re-opening the
 index
  (DirectoryReader.openIfChanged)
 
  Am I not seeing something?
 
  roman
 
 
 
  On Tue, Jun 4, 2013 at 5:30 PM, Jason Hellman 
  jhell...@innoventsolutions.com wrote:
 
   Roman,
  
   Could you be more specific as to why replication doesn't meet your
   requirements?  It was geared explicitly for this purpose, including
 the
   automatic discovery of changes to the data on the index master.
  
   Jason
  
   On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
  
OK, so I have verified the two instances can run alongside, sharing
 the
same datadir
   
All update handlers are unaccessible in the read-only master
   
updateHandler class=solr.DirectUpdateHandler2
enable=${solr.can.write:true}
   
java -Dsolr.can.write=false .
   
And I can reload the index manually:
   
curl 
   
  
 
 http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1

Re: Two instances of solr - the same datadir?

2013-06-07 Thread Roman Chyla

I have auto commit after 40k RECs/1800secs. But I only tested with manual
commit, but I don't see why it should work differently.
Roman
On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote:

 If it makes you feel better, I also considered this approach when I was in
 the same situation with a separate indexer and searcher on one Physical
 linux machine.

 My main concern was re-using the FS cache between both instances - If I
 replicated to myself there would be two independent copies of the index,
 FS-cached separately.

 I like the suggestion of using autoCommit to reload the index. If I'm
 reading that right, you'd set an autoCommit on 'zero docs changing', or
 just 'every N seconds'? Did that work?

 Best of luck!

 Tim


 On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:

  So here it is for a record how I am solving it right now:
 
  Write-master is started with: -Dmontysolr.warming.enabled=false
  -Dmontysolr.write.master=true -Dmontysolr.read.master=
  http://localhost:5005
  Read-master is started with: -Dmontysolr.warming.enabled=true
  -Dmontysolr.write.master=false
 
 
  solrconfig.xml changes:
 
  1. all index changing components have this bit,
  enable=${montysolr.master:true} - ie.
 
  updateHandler class=solr.DirectUpdateHandler2
   enable=${montysolr.master:true}
 
  2. for cache warming de/activation
 
  listener event=newSearcher
class=solr.QuerySenderListener
enable=${montysolr.enable.warming:true}...
 
  3. to trigger refresh of the read-only-master (from write-master):
 
  listener event=postCommit
class=solr.RunExecutableListener
enable=${montysolr.master:true}
str name=execurl/str
str name=dir./str
bool name=waitfalse/bool
arr name=args str${montysolr.read.master:http://localhost
 
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
  /listener
 
  This works, I still don't like the reload of the whole core, but it seems
  like the easiest thing to do now.
 
  -- roman
 
 
  On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi Peter,
  
   Thank you, I am glad to read that this usecase is not alien.
  
   I'd like to make the second instance (searcher) completely read-only,
 so
  I
   have disabled all the components that can write.
  
   (being lazy ;)) I'll probably use
   http://wiki.apache.org/solr/CollectionDistribution to call the curl
  after
   commit, or write some IndexReaderFactory that checks for changes
  
   The problem with calling the 'core reload' - is that it seems lots of
  work
   for just opening a new searcher, eeekkk...somewhere I read that it is
  cheap
   to reload a core, but re-opening the index searches must be definitely
   cheaper...
  
   roman
  
  
   On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com
  wrote:
  
   Hi,
   We use this very same scenario to great effect - 2 instances using the
   same
   dataDir with many cores - 1 is a writer (no caching), the other is a
   searcher (lots of caching).
   To get the searcher to see the index changes from the writer, you need
  the
   searcher to do an empty commit - i.e. you invoke a commit with 0
   documents.
   This will refresh the caches (including autowarming), [re]build the
   relevant searchers etc. and make any index changes visible to the RO
   instance.
   Also, make sure to use lockTypenative/lockType in solrconfig.xml
 to
   ensure the two instances don't try to commit at the same time.
   There are several ways to trigger a commit:
   Call commit() periodically within your own code.
   Use autoCommit in solrconfig.xml.
   Use an RPC/IPC mechanism between the 2 instance processes to tell the
   searcher the index has changed, then call commit when called (more
  complex
   coding, but good if the index changes on an ad-hoc basis).
   Note, doing things this way isn't really suitable for an NRT
  environment.
  
   HTH,
   Peter
  
  
  
   On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Replication is fine, I am going to use it, but I wanted it for
  instances
*distributed* across several (physical) machines - but here I have
 one
physical machine, it has many cores. I want to run 2 instances of
 solr
because I think it has these benefits:
   
1) I can give less RAM to the writer (4GB), and use more RAM for the
searcher (28GB)
2) I can deactivate warming for the writer and keep it for the
  searcher
(this considerably speeds up indexing - each time we commit, the
  server
   is
rebuilding a citation network of 80M edges)
3) saving disk space and better OS caching (OS should be able to use
   more
RAM for the caching, which should result in faster operations - the
  two
processes are accessing the same index)
   
Maybe I should just forget it and go with the replication, but it
   doesn't
'feel right' IFF

Re: New operator.

2013-06-17 Thread Roman Chyla

Hello Yanis,

We are probably using something similar - eg. 'functional operators' - eg.
edismax() to treat everything inside the bracket as an argument for
edismax, or pos() to search for authors based on their position. And
invenio() which is exactly what you describe, to get results from external
engine. Depending on the level of complexity, you may need any/all of the
following

1. query parser that understands the operator syntax and can build some
'external search' query object
2. the 'query object' that knows to contact the external service and return
lucene docids - so you will need some translation
externalIds-luceneDocIds - you can for example index the same primary key
in both solr and the ext engine, and then use a cache for the mapping

To solve the 1, you could use the
https://issues.apache.org/jira/browse/LUCENE-5014 - sorry for the shameless
plug :) - but this is what we use and what i am familiar with, you can see
a grammar that gives you the 'functional operator' here - if you dig
deeper, you will see how it is building different query objects for
different operators:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/grammars/ADS.g

and here an example how to ask the external engine for results and return
lucene docids:
https://github.com/romanchyla/montysolr/blob/master/contrib/invenio/src/java/org/apache/lucene/search/InvenioWeight.java

it is a bit messy and you should probably ignore how we are getting the
results, just look at nextDoc()

HTH,

roman

On Mon, Jun 17, 2013 at 2:34 PM, Yanis Kakamaikis
yanis.kakamai...@gmail.com wrote:

Hi all, thanks for your reply.
I want to be able to ask a combined query, a normal solr querym but one of
the query fields should get it's answer not from within the solr engine,
but from an external engine.
the rest should work normaly with the ability to do more tasks on the
answer like faceting for example.
The external engine will use the same objects ids like solr, so the boolean
query that uses this engine answer be executed correctly.
For example, let say I want to find a person by his name, age, address, and
also by his picture. I have a picture indexing engine, I want to create a
combined query that will call this engine like other query field. I hope
it's more clear now...

On Sun, Jun 16, 2013 at 4:02 PM, Jack Krupansky j...@basetechnology.com
wrote:

It all depends on what you mean by an operator.

Start by describing in more detail what problem you are trying to solve.

And how do you expect your users or applications to use this operator.
Give some examples.

Solr and Lucene do not have operators per say, except in query parser
syntax, but that is hard-wired into the individual query parsers.

-- Jack Krupansky

-Original Message- From: Yanis Kakamaikis
Sent: Sunday, June 16, 2013 2:01 AM
To: solr-user@lucene.apache.org
Subject: New operator.

Hi all,I want to add a new operator to my solr. I need that
operator
to call my proprietary engine and build an answer vector to solr, in a
way
that this vector will be part of the boolean query at the next step.
How
do I do that?
Thanks

Re: Avoiding OOM fatal crash

2013-06-17 Thread Roman Chyla

I think you can modify the response writer and stream results instead of
building them first and then sending in one go. I am using this technique
to dump millions of docs in json format - but in your case you may have to
figure out how to dump during streaming if you don't want to save data to
disk first.

Roman
On 17 Jun 2013 20:02, Mark Miller markrmil...@gmail.com wrote:

 There is a java cmd line arg that lets you run a command on OOM - I'd
 configure it to log and kill -9 Solr. Then use runit or something to
 supervice Solr - so that if it's killed, it just restarts.

 I think that is the best way to deal with OOM's. Other than that, you have
 to write a middle layer and put limits on user requests before making Solr
 requests.

 - Mark

 On Jun 17, 2013, at 4:44 PM, Manuel Le Normand manuel.lenorm...@gmail.com
 wrote:

  Hello again,
 
  After a heavy query on my index (returning 100K docs in a single query)
 my
  JVM heap's floods and I get an JAVA OOM exception, and then that my
  GCcannot collect anything (GC
  overhead limit exceeded) as these memory chunks are not disposable.
 
  I want to afford queries like this, my concern is that this case
 provokes a
  total Solr crash, returning a 503 Internal Server Error while trying to *
  index.*
 
  Is there anyway to separate these two logics? I'm fine with solr not
 being
  able to return any response after returning this OOM, but I don't see the
  justification the query to flood JVM's internal (bounded) buffers for
  writings.
 
  Thanks,
  Manuel

Re: UnInverted multi-valued field

2013-06-19 Thread Roman Chyla

On Wed, Jun 19, 2013 at 5:30 AM, Jochen Lienhard 
lienh...@ub.uni-freiburg.de wrote:

 Hi @all.

 We have the problem that after an update the index takes to much time for
 'warm up'.

 We have some multivalued facet-fields and during the startup solr creates
 the messages:

 INFO: UnInverted multi-valued field {field=mt_facet,memSize=**
 18753256,tindexSize=54,time=**170,phase1=156,nTerms=17,**
 bigTerms=3,termInstances=**903276,uses=0}


 In the solconfig we use the facet.method 'fc'.
 We know, that the start-up with the method 'enum' is faster, but then the
 searches are very slow.

 How do you handle this problem?
 Or have you any idea for optimizing the warm up?
 Or what do you do after an update?


You probably know, but just in case... you may use autowarming; the
searcher will populate the cache and only after the warmup queries
finished, will it be exposed to the world. The old searcher continues to
handle requests in the meantime.

roman



 Greetings

 Jochen

 --
 Dr. rer. nat. Jochen Lienhard
 Dezernat EDV

 Albert-Ludwigs-Universität Freiburg
 Universitätsbibliothek
 Rempartstr. 10-16  | Postfach 1629
 79098 Freiburg | 79016 Freiburg

 Telefon: +49 761 203-3908
 E-Mail: lienh...@ub.uni-freiburg.de
 Internet: www.ub.uni-freiburg.de

Re: cores sharing an instance

2013-06-29 Thread Roman Chyla

Cores can be reloaded, they are inside solrcore loader /I forgot the exact
name/, and they will have different classloaders /that's servlet thing/, so
if you want singletons you must load them outside of the core, using a
parent classloader - in case of jetty, this means writing your own jetty
initialization or config to force shared class loaders. or find a place
inside the solr, before the core is created. Google for montysolr to see
the example of the first approach.

But, unless you really have no other choice, using singletons is IMHO a bad
idea in this case

Roman

On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote:

 its the singleton pattern, where in my case i want an object (which is
RAM expensive) to be a centralized coordinator of application logic.

 thank you

 On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar shalinman...@gmail.com
wrote:

  There is very little shared between multiple cores (instanceDir paths,
  logging config maybe?). Why are you trying to do this?
 
  On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin pey...@robustlinks.com
wrote:
  Hi
 
  I have a multicore setup (in 4.3.0). Is it possible for one core to
share an instance of its class with other cores at run time? i.e.
 
  At run time core 1 makes an instance of object O_i
 
  core 1 -- object O_i
  core 2
  ---
  core n
 
  then can core K access O_i? I know they can share properties but is it
possible to share objects?
 
  thank you
 
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.

Re: cores sharing an instance

2013-07-01 Thread Roman Chyla

as for the second option:

If you look inside SolrResourceLoader, you will notice that before a
CoreContainer is created, a new class loader is also created

line:111

this.classLoader = createClassLoader(null, parent);

however, this parent object is always null, because it is called from:

public SolrResourceLoader( String instanceDir )
  {
this( instanceDir, null, null );
  }

but if you were able to replace the second null (parent class loader) with
a classloader of your own choice - ie. one that loads your singleton (but
only that singleton, you don't want to share other objects), your cores
should be able to see/share that object

so, as you can see, if you test it and it works, you may fill a JIRA ticket
and help other folks out there (i was too lazy and worked around it in the
past - but that wasn't a good solution). If there a well justified reason
to share objects, it seems weird the core is using 'null' as a parent class
loader

HTH,

  roman






On Sun, Jun 30, 2013 at 2:18 PM, Peyman Faratin pey...@robustlinks.comwrote:

 I see. If I wanted to try the second option (find a place inside solr
 before the core is created) then where would that place be in the flow of
 app waking up? Currently what I am doing is each core loads its app caches
 via a requesthandler (in solrconfig.xml) that initializes the java class
 that does the loading. For instance:

 requestHandler name=/cachedResources class=solr.SearchHandler
 startup=lazy 
arr name=last-components
  strAppCaches/str
/arr
 /requestHandler
 searchComponent name=AppCaches
 class=com.name.Project.AppCaches/


 So each core has its own so specific cachedResources handler. Where in
 SOLR would I need to place the AppCaches code to make it visible to all
 other cores then?

 thank you Roman

 On Jun 29, 2013, at 10:58 AM, Roman Chyla roman.ch...@gmail.com wrote:

  Cores can be reloaded, they are inside solrcore loader /I forgot the
 exact
  name/, and they will have different classloaders /that's servlet thing/,
 so
  if you want singletons you must load them outside of the core, using a
  parent classloader - in case of jetty, this means writing your own jetty
  initialization or config to force shared class loaders. or find a place
  inside the solr, before the core is created. Google for montysolr to see
  the example of the first approach.
 
  But, unless you really have no other choice, using singletons is IMHO a
 bad
  idea in this case
 
  Roman
 
  On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote:
 
  its the singleton pattern, where in my case i want an object (which is
  RAM expensive) to be a centralized coordinator of application logic.
 
  thank you
 
  On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com
  wrote:
 
  There is very little shared between multiple cores (instanceDir paths,
  logging config maybe?). Why are you trying to do this?
 
  On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin 
 pey...@robustlinks.com
  wrote:
  Hi
 
  I have a multicore setup (in 4.3.0). Is it possible for one core to
  share an instance of its class with other cores at run time? i.e.
 
  At run time core 1 makes an instance of object O_i
 
  core 1 -- object O_i
  core 2
  ---
  core n
 
  then can core K access O_i? I know they can share properties but is it
  possible to share objects?
 
  thank you
 
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.

Re: Solr large boolean filter

2013-07-02 Thread Roman Chyla

Hello @,

This thread 'kicked' me into finishing som long-past task of
sending/receiving large boolean (bitset) filter. We have been using bitsets
with solr before, but now I sat down and wrote it as a qparser. The use
cases, as you have discussed are:

- necessity to send lng list of ids as a query (where it is not
possible to do it the 'normal' way)
- or filtering ACLs

It works in the following way:

- external application constructs bitset and sends it as a query to solr
(q or fq, depends on your needs)
- solr unpacks the bitset (translated bits into lucene ids, if
necessary), and wraps this into a query which then has the easy job of
'filtering' wanted/unwanted items

Therefore it is good only if you can search against something that is
indexed as integer (id's often are).

A simple benchmark shows acceptable performance, to send the bitset
(randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)

To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
(5+14+68ms)

But I haven't tested latency of sending it over the network and the query
performance, but since the query is very similar as MatchAllDocs, it is
probably very fast (and I know that sending many Mbs to Solr is fast as
well)

I know this is not exactly 'standard' solution, and it is probably not
something you want to see with hundreds of millions of docs, but people
seem to be doing 'not the right thing' all the time;)
So if you think this is something useful for the community, please let me
know. If somebody would be willing to test it, i can file a JIRA ticket.

Thanks!

Roman

The code, if no JIRA is needed, can be found here:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java

839ms. run
154ms. Building random bitset indexSize=1000 fill=0.5 --
Size=15054208,cardinality=3934477 highestBit=999
25ms. Converting bitset to byte array -- resulting array length=125
20ms. Encoding byte array into base64 -- resulting array length=168
ratio=1.344
62ms. Compressing byte array with GZIP -- resulting array length=1218602
ratio=0.9748816
20ms. Encoding gzipped byte array into base64 -- resulting string
length=1624804 ratio=1.2998432
5ms. Decoding gzipped byte array from base64
14ms. Uncompressing decoded byte array
68ms. Converting from byte array to bitset
743ms. running

On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.comwrote:

Not necessarily. If the auth tokens are available on some
other system (DB, LDAP, whatever), one could get them
in the PostFilter and cache them somewhere since,
presumably, they wouldn't be changing all that often. Or
use a UserCache and get notified whenever a new searcher
was opened and regenerate or purge the cache.

Of course you're right if the post filter does NOT have
access to the source of truth for the user's privileges.

FWIW,
Erick

On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Hi,

The unfortunate thing about this is what you still have to *pass* that
filter from the client to the server every time you want to use that
filter. If that filter is big/long, passing that in all the time has
some price that could be eliminated by using server-side named
filters.

Otis
--
Solr ElasticSearch Support
http://sematext.com/

On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson erickerick...@gmail.com
wrote:
You might consider post filters. The idea
is to write a custom filter that gets applied
after all other filters etc. One use-case
here is exactly ACL lists, and can be quite
helpful if you're not doing *:* type queries.

Best
Erick

On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Btw. ElasticSearch has a nice feature here. Not sure what it's
called, but I call it named filter.

http://www.elasticsearch.org/blog/terms-filter-lookup/

Maybe that's what OP was after?

Otis
--
Solr ElasticSearch Support
http://sematext.com/

On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com
wrote:
So I'm using query like

http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29

If the IDs are purely numeric, I wonder if the better way is to send a
bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000
is included. Even using URL-encoding rules, you can fit at least 65
sequential ID flags per character and I am sure there are more
efficient encoding schemes for long empty sequences.

Regards,
Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality

Re: Solr large boolean filter

2013-07-02 Thread Roman Chyla

Wrong link to the parser, should be:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java

On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote:

Hello @,

- necessity to send lng list of ids as a query (where it is not
possible to do it the 'normal' way)
- or filtering ACLs

It works in the following way:

Therefore it is good only if you can search against something that is
indexed as integer (id's often are).

A simple benchmark shows acceptable performance, to send the bitset
(randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)

To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
(5+14+68ms)

Thanks!

Roman

The code, if no JIRA is needed, can be found here:

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java

839ms. run
154ms. Building random bitset indexSize=1000 fill=0.5 --
Size=15054208,cardinality=3934477 highestBit=999
25ms. Converting bitset to byte array -- resulting array length=125
20ms. Encoding byte array into base64 -- resulting array length=168
ratio=1.344
62ms. Compressing byte array with GZIP -- resulting array
length=1218602 ratio=0.9748816
20ms. Encoding gzipped byte array into base64 -- resulting string
length=1624804 ratio=1.2998432
5ms. Decoding gzipped byte array from base64
14ms. Uncompressing decoded byte array
68ms. Converting from byte array to bitset
743ms. running

On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson
erickerick...@gmail.comwrote:

Of course you're right if the post filter does NOT have
access to the source of truth for the user's privileges.

FWIW,
Erick

On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Hi,

Otis
--
Solr ElasticSearch Support
http://sematext.com/

On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson
erickerick...@gmail.com wrote:
You might consider post filters. The idea
is to write a custom filter that gets applied
after all other filters etc. One use-case
here is exactly ACL lists, and can be quite
helpful if you're not doing *:* type queries.

Best
Erick

On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Btw. ElasticSearch has a nice feature here. Not sure what it's
called, but I call it named filter.

http://www.elasticsearch.org/blog/terms-filter-lookup/

Maybe that's what OP was after?

Otis
--
Solr ElasticSearch Support
http://sematext.com/

On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com
wrote:
So I'm using query like

http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29http://127.0.0.1:8080/solr/select?q=*:*fq=%7B!mqparser%7Did:%281%202%203%29

If the IDs are purely numeric, I wonder if the better way is to send
a
bitset. So, bit 1 is on if ID:1 is included, bit 2000

Re: Two instances of solr - the same datadir?

2013-07-02 Thread Roman Chyla

as i discovered, it is not good to use 'native' locktype in this scenario,
actually there is a note in the solrconfig.xml which says the same

when a core is reloaded and solr tries to grab lock, it will fail - even if
the instance is configured to be read-only, so i am using 'single' lock for
the readers and 'native' for the writer, which seems to work OK

roman


On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote:

 I have auto commit after 40k RECs/1800secs. But I only tested with manual
 commit, but I don't see why it should work differently.
 Roman
 On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote:

 If it makes you feel better, I also considered this approach when I was in
 the same situation with a separate indexer and searcher on one Physical
 linux machine.

 My main concern was re-using the FS cache between both instances - If I
 replicated to myself there would be two independent copies of the index,
 FS-cached separately.

 I like the suggestion of using autoCommit to reload the index. If I'm
 reading that right, you'd set an autoCommit on 'zero docs changing', or
 just 'every N seconds'? Did that work?

 Best of luck!

 Tim


 On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:

  So here it is for a record how I am solving it right now:
 
  Write-master is started with: -Dmontysolr.warming.enabled=false
  -Dmontysolr.write.master=true -Dmontysolr.read.master=
  http://localhost:5005
  Read-master is started with: -Dmontysolr.warming.enabled=true
  -Dmontysolr.write.master=false
 
 
  solrconfig.xml changes:
 
  1. all index changing components have this bit,
  enable=${montysolr.master:true} - ie.
 
  updateHandler class=solr.DirectUpdateHandler2
   enable=${montysolr.master:true}
 
  2. for cache warming de/activation
 
  listener event=newSearcher
class=solr.QuerySenderListener
enable=${montysolr.enable.warming:true}...
 
  3. to trigger refresh of the read-only-master (from write-master):
 
  listener event=postCommit
class=solr.RunExecutableListener
enable=${montysolr.master:true}
str name=execurl/str
str name=dir./str
bool name=waitfalse/bool
arr name=args str${montysolr.read.master:http://localhost
 
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
  /listener
 
  This works, I still don't like the reload of the whole core, but it
 seems
  like the easiest thing to do now.
 
  -- roman
 
 
  On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi Peter,
  
   Thank you, I am glad to read that this usecase is not alien.
  
   I'd like to make the second instance (searcher) completely read-only,
 so
  I
   have disabled all the components that can write.
  
   (being lazy ;)) I'll probably use
   http://wiki.apache.org/solr/CollectionDistribution to call the curl
  after
   commit, or write some IndexReaderFactory that checks for changes
  
   The problem with calling the 'core reload' - is that it seems lots of
  work
   for just opening a new searcher, eeekkk...somewhere I read that it is
  cheap
   to reload a core, but re-opening the index searches must be definitely
   cheaper...
  
   roman
  
  
   On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com
  wrote:
  
   Hi,
   We use this very same scenario to great effect - 2 instances using
 the
   same
   dataDir with many cores - 1 is a writer (no caching), the other is a
   searcher (lots of caching).
   To get the searcher to see the index changes from the writer, you
 need
  the
   searcher to do an empty commit - i.e. you invoke a commit with 0
   documents.
   This will refresh the caches (including autowarming), [re]build the
   relevant searchers etc. and make any index changes visible to the RO
   instance.
   Also, make sure to use lockTypenative/lockType in solrconfig.xml
 to
   ensure the two instances don't try to commit at the same time.
   There are several ways to trigger a commit:
   Call commit() periodically within your own code.
   Use autoCommit in solrconfig.xml.
   Use an RPC/IPC mechanism between the 2 instance processes to tell the
   searcher the index has changed, then call commit when called (more
  complex
   coding, but good if the index changes on an ad-hoc basis).
   Note, doing things this way isn't really suitable for an NRT
  environment.
  
   HTH,
   Peter
  
  
  
   On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Replication is fine, I am going to use it, but I wanted it for
  instances
*distributed* across several (physical) machines - but here I have
 one
physical machine, it has many cores. I want to run 2 instances of
 solr
because I think it has these benefits:
   
1) I can give less RAM to the writer (4GB), and use more RAM for
 the
searcher (28GB)
2) I can deactivate warming for the writer and keep it for the
  searcher

Re: Solr large boolean filter

2013-07-02 Thread Roman Chyla

Hello Mikhail,

Yes, GET is limited, but POST is not - so I just wanted that it works in
both the same way. But I am not sure if I am understanding your question
completely. Could you elaborate on the parameters/body part? Is there no
need for encoding of binary data inside the body? Or do you mean it is
treated as a string? Or is it just a bytestream and other parameters are
seen as string?

On a general note: my main concern was to send many ids fast, if we use
ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb
check numbers please :)). But certainly, if the bitset is sparse or the
collection of ids just a 'a few thousands', stream of ints/longs will be
smaller, better to use.

roman

On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev mkhlud...@griddynamics.com
wrote:

Hello Roman,

Don't you consider to pass long id sequence as body and access internally
in solr as a content stream? It makes base64 compression not necessary.
AFAIK url length is limited somehow, anyway.

On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com wrote:

Wrong link to the parser, should be:

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java

On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com
wrote:

Hello @,

This thread 'kicked' me into finishing som long-past task of
sending/receiving large boolean (bitset) filter. We have been using
bitsets
with solr before, but now I sat down and wrote it as a qparser. The use
cases, as you have discussed are:

- necessity to send lng list of ids as a query (where it is not
possible to do it the 'normal' way)
- or filtering ACLs

It works in the following way:

- external application constructs bitset and sends it as a query to
solr
(q or fq, depends on your needs)
- solr unpacks the bitset (translated bits into lucene ids, if
necessary), and wraps this into a query which then has the easy job of
'filtering' wanted/unwanted items

Therefore it is good only if you can search against something that is
indexed as integer (id's often are).

A simple benchmark shows acceptable performance, to send the bitset
(randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)

To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
(5+14+68ms)

But I haven't tested latency of sending it over the network and the
query
performance, but since the query is very similar as MatchAllDocs, it is
probably very fast (and I know that sending many Mbs to Solr is fast as
well)

I know this is not exactly 'standard' solution, and it is probably not
something you want to see with hundreds of millions of docs, but people
seem to be doing 'not the right thing' all the time;)
So if you think this is something useful for the community, please let
me
know. If somebody would be willing to test it, i can file a JIRA
ticket.

Thanks!

Roman

The code, if no JIRA is needed, can be found here:

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java

839ms. run
154ms. Building random bitset indexSize=1000 fill=0.5 --
Size=15054208,cardinality=3934477 highestBit=999
25ms. Converting bitset to byte array -- resulting array
length=125
20ms. Encoding byte array into base64 -- resulting array
length=168
ratio=1.344
62ms. Compressing byte array with GZIP -- resulting array
length=1218602 ratio=0.9748816
20ms. Encoding gzipped byte array into base64 -- resulting string
length=1624804 ratio=1.2998432
5ms. Decoding gzipped byte array from base64
14ms. Uncompressing decoded byte array
68ms. Converting from byte array to bitset
743ms. running

On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson
erickerick...@gmail.com
wrote:

Of course you're right if the post filter does NOT have
access to the source of truth for the user's privileges.

FWIW,
Erick

On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Hi,

The unfortunate thing about this is what you still have to *pass*
that
filter from the client to the server every time you want to use that
filter. If that filter is big/long, passing that in all the time
has
some price

Re: Two instances of solr - the same datadir?

2013-07-02 Thread Roman Chyla

Interesting, we are running 4.0 - and solr will refuse the start (or
reload) the core. But from looking at the code I am not seeing it is doing
any writing - but I should digg more...

Are you sure it needs to do writing? Because I am not calling commits, in
fact I have deactivated *all* components that write into index, so unless
there is something deep inside, which automatically calls the commit, it
should never happen.

roman


On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com wrote:

 Hmmm, single lock sounds dangerous. It probably works ok because you've
 been [un]lucky.
 For example, even with a RO instance, you still need to do a commit in
 order to reload caches/changes from the other instance.
 What happens if this commit gets called in the middle of the other
 instance's commit? I've not tested this scenario, but it's very possible
 with a 'single' lock the results are indeterminate.
 If the 'single' lock mechanism is making assumptions e.g. no other process
 will interfere, and then one does, the Lucene index could very well get
 corrupted.

 For the error you're seeing using 'native', we use native lockType for both
 write and RO instances, and it works fine - no contention.
 Which version of Solr are you using? Perhaps there's been a change in
 behaviour?

 Peter


 On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote:

  as i discovered, it is not good to use 'native' locktype in this
 scenario,
  actually there is a note in the solrconfig.xml which says the same
 
  when a core is reloaded and solr tries to grab lock, it will fail - even
 if
  the instance is configured to be read-only, so i am using 'single' lock
 for
  the readers and 'native' for the writer, which seems to work OK
 
  roman
 
 
  On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   I have auto commit after 40k RECs/1800secs. But I only tested with
 manual
   commit, but I don't see why it should work differently.
   Roman
   On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote:
  
   If it makes you feel better, I also considered this approach when I
 was
  in
   the same situation with a separate indexer and searcher on one
 Physical
   linux machine.
  
   My main concern was re-using the FS cache between both instances -
 If
  I
   replicated to myself there would be two independent copies of the
 index,
   FS-cached separately.
  
   I like the suggestion of using autoCommit to reload the index. If I'm
   reading that right, you'd set an autoCommit on 'zero docs changing',
 or
   just 'every N seconds'? Did that work?
  
   Best of luck!
  
   Tim
  
  
   On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:
  
So here it is for a record how I am solving it right now:
   
Write-master is started with: -Dmontysolr.warming.enabled=false
-Dmontysolr.write.master=true -Dmontysolr.read.master=
http://localhost:5005
Read-master is started with: -Dmontysolr.warming.enabled=true
-Dmontysolr.write.master=false
   
   
solrconfig.xml changes:
   
1. all index changing components have this bit,
enable=${montysolr.master:true} - ie.
   
updateHandler class=solr.DirectUpdateHandler2
 enable=${montysolr.master:true}
   
2. for cache warming de/activation
   
listener event=newSearcher
  class=solr.QuerySenderListener
  enable=${montysolr.enable.warming:true}...
   
3. to trigger refresh of the read-only-master (from write-master):
   
listener event=postCommit
  class=solr.RunExecutableListener
  enable=${montysolr.master:true}
  str name=execurl/str
  str name=dir./str
  bool name=waitfalse/bool
  arr name=args str${montysolr.read.master:
 http://localhost
   
   
  
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
/listener
   
This works, I still don't like the reload of the whole core, but it
   seems
like the easiest thing to do now.
   
-- roman
   
   
On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
 
wrote:
   
 Hi Peter,

 Thank you, I am glad to read that this usecase is not alien.

 I'd like to make the second instance (searcher) completely
  read-only,
   so
I
 have disabled all the components that can write.

 (being lazy ;)) I'll probably use
 http://wiki.apache.org/solr/CollectionDistribution to call the
 curl
after
 commit, or write some IndexReaderFactory that checks for changes

 The problem with calling the 'core reload' - is that it seems lots
  of
work
 for just opening a new searcher, eeekkk...somewhere I read that it
  is
cheap
 to reload a core, but re-opening the index searches must be
  definitely
 cheaper...

 roman


 On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge 
  peter.stu...@gmail.com
wrote:

 Hi

Re: Surround query parser not working?

2013-07-03 Thread Roman Chyla

Hi Niran, all,
Please look at JIRA LUCENE-5014. There you will find a Lucene parser that
does both analysis and span queries, equivalent to combination of
lucene+surround, and much more The ticket needs your review.

Roman

Re: What are the options for obtaining IDF at interactive speeds?

2013-07-03 Thread Roman Chyla

Hi Kathryn,
I wonder if you could index all your terms as separate documents and then
construct a new query (2nd pass)

q=term:term1 OR term:term2 OR term:term3

and use func to score them

*idf(other_field,field(term))*
*
*
the 'term' index cannot be multi-valued, obviously.

Other than that, if you could do it on server side, that weould be the
fastest - the code is ready inside IDFValueSource:
http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html

roman


On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
kathryn.riv...@gmail.comwrote:

 Hi,

 I'm using SOLRJ to run a query, with the goal of obtaining:

 (1) the retrieved documents,
 (2) the TF of each term in each document,
 (3) the IDF of each term in the set of retrieved documents (TF/IDF would be
 fine too)

 ...all at interactive speeds, or 10s per query. This is a demo, so if all
 else fails I can adjust the corpus, but I'd rather, y'know, actually do it.

 (1) and (2) are working; I completed the patch posted in the following
 issue:
 https://issues.apache.org/jira/browse/SOLR-949
 and am just setting tv=truetv.tf=true for my query. This way I get the
 documents and the tf information all in one go.

 With (3) I'm running into trouble. I have found 2 ways to do it so far:

 Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
 information along with the documents and tf information. Since each term
 may appear in multiple documents, this means retrieving idf information for
 each term about 20 times, and takes over a minute to do.

 Option B: After I've gathered the tf information, run through the list of
 terms used across the set of retrieved documents, and for each term, run a
 query like:
 {!func}idf(text,'the_term')deftype=funcfl=scorerows=1
 ...while this retrieves idf information only once for each term, the added
 latency for doing that many queries piles up to almost two minutes on my
 current corpus.

 Is there anything I didn't think of -- a way to construct a query to get
 idf information for a set of terms all in one go, outside the bounds of
 what terms happen to be in a document?

 Failing that, does anyone have a sense for how far I'd have to scale down a
 corpus to approach interactive speeds, if I want this sort of data?

 Katie

Re: Two instances of solr - the same datadir?

2013-07-04 Thread Roman Chyla

I have spent lot of time in the past day playing with this setup, and made
it work finally, here are few bits of interest:

- solr v40
- linux, java7, local filesystem
- big index, 1 RW instance + 2 RO instances (sharing the same index)


lock is acquired when solr is writing data - if you happen to be starting
your RO instance at this moment and you are using 'native' lock, it will
fail. However, when using RW instance with 'native' lock, and 2 RO
instances 'single' lock, the RO instances can start, but they will
eventually get into troubles too - our index is too big and so when core
RELOAD is called and indexing is under way, the RO instances time out.

core reload, when using 'native' lock, seems to work fine - if you were
lucky and all instances managed to start - HOWEVER, the core is
unresponsive until fully loaded (makes sense), but this is actually
terrible - your search is gone for seconds/minutes

the best setup is as described in my original post - RO instances MUST NOT
commit anything - neither use reload (because during reload solr tries to
acquire lock). Instead, they should just reopen the searcher - i repeat:
you should make sure that nothing is every going to write on the RO
instance. And because there is no public api for reopening the searcher, I
wrote a simple handler which just calls:

req.getCore().getSearcher(true, false, null, false);

when called, the RO instances continue to handle requests using the old
searcher, warming in the background, once ready, the new searcher takes
over [to repeat: i am triggering this refresh from the RW instance, it does
'curl http://foo/solr/myhandler?command=reopenSearcher]


the bad thing: when the RO instance dies (eg OOM error) and the RW is just
in the middle of writing data, you can't restart RO instance (unless you
use lock 'single' or some other lock)

HTH,

  roman




On Tue, Jul 2, 2013 at 5:35 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Wouldn't it be better to do a RELOAD?

 http://wiki.apache.org/solr/CoreAdmin#RELOAD

 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 w: appinions.com http://www.appinions.com/


 On Tue, Jul 2, 2013 at 5:05 PM, Peter Sturge peter.stu...@gmail.com
 wrote:

  The RO instance commit isn't (or shouldn't be) doing any real writing,
 just
  an empty commit to force new searchers, autowarm/refresh caches etc.
  Admittedly, we do all this on 3.6, so 4.0 could have different behaviour
 in
  this area.
  As long as you don't have autocommit in solrconfig.xml, there wouldn't be
  any commits 'behind the scenes' (we do all our commits via a local solrj
  client so it can be fully managed).
  The only caveat might be NRT/soft commits, but I'm not too familiar with
  this in 4.0.
  In any case, your RO instance must be getting updated somehow, otherwise
  how would it know your write instance made any changes?
  Perhaps your write instance notifies the RO instance externally from
 Solr?
  (a perfectly valid approach, and one that would allow a 'single' lock to
  work without contention)
 
 
 
  On Tue, Jul 2, 2013 at 7:59 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   Interesting, we are running 4.0 - and solr will refuse the start (or
   reload) the core. But from looking at the code I am not seeing it is
  doing
   any writing - but I should digg more...
  
   Are you sure it needs to do writing? Because I am not calling commits,
 in
   fact I have deactivated *all* components that write into index, so
 unless
   there is something deep inside, which automatically calls the commit,
 it
   should never happen.
  
   roman
  
  
   On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com
   wrote:
  
Hmmm, single lock sounds dangerous. It probably works ok because
 you've
been [un]lucky.
For example, even with a RO instance, you still need to do a commit
 in
order to reload caches/changes from the other instance.
What happens if this commit gets called in the middle of the other
instance's commit? I've not tested this scenario, but it's very
  possible
with a 'single' lock the results are indeterminate.
If the 'single' lock mechanism is making assumptions e.g. no other
   process
will interfere, and then one does, the Lucene index could very well
 get
corrupted.
   
For the error you're seeing using 'native', we use native lockType
 for
   both
write and RO instances, and it works fine - no contention.
Which version of Solr are you using? Perhaps there's been a change in
behaviour?
   
Peter
   
   
On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
   
 as i discovered, it is not good to use 'native' locktype in this
scenario,
 actually there is a note

Re: SOLR 4.0 frequent admin problem

2013-07-04 Thread Roman Chyla

Yes :-)  see SOLR-118, seems an old issue...
On 4 Jul 2013 06:43, David Quarterman da...@corexe.com wrote:

 Hi,

 About once a week the admin system comes up with SolrCore Initialization
 Failures. There's nothing in the logs and SOLR continues to work in the
 application it's supporting and in the 'direct access' mode (i.e.
 http://123.465.789.100:8080/solr/collection1/select?q=bingo:*).

 The cure is to restart Jetty (8.1.7) and then we can use the admin system
 again via pc's. However, a colleague can get into admin on an iPad with no
 trouble when no browser on a pc can!

 Anyone any ideas? It's really frustrating!

 Best regards,

 DQ

Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

2013-07-05 Thread Roman Chyla

I don't want to sound negative, but I think it is a valid question to
consider - for the lack of information and certain mental rigidity may make
it sound bad - first of all, it is probably not for few gigabytes of data
and I can imagine that building indexes at the side when data lives is much
faster/cheaper, then sending data to solr - if we think the index is the
product of the map, then the 'reduce' part may be this
http://wiki.apache.org/solr/MergingSolrIndexes

I don't really know enough about CloudSolrServer and how to fit the cloud
there

roman

On Fri, Jul 5, 2013 at 12:23 PM, Jack Krupansky j...@basetechnology.comwrote:

 Software developers are sometimes compensated based on the degree of
 complexity that they deal with.

 And managers are sometimes compensated based on the number of people they
 manage, as well as the degree of complexity of what they manage.

 And... training organizations can charge more and have a larger pool of
 eager customers when the subject matter has higher complexity.

 And... consultants and contractors will be in higher demand and able to
 charge more, based on the degree of complexity that they have mastered.

 So, more complexity results in greater opportunity for higher income!

 (Oh, and, writers and book authors have more to write about and readers
 are more eager to purchase those writings as well, especially if the
 subject matter is constantly changing.)

 Somebody please remind me I said this any time you catch me trying to
 argue for Solr to be made simpler and easier to use!

 -- Jack Krupansky

 -Original Message- From: Walter Underwood
 Sent: Friday, July 05, 2013 12:11 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj


 Why is it better to require another large software system (Hadoop), when
 it works fine without it?

 That just sounds like more stuff to configure, misconfigure, and cause
 problems with indexing.

 wunder

 On Jul 5, 2013, at 4:48 AM, Furkan KAMACI wrote:

  We are using Nutch to crawl web sites and it stores documents at Hbase.
 Nutch uses Solrj to send documents to be indexed. We have Hadoop at our
 ecosystem as well. I think that there should be an implementation at Solrj
 that sends documents (via CloudSolrServer or something like that) as
 MapReduce jobs. Is there any implentation for it or is it not a good idea?

Re: What are the options for obtaining IDF at interactive speeds?

2013-07-08 Thread Roman Chyla

Hi,
I am curious about the functional query, did you try it and it didn't work?
 or was it too slow?

idf(other_field,field(term))

Thanks!

  roman


On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote:

 Hi All,

 Resolution: I ended up cheating. :P Though now that I look at it, I think
 this was Roman's second suggestion. Thanks!

 Since the application that will be processing the IDF figures is located on
 the same machine as SOLR, I opened a second IndexReader on the lucene index
 and used

 reader.numDocs()
 reader.docFreq(field,term)

 to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

 As it turns out, using this method to get IDF on all the terms mentioned in
 the set of relevant documents runs in time comparable to retrieving the
 documents in the first place (so, .1-1s). This makes it fast enough that
 it's no longer the slowest part of my algorithm by far. Problem solved! It
 is possible that IDFValueSource would be faster; I may swap that in at a
 later date.

 I will keep Mikhail's debugQuery=true in my pocket, too; that technique
 would never have occurred to me. Thank you too!

 Best,
 Katie


 On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Kathryn,
  I wonder if you could index all your terms as separate documents and then
  construct a new query (2nd pass)
 
  q=term:term1 OR term:term2 OR term:term3
 
  and use func to score them
 
  *idf(other_field,field(term))*
  *
  *
  the 'term' index cannot be multi-valued, obviously.
 
  Other than that, if you could do it on server side, that weould be the
  fastest - the code is ready inside IDFValueSource:
 
 
 http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html
 
  roman
 
 
  On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis
  kathryn.riv...@gmail.comwrote:
 
   Hi,
  
   I'm using SOLRJ to run a query, with the goal of obtaining:
  
   (1) the retrieved documents,
   (2) the TF of each term in each document,
   (3) the IDF of each term in the set of retrieved documents (TF/IDF
 would
  be
   fine too)
  
   ...all at interactive speeds, or 10s per query. This is a demo, so if
  all
   else fails I can adjust the corpus, but I'd rather, y'know, actually do
  it.
  
   (1) and (2) are working; I completed the patch posted in the following
   issue:
   https://issues.apache.org/jira/browse/SOLR-949
   and am just setting tv=truetv.tf=true for my query. This way I get
 the
   documents and the tf information all in one go.
  
   With (3) I'm running into trouble. I have found 2 ways to do it so far:
  
   Option A: set tv.df=true or tv.tf_idf for my query, and get the idf
   information along with the documents and tf information. Since each
 term
   may appear in multiple documents, this means retrieving idf information
  for
   each term about 20 times, and takes over a minute to do.
  
   Option B: After I've gathered the tf information, run through the list
 of
   terms used across the set of retrieved documents, and for each term,
 run
  a
   query like:
   {!func}idf(text,'the_term')deftype=funcfl=scorerows=1
   ...while this retrieves idf information only once for each term, the
  added
   latency for doing that many queries piles up to almost two minutes on
 my
   current corpus.
  
   Is there anything I didn't think of -- a way to construct a query to
 get
   idf information for a set of terms all in one go, outside the bounds of
   what terms happen to be in a document?
  
   Failing that, does anyone have a sense for how far I'd have to scale
  down a
   corpus to approach interactive speeds, if I want this sort of data?
  
   Katie

Re: joins in solr cloud - good or bad idea?

2013-07-08 Thread Roman Chyla

Hello,

The joins are not the only idea, you may want to write your own function
(ValueSource) that can implement your logic. However, I think you should
not throw away the regex idea (as being slow), before trying it out -
because it can be faster than the joins. Your problem is that the number of
entities need to be limited, see recent replies of Jack Krupansky on the
number of fields.

The joins are of different kinds, I recommend this link to see their
differences: http://vimeo.com/44299232

If your data relations can fit in memory, a smart cache (ie [un]inverted
index) will always outperform lucene joins - look at the chart inside this:
http://code4lib.org/files/2ndOrderOperatorsv2.pdf

roman


On Mon, Jul 8, 2013 at 4:03 PM, Marcelo Elias Del Valle
mvall...@gmail.comwrote:

 Hello all,

 I am using Solr Cloud today and I have the following need:

- My queries focus on counting how many users attend to some criteria.
So my main document is user (parent table)
- Each user can access several web pages (a child table) and each web
page might have several attributes.
- I need to lookup for users where there is some page accessed by them
which matches a set of attributes. For example, I have two scenarios:
   1. if a user accessed a web page WP1 with a URL that starts with
   www. and with a title that includes solr, then the user is a
 match.
   2. However, if there is a webpage WP1 with such url and ANOTHER WP2
   that includes solr in the title, this is not a match.


 If I were modeling this on a relational DB, user would be a table and
 url would be other. However, as I using solr, my first option would be
 denormalizing first. Simply storing all the fields in the user document
 wouldn't work, as I would work as described in scenario 2.
  I thought in two solutions for these:

- Using the idea of an inverted index - Having several kinds of
documents (user, web page, entity 3, entity 4, etc.) where each entity
 (web
page, for instance) would have a field to relate to the user id. Then,
using a cross join in solr to get the results where there was a match on
user (parent table) and also on each child entity (in other words, to
 merge
the results of several queries that might return user ids). This has a
drawback of using a join.
- Having just a user document and storing each web page as only one
field (like a json). To search, the same field would need to match a
regular expression that includes both conditions. This would make my
 search
slower and I would not be able to apply the same technique if the child
tables also had children.

 Am I missing any obvious solution here? I would love to receive critics
 on this, as I am probably not the only one who have this problem...  I
 would like more ideas on how to denormalize data in this case.  Is the join
 my best option here?

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr

Re: solr way to exclude terms

2013-07-08 Thread Roman Chyla

One of the approaches is to index create a new field based on the stopwords
(ie accept only stopwords :)) - ie. if the documents contains them, you
index 1 - and use a q=applefq=bad_apple:0
This has many limitations (in terms of flexibility), but it will be
superfast

roman


On Mon, Jul 8, 2013 at 4:14 PM, Angela Zhu ang...@squareup.com wrote:

 Is there a solr way to remove any result from the list search results that
 contain a term in a excluding list?

 For example, suppose I search for apple and get 5 documents contains it,
 and my excluding list is something like ['bad', 'wrong', 'staled'].
 Out of the 5 documents, 3 has a word in this list, so I want solr to return
 only the other 2 documents.

 I know exclude will work, but my list is super long and I don't want have a
 very long url.
 I know stopwords is not returning the thing I want.
 So is there something I don't know that would work as expected?

 Thanks a lot!
 angela

Re: Solr large boolean filter

2013-07-08 Thread Roman Chyla

OK, thank you Otis, I *think* this should be easy to add - I can try. We
were calling them 'private library' searches

roman

On Mon, Jul 8, 2013 at 11:58 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:

Hi Roman,

I referred to something I called server-side named filters. It
matches the feature described at
http://www.elasticsearch.org/blog/terms-filter-lookup/

Would be a cool addition, IMHO.

Otis
--
Solr ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm

On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote:
Hello @,

This thread 'kicked' me into finishing som long-past task of
sending/receiving large boolean (bitset) filter. We have been using
bitsets
with solr before, but now I sat down and wrote it as a qparser. The use
cases, as you have discussed are:

- necessity to send lng list of ids as a query (where it is not
possible to do it the 'normal' way)
- or filtering ACLs

It works in the following way:

- external application constructs bitset and sends it as a query to
solr
(q or fq, depends on your needs)
- solr unpacks the bitset (translated bits into lucene ids, if
necessary), and wraps this into a query which then has the easy job of
'filtering' wanted/unwanted items

Therefore it is good only if you can search against something that is
indexed as integer (id's often are).

A simple benchmark shows acceptable performance, to send the bitset
(randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)

To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
(5+14+68ms)

Thanks!

Roman

The code, if no JIRA is needed, can be found here:

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java

839ms. run
154ms. Building random bitset indexSize=1000 fill=0.5 --
Size=15054208,cardinality=3934477 highestBit=999
25ms. Converting bitset to byte array -- resulting array length=125
20ms. Encoding byte array into base64 -- resulting array length=168
ratio=1.344
62ms. Compressing byte array with GZIP -- resulting array
length=1218602
ratio=0.9748816
20ms. Encoding gzipped byte array into base64 -- resulting string
length=1624804 ratio=1.2998432
5ms. Decoding gzipped byte array from base64
14ms. Uncompressing decoded byte array
68ms. Converting from byte array to bitset
743ms. running

On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.com
wrote:

Of course you're right if the post filter does NOT have
access to the source of truth for the user's privileges.

FWIW,
Erick

On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Hi,

Otis
--
Solr ElasticSearch Support
http://sematext.com/

On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson
erickerick...@gmail.com
wrote:
You might consider post filters. The idea
is to write a custom filter that gets applied
after all other filters etc. One use-case
here is exactly ACL lists, and can be quite
helpful if you're not doing *:* type queries.

Best
Erick

On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Btw. ElasticSearch has a nice feature here. Not sure what it's
called, but I call it named filter.

http://www.elasticsearch.org/blog/terms-filter-lookup/

Maybe that's what OP was after?

Otis
--
Solr ElasticSearch Support
http

Re: Best way to call asynchronously - Custom data import handler

2013-07-09 Thread Roman Chyla

Other than using futures and callables? Runnables ;-) Other than that you
will need async request (ie. client).

But in case sb else is looking for an easy-recipe for the server-side async:


public void handleRequestBody(.) {
   if (isBusy()) {
rsp.add(message, Batch processing is already running...);
 rsp.add(status, busy);
return;
  }
   runAsynchronously(new LocalSolrQueryRequest(req.getCore(),
req.getParams()));
}
private void runAsynchronously(SolrQueryRequest req) {

final SolrQueryRequest request = req;
 thread = new Thread(new Runnable() {
public void run() {
try {
 while (queue.hasMore()) {
runSynchronously(queue, request);
}
 } catch (Exception e) {
log.error(e.getLocalizedMessage());
} finally {
 request.close();
setBusy(false);
}
 }
});

thread.start();
}


On Tue, Jul 9, 2013 at 1:10 AM, Learner bbar...@gmail.com wrote:


 I wrote a custom data import handler to import data from files. I am trying
 to figure out a way to make asynchronous call instead of waiting for the
 data import response. Is there an easy way to invoke asynchronously  (other
 than using futures and callables) ?

 public class CustomFileImportHandler extends RequestHandlerBase implements
 SolrCoreAware{
 public void handleRequestBody(SolrQueryRequest arg0,
 SolrQueryResponse
 arg1){
indexer a= new indexer(); // constructor
String status= a.Index(); // method to do indexing, trying to make
 it
 async
 }
 }




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Best-way-to-call-asynchronously-Custom-data-import-handler-tp4076475.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Roman Chyla

On Wed, Jul 10, 2013 at 5:37 PM, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:

 Hello,

 I have asked a question recently about solr limitations and some about
 joins. It comes that this question is about both at the same time.
 I am trying to figure how to denormalize my data so I will need just 1
 document in my index instead of performing a join. I figure one way of
 doing this is storing an entity as a multivalued field, instead of storing
 different fields.
 Let me give an example. Consider the entities:

 User:
 id: 1
 type: Joan of Arc
 age: 27

 Webpage:
 id: 1
 url: http://wiki.apache.org/solr/Join
 category: Technical
 user_id: 1

 id: 2
 url: http://stackoverflow.com
 category: Technical
 user_id: 1

 Instead of creating 1 document for user, 1 for webpage 1 and 1 for
 webpage 2 (1 parent and 2 childs) I could store webpages in a user
 multivalued field, as follows:

 User:
 id: 1
 name: Joan of Arc
 age: 27
 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category:
 Technical]
 webpage2: [id:2, url: http://stackoverflow.com;, category:
 Technical]

 It would probably perform better than the join, right? However, it made
 me think about solr limitations again. What if I have 200 million webpges
 (200 million fields) per user? Or imagine a case where I could have 200
 million values on a field, like in the case I need to index every html DOM
 element (div, a, etc.) for each web page user visited.
 I mean, if I need to do the query and this is a business requirement no
 matter what, although denormalizing could be better than using query time
 joins, I wonder it distributing the data present in this single document
 along the cluster wouldn't give me better performance. And this is
 something I won't get with block joins or multivalued fields...


Indeed, and when you think of it, then there are only (2?) alternatives

1. let you distributed search cluster have the knowledge of relations
2. denormalize  duplicate the data


 I guess there is probably no right answer for this question (at least
 not a known one), and I know I should create a POC to check how each
 perform... But do you think a so large number of values in a single
 document could make denormalization not possible in an extreme case like
 this? Would you share my thoughts if I said denormalization is not always
 the right option?


Aren't words of natural language (and whatever crap there comes with them
in the fulltext) similar? You may not want to retrieve relations between
every word that you indexed, but still you can index millions of unique
tokens (well, having 200 millions seems to high). But if you were having
such a high number of unique values, you can think of indexing hash values
- search for 'near-duplicates' could be acceptable too.

And so, with lucene, only the denormalization will give you anywhere closer
to acceptable search speed. If you look at the code that executes the join
search, you would see that values for the 1st order search are harvested,
then a new search (or lookup) is performed - so it has to be almost always
slower than the inverted index lookup

roman



 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr

Re: Performance of cross join vs block join

2013-07-12 Thread Roman Chyla

Hi Mikhail,
I have commented on your blog, but it seems I have done st wrong, as the
comment is not there. Would it be possible to share the test setup (script)?

I have found out that the crucial thing with joins is the number of 'joins'
[hits returned] and it seems that the experiments I have seen so far were
geared towards small collection - even if Erick's index was 26M, the number
of hits was probably small - you can see a very different story if you face
some [other] real data. Here is a citation network and I was comparing
lucene join's [ie not the block joins, because these cannot be used for
citation data - we cannot reasonably index them into one segment])

https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png

Notice, the y axes is sqrt, so the running time for lucene join is growing
and growing very fast! It takes lucene 30s to do the search that selects 1M
hits.

The comparison is against our own implementation of a similar search - but
the main point I am making is that the join benchmarks should be showing
the number of hits selected by the join operation. Otherwise, a very
important detail is hidden.

Best,

  roman


On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu mihaela...@yahoo.com
 wrote:

  Hi Mikhail,
 
  I have used wrong the term block join. When I said block join I was
  referring to a join performed on a single core versus cross join which
 was
  performed on multiple cores.
  But I saw your benchmark (from cache) and it seems that block join has
  better performance. Is this functionality available on Solr 4.3.1?

 nope SOLR-3076 awaits for ages.


  I did not find such examples on Solr's wiki page.
  Does this functionality require a special schema, or a special indexing?

 Special indexing - yes.


  How would I need to index the data from my tables? In my case anyway all
  the indices have a common schema since I am using dynamic fields, thus I
  can easily add all documents from all tables in one Solr core, but for
 each
  document to add a discriminator field.
 
 correct. but notion of ' discriminator field' is a little bit different for
 blockjoin.


 
  Could you point me to some more documentation?
 

 I can recommend only those

 http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
 http://www.youtube.com/watch?v=-OiIlIijWH0


  Thanks in advance,
  Mihaela
 
 
  
   From: Mikhail Khludnev mkhlud...@griddynamics.com
  To: solr-user solr-user@lucene.apache.org; mihaela olteanu 
  mihaela...@yahoo.com
  Sent: Thursday, July 11, 2013 2:25 PM
  Subject: Re: Performance of cross join vs block join
 
 
  Mihaela,
 
  For me it's reasonable that single core join takes the same time as cross
  core one. I just can't see which gain can be obtained from in the former
  case.
  I hardly able to comment join code, I looked into, it's not trivial, at
  least. With block join it doesn't need to obtain parentId term
  values/numbers and lookup parents by them. Both of these actions are
  expensive. Also blockjoin works as an iterator, but join need to allocate
  memory for parents bitset and populate it out of order that impacts
  scalability.
  Also in None scoring mode BJQ don't need to walk through all children,
 but
  only hits first. Also, nice feature is 'both side leapfrog' if you have a
  highly restrictive filter/query intersects with BJQ, it allows to skip
 many
  parents and children as well, that's not possible in Join, which has
 fairly
  'full-scan' nature.
  Main performance factor for Join is number of child docs.
  I'm not sure I got all your questions, please specify them in more
 details,
  if something is still unclear.
  have you saw my benchmark
  http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?
 
 
 
  On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu mihaela...@yahoo.com
  wrote:
 
   Hello,
  
   Does anyone know about some measurements in terms of performance for
  cross
   joins compared to joins inside a single index?
  
   Is it faster the join inside a single index that stores all documents
 of
   various types (from parent table or from children tables)with a
   discriminator field compared to the cross join (basically in this case
  each
   document type resides in its own index)?
  
   I have performed some tests but to me it seems that having a join in a
   single index (bigger index) does not add too much speed improvements
   compared to cross joins.
  
   Why a block join would be faster than a cross join if this is the case?
   What are the variables that count when trying to improve the query
   execution time?
  
   Thanks!
   Mihaela
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com




 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,

Re: ACL implementation: Pseudo-join performance Atomic Updates

2013-07-15 Thread Roman Chyla

On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca oburl...@gmail.com wrote:

 Hello Erick,

  Join performance is most sensitive to the number of values
  in the field being joined on. So if you have lots and lots of
  distinct values in the corpus, join performance will be affected.
 Yep, we have a list of unique Id's that we get by first searching for
 records
 where loggedInUser IS IN (userIDs)
 This corpus is stored in memory I suppose? (not a problem) and then the
 bottleneck is to match this huge set with the core where I'm searching?

 Somewhere in maillist archive people were talking about external list of
 Solr unique IDs
 but didn't find if there is a solution.
 Back in 2010 Yonik posted a comment:
 http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd


sorry, haven't the previous thread in its entirety, but few weeks back that
Yonik's proposal got implemented, it seems ;)

http://search-lucene.com/m/Fa3Dg14mqoj/bitsetsubj=Re+Solr+large+boolean+filter

You could use this to send very large bitset filter (which can be
translated into any integers, if you can come up with a mapping function).

roman



  bq: I suppose the delete/reindex approach will not change soon
  There is ongoing work (search the JIRA for Stacked Segments)
 Ah, ok, I was feeling it affects the architecture, ok, now the only hope is
 Pseudo-Joins ))

  One way to deal with this is to implement a post filter, sometimes
 called
  a no cache filter.
 thanks, will have a look, but as you describe it, it's not the best option.

 The approach
 too many documents, man. Please refine your query. Partial results below
 means faceting will not work correctly?

 ... I have in mind a hybrid approach, comments welcome:
 Most of the time users are not searching, but browsing content, so our
 virtual filesystem stored in SOLR will use only the index with the Id of
 the file and the list of users that have access to it. i.e. not touching
 the fulltext index at all.

 Files may have metadata (EXIF info for images for ex) that we'd like to
 filter by, calculate facets.
 Meta will be stored in both indexes.

 In case of a fulltext query:
 1. search FT index (the fulltext index), get only the number of search
 results, let it be Rf
 2. search DAC index (the index with permissions), get number of search
 results, let it be Rd

 let maxR be the maximum size of the corpus for the pseudo-join.
 *That was actually my question: what is a reasonable number? 10, 100, 1000
 ?
 *

 if (Rf  maxR) or (Rd  maxR) then use the smaller corpus to join onto the
 second one.
 this happens when (only a few documents contains the search query) OR (user
 has access to a small number of files).

 In case none of these happens, we can use the
 too many documents, man. Please refine your query. Partial results below
 but first searching the FT index, because we want relevant results first.

 What do you think?

 Regards,
 Oleg




 On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Join performance is most sensitive to the number of values
  in the field being joined on. So if you have lots and lots of
  distinct values in the corpus, join performance will be affected.
 
  bq: I suppose the delete/reindex approach will not change soon
 
  There is ongoing work (search the JIRA for Stacked Segments)
  on actually doing something about this, but it's been under
 consideration
  for at least 3 years so your guess is as good as mine.
 
  bq: notice that the worst situation is when everyone has access to all
 the
  files, it means the first filter will be the full index.
 
  One way to deal with this is to implement a post filter, sometimes
 called
  a no cache filter. The distinction here is that
  1 it is not cached (duh!)
  2 it is only called for documents that have made it through all the
   other lower cost filters (and the main query of course).
  3 lower cost means the filter is either a standard, cached filters
  and any no cache filters with a cost (explicitly stated in the
 query)
  lower than this one's.
 
  Critically, and unlike normal filter queries, the result set is NOT
  calculated for all documents ahead of time
 
  You _still_ have to deal with the sysadmin doing a *:* query as you
  are well aware. But one can mitigate that by having the post-filter
  fail all documents after some arbitrary N, and display a message in the
  app like too many documents, man. Please refine your query. Partial
  results below. Of course this may not be acceptable, but
 
  HTH
  Erick
 
  On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
  j...@basetechnology.com wrote:
   Take a look at LucidWorks Search and its access control:
  
 
 http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
  
   Role-based security is an easier nut to crack.
  
   Karl Wright of ManifoldCF had a Solr patch for document access control
 at
   one point:
   SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing

Re: ACL implementation: Pseudo-join performance Atomic Updates

2013-07-16 Thread Roman Chyla

Erick,

I wasn't sure this issue is important, so I wanted first solicit some
feedback. You and Otis expressed interest, and I could create the JIRA -
however, as Alexandre, points out, the SOLR-1913 seems similar (actually,
closer to the Otis request to have the elasticsearch named filter) but the
SOLR-1913 was created in 2010 and is not integrated yet, so I am wondering
whether this new feature (somewhat overlapping, but still different from
SOLR-1913) is something people would really want and the effort on the JIRA
is well spent. What's your view?

Thanks,

  roman




On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ?

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  Roman:
 
  Did this ever make into a JIRA? Somehow I missed it if it did, and this
  would
  be pretty cool
 
  Erick
 
  On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
   On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca oburl...@gmail.com
  wrote:
  
   Hello Erick,
  
Join performance is most sensitive to the number of values
in the field being joined on. So if you have lots and lots of
distinct values in the corpus, join performance will be affected.
   Yep, we have a list of unique Id's that we get by first searching for
   records
   where loggedInUser IS IN (userIDs)
   This corpus is stored in memory I suppose? (not a problem) and then
 the
   bottleneck is to match this huge set with the core where I'm
 searching?
  
   Somewhere in maillist archive people were talking about external list
  of
   Solr unique IDs
   but didn't find if there is a solution.
   Back in 2010 Yonik posted a comment:
   http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd
  
  
   sorry, haven't the previous thread in its entirety, but few weeks back
  that
   Yonik's proposal got implemented, it seems ;)
  
  
 
 http://search-lucene.com/m/Fa3Dg14mqoj/bitsetsubj=Re+Solr+large+boolean+filter
  
   You could use this to send very large bitset filter (which can be
   translated into any integers, if you can come up with a mapping
  function).
  
   roman
  
  
  
bq: I suppose the delete/reindex approach will not change soon
There is ongoing work (search the JIRA for Stacked Segments)
   Ah, ok, I was feeling it affects the architecture, ok, now the only
  hope is
   Pseudo-Joins ))
  
One way to deal with this is to implement a post filter, sometimes
   called
a no cache filter.
   thanks, will have a look, but as you describe it, it's not the best
  option.
  
   The approach
   too many documents, man. Please refine your query. Partial results
  below
   means faceting will not work correctly?
  
   ... I have in mind a hybrid approach, comments welcome:
   Most of the time users are not searching, but browsing content, so our
   virtual filesystem stored in SOLR will use only the index with the
 Id
  of
   the file and the list of users that have access to it. i.e. not
 touching
   the fulltext index at all.
  
   Files may have metadata (EXIF info for images for ex) that we'd like
 to
   filter by, calculate facets.
   Meta will be stored in both indexes.
  
   In case of a fulltext query:
   1. search FT index (the fulltext index), get only the number of search
   results, let it be Rf
   2. search DAC index (the index with permissions), get number of search
   results, let it be Rd
  
   let maxR be the maximum size of the corpus for the pseudo-join.
   *That was actually my question: what is a reasonable number? 10, 100,
  1000
   ?
   *
  
   if (Rf  maxR) or (Rd  maxR) then use the smaller corpus to join onto
  the
   second one.
   this happens when (only a few documents contains the search query) OR
  (user
   has access to a small number of files).
  
   In case none of these happens, we can use the
   too many documents, man. Please refine your query. Partial results
  below
   but first searching the FT index, because we want relevant results
  first.
  
   What do you think?
  
   Regards,
   Oleg
  
  
  
  
   On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
Join performance is most sensitive to the number of values
in the field being joined on. So if you have lots and lots of
distinct values in the corpus, join performance will be affected.
   
bq: I suppose the delete/reindex approach will not change soon
   
There is ongoing work (search the JIRA for Stacked Segments)
on actually doing something about this, but it's been under
   consideration
for at least 3 years so your guess is as good

Re: Range query on a substring.

2013-07-16 Thread Roman Chyla

Well, I think this is slightly too categorical - a range query on a
substring can be thought of as a simple range query. So, for example the
following query:

lucene 1*

becomes behind the scenes: lucene (10|11|12|13|14|1abcd)

the issue there is that it is a string range, but it is a range query - it
just has to be indexed in a clever way

So, Marcin, you still have quite a few options besides the strict boolean
query model

1. have a special tokenizer chain which creates one token out of these
groups (eg. some text prefix_1) and search for some text prefix_* [and
do some post-filtering if necessary]
2. another version, using regex /some text (1|2|3...)/ - you got the idea
3. construct the lucene multi-term range query automatically, in your
qparser - to produce a phrase query lucene (10|11|12|13|14)
4. use payloads to index your integer at the position of some text and
then retrieve only some text where the payload is in range x-y - an
example is here, look at getPayloadQuery()
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java-
but this is more complex situation and if you google, you will find a
better description
5. use a qparser that is able to handle nested search and analysis at the
same time - eg. your query is: field:some text NEAR1 field:[0 TO 10] - i
know about a parser that can handle this and i invite others to check it
out (yeah, JIRA tickets need reviewers ;-))
https://issues.apache.org/jira/browse/LUCENE-5014

there might be others i forgot, but it is certainly doable; but as Jack
points out, you may want to stop for a moment to reflect whether it is
necessary

HTH,

roman

On Tue, Jul 16, 2013 at 8:35 AM, Jack Krupansky j...@basetechnology.comwrote:

Sorry, but you are basically misusing Solr (and multivalued fields),
trying to take a shortcut to avoid a proper data model.

To properly use Solr, you need to put each of these multivalued field
values in a separate Solr document, with a text field and a value
field. Then, you can query:

text:some text AND value:[min-value TO max-value]

Exactly how you should restructure your data model is dependent on all of
your other requirements.

You may be able to simply flatten your data.

You may be able to use a simple join operation.

Or, maybe you need to do a multi-step query operation if you data is
sufficiently complex.

If you want to keep your multivalued field in its current form for display
purposes or keyword search, or exact match search, fine, but your stated
goal is inconsistent with the Semantics of Solr and Lucene.

To be crystal clear, there is no such thing as a range query on a
substring in Solr or Lucene.

-- Jack Krupansky

-Original Message- From: Marcin Rzewucki
Sent: Tuesday, July 16, 2013 5:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Range query on a substring.

By multivalued I meant an array of values. For example:
arr name=myfield
strtext1 (X)/str
strtext2 (Y)/str
/arr

I'd like to avoid spliting it as you propose. I have 2.3mn collection with
pretty large records (few hundreds fields and more per record). Duplicating
them would impact performance.

Regards.

On 16 July 2013 10:26, Oleg Burlaca oburl...@gmail.com wrote:

Ah, you mean something like this:
record:
Id=10, text = this is a text N1 (X), another text N2 (Y), text N3 (Z)
Id=11, text = this is a text N1 (W), another text N2 (Q), third text
(M)

and you need to search for: text N1 and X B ?
How big is the core? the first thing that comes to my mind, again, at
indexing level,
split the text into pieces and index it in solr like this:

does it help?

On Tue, Jul 16, 2013 at 10:51 AM, Marcin Rzewucki mrzewu...@gmail.com
wrote:

Hi Oleg,
It's a multivalued field and it won't be easier to query when I split
this
field into text and numbers. I may get wrong results.

Regards.

On 16 July 2013 09:35, Oleg Burlaca oburl...@gmail.com wrote:

IMHO the number(s) should be extracted and stored in separate columns
in
SOLR at indexing time.

--
Oleg

On Tue, Jul 16, 2013 at 10:12 AM, Marcin Rzewucki
mrzewu...@gmail.com
wrote:

Hi,

I have a problem (wonder if it is possible to solve it at all) with
the
following query. There are documents with a field which contains a
text
and
a number in brackets, eg.

myfield: this is a text (number)

There might be some other documents with the same text but different
number
in brackets.
I'd like to find documents with the given text say this is a text
and
number between A and B. Is it possible in Solr ? Any ideas ?

Kind regards.

Re: Range query on a substring.

2013-07-16 Thread Roman Chyla

On Tue, Jul 16, 2013 at 5:08 PM, Marcin Rzewucki mrzewu...@gmail.comwrote:

Hi guys,

First of all, thanks for your response.

Jack: Data structure was created some time ago and this is a new
requirement in my project. I'm trying to find a solution. I wouldn't like
to split multivalued field into N similar records varying in this
particular field only. That could impact performance and imply more changes
in backend architecture as well. I'd prefer to create yet another
collection and use pseudo-joins...

Roman: Your ideas seem to be much closer to what I'm looking for. However,
the following syntax: text (1|2|3) does not work for me. Are you sure it
works like OR inside a regexp ?

I wasn't clear, sorry: the text (1|1|3) is a result of the term expansion
- you can see something like that when you look at debugQuery=true output
after you sent phrase quer* - lucene will search for the variants by
enumerating the possible alternatives, hence phrase (token|token|token)

it is possible to construct such a query manually, it depends on your
application

one more thing: the term expansion depends on the type of the field (ie.
expanding string field is different from the int field type), yet you could
very easily write a small processor that looks at the range values and
treats them as numbers (*after* they were parsed by the qparser, but
*before* they were built into a query - hmmm, now when I think of it...
your values will be indexed as strings, so you have to search/expand into
string byterefs - it's doable, just wanted to point out this detail - in
normal situations, SOLR will be building query tokens using the string/text
field, because your field will be of that type)

roman

By the way: Honestly, I have one more requirement for which I would have to
extend Solr query syntax. Basically, it should be possible to do some math
on few fields and do range query on the result (without indexing it,
because a combination of different fields is allowed). I'd like to spend
some time on ANTLR and the new way of parsing you mentioned. I will let you
know if it was useful for me. Thanks.

Kind regards.

On 16 July 2013 20:07, Roman Chyla roman.ch...@gmail.com wrote:

Well, I think this is slightly too categorical - a range query on a
substring can be thought of as a simple range query. So, for example the
following query:

lucene 1*

becomes behind the scenes: lucene (10|11|12|13|14|1abcd)

the issue there is that it is a string range, but it is a range query -
it
just has to be indexed in a clever way

So, Marcin, you still have quite a few options besides the strict boolean
query model

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java-
but this is more complex situation and if you google, you will find a
better description
5. use a qparser that is able to handle nested search and analysis at the
same time - eg. your query is: field:some text NEAR1 field:[0 TO 10] -
i
know about a parser that can handle this and i invite others to check it
out (yeah, JIRA tickets need reviewers ;-))
https://issues.apache.org/jira/browse/LUCENE-5014

there might be others i forgot, but it is certainly doable; but as Jack
points out, you may want to stop for a moment to reflect whether it is
necessary

HTH,

roman

On Tue, Jul 16, 2013 at 8:35 AM, Jack Krupansky j...@basetechnology.com
wrote:

Sorry, but you are basically misusing Solr (and multivalued fields),
trying to take a shortcut to avoid a proper data model.

To properly use Solr, you need to put each of these multivalued field
values in a separate Solr document, with a text field and a value
field. Then, you can query:

text:some text AND value:[min-value TO max-value]

Exactly how you should restructure your data model is dependent on all
of
your other requirements.

You may be able to simply flatten your data.

You may be able to use a simple join operation.

Or, maybe you need to do a multi-step query operation if you data is
sufficiently complex.

If you want to keep your multivalued field in its current form for
display
purposes or keyword search, or exact match search, fine, but your
stated
goal is inconsistent with the Semantics of Solr and Lucene.

To be crystal clear

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread Roman Chyla

Hi all,

What I find very 'sad' is that Lucene/SOLR contain all the necessary
components for handling multi-token synonyms; the Finite State Automaton
works perfectly for matching these items; the biggest problem is IMO the
old query parser which split things on spaces and doesn't know to be
smarter.

THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
none was committed...sigh, we are re-inventing wheel all the time...)

LUCENE-1622
LUCENE-4381
LUCENE-4499


The problem of synonym expansion is more difficult becuase of the parsing -
the default parsers are not flexible and they split on empty space -
recently I have proposed a solution which makes also the multi-token
synonym expansion simple

this is the ticket:
https://issues.apache.org/jira/browse/LUCENE-5014

that query parser is able to split on spaces, then look back, do the second
pass to see whether to expand with synonyms - and even discover different
parse paths and construct different queries based on that. if you want to
see some complex examples, look at:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java
-
eg. line 373, 483


Lucene/SOLR developers are already doing great work and have much to do -
they need help from everybody who is able to apply patch, test it and
report back to JIRA.

roman



On Wed, Jul 17, 2013 at 9:37 AM, dmarini david.marini...@gmail.com wrote:

 iorixxx,

 Thanks for pointing me in the direction of the QueryElevation component. If
 it did not require that the target documents be keyed by the unique key
 field it would be ideal, but since our Sku field is not the Unique field
 (we
 have an internal id which serves as the key while this is the client's key)
 it doesn't seem like it will match unless I make a larger scope change.

 Jack,

 I agree that out of the box there hasn't been a generalized solution for
 this yet. I guess what I'm looking for is confirmation that I've gone as
 far
 as I can properly and from this point need to consider using something like
 the HON custom query parser component (which we're leery of using because
 from my reading it solves a specific scenario that may overcompensate what
 we're attempting to fix). I would personally rather stay IN solr than add
 custom .jar files from around the web if at all possible.

 Thanks for the replies.

 --Dave





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078610.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread Roman Chyla

OK, let's do a simple test instead of making claims - take your solr
instance, anything bigger or equal to version 4.0

In your schema.xml, pick a field and add the synonym filter

filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.KeywordTokenizerFactory /

in your synonyms.txt, add these entries:

hubble\0space\0telescope, HST

ATTENTION: the \0 is a null byte, you must be written as null byte! You can
do it with: python -c print \hubble\0space\0telescope,HST\ 
synonyms.txt

send a phrase query q=field:hubble space telescopedebugQuery=true

if you have done it right, you will see 'HST' is in the list - this means,
solr is able to recognize the multi-token synonym! As far as recognition is
concerned, there is no need for more work on FST.

I have written a big unittest that proves the point (9 months ago,
LUCENE-4499) making no changes in the way how FST works. What is missing is
the query parser that can take advantage - another JIRA issue.

I'll repeat my claim now: the solution(s) are there, they solve the problem
completely - they are not inside one JIRA issue, but they are there. They
need to be proven wrong, NOT proclaimed incomplete.


roman


On Wed, Jul 17, 2013 at 10:22 AM, Jack Krupansky j...@basetechnology.comwrote:

 To the best of my knowledge, there is no patch or collection of patches
 which constitutes a working solution - just partial solutions.

 Yes, it is true, there is some FST work underway (active??) that shows
 promise depending on query parser implementation, but again, this is all a
 longer-term future, not a here and now. Maybe in the 5.0 timeframe?

 I don't want anyone to get the impression that there are off-the-shelf
 patches that completely solve the synonym phrase problem. Yes, progress is
 being made, but we're not there yet.

 -- Jack Krupansky

 -Original Message- From: Roman Chyla
 Sent: Wednesday, July 17, 2013 9:58 AM
 To: solr-user@lucene.apache.org

 Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

 Hi all,

 What I find very 'sad' is that Lucene/SOLR contain all the necessary
 components for handling multi-token synonyms; the Finite State Automaton
 works perfectly for matching these items; the biggest problem is IMO the
 old query parser which split things on spaces and doesn't know to be
 smarter.

 THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
 none was committed...sigh, we are re-inventing wheel all the time...)

 LUCENE-1622
 LUCENE-4381
 LUCENE-4499


 The problem of synonym expansion is more difficult becuase of the parsing -
 the default parsers are not flexible and they split on empty space -
 recently I have proposed a solution which makes also the multi-token
 synonym expansion simple

 this is the ticket:
 https://issues.apache.org/**jira/browse/LUCENE-5014https://issues.apache.org/jira/browse/LUCENE-5014

 that query parser is able to split on spaces, then look back, do the second
 pass to see whether to expand with synonyms - and even discover different
 parse paths and construct different queries based on that. if you want to
 see some complex examples, look at:
 https://github.com/romanchyla/**montysolr/blob/master/contrib/**
 adsabs/src/test/org/apache/**solr/analysis/**
 TestAdsabsTypeFulltextParsing.**javahttps://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java
 -
 eg. line 373, 483


 Lucene/SOLR developers are already doing great work and have much to do -
 they need help from everybody who is able to apply patch, test it and
 report back to JIRA.

 roman



 On Wed, Jul 17, 2013 at 9:37 AM, dmarini david.marini...@gmail.com
 wrote:

  iorixxx,

 Thanks for pointing me in the direction of the QueryElevation component.
 If
 it did not require that the target documents be keyed by the unique key
 field it would be ideal, but since our Sku field is not the Unique field
 (we
 have an internal id which serves as the key while this is the client's
 key)
 it doesn't seem like it will match unless I make a larger scope change.

 Jack,

 I agree that out of the box there hasn't been a generalized solution for
 this yet. I guess what I'm looking for is confirmation that I've gone as
 far
 as I can properly and from this point need to consider using something
 like
 the HON custom query parser component (which we're leery of using because
 from my reading it solves a specific scenario that may overcompensate what
 we're attempting to fix). I would personally rather stay IN solr than add
 custom .jar files from around the web if at all possible.

 Thanks for the replies.

 --Dave





 --
 View this message in context:
 http://lucene.472066.n3.**nabble.com/Searching-w-**
 explicit-Multi-Word-Synonym-**Expansion-tp4078469p4078610.**htmlhttp://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078610.html
 Sent

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread Roman Chyla

As I don't see in the heads of the users, I can make different assumptions
- but OK, seems reasonable that only minority of users here are actually
willing to do more (btw, I've received coding advice in the past here in
this list). I am working under the assumption that Lucene/SOLR devs are
swamped (there are always more requests and many unclosed JIRA issues), so
where else do they get helping hand than from users of this list? Users
like me, for example.

roman


On Wed, Jul 17, 2013 at 11:59 AM, Jack Krupansky j...@basetechnology.comwrote:

 Remember, this is the users list, not the dev list. Users want to know
 what they can do and use off the shelf today, not what could be
 developed. Hopefully, the situation will be brighter in six months or a
 year, but today... is today, not tomorrow.

 (And, in fact, users can use LucidWorks Search for query-time phrase
 synonyms, off-the-shelf, today, no patches required.)


 -- Jack Krupansky

 -Original Message- From: Roman Chyla
 Sent: Wednesday, July 17, 2013 11:44 AM

 To: solr-user@lucene.apache.org
 Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

 OK, let's do a simple test instead of making claims - take your solr
 instance, anything bigger or equal to version 4.0

 In your schema.xml, pick a field and add the synonym filter

 filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
 tokenizerFactory=solr.**KeywordTokenizerFactory /

 in your synonyms.txt, add these entries:

 hubble\0space\0telescope, HST

 ATTENTION: the \0 is a null byte, you must be written as null byte! You can
 do it with: python -c print \hubble\0space\0telescope,**HST\ 
 synonyms.txt

 send a phrase query q=field:hubble space telescopedebugQuery=true

 if you have done it right, you will see 'HST' is in the list - this means,
 solr is able to recognize the multi-token synonym! As far as recognition is
 concerned, there is no need for more work on FST.

 I have written a big unittest that proves the point (9 months ago,
 LUCENE-4499) making no changes in the way how FST works. What is missing is
 the query parser that can take advantage - another JIRA issue.

 I'll repeat my claim now: the solution(s) are there, they solve the problem
 completely - they are not inside one JIRA issue, but they are there. They
 need to be proven wrong, NOT proclaimed incomplete.


 roman


 On Wed, Jul 17, 2013 at 10:22 AM, Jack Krupansky j...@basetechnology.com
 **wrote:

  To the best of my knowledge, there is no patch or collection of patches
 which constitutes a working solution - just partial solutions.

 Yes, it is true, there is some FST work underway (active??) that shows
 promise depending on query parser implementation, but again, this is all a
 longer-term future, not a here and now. Maybe in the 5.0 timeframe?

 I don't want anyone to get the impression that there are off-the-shelf
 patches that completely solve the synonym phrase problem. Yes, progress is
 being made, but we're not there yet.

 -- Jack Krupansky

 -Original Message- From: Roman Chyla
 Sent: Wednesday, July 17, 2013 9:58 AM
 To: solr-user@lucene.apache.org

 Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

 Hi all,

 What I find very 'sad' is that Lucene/SOLR contain all the necessary
 components for handling multi-token synonyms; the Finite State Automaton
 works perfectly for matching these items; the biggest problem is IMO the
 old query parser which split things on spaces and doesn't know to be
 smarter.

 THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
 none was committed...sigh, we are re-inventing wheel all the time...)

 LUCENE-1622
 LUCENE-4381
 LUCENE-4499


 The problem of synonym expansion is more difficult becuase of the parsing
 -
 the default parsers are not flexible and they split on empty space -
 recently I have proposed a solution which makes also the multi-token
 synonym expansion simple

 this is the ticket:
 https://issues.apache.org/jira/browse/LUCENE-5014https://issues.apache.org/**jira/browse/LUCENE-5014
 https:**//issues.apache.org/jira/**browse/LUCENE-5014https://issues.apache.org/jira/browse/LUCENE-5014
 


 that query parser is able to split on spaces, then look back, do the
 second
 pass to see whether to expand with synonyms - and even discover different
 parse paths and construct different queries based on that. if you want to
 see some complex examples, look at:
 https://github.com/romanchyla/montysolr/blob/master/**contrib/**https://github.com/romanchyla/**montysolr/blob/master/contrib/**
 adsabs/src/test/org/apache/solr/analysis/**
 TestAdsabsTypeFulltextParsing.javahttps://github.com/**
 romanchyla/montysolr/blob/**master/contrib/adsabs/src/**
 test/org/apache/solr/analysis/**TestAdsabsTypeFulltextParsing.**javahttps://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread Roman Chyla

Hi Dave,



On Wed, Jul 17, 2013 at 2:03 PM, dmarini david.marini...@gmail.com wrote:

 Roman,

 As a developer, I understand where you are coming from. My issue is that I
 specialize in .NET, haven't done java dev in over 10 years. As an
 organization we're new to solr (coming from endeca) and we're looking to
 use
 it more across the organization, so for us, we are looking to do the
 classic
 time/payoff justification for most features that are causing a bit of
 friction. I have seen custom query parsers that are out there that seem
 like
 they will do what we're looking to do, but I worry that they might fix a
 custom case and not necessarily work for us.


been in the same position 2 years back, that's why I have developed the
ANTLR query parser (before that, I went through the phase of hacking
different query parsers, but it was always obvious to me it cannot work for
anything but simple cases)



 Also, Roman, are you suggesting that I can have an indexed document titled
 hubble telescope and as long as I separate multi-word synonyms with the
 null character \0 in the synonyms.txt file the query expansion will just
 work? if so, that would suffice for our needs.. can you elaborate or will

the query parser still foil the system. I ask because I've seen instances


First, bit of explanation of indexing/tokenization operates:

input text: hubble space telescope is in the space

let's say we are tokenizing on empty space and we use stopwords; this is
what gets indexed:

hubble
space
telescope
space

these tokens can have different positions, but let's ignore that for a
moment - the first three are adjacent


 where I can use the admin analysis tool against a custom field type to
 expand a multi-word synonym where it appears it's expanding the terms
 properly but when I run a search against it using the actual handler, it
 doesn't behave the same way and the debugQuery shows that indeed it split
 my
 term and did not expand it.


this is because the solr analysis tool is seeing the whole input as one
string hubble space telescope, WHILST the standard query parser first
tokenizes, then builds the query *out of every token* - so it is seeing 3
tokens instead of 1 big token, and builds the following query

field:hubble field:space field:telescope field:space

HOWEVER, when you send the phrase query, it arrives as one token - the
synonym filter will see it, it will recognize it as a multi-token synonym
and it will expand it

BUT, the standard behaviour is to insert the new token into the position of
the first token, so you will get a phrase query

(hubble | HST) space telescope space

So really, the problem of the multi-token synonym expansion is in essence a
problem of a query parser - it must know how to harvest tokens, expand
them, and how to build a proper query - int this case, the HST [one token]
spans over 3 original tokens, so the parser must be smart enough to build:

hubble space telescope space OR HST in the space

So, the synonym expansion part is standard FST, already in the Lucene/SOLR
core. The parser that can handle these cases (and not just them, but also
many others) is also inside Lucene - it is called 'flexible' and has been
contributed by IBM few years back. But so far it has been a sleeping beauty.

I haven't seen LucidWorks parser, but from the description it seems it does
much better job than the standard parser (if, when you do quoted phrase
search for hubble space telescope in the space and the result is: hubble
space telescope space OR HST in the space, you can be reasonably sure it
does everything - well, to be 100% sure: HST in the space should also
produce the same query; but that's a much longer discussion about
index-time XOR query-time analysis)

roman




 Jack,

 Is there a link where I can read more about the LucidWorks search parser
 and
 how we can perchance tie into that so I can test to see if it yields better
 results?

 Thanks again for the help and suggestions. As an organization, we've
 learned
 much of solr since we started in 4.1 (especially with the cloud). The devs
 are doing phenomenal work and my query is really meant more as confirmation
 that I'm taking the correct approach than to beg for a specific feature :)

 --Dave



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078675.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: ACL implementation: Pseudo-join performance Atomic Updates

2013-07-17 Thread Roman Chyla

Hello Oleg,

On Wed, Jul 17, 2013 at 3:49 PM, Oleg Burlaca oburl...@gmail.com wrote:

Hello Roman and all,

sorry, haven't the previous thread in its entirety, but few weeks back
that
Yonik's proposal got implemented, it seems ;)

http://search-lucene.com/m/Fa3Dg14mqoj/bitsetsubj=Re+Solr+large+boolean+filter

In that post I see a reference to your plugin BitSetQParserPlugin, right ?

https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java

I understood it as follows:
1. query the core and get ALL search results,
search results == (id1, id2, id7 .. id28263) // a long arrays of
Unique IDs
2. Generate a bitset from this array of IDs
3. search a core using a bitsetfilter

Correct?

yes, the BitSetQParserPlugin does the 3rd step

the unittest, may explain it better:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java

I was thinking that pseudo-joins can help exactly with this situation
(actually didn't even tried yet pseudo-joins, still watching the mail
list).
i.e. to make the first step efficient and at the same time perform a second
query without to send a lot of data to the client and then receiving this
data back.

I have a feeling that such a situation: a list of Unique IDs from query1
participates in filter in query2
happens frequently, and would be very useful if SOLR has an optimized
approach to handle it.
mmm, it's transform the pseudo-join in a real JOIN like in SQL world.

I think I'll just test to see the performance of pseudo-joins with large
datasets (was waiting to find the perfect solution).

I'd be very curious,if you do some experiments, please let us know. Thanks,

roman

Thanks for all the ideas/links, now I have a better view of the situation.

Regards.

On Wed, Jul 17, 2013 at 3:34 PM, Erick Erickson erickerick...@gmail.com
wrote:

Roman:

I think that SOLR-1913 is completely different. It's
about having a field in a document and being able
to do bitwise operations on it. So say I have a
field in a Solr doc with the value 6 in it. I can then
form a query like
{!bitwise field=myfield op=AND source=2}
and it would match.

You're talking about a much different operation as I
understand it.

In which case, go ahead and open up a JIRA, there's
no harm in it.

Best
Erick

On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla roman.ch...@gmail.com
wrote:
Erick,

I wasn't sure this issue is important, so I wanted first solicit some
feedback. You and Otis expressed interest, and I could create the JIRA
-
however, as Alexandre, points out, the SOLR-1913 seems similar
(actually,
closer to the Otis request to have the elasticsearch named filter) but
the
SOLR-1913 was created in 2010 and is not integrated yet, so I am
wondering
whether this new feature (somewhat overlapping, but still different
from
SOLR-1913) is something people would really want and the effort on the
JIRA
is well spent. What's your view?

Thanks,

roman

On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ?

Regards,
Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at
once. Lately, it doesn't seem to be working. (Anonymous - via GTD
book)

On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson
erickerick...@gmail.com
wrote:

Roman:

Did this ever make into a JIRA? Somehow I missed it if it did, and
this
would
be pretty cool

Erick

On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla roman.ch...@gmail.com

wrote:
On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca oburl...@gmail.com

wrote:

Hello Erick,

Join performance is most sensitive to the number of values
in the field being joined on. So if you have lots and lots of
distinct values in the corpus, join performance will be
affected.
Yep, we have a list of unique Id's that we get by first searching
for
records
where loggedInUser IS IN (userIDs)
This corpus is stored in memory I suppose? (not a problem) and
then
the
bottleneck is to match this huge set with the core where I'm
searching?

Somewhere in maillist archive people were talking about external
list
of
Solr unique IDs
but didn't find if there is a solution.
Back in 2010 Yonik posted a comment:

http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd

sorry, haven't the previous thread in its entirety, but few weeks
back
that
Yonik's proposal got implemented, it seems ;)

http

Re: Getting a large number of documents by id

2013-07-18 Thread Roman Chyla

Look at speed of reading the data - likely, it takes long time to assemble
a big response, especially if there are many long fields - you may want to
try SSD disks, if you have that option.

Also, to gain better understanding: Start your solr, start jvisualvm and
attach to your running solr. Start sending queries and observe where the
most time is spent - it is very easy, you don't have to be a programmer to
do it.

The crucial parts are (but they will show up under different names) are:

1. query parsing
2. search execution
3. response assembly

quite likely, your query is a huge boolean OR clause, that may not be as
efficient as some filter query.

Your use case is actually not at all exotic. There will soon be a JIRA
ticket that makes the scenario of sending/querying with large number of IDs
less painful.

http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html#a4070964
http://lucene.472066.n3.nabble.com/ACL-implementation-Pseudo-join-performance-amp-Atomic-Updates-td4077894.html

But I would really recommend you to do the jvisualvm measurement - that's
like bringing the light into darkness.

roman

On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

I have a situation which is common in our current use case, where I need to
get a large number (many hundreds) of documents by id. What I'm doing
currently is creating a large query of the form id:12345 OR id:23456 OR
... and sending it off. Unfortunately, this query is taking a long time,
especially the first time it's executed. I'm seeing times of like 4+
seconds for this query to return, to get 847 documents.

So, my question is: what should I be looking at to improve the performance
here?

Brian

Re: short-circuit OR operator in lucene/solr

2013-07-22 Thread Roman Chyla

Deepak,

I think your goal is to gain something in speed, but most likely the
function query will be slower than the query without score computation (the
filter query) - this stems from the fact how the query is executed, but I
may, of course, be wrong. Would you mind sharing measurements you make?

Thanks,

  roman


On Mon, Jul 22, 2013 at 10:54 AM, Yonik Seeley yo...@lucidworks.com wrote:

 function queries to the rescue!

 q={!func}def(query($a),query($b),query($c))
 a=field1:value1
 b=field2:value2
 c=field3:value3

 def or default function returns the value of the first argument that
 matches.  It's named default because it's more commonly used like
 def(popularity,50)  (return the value of the popularity field, or 50
 if the doc has no value for that field).

 -Yonik
 http://lucidworks.com


 On Sun, Jul 21, 2013 at 8:48 PM, Deepak Konidena deepakk...@gmail.com
 wrote:
  I understand that lucene's AND (), OR (||) and NOT (!) operators are
  shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why
  one can't treat them as boolean operators (adhering to boolean algebra).
 
  I have been trying to construct a simple OR expression, as follows
 
  q = +(field1:value1 OR field2:value2)
 
  with a match on either field1 or field2. But since the OR is merely an
  optional, documents where both field1:value1 and field2:value2 are
 matched,
  the query returns a score resulting in a match on both the clauses.
 
  How do I enforce short-circuiting in this context? In other words, how to
  implement short-circuiting as in boolean algebra where an expression A
 || B
  || C returns true if A is true without even looking into whether B or C
  could be true.
  -Deepak

Re: Performance of cross join vs block join

2013-07-22 Thread Roman Chyla

Hello Mikhail,

ps: sending to the solr-user as well, i've realized i was writing just to
you, sorry...

On Mon, Jul 22, 2013 at 3:07 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Hello Roman,

 Pleas get me right. I have no idea what happened with that dependency.
 There are recent patches from Yonik, they should be more actual, and I
 think he can help you with particular issues. From the common (captain's)
 sense I propose to specify any closer version of jetty, I don't think there
 are much reason to rely on that particular one.

 I'm thinking about your problem from time to time. You are right, it's
 definitely not a case for block join. I still trying to figure out how to
 make it computationally easier. As far as I get you have recursive
 many-to-many relationship and need to traverse it during the search.

 doc(id, author, text, references:[docid,] )

 I'm not sure it's possible with lucene now, but if it can, what you think
 about writing DocValues stripe contains internal Lucene docnums instead of
 external docIds. It moves few steps from query time to index time, hence
 can get some performance.


Our use case of many-to-many relations is probably a weird one and we ought
to de-normalize the values. What I do (a building a citation network in
memory, using Lucene caches) is just a work-around that happens to
out-perform the index seeking, no surprise on that, but in the expense of
memory. I am aware the de-normalization may be necessary, the DocValues
would probably be a step forward to it - the joins give great flexibility,
it is really cool, but that comes with its own price...



 Also, I mentioned you hesitates regarding cross segments join. You
 actually shouldn't due to the following reasons:
  - Join is a Solr code (which is a top reader beast);
  - it obtains and works with SolrIndexSearcher which is a top reader...
  - join happens at Weight without any awareness about leaf segments.

 https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java#L272


Thanks, I think I have not used (i believe) because there was very small
chance it could have been fast enough. It is reading terms/joins for docs
that match the query, so in that sense, it is not different from
pre-computing the citation cache - but it happens for every query/request,
and so for 0.5M of edges it must take some time. But I guess I should
measure it. I haven't made notes so now I am having hard time backtracking
:)

roman


 It seems to me cross segment join works well.



 On Mon, Jul 22, 2013 at 3:08 AM, Roman Chyla roman.ch...@gmail.comwrote:

 ah, in case you know the solution, here ant output:

 resolve:
 [ivy:retrieve]
 [ivy:retrieve] :: problems summary ::
 [ivy:retrieve]  WARNINGS
 [ivy:retrieve] module not found:
 org.eclipse.jetty#jetty-deploy;8.1.10.v20130312
 [ivy:retrieve]  local: tried
 [ivy:retrieve]  
 /home/rchyla/.ivy2/local/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/ivys/ivy.xml
 [ivy:retrieve]   -- artifact
 org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
 [ivy:retrieve]  
 /home/rchyla/.ivy2/local/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/jars/jetty-deploy.jar
 [ivy:retrieve]  shared: tried
 [ivy:retrieve]  
 /home/rchyla/.ivy2/shared/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/ivys/ivy.xml
 [ivy:retrieve]   -- artifact
 org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
 [ivy:retrieve]  
 /home/rchyla/.ivy2/shared/org.eclipse.jetty/jetty-deploy/8.1.10.v20130312/jars/jetty-deploy.jar
 [ivy:retrieve]  public: tried
 [ivy:retrieve]
 http://repo1.maven.org/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom
 [ivy:retrieve]  sonatype-releases: tried
 [ivy:retrieve]
 http://oss.sonatype.org/content/repositories/releases/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom
 [ivy:retrieve]   -- artifact
 org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
 [ivy:retrieve]
 http://oss.sonatype.org/content/repositories/releases/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar
 [ivy:retrieve]  maven.restlet.org: tried
 [ivy:retrieve]
 http://maven.restlet.org/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom
 [ivy:retrieve]   -- artifact
 org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
 [ivy:retrieve]
 http://maven.restlet.org/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.jar
 [ivy:retrieve]  working-chinese-mirror: tried
 [ivy:retrieve]
 http://mirror.netcologne.de/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312.pom
 [ivy:retrieve]   -- artifact
 org.eclipse.jetty#jetty-deploy;8.1.10.v20130312!jetty-deploy.jar:
 [ivy:retrieve]
 http://mirror.netcologne.de/maven2/org/eclipse/jetty/jetty-deploy/8.1.10.v20130312/jetty-deploy-8.1.10.v20130312

Re: Processing a lot of results in Solr

2013-07-23 Thread Roman Chyla

Hello Matt,

You can consider writing a batch processing handler, which receives a query
and instead of sending results back, it writes them into a file which is
then available for streaming (it has its own UUID). I am dumping many GBs
of data from solr in few minutes - your query + streaming writer can go
very long way :)

roman


On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:

 Hello Solr users,

 Question regarding processing a lot of docs returned from a query; I
 potentially have millions of documents returned back from a query. What is
 the common design to deal with this ?

 2 ideas I have are:
 - create a client service that is multithreaded to handled this
 - Use the Solr pagination to retrieve a batch of rows at a time (start,
 rows in Solr Admin console )

 Any other ideas that I may be missing ?

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.

Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla

Mikhail,
It is a slightly hacked JSONWriter - actually, while poking around, I have
discovered that dumping big hitsets would be possible - the main hurdle
right now, is that writer is expecting to receive docuemnts with fields
loaded, but if it received something that loads docs lazily, you could
stream thousands and thousands of recs just as it is done with the normal
response - standard operation. Well, people may cry this is not how SOLR is
meant to operate ;-)

roman


On Wed, Jul 24, 2013 at 5:28 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Roman,

 Can you disclosure how that streaming writer works? What does it stream
 docList or docSet?

 Thanks


 On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hello Matt,
 
  You can consider writing a batch processing handler, which receives a
 query
  and instead of sending results back, it writes them into a file which is
  then available for streaming (it has its own UUID). I am dumping many GBs
  of data from solr in few minutes - your query + streaming writer can go
  very long way :)
 
  roman
 
 
  On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com
 wrote:
 
   Hello Solr users,
  
   Question regarding processing a lot of docs returned from a query; I
   potentially have millions of documents returned back from a query. What
  is
   the common design to deal with this ?
  
   2 ideas I have are:
   - create a client service that is multithreaded to handled this
   - Use the Solr pagination to retrieve a batch of rows at a time
  (start,
   rows in Solr Admin console )
  
   Any other ideas that I may be missing ?
  
   Thanks,
   Matt
  
  
   
  
  
  
  
  
  
   NOTE: This message may contain information that is confidential,
   proprietary, privileged or otherwise protected by law. The message is
   intended solely for the named addressee. If received in error, please
   destroy and notify the sender. Any use of this email is prohibited when
   received in error. Impetus does not represent, warrant and/or
 guarantee,
   that the integrity of this communication has been maintained nor that
 the
   communication is free of errors, virus, interception or interference.
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com

Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla

On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote:

 That sounds like a satisfactory solution for the time being -
 I am assuming you dump the data from Solr in a csv format?


JSON


 How did you implement the streaming processor ? (what tool did you use for
 this? Not familiar with that)


this is what dumps the docs:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java

it is called by one of our batch processors, which can pass it a bitset of
recs
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java

as far as streaming is concerned, we were all very nicely surprised, a few
GB file (on local network) took ridiculously short time - in fact, a
colleague of mine was assuming it is not working, until we looked into the
downloaded file ;-), you may want to look at line 463
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java

roman


 You say it takes a few minutes only to dump the data - how long does it to
 stream it back in, are performances acceptable (~ within minutes) ?

 Thanks,
 Matt

 On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hello Matt,
 
 You can consider writing a batch processing handler, which receives a
 query
 and instead of sending results back, it writes them into a file which is
 then available for streaming (it has its own UUID). I am dumping many GBs
 of data from solr in few minutes - your query + streaming writer can go
 very long way :)
 
 roman
 
 
 On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:
 
  Hello Solr users,
 
  Question regarding processing a lot of docs returned from a query; I
  potentially have millions of documents returned back from a query. What
 is
  the common design to deal with this ?
 
  2 ideas I have are:
  - create a client service that is multithreaded to handled this
  - Use the Solr pagination to retrieve a batch of rows at a time
 (start,
  rows in Solr Admin console )
 
  Any other ideas that I may be missing ?
 
  Thanks,
  Matt
 
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that
 the
  communication is free of errors, virus, interception or interference.
 


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.

Re: How to debug an OutOfMemoryError?

2013-07-24 Thread Roman Chyla

_One_ idea would be to configure your java to dump core on the oom error -
you can then load the dump into some analyzers, eg. Eclipse, and that may
give you the desired answers (I fortunately don't remember that from top of
my head how to activate the dump, but google will give your the answer)

roman


On Wed, Jul 24, 2013 at 11:38 AM, jimtronic jimtro...@gmail.com wrote:

 I've encountered an OOM that seems to come after the server has been up
 for a
 few weeks.

 While I would love for someone to just tell me you did X wrong, I'm more
 interested in trying to debug this. So, given the error below, where would
 I
 look next? The only odd thing that sticks out to me is that my log file had
 grown to about 70G. Would that cause an error like this? This is Solr 4.2.

 Jul 24, 2013 3:08:09 PM org.apache.solr.common.SolrException log
 SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java
 heap space
 at

 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:651)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:364)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at

 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at

 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at

 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at

 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at

 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at

 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at

 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at

 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at

 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
 at

 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
 at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at

 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at

 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at

 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at

 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at org.apache.lucene.util.OpenBitSet.init(OpenBitSet.java:88)
 at
 org.apache.solr.search.DocSetCollector.collect(DocSetCollector.java:65)
 at org.apache.lucene.search.Scorer.score(Scorer.java:64)
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:605)
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
 at

 org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1060)
 at

 org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:763)
 at

 org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:880)
 at org.apache.solr.search.Grouping.execute(Grouping.java:284)
 at

 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:384)
 at

 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
 at

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-24 Thread Roman Chyla

This paper contains an excellent algorithm for plagiarism detection, but
beware the published version had a mistake in the algorithm - look for
corrections - I can't find them now, but I know they have been published
(perhaps by one of the co-authors). You could do it with solr, to create an
index of hashes, with the twist of storing position of the original text
(source of the hash) together with the token and the solr highlighting
would do the rest for you :)

roman


On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant sk...@sloan.mit.edu wrote:

 Here is a paper that I found useful:
 http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf


 On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Thanks for your comments.
 
  2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com
 
  if you need a specialized algorithm for detecting blogposts plagiarism /
  quotations (which are different tasks IMHO) I think you have 2 options:
  1. implement a dedicated one based on your features / metrics / domain
  2. try to fine tune an existing algorithm that is flexible enough
 
  If I were to do it with Solr I'd probably do something like:
  1. index original blogposts in Solr (possibly using Jack's suggestion
  about ngrams / shingles)
  2. do MLT queries with candidate blogposts copies text
  3. get the first, say, 2-3 hits
  4. mark it as quote / plagiarism
  5. eventually train a classifier to help you mark other texts as quote /
  plagiarism
 
  HTH,
  Tommaso
 
 
 
  2013/7/23 Furkan KAMACI furkankam...@gmail.com
 
   Actually I need a specialized algorithm. I want to use that algorithm
 to
   detect duplicate blog posts.
  
   2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com
  
Hi,
   
I you may leverage and / or improve MLT component [1].
   
HTH,
Tommaso
   
[1] : http://wiki.apache.org/solr/MoreLikeThis
   
   
2013/7/23 Furkan KAMACI furkankam...@gmail.com
   
 Hi;

 Sometimes a huge part of a document may exist in another
 document. As
like
 in student plagiarism or quotation of a blog post at another blog
  post.
 Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any
 class
   to
 detect it?

Re: Using Solr to search between two Strings without using index

2013-07-25 Thread Roman Chyla

Hi,

I think you are pushing it too far - there is no 'string search' without an
index. And besides, these things are just better done by a few lines of
code - and if your array is too big, then you should create the index...

roman


On Thu, Jul 25, 2013 at 9:06 AM, Rohit Kumar rohit.kku...@gmail.com wrote:

 Hi,

 I have a scenario.

 String array = [Input1 is good, Input2 is better, Input2 is sweet,
 Input3 is bad]

 I want to compare the string array against the given input :
 String inputarray= [Input1, Input2]


 It involves no indexes. I just want to use the power of string search to do
 a runtime search on the array and should return

 [Input1 is good, Input2 is better, Input2 is sweet]



 Thanks

Re: processing documents in solr

2013-07-27 Thread Roman Chyla

Dear list,
I'vw written a special processor exactly for this kind of operations

https://github.com/romanchyla/montysolr/tree/master/contrib/adsabs/src/java/org/apache/solr/handler/batch

This is how we use it
http://labs.adsabs.harvard.edu/trac/ads-invenio/wiki/SearchEngineBatch

It is capable of processing index of 200gb in few minutes,
copying/streaming large amounts of data is normal

If there is general interest, we can create jira issue - but given my
current workload time, it will take longer and also somebody else will
*have to* invest their time and energy in testing it, reporting, etc. Of
course, feel free to create the jira yourself or reuse the code -
hopefully, you will improve it and let me know ;-)

Roman
On 27 Jul 2013 01:03, Joe Zhang smartag...@gmail.com wrote:

 Dear list:

 I have an ever-growing solr repository, and I need to process every single
 document to extract statistics. What would be a reasonable process that
 satifies the following properties:

 - Exhaustive: I have to traverse every single document
 - Incremental: in other words, it has to allow me to divide and conquer ---
 if I have processed the first 20k docs, next time I can start with 20001.

 A simple *:* query would satisfy the 1st but not the 2nd property. In
 fact, given that the processing will take very long, and the repository
 keeps growing, it is not even clear that the exhaustiveness is achieved.

 I'm running solr 3.6.2 in a single-machine setting; no hadoop capability
 yet. But I guess the same issues still hold even if I have the solr cloud
 environment, right, say in each shard?

 Any help would be greatly appreciated.

 Joe

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla

Mikhail,
If your solution gives lazy loading of solr docs /and thus streaming of
huge result lists/ it should be big YES!
Roman
On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com wrote:

 Otis,
 You gave links to 'deep paging' when I asked about response streaming.
 Let me understand. From my POV, deep paging is a special case for regular
 search scenarios. We definitely need it in Solr. However, if we are talking
 about data analytic like problems, when we need to select an endless
 stream of responses (or store them in file as Roman did), 'deep paging' is
 a suboptimal hack.
 What's your vision on this?

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla

Hi Mikhail,

I can see it is lazy-loading, but I can't judge how much complex it becomes
(presumably, the filter dispatching mechanism is doing also other things -
it is there not only for streaming).

Let me just explain better what I found when I dug inside solr: documents
(results of the query) are loaded before they are passed into a writer - so
the writers are expecting to encounter the solr documents, but these
documents were loaded by one of the components before rendering them - so
it is kinda 'hard-coded'. But if solr was NOT loading these docs before
passing them to a writer, writer can load them instead (hence lazy loading,
but the difference is in numbers - it could deal with hundreds of thousands
of docs, instead of few thousands now).

I see one crucial point: this could work without any new handler/servlet -
solr would just gain a new parameter, something like: 'lazy=true' ;) and
people can use whatever 'wt' they did before

disclaimer: i don't know whether that would break other stuff, I only know
that I am using the same idea to dump what i need without breaking things
(so far...;-)) - but obviously, i didn't want to patch solr core

roman


On Sat, Jul 27, 2013 at 3:52 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Roman,

 Let me briefly explain  the design

 special RequestParser stores servlet output stream into the context
 https://github.com/m-khl/solr-patches/compare/streaming#L7R22

 then special component injects special PostFilter/DelegatingCollector which
 writes right into output
 https://github.com/m-khl/solr-patches/compare/streaming#L2R146

 here is how it streams the doc, you see it's lazy enough
 https://github.com/m-khl/solr-patches/compare/streaming#L2R181

 I mention that it disables later collectors
 https://github.com/m-khl/solr-patches/compare/streaming#L2R57
 hence, no facets with streaming, yet as well as memory consumption.

 This test shows how it works
 https://github.com/m-khl/solr-patches/compare/streaming#L15R115

 all other code purposed for distributed search.



 On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Mikhail,
  If your solution gives lazy loading of solr docs /and thus streaming of
  huge result lists/ it should be big YES!
  Roman
  On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com
  wrote:
 
   Otis,
   You gave links to 'deep paging' when I asked about response streaming.
   Let me understand. From my POV, deep paging is a special case for
 regular
   search scenarios. We definitely need it in Solr. However, if we are
  talking
   about data analytic like problems, when we need to select an endless
   stream of responses (or store them in file as Roman did), 'deep paging'
  is
   a suboptimal hack.
   What's your vision on this?
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com

Re: processing documents in solr

2013-07-27 Thread Roman Chyla

On Sat, Jul 27, 2013 at 4:17 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/27/2013 11:38 AM, Joe Zhang wrote:
  I have a constantly growing index, so not updating the index can't be
  practical...
 
  Going back to the beginning of this thread: when we use the vanilla
  *:*+pagination approach, would the ordering of documents remain stable?
   the index is dynamic: update/insertion only, no deletion.

 If you use a sort parameter with pagination, then you have stable
 ordering, unless as described with the 'b' example, a new document gets
 inserted into a position in the sort sequence that's before the current
 result page.

 One thing that you could do is make a copy of your index, set up a
 separate Solr installation that's not getting updates, and use that for
 your inspection.


Hi Shawn,

I guess if something prevents the current searcher from being recycled
(e.g. incrementing its ref count), it would be possible to re-use it for
the pagination - then the consumer could get tight to the reader and the
order is stable (seeing the same data) but there probably is not a
mechanism for this (?) nor would it be very wise to have such a mechanism
(?).

roman



 Thanks,
 Shawn

Re: Solr-4663 - Alternatives to use same data dir in different cores for optimal cache performance

2013-07-28 Thread Roman Chyla

Hi,
Yes, it can be done, if you search the mailing list for 'two solr instances
same datadir', you will a post where i am describing our setup - it works
well even with automated deployments

how do you measure performance? I am asking before one reason for us having
the same setup is sharing the OS cache, i'd be curious to see your numbers
and i can also (very soon) share ours.

roman


On Fri, Jul 26, 2013 at 3:23 AM, Dominik Siebel m...@dsiebel.de wrote:

 Hi,

 I just found SOLR-4663 beeing patched in the latest update I did.
 Does anyone know any other solution to use ONE physical index for various
 purposes?

 Why? I would like to use different solconfig.xmls in terms of cache sizes,
 result window size, etc. per business case for optimal performance, while
 relying on the same data.
 This is due to the fact the queries are mostly completely different in
 structure and result size and we only have one unified search index
 (indexing performance).
 Any suggestions (besides replicating the index to another core on the same
 machine, of course ;) )?


 Cheers!
 Dom

Measuring SOLR performance

2013-07-30 Thread Roman Chyla

Hello,

I have been wanting some tools for measuring performance of SOLR, similar
to Mike McCandles' lucene benchmark.

so yet another monitor was born, is described here:
http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/

I tested it on the problem of garbage collectors (see the blogs for
details) and so far I can't conclude whether highly customized G1 is better
than highly customized CMS, but I think interesting details can be seen
there.

Hope this helps someone, and of course, feel free to improve the tool and
share!

roman

Re: Measuring SOLR performance

2013-07-31 Thread Roman Chyla

Hi Dmitry,
probably mistake in the readme, try calling it with -q
/home/dmitry/projects/lab/solrjmeter/queries/demo/demo.queries

as for the base_url, i was testing it on solr4.0, where it tries contactin
/solr/admin/system - is it different for 4.3? I guess I should make it
configurable (it already is, the endpoint is set at the check_options())

thanks

roman


On Wed, Jul 31, 2013 at 10:01 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Ok, got the error fixed by modifying the base solr ulr in solrjmeter.py
 (added core name after /solr part).
 Next error is:

 WARNING: no test name(s) supplied nor found in:
 ['/home/dmitry/projects/lab/solrjmeter/demo/queries/demo.queries']

 It is a 'slow start with new tool' symptom I guess.. :)


 On Wed, Jul 31, 2013 at 4:39 PM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi Roman,

 What  version and config of SOLR does the tool expect?

 Tried to run, but got:

 **ERROR**
   File solrjmeter.py, line 1390, in module
 main(sys.argv)
   File solrjmeter.py, line 1296, in main
 check_prerequisities(options)
   File solrjmeter.py, line 351, in check_prerequisities
 error('Cannot contact: %s' % options.query_endpoint)
   File solrjmeter.py, line 66, in error
 traceback.print_stack()
 Cannot contact: http://localhost:8983/solr


 complains about URL, clicking which leads properly to the admin page...
 solr 4.3.1, 2 cores shard

 Dmitry


 On Wed, Jul 31, 2013 at 3:59 AM, Roman Chyla roman.ch...@gmail.comwrote:

 Hello,

 I have been wanting some tools for measuring performance of SOLR, similar
 to Mike McCandles' lucene benchmark.

 so yet another monitor was born, is described here:
 http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/

 I tested it on the problem of garbage collectors (see the blogs for
 details) and so far I can't conclude whether highly customized G1 is
 better
 than highly customized CMS, but I think interesting details can be seen
 there.

 Hope this helps someone, and of course, feel free to improve the tool and
 share!

 roman

Re: Measuring SOLR performance

2013-07-31 Thread Roman Chyla

I'll try to run it with the new parameters and let you know how it goes.
I've rechecked details for the G1 (default) garbage collector run and I can
confirm that 2 out of 3 runs were showing high max response times, in some
cases even 10secs, but the customized G1 never - so definitely the
parameters had effect because the max time for the customized G1 never went
higher than 1.5secs (and that happend for 2 query classes only). Both the
cms-custom and G1-custom are similar, the G1 seems to have higher values in
the max fields, but that may be random. So, yes, now I am sure what to
think of default G1 as 'bad', and that these G1 parameters, even if they
don't seem G1 specific, have real effect.
Thanks,

roman


On Tue, Jul 30, 2013 at 11:01 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/30/2013 6:59 PM, Roman Chyla wrote:
  I have been wanting some tools for measuring performance of SOLR, similar
  to Mike McCandles' lucene benchmark.
 
  so yet another monitor was born, is described here:
  http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/
 
  I tested it on the problem of garbage collectors (see the blogs for
  details) and so far I can't conclude whether highly customized G1 is
 better
  than highly customized CMS, but I think interesting details can be seen
  there.
 
  Hope this helps someone, and of course, feel free to improve the tool and
  share!

 I have a CMS config that's even more tuned than before, and it has made
 things MUCH better.  This new config is inspired by more info that I got
 on IRC:

 http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

 The G1 customizations in your blog post don't look like they are really
 G1-specific - they may be useful with CMS as well.  This statement also
 applies to some of the CMS parameters, so I would use those with G1 as
 well for any testing.

 UseNUMA looks interesting for machines that actually are NUMA.  All the
 information that I can find says it is only for the throughput
 (parallel) collector, so it's probably not doing anything for G1.

 The pause parameters you've got for G1 are targets only.  It will *try*
 to stick within those parameters, but if a collection requires more than
 50 milliseconds or has to happen more often than once a second, the
 collector will ignore what you have told it.

 Thanks,
 Shawn

1 2 >

1 - 100 of 167 matches

Mail list logo