Copy in multivalued field and faceting
Hello, Field for this scenario is Title and contains several words. For a specific query, I would like get the top ten words by frequency in a specific field. My idea was the following: - Title in my schema is stored/indexed in a specific field - A copyField copy Title field content into a multivalued field. If my multivalue field use a specific tokenizer which split words, does it fill each word in each multivalued items ? - If so, using faceting on this multivalue field, I will get top ten words, correct ? Example: 1) Title : this is my title 2) CopyField Title to specific multivalue field F1 3) F1 contains : {this, is, my, title} My english Thanks, Jul -- View this message in context: http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3584819.html Sent from the Solr - User mailing list archive at Nabble.com.
Sorting and searching on a field
Hi, I have a field in Solr that I want to be sortable. But at the same time, I want to be able to search on that field without using wild cards. Is that possible ? For example, if I have a field Subject with a value This is my first subject, searching in solr as subject:first should give me this result. And the field Subject should be sortable. I have read about the option of copying this to a different field, using one for searching by tokenizing, and one for sorting. But am looking for to be able to do both things on the same field. Can someone please point to a way to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
Possible to adjust FieldNorm?
Hi, Is it possible to adjust FieldNorm? I have a scenario where the search is not producing the desired result because of fieldNorm: Search terms: coaching leadership Record 1: name=Ask the Coach, desc=...,... Record 2: name=Coaching as a Leadership Development Tool Part 1, desc=...,... Record 1 was scored higher than record 2, despite record 2 has two matches. The scoring is given below: Record 1: 1.2878088 = (MATCH) weight(name_en:coach in 6430), product of: 0.20103075 = queryWeight(name_en:coach), product of: 6.406029 = idf(docFreq=160, maxDocs=35862) 0.03138149 = queryNorm 6.406029 = (MATCH) fieldWeight(name_en:coach in 6430), product of: 1.0 = tf(termFreq(name_en:coach)=1) 6.406029 = idf(docFreq=160, maxDocs=35862) 1.0 = fieldNorm(field=name_en, doc=6430) Record 2: 0.56341636 = (MATCH) weight(name_en:coach in 4744), product of: 0.20103075 = queryWeight(name_en:coach), product of: 6.406029 = idf(docFreq=160, maxDocs=35862) 0.03138149 = queryNorm 2.8026378 = (MATCH) fieldWeight(name_en:coach in 4744), product of: 1.0 = tf(termFreq(crs_name_en:coach)=1) 6.406029 = idf(docFreq=160, maxDocs=35862) 0.4375 = fieldNorm(field=name_en, doc=4744) Many thanks in advance. Chut -- View this message in context: http://lucene.472066.n3.nabble.com/Possible-to-adjust-FieldNorm-tp3584998p3584998.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Copy in multivalued field and faceting
Sounds like working by carefully choosing tokenizer, and then use facet.sort and facet.limit parameters to do faceting. Will see any expert's comments on this one. Yunfei On Wed, Dec 14, 2011 at 12:26 AM, darul daru...@gmail.com wrote: Hello, Field for this scenario is Title and contains several words. For a specific query, I would like get the top ten words by frequency in a specific field. My idea was the following: - Title in my schema is stored/indexed in a specific field - A copyField copy Title field content into a multivalued field. If my multivalue field use a specific tokenizer which split words, does it fill each word in each multivalued items ? - If so, using faceting on this multivalue field, I will get top ten words, correct ? Example: 1) Title : this is my title 2) CopyField Title to specific multivalue field F1 3) F1 contains : {this, is, my, title} My english Thanks, Jul -- View this message in context: http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3584819.html Sent from the Solr - User mailing list archive at Nabble.com.
Large RDBMS dataset
Hello, I have a very large dataset ( 1 Mrecords) on the RDBMS which I want my Solr application to pull data from. Problem is that the document fields which I have to index aren't in the same table, but I have to join records with two other tables. Well, in fact they are views, but I don't think that this makes any difference. That's the data import handler that I've actually written: ?xml version=1.0? dataConfig dataSource type=JdbcDataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://YSQLDEV01BLQ/YooxProcessCluster1 instance=SVCSQLDEV / document name=Products entity name=fd query=SELECT * FROM clust_w_fast_dump ORDER BY endeca_id; entity name=fd2 query=SELECT macrocolor_id, color_descr, gsize_descr, size_descr FROM clust_w_fast_dump2_ByMarkets WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ entity name=cpd query=SELECT DepartmentCode, Ranking, DepartmentPriceRangeCode FROM clust_w_CatalogProductsDepartments_ByMarket WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ entity name=env query=SELECT Environment FROM clust_w_Environment WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ /entity /document /dataConfig It works, but it takes 1'38 to parse 100 records: it means 1 rec/s! That means that digesting the whole dataset would take 1 Ms (= 12 days). The problem is that for each record in fd, Solr makes three distinct SELECT on the other three tables. Of course, this is absolutely inefficient. Is there a way to have Solr loading every record in the four tables and join them when they are already loaded in memory? TIA
Solr Search Across Multiple Cores not working when quering on specific field
I have two Solr cores. Core0 and core1 Both cores are having same schema and configuration. after indexing both cores data is retried from both cores individually http://localhost:8983/solr/core0/select?q=fieldName:%22United%22 http://localhost:8983/solr/core1/select?q=fieldName:%22United%22 *Searching on both cores* This url is working http://localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0,localhost:8983/solr/core1q=iPo* but when i searched on a specific field than it is not working http://localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0,localhost:8983/solr/core1q=mnemonic_value:United; Why distributed search is not working when i search on a particular field.? Please help -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Search-Across-Multiple-Cores-not-working-when-quering-on-specific-field-tp3585013p3585013.html Sent from the Solr - User mailing list archive at Nabble.com.
Getting Error while running Query
Hi All, I am sorry If I have sent this email at wrong list. If it is then kindly let me know! I am using Alfresco 4.0 which is having SOLR for Lucene. I am able to see the SOLR page and also able to fire queris But they do not return any results and sometimes giving errors. I am using SOLR UI (https://localhost:8443/solr/alfresco/admin/ ). For example: When I search for @cm\:name:sanket It shows me some xml result which is as under. response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=explainOther/ str name=indenton/str str name=hl.fl/ str name=wtstandard/str str name=hlon/str str name=rows10/str str name=version2.2/str str name=fl*,score/str str name=debugQueryon/str str name=start0/str str name=q@cm\:cm:sanket/str str name=qtstandard/str str name=fq/ /lst /lst result name=response numFound=0 start=0 maxScore=0.0/ lst name=highlighting/ lst name=debug str name=rawquerystring@cm\:cm:sanket/str str name=querystring@cm\:cm:sanket/str str name=parsedquery@cm:cm:sanket/str str name=parsedquery_toString@cm:cm:sanket/str lst name=explain/ str name=QParserLuceneQParser/str arr name=filter_queries Also sometime I get exception like HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse '@cm:name:sanket': Encountered : : at line 1, column 8. Was expecting one of: EOF AND ... OR ... NOT bla..bla ..bla.. Please help me out. Thanking You! Sanket Shah
Re: Getting Error while running Query
On Wed, Dec 14, 2011 at 5:00 PM, Sanket Shah sanket.s...@cignex.com wrote: Hi All, I am sorry If I have sent this email at wrong list. If it is then kindly let me know! I am using Alfresco 4.0 which is having SOLR for Lucene. I am able to see the SOLR page and also able to fire queris But they do not return any results and sometimes giving errors. I am using SOLR UI (https://localhost:8443/solr/alfresco/admin/ ). [...] result name=response numFound=0 start=0 maxScore=0.0/ [...] The 'numfound=0' indicates that no documents matched the search string. Maybe the indexing is not done properly: Could you try searching for *:* from the Solr admin. interface? This should return all documents indexed into Solr. Also sometime I get exception like HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse '@cm:name:sanket': Encountered : : at line 1, column 8. Was expecting one of: EOF AND ... OR ... NOT bla..bla ..bla.. The colon, :, is a special character for Solr, and needs to be escaped, as you did in your first example. Please see http://wiki.apache.org/solr/SolrQuerySyntax#NOTE:_URL_Escaping_Special_Characters Regards, Gora
RE: Getting Error while running Query
Thanks Gora for your reply. How can I come to know that alfresco or share is running n SOLR? I meant, when I login, clicking some folders or creating or uploading new files. How can I know that it is being done by SOLR and not by the old way before alfresco 4.0. I have put the following things in my glob.prop file. ### Solr indexing ### index.subsystem.name=solr dir.keystore=${dir.root}/keystore solr.port.ssl=8443 ## newly added. solr.host=localhost solr.port=8080 # default keystores location dir.keystore=classpath:alfresco/keystore encryption.ssl.keystore.location=${dir.keystore}/ssl.keystore encryption.ssl.keystore.provider= encryption.ssl.keystore.type=JCEKS encryption.ssl.keystore.keyMetaData.location=${dir.keystore}/ssl-keystore-passwords.properties encryption.ssl.truststore.location=${dir.keystore}/ssl.truststore encryption.ssl.truststore.provider= encryption.ssl.truststore.type=JCEKS encryption.ssl.truststore.keyMetaData.location=${dir.keystore}/ssl-truststore-passwords.properties thanking you! -Original Message- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Wednesday, December 14, 2011 5:18 PM To: solr-user@lucene.apache.org Subject: Re: Getting Error while running Query On Wed, Dec 14, 2011 at 5:00 PM, Sanket Shah sanket.s...@cignex.com wrote: Hi All, I am sorry If I have sent this email at wrong list. If it is then kindly let me know! I am using Alfresco 4.0 which is having SOLR for Lucene. I am able to see the SOLR page and also able to fire queris But they do not return any results and sometimes giving errors. I am using SOLR UI (https://localhost:8443/solr/alfresco/admin/ ). [...] result name=response numFound=0 start=0 maxScore=0.0/ [...] The 'numfound=0' indicates that no documents matched the search string. Maybe the indexing is not done properly: Could you try searching for *:* from the Solr admin. interface? This should return all documents indexed into Solr. Also sometime I get exception like HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse '@cm:name:sanket': Encountered : : at line 1, column 8. Was expecting one of: EOF AND ... OR ... NOT bla..bla ..bla.. The colon, :, is a special character for Solr, and needs to be escaped, as you did in your first example. Please see http://wiki.apache.org/solr/SolrQuerySyntax#NOTE:_URL_Escaping_Special_Characters Regards, Gora
Use Solr to process/analyze docs without indexing
Hello, I would use Solr to analyze / process documents using stemming analyzers, stopwordsfilters, etc. and then return the results instead of indexing. There is already some api service out-of-box to do this? It would be easy to implement? I'm thinking of using a RequestHandler to receive the documents, process them with analyzers specified in the schema.xml and return the results without going through the index... is this possible? someone has done? Thanks!! -- View this message in context: http://lucene.472066.n3.nabble.com/Use-Solr-to-process-analyze-docs-without-indexing-tp3585263p3585263.html Sent from the Solr - User mailing list archive at Nabble.com.
Shutdown hook issue
Hi All, I'm experiencing some issues with solr. From time to time solr goes down. After checking the logs, I see that it's due to the shutdown hook being triggered. I still don't know why it happens but it seems to be related to solr being idle. Does anyone have any insights? I'm using Ubuntu 10.04.2 LTS and solr 3.1.0 running on Jetty (default configuration). Solr runs in background, so it doesn't seem to be related to a SIGINT unless ubuntu is sending it for some odd reason. Thanks, Adolfo.
Re: Use Solr to process/analyze docs without indexing
I would use Solr to analyze / process documents using stemming analyzers, stopwordsfilters, etc. and then return the results instead of indexing. There is already some api service out-of-box to do this? It would be easy to implement? I'm thinking of using a RequestHandler to receive the documents, process them with analyzers specified in the schema.xml and return the results without going through the index... is this possible? someone has done? May be this? http://wiki.apache.org/solr/AnalysisRequestHandler
Faceting with null dates
hello,I have the following faceting parameters, which gives me some unwanted non-null dates in the result set. Is there a way to query the index to not give me non-null dates in return? I.e. I would like to get a result set which contains only non-nulls on the validToDate, but as I am faceting on non-null values on the validToDate, I would like to get the non-null values in the faceting result. This response example below gives me 10 results, with 7 non-null validToDates. What I would like to get is 3 results and 7 non-null validToDate facets. And as I write this, I start to wonder if this is possible at all as the facets are dependent on the result set and that this might be better to handle in the application layer by just extracting 10-7=3... Any help would be appreciated! br,ken codestr name=facettrue/strstr name=f.validToDate.facet.range.startNOW/DAYS-4MONTHS/strstr name=facet.mincount1/strstr name=q(*:*)/strarr name=facet.rangestrvalidToDate/str/arrstr name=facet.range.endNOW/DAY+1DAY/strstr name=facet.range.gap+1MONTH/str/code result name=response numFound=10 start=0lst name=facet_countslst name=facet_ranges lst name=validToDate lst name=counts int name=2011-11-14T00:00:00Z7/int
Re: Too many connections in CLOSE_WAIT state on master solr server
I'm guessing (and it's just a guess) that what's happening is that the container is queueing up your requests while waiting for the other connections to close, so Mikhail's suggestion seems like a good idea. Best Erick On Wed, Dec 14, 2011 at 12:28 AM, samarth s samarth.s.seksa...@gmail.com wrote: The updates to the master are user driven, and are needed to be visible quickly. Hence, the high frequency of replication. It may be that too many replication requests are being handled at a time, but why should that result in half closed connections? On Wed, Dec 14, 2011 at 2:47 AM, Erick Erickson erickerick...@gmail.com wrote: Replicating 40 cores every 20 seconds is just *asking* for trouble. How often do your cores change on the master? How big are they? Is there any chance you just have too many cores replicating at once? Best Erick On Tue, Dec 13, 2011 at 3:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: You can try to reuse your connections (prevent them from closing) by specifying -Dhttp.maxConnections=http://download.oracle.com/javase/1.4.2/docs/guide/net/properties.htmlN in jvm startup params. At client JVM!. Number should be chosen considering the number of connection you'd like to keep alive. Let me know if it works for you. On Tue, Dec 13, 2011 at 2:57 PM, samarth s samarth.s.seksa...@gmail.comwrote: Hi, I am using solr replication and am experiencing a lot of connections in the state CLOSE_WAIT at the master solr server. These disappear after a while, but till then the master solr stops responding. There are about 130 open connections on the master server with the client as the slave m/c and all are in the state CLOSE_WAIT. Also, the client port specified on the master solr server netstat results is not visible in the netstat results on the client (slave solr) m/c. Following is my environment: - 40 cores in the master solr on m/c 1 - 40 cores in the slave solr on m/c 2 - The replication poll interval is 20 seconds. - Replication part in solrconfig.xml in the slave solr: requestHandler name=/replication class=solr.ReplicationHandler lst name=slave !--fully qualified url for the replication handler of master-- str name=masterUrl$mastercorename/replication/str !--Interval in which the slave should poll master .Format is HH:mm:ss . If this is absent slave does not poll automatically. But a fetchindex can be triggered from the admin or the http API-- str name=pollInterval00:00:20/str !-- The following values are used when the slave connects to the master to download the index files. Default values implicitly set as 5000ms and 1ms respectively. The user DOES NOT need to specify these unless the bandwidth is extremely low or if there is an extremely high latency-- str name=httpConnTimeout5000/str str name=httpReadTimeout1/str /lst /requestHandler Thanks for any pointers. -- Regards, Samarth -- Sincerely yours Mikhail Khludnev Developer Grid Dynamics tel. 1-415-738-8644 Skype: mkhludnev http://www.griddynamics.com mkhlud...@griddynamics.com -- Regards, Samarth
Re: Copy in multivalued field and faceting
I don't quite understand what you're trying to do. MultiValued is a bit misleading. All it means is that you can add the same field multiple times to a document, i.e. (XML example) doc add name=fieldvalue1 value2 value3/add add name=fieldvalue4 value5 value6/add /doc will succeed if field is multiValued and fail if not. This will work if field is NOT multiValued: doc add name=fieldvalue1 value2 value3 value4 value5 value6/add /doc and, assuming WhitespaceTokenizer, the field field will contain the exact same tokens. The only difference *might* be the offsets, but don't worry about that quite yet, all it would really affect is phrase queries. With that as a preface, I don't see why copyField has anything to do with your problem, you'd get the same results faceting on the title field, assuming identical analyzer chains. Faceting on a text field is iffy, it can be quite expensive. What you'd get in the end, though, is a list of the top words in your corpus for that field counted from the documents that satisfied the query. Which sounds like what you're after. Best Erick On Wed, Dec 14, 2011 at 4:59 AM, yunfei wu yunfei...@gmail.com wrote: Sounds like working by carefully choosing tokenizer, and then use facet.sort and facet.limit parameters to do faceting. Will see any expert's comments on this one. Yunfei On Wed, Dec 14, 2011 at 12:26 AM, darul daru...@gmail.com wrote: Hello, Field for this scenario is Title and contains several words. For a specific query, I would like get the top ten words by frequency in a specific field. My idea was the following: - Title in my schema is stored/indexed in a specific field - A copyField copy Title field content into a multivalued field. If my multivalue field use a specific tokenizer which split words, does it fill each word in each multivalued items ? - If so, using faceting on this multivalue field, I will get top ten words, correct ? Example: 1) Title : this is my title 2) CopyField Title to specific multivalue field F1 3) F1 contains : {this, is, my, title} My english Thanks, Jul -- View this message in context: http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3584819.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: edismax phrase matching with a non-word char inbetween
What I think is happening here is that WordDelimiterFilterFactory is throwing away your non-alpha-numeric characters. You can see this in admin/analysis, which I've found *extremely* helpful when faced with this kind of question. Best Erick On Tue, Dec 13, 2011 at 10:37 AM, Robert Brown r...@intelcompute.com wrote: I have a field which is indexed and queried as follows: tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=text-synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ When searching for street work (with quotes), i'm getting matches and highlighting on things like... ...Oxford emStreet/em (emWork/em Experience)... why is this happening, and what can I do to stop it? I've set int name=qs0/int in my config to try and avert this sort of behaviour, am I correct in thinking that this is used to ensure there are no words in-between the phrase words?
Re: Use Solr to process/analyze docs without indexing
Thanks iorixxx! I think that's exactly what I was looking. -- View this message in context: http://lucene.472066.n3.nabble.com/Use-Solr-to-process-analyze-docs-without-indexing-tp3585263p3585522.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Getting Error while running Query
On Wed, Dec 14, 2011 at 5:34 PM, Sanket Shah sanket.s...@cignex.com wrote: Thanks Gora for your reply. How can I come to know that alfresco or share is running n SOLR? I meant, when I login, clicking some folders or creating or uploading new files. How can I know that it is being done by SOLR and not by the old way before alfresco 4.0. [...] I am sorry, but while we do use Alfresco, we have not yet had occasion to look at integrating Solr with Alfresco. So, I would be unable to help you here. I presume that you have tried looking at the resources thrown up by searching Google, e.g., http://wiki.alfresco.com/wiki/Alfresco_And_SOLR Regards, Gora
Re: Large RDBMS dataset
Instead of handling it from within solr, I'd suggest writing an external application (e.g. in python using pysolr) that wraps the (fast) SQL query you like. Then retrieve a batch of documents, and write them to solr. For extra speed, don't commit until you're done. /Martin On Wed, Dec 14, 2011 at 11:18 AM, Finotti Simone tech...@yoox.com wrote: Hello, I have a very large dataset ( 1 Mrecords) on the RDBMS which I want my Solr application to pull data from. Problem is that the document fields which I have to index aren't in the same table, but I have to join records with two other tables. Well, in fact they are views, but I don't think that this makes any difference. That's the data import handler that I've actually written: ?xml version=1.0? dataConfig dataSource type=JdbcDataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://YSQLDEV01BLQ/YooxProcessCluster1 instance=SVCSQLDEV / document name=Products entity name=fd query=SELECT * FROM clust_w_fast_dump ORDER BY endeca_id; entity name=fd2 query=SELECT macrocolor_id, color_descr, gsize_descr, size_descr FROM clust_w_fast_dump2_ByMarkets WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ entity name=cpd query=SELECT DepartmentCode, Ranking, DepartmentPriceRangeCode FROM clust_w_CatalogProductsDepartments_ByMarket WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ entity name=env query=SELECT Environment FROM clust_w_Environment WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ /entity /document /dataConfig It works, but it takes 1'38 to parse 100 records: it means 1 rec/s! That means that digesting the whole dataset would take 1 Ms (= 12 days). The problem is that for each record in fd, Solr makes three distinct SELECT on the other three tables. Of course, this is absolutely inefficient. Is there a way to have Solr loading every record in the four tables and join them when they are already loaded in memory? TIA
Re: Solr using very high I/O
Do you commit often? If so, try committing less often :) /Martin On Wed, Dec 7, 2011 at 12:16 PM, Adrian Fita adrian.f...@gmail.com wrote: Hi. I experience an issue where Solr is using huge ammounts of I/O. Basically it uses the whole HDD continously, leaving nothing to the other processes. Solr is called by a script which continously indexes some files. The index has around 800MB and I can't understand why it could trash the HDD so much. I could use some help on how to optimize Solr so it doesn't use so much I/O. Thank you. -- Fita Adrian
Using LocalParams in StatsComponent to create a price slider?
Hi, I'm using the StatsComponent to receive to lower and upper bounds of a price field to create a price slider. If someone sets the price range to $100-$200 I have to add a filter to the query. But then the lower and upper bound are calculated of the filtered result. Is it possible to use LocalParams (like for facets) to ignore a specific filter? Thanks. Mark
Re: Large RDBMS dataset
On Wed, Dec 14, 2011 at 3:48 PM, Finotti Simone tech...@yoox.com wrote: Hello, I have a very large dataset ( 1 Mrecords) on the RDBMS which I want my Solr application to pull data from. [...] It works, but it takes 1'38 to parse 100 records: it means 1 rec/s! That means that digesting the whole dataset would take 1 Ms (= 12 days). Depending on the size of the data that you are pulling from the database, 1M records is not really that large a number. We were doing ~75GB of stored data from ~7million records in about 9h, including quite complicated transfomers. I would imagine that there is much room for improvement in your case also. Some notes on this: * If you have servers to throw at the problem, and a sensible way to shard your RDBMS data, use parallel indexing to multiple Solr cores, maybe on multiple servers, followed by a merge. In our experience, given enough RAM and adequate provisioning of database servers, indexing speed scales linearly with the total no. of cores. * Replicate your database, manually if needed. Look at the load on a database server during the indexing process, and provision enough database servers to match the no. of Solr indexing servers. * This point is leading into flamewar territory, but consider switching databases. From our (admittedly non-rigorous measurements), mysql was at least a factor of 2-3 faster than MS-SQL, with the same dataset. * Look at cloud-computing. If finances permit, one should be able to shrink indexing times to almost any desired level. E.g., for the dataset that we used, I have little doubt that we could have shrunk the time down to less than 1h, at an affordable cost on Amazon EC2. Unfortunately, we have not yet had the opportunity to try this. The problem is that for each record in fd, Solr makes three distinct SELECT on the other three tables. Of course, this is absolutely inefficient. Is there a way to have Solr loading every record in the four tables and join them when they are already loaded in memory? For various reasons, we did not investigate this in depth, but you could also look at Solr's CachedSqlEntityProcessor. Regards, Gora
Re: CRUD on solr Index while replicating between master/slave
Hi, We have an index which needs constant updates in the master. One more question.. The scenario is 1) Master starts replicating to slave (takes approx 15 mins) 2) We do some changes to index on master while it is replicating So question is what happens to the changes in master index while it is replicating. Will the slave get it or not? Tarun Jain -=- - Original Message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org; Tarun Jain tjai...@yahoo.com Cc: Sent: Tuesday, December 13, 2011 4:18 PM Subject: Re: CRUD on solr Index while replicating between master/slave No, you can search on the master when replicating, no problem. But why do you want to? The whole point of master/slave setups is to separate indexing from searching machines. Best Erick On Tue, Dec 13, 2011 at 4:10 PM, Tarun Jain tjai...@yahoo.com wrote: Hi, Thanks. So just to clarify here again while replicating we cannot search on master index ? Tarun Jain -=- - Original Message - From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Tuesday, December 13, 2011 3:03 PM Subject: Re: CRUD on solr Index while replicating between master/slave Hi, Master: Update/insert/delete docs -- Yes Slaves: Search -- Yes Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Tarun Jain tjai...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, December 13, 2011 11:15 AM Subject: CRUD on solr Index while replicating between master/slave Hi, When replication is happening between master to slave what operations can we do on the master what operations are possible on the slave? I know it is not adivisable to do DML on the slave index but I wanted to know this anyway. Also I understand that doing DML on a slave will make the slave index incompatible with the master. Master Search -- Yes/No Update/insert/delete docs -- Yes/No Slave = Search -- Yes/No Update/insert/delete docs -- Yes/No Please share any other caveats that you have discovered regarding the above scenario that might be helpful. Thanks -=-
Re: Shutdown hook issue
Hi, Solr won't shut down by itself just because it's idle. :) You could run it with debugger attached and breakpoint set in the shutdown hook you are talking about and see what calls it. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Adolfo Castro Menna adolfo.castrome...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, December 14, 2011 8:17 AM Subject: Shutdown hook issue Hi All, I'm experiencing some issues with solr. From time to time solr goes down. After checking the logs, I see that it's due to the shutdown hook being triggered. I still don't know why it happens but it seems to be related to solr being idle. Does anyone have any insights? I'm using Ubuntu 10.04.2 LTS and solr 3.1.0 running on Jetty (default configuration). Solr runs in background, so it doesn't seem to be related to a SIGINT unless ubuntu is sending it for some odd reason. Thanks, Adolfo.
Re: CRUD on solr Index while replicating between master/slave
Hi, The slave will get the changes next time it polls the master and master tells it the index has changed. Note that master doesn't replicate to slave, but rather the slave copies changes from the master. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Tarun Jain tjai...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wednesday, December 14, 2011 10:43 AM Subject: Re: CRUD on solr Index while replicating between master/slave Hi, We have an index which needs constant updates in the master. One more question.. The scenario is 1) Master starts replicating to slave (takes approx 15 mins) 2) We do some changes to index on master while it is replicating So question is what happens to the changes in master index while it is replicating. Will the slave get it or not? Tarun Jain -=- - Original Message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org; Tarun Jain tjai...@yahoo.com Cc: Sent: Tuesday, December 13, 2011 4:18 PM Subject: Re: CRUD on solr Index while replicating between master/slave No, you can search on the master when replicating, no problem. But why do you want to? The whole point of master/slave setups is to separate indexing from searching machines. Best Erick On Tue, Dec 13, 2011 at 4:10 PM, Tarun Jain tjai...@yahoo.com wrote: Hi, Thanks. So just to clarify here again while replicating we cannot search on master index ? Tarun Jain -=- - Original Message - From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Tuesday, December 13, 2011 3:03 PM Subject: Re: CRUD on solr Index while replicating between master/slave Hi, Master: Update/insert/delete docs -- Yes Slaves: Search -- Yes Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Tarun Jain tjai...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, December 13, 2011 11:15 AM Subject: CRUD on solr Index while replicating between master/slave Hi, When replication is happening between master to slave what operations can we do on the master what operations are possible on the slave? I know it is not adivisable to do DML on the slave index but I wanted to know this anyway. Also I understand that doing DML on a slave will make the slave index incompatible with the master. Master Search -- Yes/No Update/insert/delete docs -- Yes/No Slave = Search -- Yes/No Update/insert/delete docs -- Yes/No Please share any other caveats that you have discovered regarding the above scenario that might be helpful. Thanks -=-
NumericRangeQuery: what am I doing wrong?
I can't get NumericRangeQuery or TermQuery to work on my integer id field. I feel like I must be missing something obvious. I have a test index that has only two documents, id:9076628 and id:8003001. The id field is defined like so: field name=id type=tint indexed=true stored=true required=true / A MatchAllDocsQuery will return the 2 documents, but any queries I try on the id field return no results. For instance, public void testIdRange() throws IOException { Query q = NumericRangeQuery.newIntRange(id, 1, 1000, true, true); System.out.println(query: + q); assertEquals(2, searcher.search(q, 5).totalHits); } public void testIdSearch() throws IOException { Query q = new TermQuery(new Term(id, 9076628)); System.out.println(query: + q); assertEquals(1, searcher.search(q, 5).totalHits); } Both tests fail with totalHits being 0. This is using solr/lucene trunk, but I tried also with 3.2 and got the same results. What could I be doing wrong here? Thanks, --jay
Re: NumericRangeQuery: what am I doing wrong?
Maybe you should index your values differently? Here is what Lucene's 2.9 javadoc says: To use this, you must first index the numeric values using NumericFieldhttp://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/document/NumericField.html(expert: NumericTokenStreamhttp://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/analysis/NumericTokenStream.html). If your terms are instead textual, you should use TermRangeQueryhttp://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/TermRangeQuery.html. NumericRangeFilterhttp://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/NumericRangeFilter.htmlis the filter equivalent of this query. Dmitry On Wed, Dec 14, 2011 at 6:53 PM, Jay Luker lb...@reallywow.com wrote: I can't get NumericRangeQuery or TermQuery to work on my integer id field. I feel like I must be missing something obvious. I have a test index that has only two documents, id:9076628 and id:8003001. The id field is defined like so: field name=id type=tint indexed=true stored=true required=true / A MatchAllDocsQuery will return the 2 documents, but any queries I try on the id field return no results. For instance, public void testIdRange() throws IOException { Query q = NumericRangeQuery.newIntRange(id, 1, 1000, true, true); System.out.println(query: + q); assertEquals(2, searcher.search(q, 5).totalHits); } public void testIdSearch() throws IOException { Query q = new TermQuery(new Term(id, 9076628)); System.out.println(query: + q); assertEquals(1, searcher.search(q, 5).totalHits); } Both tests fail with totalHits being 0. This is using solr/lucene trunk, but I tried also with 3.2 and got the same results. What could I be doing wrong here? Thanks, --jay
Re: cache monitoring tools?
Thanks, Justin. With zabbix I can gather jmx exposed stats from SOLR, how about munin, what protocol / way it uses to accumulate stats? It wasn't obvious from their online documentation... On Mon, Dec 12, 2011 at 4:56 PM, Justin Caratzas justin.carat...@gmail.comwrote: Dmitry, The only added stress that munin puts on each box is the 1 request per stat per 5 minutes to our admin stats handler. Given that we get 25 requests per second, this doesn't make much of a difference. We don't have a sharded index (yet) as our index is only 2-3 GB, but we do have slave servers with replicated indexes that handle the queries, while our master handles updates/commits. Justin Dmitry Kan dmitry@gmail.com writes: Justin, in terms of the overhead, have you noticed if Munin puts much of it when used in production? In terms of the solr farm: how big is a shard's index (given you have sharded architecture). Dmitry On Sun, Dec 11, 2011 at 6:39 PM, Justin Caratzas justin.carat...@gmail.comwrote: At my work, we use Munin and Nagio for monitoring and alerts. Munin is great because writing a plugin for it so simple, and with Solr's statistics handler, we can track almost any solr stat we want. It also comes with included plugins for load, file system stats, processes, etc. http://munin-monitoring.org/ Justin Paul Libbrecht p...@hoplahup.net writes: Allow me to chim in and ask a generic question about monitoring tools for people close to developers: are any of the tools mentioned in this thread actually able to show graphs of loads, e.g. cache counts or CPU load, in parallel to a console log or to an http request log?? I am working on such a tool currently but I have a bad feeling of reinventing the wheel. thanks in advance Paul Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit : Otis, Tomás: thanks for the great links! 2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use any tool that visualizes JMX stuff like Zabbix. See http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/ On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan dmitry@gmail.com wrote: The culprit seems to be the merger (frontend) SOLR. Talking to one shard directly takes substantially less time (1-2 sec). On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan dmitry@gmail.com wrote: Tomás: thanks. The page you gave didn't mention cache specifically, is there more documentation on this specifically? I have used solrmeter tool, it draws the cache diagrams, is there a similar tool, but which would use jmx directly and present the cache usage in runtime? pravesh: I have increased the size of filterCache, but the search hasn't become any faster, taking almost 9 sec on avg :( name: search class: org.apache.solr.handler.component.SearchHandler version: $Revision: 1052938 $ description: Search using components: org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent, stats: handlerStart : 1323255147351 requests : 100 errors : 3 timeouts : 0 totalTime : 885438 avgTimePerRequest : 8854.38 avgRequestsPerSecond : 0.008789442 the stats (copying fieldValueCache as well here, to show term statistics): name: fieldValueCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=1, initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false) stats: lookups : 79 hits : 77 hitratio : 0.97 inserts : 1 evictions : 0 size : 1 warmupTime : 0 cumulative_lookups : 79 cumulative_hits : 77 cumulative_hitratio : 0.97 cumulative_inserts : 1 cumulative_evictions : 0 item_shingleContent_trigram : {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78} name: filterCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=153600, initialSize=4096, minSize=138240, acceptableSize=145920, cleanupThread=false) stats: lookups : 1082854 hits : 940370 hitratio : 0.86 inserts : 142486 evictions : 0 size : 142486 warmupTime : 0 cumulative_lookups : 1082854 cumulative_hits : 940370 cumulative_hitratio : 0.86 cumulative_inserts : 142486 cumulative_evictions : 0 index size: 3,25 GB Does anyone have some pointers to where to look at and optimize for query time?
Optimal Setup
Background: We have around 100 web sites of various sizes (in terms of indexable content) and I'm trying to come up with the best architectural design from a performance perspective. - Each of the sites has a need for DEV, TEST and LIVE indices. - The content on the sites divided into 5 groups (but its likely that there will be more groups in future) which is similar enough to use the same schema, solrconfig, synonyms, stopwords etc. - From a search view they are distinct per website (i.e. only that sites content should appear) . - The indexing mechanism is the same for all sites (i.e. they all use the web api) - While not unlimited we have a fair bit of flexibility on servers (although they are all virtual) Questions: - Is it better to (a) have each site in it own core and webapp e.g. /solr1/(DEV|TEST|LIVE) /solr2/(DEV|TEST|LIVE) etc (b) have all the cores in one webapp e.g. /solr1/(SITE1DEV|SITE1TEST|SITE1LIVE|SITE2DEV|SITE2TEST|SITE2LIVE) etc (c) have 3 cores per content group and have a filter query param in all queries that only grabs that sites data e.g./solr1/(CONTENTGRP1DEV|CONTENTGRP1TEST). (d) same as (c) except sharding across multiple servers (e) have all the DEV's, TEST's and LIVE's on separate boxes with either the (b) or (c) setup eg. (b) box1: /solr1/(SITE1LIVE|SITE2LIVE...) c /solr1/(ONTENTGRP1LIVE|ONTENTGRP2LIVE...) Thanks For the help Regards, Dave
Re: Optimal Setup
You need dev, test, and live on separate boxes so that you can do capacity tests. When you are sending queries to find out the max rate before overload, you need to do that on dev or test, not live. Also, you'll need to test new versions of Solr, so you need separate Solr installations. wunder On Dec 14, 2011, at 9:30 AM, Dave Stuart wrote: Background: We have around 100 web sites of various sizes (in terms of indexable content) and I'm trying to come up with the best architectural design from a performance perspective. - Each of the sites has a need for DEV, TEST and LIVE indices. - The content on the sites divided into 5 groups (but its likely that there will be more groups in future) which is similar enough to use the same schema, solrconfig, synonyms, stopwords etc. - From a search view they are distinct per website (i.e. only that sites content should appear) . - The indexing mechanism is the same for all sites (i.e. they all use the web api) - While not unlimited we have a fair bit of flexibility on servers (although they are all virtual) Questions: - Is it better to (a) have each site in it own core and webapp e.g. /solr1/(DEV|TEST|LIVE) /solr2/(DEV|TEST|LIVE) etc (b) have all the cores in one webapp e.g. /solr1/(SITE1DEV|SITE1TEST|SITE1LIVE|SITE2DEV|SITE2TEST|SITE2LIVE) etc (c) have 3 cores per content group and have a filter query param in all queries that only grabs that sites data e.g./solr1/(CONTENTGRP1DEV|CONTENTGRP1TEST). (d) same as (c) except sharding across multiple servers (e) have all the DEV's, TEST's and LIVE's on separate boxes with either the (b) or (c) setup eg. (b) box1: /solr1/(SITE1LIVE|SITE2LIVE...) c /solr1/(ONTENTGRP1LIVE|ONTENTGRP2LIVE...) Thanks For the help Regards, Dave
Re: Large RDBMS dataset
You can also consider using SolrJ to do this. I posted a small example a couple of days ago. Best Erick On Wed, Dec 14, 2011 at 10:39 AM, Gora Mohanty g...@mimirtech.com wrote: On Wed, Dec 14, 2011 at 3:48 PM, Finotti Simone tech...@yoox.com wrote: Hello, I have a very large dataset ( 1 Mrecords) on the RDBMS which I want my Solr application to pull data from. [...] It works, but it takes 1'38 to parse 100 records: it means 1 rec/s! That means that digesting the whole dataset would take 1 Ms (= 12 days). Depending on the size of the data that you are pulling from the database, 1M records is not really that large a number. We were doing ~75GB of stored data from ~7million records in about 9h, including quite complicated transfomers. I would imagine that there is much room for improvement in your case also. Some notes on this: * If you have servers to throw at the problem, and a sensible way to shard your RDBMS data, use parallel indexing to multiple Solr cores, maybe on multiple servers, followed by a merge. In our experience, given enough RAM and adequate provisioning of database servers, indexing speed scales linearly with the total no. of cores. * Replicate your database, manually if needed. Look at the load on a database server during the indexing process, and provision enough database servers to match the no. of Solr indexing servers. * This point is leading into flamewar territory, but consider switching databases. From our (admittedly non-rigorous measurements), mysql was at least a factor of 2-3 faster than MS-SQL, with the same dataset. * Look at cloud-computing. If finances permit, one should be able to shrink indexing times to almost any desired level. E.g., for the dataset that we used, I have little doubt that we could have shrunk the time down to less than 1h, at an affordable cost on Amazon EC2. Unfortunately, we have not yet had the opportunity to try this. The problem is that for each record in fd, Solr makes three distinct SELECT on the other three tables. Of course, this is absolutely inefficient. Is there a way to have Solr loading every record in the four tables and join them when they are already loaded in memory? For various reasons, we did not investigate this in depth, but you could also look at Solr's CachedSqlEntityProcessor. Regards, Gora
Re: CRUD on solr Index while replicating between master/slave
Whoa! Replicating takes 15 mins? That's a really long time. Are you including about the polling interval here? Or is this just raw replication time? Because this is really suspicious. Are you optimizing your index all the time or something? Replication should pull down ONLY the changed segments. But optimizing changes *all* the segments (really, collapses them into one) and you'd be copying the full index each replication. Or are you committing after every few documents? Or? You need to understand why replication takes s long before going any further IMO. It may be perfectly legitimate, but on the surface it sure doesn't seem right. Best Erick On Wed, Dec 14, 2011 at 10:52 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, The slave will get the changes next time it polls the master and master tells it the index has changed. Note that master doesn't replicate to slave, but rather the slave copies changes from the master. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Tarun Jain tjai...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wednesday, December 14, 2011 10:43 AM Subject: Re: CRUD on solr Index while replicating between master/slave Hi, We have an index which needs constant updates in the master. One more question.. The scenario is 1) Master starts replicating to slave (takes approx 15 mins) 2) We do some changes to index on master while it is replicating So question is what happens to the changes in master index while it is replicating. Will the slave get it or not? Tarun Jain -=- - Original Message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org; Tarun Jain tjai...@yahoo.com Cc: Sent: Tuesday, December 13, 2011 4:18 PM Subject: Re: CRUD on solr Index while replicating between master/slave No, you can search on the master when replicating, no problem. But why do you want to? The whole point of master/slave setups is to separate indexing from searching machines. Best Erick On Tue, Dec 13, 2011 at 4:10 PM, Tarun Jain tjai...@yahoo.com wrote: Hi, Thanks. So just to clarify here again while replicating we cannot search on master index ? Tarun Jain -=- - Original Message - From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Tuesday, December 13, 2011 3:03 PM Subject: Re: CRUD on solr Index while replicating between master/slave Hi, Master: Update/insert/delete docs -- Yes Slaves: Search -- Yes Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Tarun Jain tjai...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, December 13, 2011 11:15 AM Subject: CRUD on solr Index while replicating between master/slave Hi, When replication is happening between master to slave what operations can we do on the master what operations are possible on the slave? I know it is not adivisable to do DML on the slave index but I wanted to know this anyway. Also I understand that doing DML on a slave will make the slave index incompatible with the master. Master Search -- Yes/No Update/insert/delete docs -- Yes/No Slave = Search -- Yes/No Update/insert/delete docs -- Yes/No Please share any other caveats that you have discovered regarding the above scenario that might be helpful. Thanks -=-
Re: NumericRangeQuery: what am I doing wrong?
Hmmm, seems like it should work, but there are two things you might try: 1 just execute the query in Solr. id:1 TO 100]. Does that work? 2 I'm really grasping at straws here, but it's *possible* that you need to use the same precisionstep as tint (8?)? There's a constructor that takes precisionStep as a parameter, but the default is 4 in the 3.x code. I guess it's also possible that you're not really connecting to the server you think you are, but I doubt it as I expect your unit test is creating the index for you, in which case you can't do 1 Best Erick On Wed, Dec 14, 2011 at 12:14 PM, Dmitry Kan dmitry@gmail.com wrote: Maybe you should index your values differently? Here is what Lucene's 2.9 javadoc says: To use this, you must first index the numeric values using NumericFieldhttp://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/document/NumericField.html(expert: NumericTokenStreamhttp://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/analysis/NumericTokenStream.html). If your terms are instead textual, you should use TermRangeQueryhttp://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/TermRangeQuery.html. NumericRangeFilterhttp://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/NumericRangeFilter.htmlis the filter equivalent of this query. Dmitry On Wed, Dec 14, 2011 at 6:53 PM, Jay Luker lb...@reallywow.com wrote: I can't get NumericRangeQuery or TermQuery to work on my integer id field. I feel like I must be missing something obvious. I have a test index that has only two documents, id:9076628 and id:8003001. The id field is defined like so: field name=id type=tint indexed=true stored=true required=true / A MatchAllDocsQuery will return the 2 documents, but any queries I try on the id field return no results. For instance, public void testIdRange() throws IOException { Query q = NumericRangeQuery.newIntRange(id, 1, 1000, true, true); System.out.println(query: + q); assertEquals(2, searcher.search(q, 5).totalHits); } public void testIdSearch() throws IOException { Query q = new TermQuery(new Term(id, 9076628)); System.out.println(query: + q); assertEquals(1, searcher.search(q, 5).totalHits); } Both tests fail with totalHits being 0. This is using solr/lucene trunk, but I tried also with 3.2 and got the same results. What could I be doing wrong here? Thanks, --jay
Re: CRUD on solr Index while replicating between master/slave
Hi, We do optimize the whole index because we index our entire content every 4 hrs. From an application/business point of view the replication time if acceptable. Thanks for the information though. We will try to change this behaviour in the future so that replication time if reduced. Tarun Jain -=- From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org; Otis Gospodnetic otis_gospodne...@yahoo.com Sent: Wednesday, December 14, 2011 1:52 PM Subject: Re: CRUD on solr Index while replicating between master/slave Whoa! Replicating takes 15 mins? That's a really long time. Are you including about the polling interval here? Or is this just raw replication time? Because this is really suspicious. Are you optimizing your index all the time or something? Replication should pull down ONLY the changed segments. But optimizing changes *all* the segments (really, collapses them into one) and you'd be copying the full index each replication. Or are you committing after every few documents? Or? You need to understand why replication takes s long before going any further IMO. It may be perfectly legitimate, but on the surface it sure doesn't seem right. Best Erick On Wed, Dec 14, 2011 at 10:52 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, The slave will get the changes next time it polls the master and master tells it the index has changed. Note that master doesn't replicate to slave, but rather the slave copies changes from the master. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Tarun Jain tjai...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wednesday, December 14, 2011 10:43 AM Subject: Re: CRUD on solr Index while replicating between master/slave Hi, We have an index which needs constant updates in the master. One more question.. The scenario is 1) Master starts replicating to slave (takes approx 15 mins) 2) We do some changes to index on master while it is replicating So question is what happens to the changes in master index while it is replicating. Will the slave get it or not? Tarun Jain -=- - Original Message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org; Tarun Jain tjai...@yahoo.com Cc: Sent: Tuesday, December 13, 2011 4:18 PM Subject: Re: CRUD on solr Index while replicating between master/slave No, you can search on the master when replicating, no problem. But why do you want to? The whole point of master/slave setups is to separate indexing from searching machines. Best Erick On Tue, Dec 13, 2011 at 4:10 PM, Tarun Jain tjai...@yahoo.com wrote: Hi, Thanks. So just to clarify here again while replicating we cannot search on master index ? Tarun Jain -=- - Original Message - From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Tuesday, December 13, 2011 3:03 PM Subject: Re: CRUD on solr Index while replicating between master/slave Hi, Master: Update/insert/delete docs -- Yes Slaves: Search -- Yes Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Tarun Jain tjai...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tuesday, December 13, 2011 11:15 AM Subject: CRUD on solr Index while replicating between master/slave Hi, When replication is happening between master to slave what operations can we do on the master what operations are possible on the slave? I know it is not adivisable to do DML on the slave index but I wanted to know this anyway. Also I understand that doing DML on a slave will make the slave index incompatible with the master. Master Search -- Yes/No Update/insert/delete docs -- Yes/No Slave = Search -- Yes/No Update/insert/delete docs -- Yes/No Please share any other caveats that you have discovered regarding the above scenario that might be helpful. Thanks -=-
Re: How to get SolrServer
Hi Joey, if what you want is to customize Solr so that you do the indexing code on the server side, you could implement your own RequestHandler, then the only thing you need to do is to add it to the solrconfig.xml and you can call it through http GET method. On Tue, Dec 13, 2011 at 4:42 PM, Schmidt Jeff j...@rvswithoutborders.comwrote: Joey: I'm not sure what you mean by wapping solr to your own web application. There is a way to embed Solr into your application (same JVM), but I've never used that. If you're talking about your servlet running in one JVM and Solr in another, then use the SolrJ client library to interact with Solr. I use CommonsHttpSolrServer ( http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/CommonsHttpSolrServer.html) and specify the URL that locates the Solr server/core name. I use Spring to instantiate the server instance, and then I inject it where I need it. bean id=solrServerIngContent class=org.apache.solr.client.solrj.impl.CommonsHttpSolrServer constructor-arg value= http://localhost:8091/solr/mycorename/ /bean Thus is equivalent to new CommonsHttpSolrServer( http://localhost:8091/solr/mycorename;); Check out the API link above and http://wiki.apache.org/solr/Solrj to examples on using the SolrJ API. Cheers, Jeff On Dec 13, 2011, at 12:12 PM, Joey wrote: Hi I am new to Solr and want to do some customize development. I have wrapped solr to my own web application, and want to write a servlet to index a file system. The question is how can I get SolrServer inside my Servlet? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-SolrServer-tp3583304p3583304.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Possible to adjust FieldNorm?
From what I can see, the problem there is not with the field norm, but with the fact that leadership is not matching the second document for some reason. Is it possible that you are having some kind of analysis problem? On Wed, Dec 14, 2011 at 6:50 AM, cnyee yeec...@gmail.com wrote: Hi, Is it possible to adjust FieldNorm? I have a scenario where the search is not producing the desired result because of fieldNorm: Search terms: coaching leadership Record 1: name=Ask the Coach, desc=...,... Record 2: name=Coaching as a Leadership Development Tool Part 1, desc=...,... Record 1 was scored higher than record 2, despite record 2 has two matches. The scoring is given below: Record 1: 1.2878088 = (MATCH) weight(name_en:coach in 6430), product of: 0.20103075 = queryWeight(name_en:coach), product of: 6.406029 = idf(docFreq=160, maxDocs=35862) 0.03138149 = queryNorm 6.406029 = (MATCH) fieldWeight(name_en:coach in 6430), product of: 1.0 = tf(termFreq(name_en:coach)=1) 6.406029 = idf(docFreq=160, maxDocs=35862) 1.0 = fieldNorm(field=name_en, doc=6430) Record 2: 0.56341636 = (MATCH) weight(name_en:coach in 4744), product of: 0.20103075 = queryWeight(name_en:coach), product of: 6.406029 = idf(docFreq=160, maxDocs=35862) 0.03138149 = queryNorm 2.8026378 = (MATCH) fieldWeight(name_en:coach in 4744), product of: 1.0 = tf(termFreq(crs_name_en:coach)=1) 6.406029 = idf(docFreq=160, maxDocs=35862) 0.4375 = fieldNorm(field=name_en, doc=4744) Many thanks in advance. Chut -- View this message in context: http://lucene.472066.n3.nabble.com/Possible-to-adjust-FieldNorm-tp3584998p3584998.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Shutdown hook issue
I think I found the issue. The ubuntu server is running OOM-Killer which might be sending a SIGINT to the java process, probably because of memory consumption. Thanks, Adolfo. On Wed, Dec 14, 2011 at 12:44 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Solr won't shut down by itself just because it's idle. :) You could run it with debugger attached and breakpoint set in the shutdown hook you are talking about and see what calls it. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Adolfo Castro Menna adolfo.castrome...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, December 14, 2011 8:17 AM Subject: Shutdown hook issue Hi All, I'm experiencing some issues with solr. From time to time solr goes down. After checking the logs, I see that it's due to the shutdown hook being triggered. I still don't know why it happens but it seems to be related to solr being idle. Does anyone have any insights? I'm using Ubuntu 10.04.2 LTS and solr 3.1.0 running on Jetty (default configuration). Solr runs in background, so it doesn't seem to be related to a SIGINT unless ubuntu is sending it for some odd reason. Thanks, Adolfo.
Re: Solr Join with Dismax
Thanks Hoss! But unfortunately, the dismax parameters (like qf) are not passed over to the fromIndex. In fact, even if using var dereferencing makes Dismax to be selected as the fromQueryParser, the query that is passed to the JoinQuery object contains nothing to indicate that it should use dismax. The following code is from the method createParser in JoinQParserPlugin.java: // With var dereferencing, this makes the fromQueryParser to be dismax QParser fromQueryParser = subQuery(v, lucene); // But after the call to getQuery, there is no indication that dismax should be used Query fromQuery = fromQueryParser.getQuery(); JoinQuery jq = new JoinQuery(fromField, toField, fromIndex, fromQuery); So I guess that as it is right now, dismax can't really be used with joins. On Fri, Dec 9, 2011 at 3:20 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Is there a specific reason why it is hard-coded to use the lucene : QParser? I was looking at JoinQParserPlugin.java and here it is in : createParser: : : QParser fromQueryParser = subQuery(v, lucene); : : I could pass another param named fromQueryParser and use it instead of : lucene. But again, is there a reason why I should not do that? It's definitley a bug, but we don't need a new local param: that hardcoded lucene should just be replaced with null, so that the defType local param will be checked (just like it can in the BoostQParser)... qf=text name q={!join from=manu_id_s to=id defType=dismax}ipod Note: even with that hardcoded lucene bug, you can still override the default by using var dereferencing to point at another param with it's own localparams specying the type... qf=text name q={!join from=manu_id_s to=id v=$qq} qq={!dismax}ipod -Hoss -- Pascal Dimassimo Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Shutdown hook issue
I am not an expert on this but the oom-killer will kill off the process consuming the greatest amount of memory if the machine runs out of memory, and you should see something to that effect in the system log, /var/log/messages I think. François On Dec 14, 2011, at 2:54 PM, Adolfo Castro Menna wrote: I think I found the issue. The ubuntu server is running OOM-Killer which might be sending a SIGINT to the java process, probably because of memory consumption. Thanks, Adolfo. On Wed, Dec 14, 2011 at 12:44 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Solr won't shut down by itself just because it's idle. :) You could run it with debugger attached and breakpoint set in the shutdown hook you are talking about and see what calls it. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Adolfo Castro Menna adolfo.castrome...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, December 14, 2011 8:17 AM Subject: Shutdown hook issue Hi All, I'm experiencing some issues with solr. From time to time solr goes down. After checking the logs, I see that it's due to the shutdown hook being triggered. I still don't know why it happens but it seems to be related to solr being idle. Does anyone have any insights? I'm using Ubuntu 10.04.2 LTS and solr 3.1.0 running on Jetty (default configuration). Solr runs in background, so it doesn't seem to be related to a SIGINT unless ubuntu is sending it for some odd reason. Thanks, Adolfo.
Re: Copy in multivalued field and faceting
Hi, Eric, Just interested in this topic, so might want to ask further question based on Jul's topic. I read the document of Facet.sort=count which seems to return the facets order by the doc hit counts. So, suppose one doc has title value1 value2 value3, and another doc has title value2 value 4 value 5, and use WhitespaceTokenizer (no matter designed in single field or multi-value field), do we get the facet results as: value2 - 2 docs value1 - 1 doc value3 - 1 doc value4 - 1 doc value5 - 1 doc is it a way to get top words? does it cause high performance cost? Thanks, Yunfei On Wed, Dec 14, 2011 at 5:51 AM, Erick Erickson erickerick...@gmail.comwrote: I don't quite understand what you're trying to do. MultiValued is a bit misleading. All it means is that you can add the same field multiple times to a document, i.e. (XML example) doc add name=fieldvalue1 value2 value3/add add name=fieldvalue4 value5 value6/add /doc will succeed if field is multiValued and fail if not. This will work if field is NOT multiValued: doc add name=fieldvalue1 value2 value3 value4 value5 value6/add /doc and, assuming WhitespaceTokenizer, the field field will contain the exact same tokens. The only difference *might* be the offsets, but don't worry about that quite yet, all it would really affect is phrase queries. With that as a preface, I don't see why copyField has anything to do with your problem, you'd get the same results faceting on the title field, assuming identical analyzer chains. Faceting on a text field is iffy, it can be quite expensive. What you'd get in the end, though, is a list of the top words in your corpus for that field counted from the documents that satisfied the query. Which sounds like what you're after. Best Erick On Wed, Dec 14, 2011 at 4:59 AM, yunfei wu yunfei...@gmail.com wrote: Sounds like working by carefully choosing tokenizer, and then use facet.sort and facet.limit parameters to do faceting. Will see any expert's comments on this one. Yunfei On Wed, Dec 14, 2011 at 12:26 AM, darul daru...@gmail.com wrote: Hello, Field for this scenario is Title and contains several words. For a specific query, I would like get the top ten words by frequency in a specific field. My idea was the following: - Title in my schema is stored/indexed in a specific field - A copyField copy Title field content into a multivalued field. If my multivalue field use a specific tokenizer which split words, does it fill each word in each multivalued items ? - If so, using faceting on this multivalue field, I will get top ten words, correct ? Example: 1) Title : this is my title 2) CopyField Title to specific multivalue field F1 3) F1 contains : {this, is, my, title} My english Thanks, Jul -- View this message in context: http://lucene.472066.n3.nabble.com/Copy-in-multivalued-field-and-faceting-tp3584819p3584819.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Copy in multivalued field and faceting
I read the document of Facet.sort=count which seems to return the facets order by the doc hit counts. So, suppose one doc has title value1 value2 value3, and another doc has title value2 value 4 value 5, and use WhitespaceTokenizer (no matter designed in single field or multi-value field), do we get the facet results as: value2 - 2 docs value1 - 1 doc value3 - 1 doc value4 - 1 doc value5 - 1 doc is it a way to get top words? does it cause high performance cost? Consider using http://wiki.apache.org/solr/LukeRequestHandler for top term. Faceting is more meant to 'drill down' in the search result set.
Re: Possible to adjust FieldNorm?
: From what I can see, the problem there is not with the field norm, but with : the fact that leadership is not matching the second document for some : reason. Is it possible that you are having some kind of analysis problem? Agreed ... if those are your full score explanations for those two documents then something is not right with how the second document is matcihng your query. what exactly does your request look like? what exactly does the requestHandler configuration look like? what is the final parsed query according to the debug information? what does the fieldtype for name_en look like? what does analysis.jsp say about how those name_en field values are analized at index time? how is the word leadership analyzed at query time? In general, if you want to disable norms you can set omitNorm on the field -- or you can customize the similariy to change the lengthNorm function. : Search terms: coaching leadership : Record 1: name=Ask the Coach, desc=...,... : Record 2: name=Coaching as a Leadership Development Tool Part 1, : desc=...,... : : Record 1 was scored higher than record 2, despite record 2 has two matches. : The scoring is given below: : : Record 1: : 1.2878088 = (MATCH) weight(name_en:coach in 6430), product of: : 0.20103075 = queryWeight(name_en:coach), product of: : 6.406029 = idf(docFreq=160, maxDocs=35862) : 0.03138149 = queryNorm : 6.406029 = (MATCH) fieldWeight(name_en:coach in 6430), product of: : 1.0 = tf(termFreq(name_en:coach)=1) : 6.406029 = idf(docFreq=160, maxDocs=35862) : 1.0 = fieldNorm(field=name_en, doc=6430) : : Record 2: : 0.56341636 = (MATCH) weight(name_en:coach in 4744), product of: : 0.20103075 = queryWeight(name_en:coach), product of: : 6.406029 = idf(docFreq=160, maxDocs=35862) : 0.03138149 = queryNorm : 2.8026378 = (MATCH) fieldWeight(name_en:coach in 4744), product of: : 1.0 = tf(termFreq(crs_name_en:coach)=1) : 6.406029 = idf(docFreq=160, maxDocs=35862) : 0.4375 = fieldNorm(field=name_en, doc=4744) -Hoss
Re: Solr Join with Dismax
Hi, I have been doing more tracing in the code. And I think that I understand a bit more. The problem does not seem to be dismax+join, but dismax+join+fromIndex. When doing this joined dismax query on the same index: http://localhost:8080/solr/gutenberg/select?q={!join+from=id+to=id+v=$qq}qq={!dismax+qf='body%20tag ^2'}solr the query returned by the method fromQueryParser.getQuery looks like this: +(body:solr | tag:solr^2.0) But when doing the same query across another core: http://localhost:8080/solr/test/select/?q={!join+fromIndex=gutenberg+from=id+to=id+v=$qq}qq={!dismax+qf='body%20tag ^2'}solr the query is: +(body:solr) We see that the second field defined in the qf param is not added to the query. Tracing deeper shows that this happens because the tag field does not exist in the test core, hence it is not added. This can be seen in SolrPluginUtils.java in the method getFieldQuery. All the fields not part of the current index won't be added to the query. So the conclusion does not seem to be that dismax can't be used with joins, but that it can't be used with another core that does not have the same fields than the one where the initial query is made. I just notice SOLR-2824. So it is really a bug. I'll take the time to look at the patch attached to this ticket. On Wed, Dec 14, 2011 at 2:55 PM, Pascal Dimassimo pascal.dimass...@sematext.com wrote: Thanks Hoss! But unfortunately, the dismax parameters (like qf) are not passed over to the fromIndex. In fact, even if using var dereferencing makes Dismax to be selected as the fromQueryParser, the query that is passed to the JoinQuery object contains nothing to indicate that it should use dismax. The following code is from the method createParser in JoinQParserPlugin.java: // With var dereferencing, this makes the fromQueryParser to be dismax QParser fromQueryParser = subQuery(v, lucene); // But after the call to getQuery, there is no indication that dismax should be used Query fromQuery = fromQueryParser.getQuery(); JoinQuery jq = new JoinQuery(fromField, toField, fromIndex, fromQuery); So I guess that as it is right now, dismax can't really be used with joins. On Fri, Dec 9, 2011 at 3:20 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Is there a specific reason why it is hard-coded to use the lucene : QParser? I was looking at JoinQParserPlugin.java and here it is in : createParser: : : QParser fromQueryParser = subQuery(v, lucene); : : I could pass another param named fromQueryParser and use it instead of : lucene. But again, is there a reason why I should not do that? It's definitley a bug, but we don't need a new local param: that hardcoded lucene should just be replaced with null, so that the defType local param will be checked (just like it can in the BoostQParser)... qf=text name q={!join from=manu_id_s to=id defType=dismax}ipod Note: even with that hardcoded lucene bug, you can still override the default by using var dereferencing to point at another param with it's own localparams specying the type... qf=text name q={!join from=manu_id_s to=id v=$qq} qq={!dismax}ipod -Hoss -- Pascal Dimassimo Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Pascal Dimassimo Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Solr Join with Dismax
: I have been doing more tracing in the code. And I think that I understand a : bit more. The problem does not seem to be dismax+join, but : dismax+join+fromIndex. Correct. join+dismax works fine as i already demonstrated... : Note: even with that hardcoded lucene bug, you can still override the : default by using var dereferencing to point at another param with it's own : localparams specying the type... : :qf=text name :q={!join from=manu_id_s to=id v=$qq} :qq={!dismax}ipod ...the problem you are refering to now has nothing to do with dismax, and is specificly a bug in how the query is parsed when fromIndex is used (which i thought i already mentioned in this thread but i see you found independently)... https://issues.apache.org/jira/browse/SOLR-2824 Did you file a Jira about defaulting to lucene instead of null so we can make the defType local param syntax work? I havne't seen it in my email but it's really an unrelated problem so it should be tracked seperately) -Hoss
Re: NumericRangeQuery: what am I doing wrong?
On Wed, Dec 14, 2011 at 2:04 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, seems like it should work, but there are two things you might try: 1 just execute the query in Solr. id:1 TO 100]. Does that work? Yep, that works fine. 2 I'm really grasping at straws here, but it's *possible* that you need to use the same precisionstep as tint (8?)? There's a constructor that takes precisionStep as a parameter, but the default is 4 in the 3.x code. Ah-ha, that was it. I did not notice the alternate constructor. The field was originally indexed with solr's default int type, which has precisionStep=0 (i.e., don't index at different precision levels). The equivalent value for the NumericRangeQuery constructor is 32. This isn't exactly inuitive, but I was able to figure it out with a careful reading of the javadoc. Thanks! --jay
queryResultCache hit count is not being increased when programmatically adding Lucene queries as filters in the SearchComponent
In my application I need to deal with a very large number of filter queries that I cannot pass as http parameters - instead I add them as filters on the ResponseBuilder: public void process(ResponseBuilder rb) { ListQuery filters = rb.getFilters(); if (filters == null) { filters = new ArrayListQuery(); rb.setFilters(filters); } filters.add(userAccessQuery); filters.add(auctionEndConditionQuery); } In the /admin/stats.jsp I have noticed that if the code above gets executed then my queryResultCache hit count does not increase. The following is the debug query: { responseHeader:{ status:0, QTime:31, params:{ _jstate:-NWHpuWq8R7oPBQGnJsHifjVh6blqvEe6DwUBpybB4ldLWmZEGqsSyvRk_0LX6a-U3fqO6Wd4kc, indent:true, wt:json, version:2, debugQuery:true, fl:vid, q:text:(Solara) OR slrDescEm:(\Solara\) OR vinLast8:(\Solara\), fq:(eColorId:\4\)}}, response:{numFound:2,start:0,docs:[ { vid:18372703}, { vid:19071820}] }, debug:{ rawquerystring:text:(Solara) OR slrDescEm:(\Solara\) OR vinLast8:(\Solara\), querystring:text:(Solara) OR slrDescEm:(\Solara\) OR vinLast8:(\Solara\), parsedquery:text:solara slrDescEm:solara vinLast8:solara, parsedquery_toString:text:solara slrDescEm:solara vinLast8:solara, explain:{ 18372703:\n0.285029 = (MATCH) product of:\n 0.855087 = (MATCH) sum of:\n0.855087 = (MATCH) weight(text:solara in 4146), product of:\n 0.44694334 = queryWeight(text:solara), product of:\n7.6527553 = idf(docFreq=23, maxDocs=18598)\n0.058402933 = queryNorm\n 1.9131888 = (MATCH) fieldWeight(text:solara in 4146), product of:\n 1.0 = tf(termFreq(text:solara)=1)\n7.6527553 = idf(docFreq=23, maxDocs=18598)\n0.25 = fieldNorm(field=text, doc=4146)\n 0.3334 = coord(1/3)\n, 19071820:\n0.285029 = (MATCH) product of:\n 0.855087 = (MATCH) sum of:\n0.855087 = (MATCH) weight(text:solara in 13815), product of:\n 0.44694334 = queryWeight(text:solara), product of:\n7.6527553 = idf(docFreq=23, maxDocs=18598)\n0.058402933 = queryNorm\n 1.9131888 = (MATCH) fieldWeight(text:solara in 13815), product of:\n 1.0 = tf(termFreq(text:solara)=1)\n7.6527553 = idf(docFreq=23, maxDocs=18598)\n0.25 = fieldNorm(field=text, doc=13815)\n 0.3334 = coord(1/3)\n}, QParser:LuceneQParser, filter_queries:[(eColorId:\4\)], parsed_filter_queries:[eColorId:4, (+(((+((+(+cgcId:4 +iter:[1 TO 6])) /** A VERY, VERY LONG LIST OF CONDITIONS HERE **/ ) (+(+cgcId:840 +iter:[1 TO 12])) (+(+cgcId:841 +iter:[1 TO 10])) (+(+cgcId:843 +iter:[1 TO 12] +blInd:true), ET:[1323899277225 TO *]], timing:{ time:31.0, prepare:{ time:0.0, com.openlane.search.solr.filter.GenericFilter:{ time:0.0}, org.apache.solr.handler.component.QueryComponent:{ time:0.0}, org.apache.solr.handler.component.FacetComponent:{ time:0.0}, org.apache.solr.handler.component.MoreLikeThisComponent:{ time:0.0}, org.apache.solr.handler.component.HighlightComponent:{ time:0.0}, org.apache.solr.handler.component.StatsComponent:{ time:0.0}, org.apache.solr.handler.component.DebugComponent:{ time:0.0}}, process:{ time:31.0, com.openlane.search.solr.filter.GenericFilter:{ time:31.0}, org.apache.solr.handler.component.QueryComponent:{ time:0.0}, org.apache.solr.handler.component.FacetComponent:{ time:0.0}, org.apache.solr.handler.component.MoreLikeThisComponent:{ time:0.0}, org.apache.solr.handler.component.HighlightComponent:{ time:0.0}, org.apache.solr.handler.component.StatsComponent:{ time:0.0}, org.apache.solr.handler.component.DebugComponent:{ time:0.0} Notice the difference between filter_queries and parsed_filter_queries If I block filters.add(userAccessQuery) then my queryResultCache hit count is being increased as it should. The following is the response with debugQuery=true in this case. { responseHeader:{ status:0, QTime:16, params:{ indent:true, wt:json, version:2, debugQuery:true, fl:vid, q:text:(Solara) OR slrDescEm:(\Solara\) OR vinLast8:(\Solara\), devMode:bypassUserAccess, fq:(eColorId:\4\)}}, response:{numFound:3,start:0,docs:[ { vid:18372703}, { vid:19071820}, { vid:17192691}] }, debug:{ rawquerystring:text:(Solara) OR slrDescEm:(\Solara\) OR vinLast8:(\Solara\), querystring:text:(Solara) OR slrDescEm:(\Solara\) OR vinLast8:(\Solara\), parsedquery:text:solara slrDescEm:solara vinLast8:solara, parsedquery_toString:text:solara
Re: queryResultCache hit count is not being increased when programmatically adding Lucene queries as filters in the SearchComponent
Solr version: 3.2.0 -- View this message in context: http://lucene.472066.n3.nabble.com/queryResultCache-hit-count-is-not-being-increased-when-programmatically-adding-Lucene-queries-as-filt-tp3586892p3586904.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: NumericRangeQuery: what am I doing wrong?
I'm a little lost in this thread ... if you are programaticly construction a NumericRangeQuery object to execute in the JVM against a Solr index, that suggests you are writting some sort of SOlr plugin (or uembedding solr in some way) why manually construct the query using options that may or may not be correct if/when someone changes the schema, when you could just ask the FieldType to construct the appropriate query for you? FieldType ft = IndexSchema.getFieldType(your field name); Query q = ft.getRnageQuery(...); ? -Hoss
XPath with ExtractingRequestHandler
I want to restrict the HTML that is returned by Tika to basically: /xhtml:html/xhtml:body//xhtml:div[@class='bibliographicData']/descendant:node() and it seems that the XPath class being used does not support the '//' syntax. Is there anyway to configure Tika to use a different XPath evaluation class?
Re: How to get SolrServer within my own servlet
: So what I want to do is to modify Solr a bit - add one servlet so I can : trigger a full index of a folder in the file system. ... : I guess there are two SolrServer instances(one is EmbeddedSolrServer, : created by myself and the other is come with Solr itself and they are : holding different index? i suspect you are correct, but frankly i'm amazed hwat you are doing is working at all (you should be getting a write lock from having two distinct Solr instances trying to write to the same directory) I think you need to back up and explain better what your overall goal is -- embedding Solr in other apps is: a) a fairly advanced usage that i would not suggest you persue until you have a better grasp of solr fundementals b) not something people usually do if they also want to be able to use solr via HTTP. in general, if your only goal of mucking with the solr.war is to be able to index files on the local filesystem (relative where Solr is running) there are a lot of other ways to approach that goal (use DIH, or write a custom RequestHanlder you load as a plugin, etc...) https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Arabic suppport,
: how can I add arabic support to the solr? https://wiki.apache.org/solr/LanguageAnalysis https://wiki.apache.org/solr/LanguageAnalysis#Arabic -Hoss
Re: Possible to adjust FieldNorm?
Sorry, I did not give the full output in the first post. For what it looks, the fieldNorm is saying that: 1 match out of 3 words in record 1 is more significant than 2 matches out of 8 words in record 2. That would be true for simple arithmetic, but unsatisfactory in human 'meaning'. Here are the full explanation. Record 2 has some boosting as well. Record 1: 1.5843434 = (MATCH) sum of: 1.5372416 = (MATCH) sum of: 1.2878088 = (MATCH) max plus 0.1 times others of: 1.2878088 = (MATCH) weight(crs_name_en:coach in 6430), product of: 0.20103075 = queryWeight(crs_name_en:coach), product of: 6.406029 = idf(docFreq=160, maxDocs=35862) 0.03138149 = queryNorm 6.406029 = (MATCH) fieldWeight(crs_name_en:coach in 6430), product of: 1.0 = tf(termFreq(crs_name_en:coach)=1) 6.406029 = idf(docFreq=160, maxDocs=35862) 1.0 = fieldNorm(field=crs_name_en, doc=6430) 0.2494328 = (MATCH) max plus 0.1 times others of: 0.2494328 = (MATCH) weight(crs_desc_en:leadership in 6430), product of: 0.15826634 = queryWeight(crs_desc_en:leadership), product of: 5.043302 = idf(docFreq=628, maxDocs=35862) 0.03138149 = queryNorm 1.5760319 = (MATCH) fieldWeight(crs_desc_en:leadership in 6430), product of: 1.0 = tf(termFreq(crs_desc_en:leadership)=1) 5.043302 = idf(docFreq=628, maxDocs=35862) 0.3125 = fieldNorm(field=crs_desc_en, doc=6430) 0.04710189 = (MATCH) product of: 0.09420378 = (MATCH) sum of: 0.09420378 = (MATCH) product of: 0.3768151 = (MATCH) sum of: 0.3768151 = (MATCH) weight(published_year:2008 in 6430), product of: 0.10874291 = queryWeight(published_year:2008), product of: 3.4651926 = idf(docFreq=3047, maxDocs=35862) 0.03138149 = queryNorm 3.4651926 = (MATCH) fieldWeight(published_year:2008 in 6430), product of: 1.0 = tf(termFreq(published_year:2008)=1) 3.4651926 = idf(docFreq=3047, maxDocs=35862) 1.0 = fieldNorm(field=published_year, doc=6430) 0.25 = coord(1/4) 0.5 = coord(1/2) 0.0 = (MATCH) FunctionQuery(int(crs_stars)), product of: 0.0 = int(crs_stars)=0 2.5 = boost 0.03138149 = queryNorm Record 2: 1.5590522 = (MATCH) sum of: 1.0096307 = (MATCH) sum of: 0.6206793 = (MATCH) max plus 0.1 times others of: 0.56341636 = (MATCH) weight(crs_name_en:coach in 4744), product of: 0.20103075 = queryWeight(crs_name_en:coach), product of: 6.406029 = idf(docFreq=160, maxDocs=35862) 0.03138149 = queryNorm 2.8026378 = (MATCH) fieldWeight(crs_name_en:coach in 4744), product of: 1.0 = tf(termFreq(crs_name_en:coach)=1) 6.406029 = idf(docFreq=160, maxDocs=35862) 0.4375 = fieldNorm(field=crs_name_en, doc=4744) 0.11664742 = (MATCH) weight(meta_en:coach in 4744), product of: 0.11443973 = queryWeight(meta_en:coach), product of: 3.646727 = idf(docFreq=2541, maxDocs=35862) 0.03138149 = queryNorm 1.0192913 = (MATCH) fieldWeight(meta_en:coach in 4744), product of: 2.236068 = tf(termFreq(meta_en:coach)=5) 3.646727 = idf(docFreq=2541, maxDocs=35862) 0.125 = fieldNorm(field=meta_en, doc=4744) 0.4559821 = (MATCH) weight(crs_desc_en:coach in 4744), product of: 0.19534174 = queryWeight(crs_desc_en:coach), product of: 6.2247434 = idf(docFreq=192, maxDocs=35862) 0.03138149 = queryNorm 2.3342788 = (MATCH) fieldWeight(crs_desc_en:coach in 4744), product of: 2.0 = tf(termFreq(crs_desc_en:coach)=4) 6.2247434 = idf(docFreq=192, maxDocs=35862) 0.1875 = fieldNorm(field=crs_desc_en, doc=4744) 0.3889513 = (MATCH) max plus 0.1 times others of: 0.36372444 = (MATCH) weight(crs_name_en:leadership in 4744), product of: 0.16152287 = queryWeight(crs_name_en:leadership), product of: 5.147074 = idf(docFreq=566, maxDocs=35862) 0.03138149 = queryNorm 2.251845 = (MATCH) fieldWeight(crs_name_en:leadership in 4744), product of: 1.0 = tf(termFreq(crs_name_en:leadership)=1) 5.147074 = idf(docFreq=566, maxDocs=35862) 0.4375 = fieldNorm(field=crs_name_en, doc=4744) 0.04061773 = (MATCH) weight(meta_en:leadership in 4744), product of: 0.076728955 = queryWeight(meta_en:leadership), product of: 2.4450386 = idf(docFreq=8453, maxDocs=35862) 0.03138149 = queryNorm 0.5293664 = (MATCH) fieldWeight(meta_en:leadership in 4744), product of: 1.7320508 = tf(termFreq(meta_en:leadership)=3) 2.4450386 = idf(docFreq=8453, maxDocs=35862) 0.125 = fieldNorm(field=meta_en, doc=4744) 0.21165074 = (MATCH) weight(crs_desc_en:leadership in 4744), product of: 0.15826634 = queryWeight(crs_desc_en:leadership), product of: 5.043302 =
Re: How to get SolrServer within my own servlet
Hi Chris, There won't be deadlock I think because there is only one place(from my own servlet) can trigger a index. Yes, I am trying to embed Solr application - I could separate my servlet to another app and talk to Sorl via HTTP, but there will be two pieces(Solr and my own app) of software I have to maintain - which is something I don't like. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-SolrServer-within-my-own-servlet-tp3583304p3587157.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Migrate Lucene 2.9 To SOLR
: I have a old project that use Lucene 2.9. Its possible to use the index : created by lucene in SOLR? May i just copy de index to data directory of : SOLR, or exists some mechanism to import Lucene index? you can use an index created directly with lucene libraries in Solr, but in order for Solr to understand that index and do anything meaningful with it you have to configure solr with a schema.xml file that makes sense given the custom code used to build that index (ie: what fields did you store, what fields did you index, what analyzers did you use, what fields dod you index with term vectors, etc...) -Hoss
Re: Too many connections in CLOSE_WAIT state on master solr server
Thanks Erick and Mikhail. I'll try this out. On Wed, Dec 14, 2011 at 7:11 PM, Erick Erickson erickerick...@gmail.com wrote: I'm guessing (and it's just a guess) that what's happening is that the container is queueing up your requests while waiting for the other connections to close, so Mikhail's suggestion seems like a good idea. Best Erick On Wed, Dec 14, 2011 at 12:28 AM, samarth s samarth.s.seksa...@gmail.com wrote: The updates to the master are user driven, and are needed to be visible quickly. Hence, the high frequency of replication. It may be that too many replication requests are being handled at a time, but why should that result in half closed connections? On Wed, Dec 14, 2011 at 2:47 AM, Erick Erickson erickerick...@gmail.com wrote: Replicating 40 cores every 20 seconds is just *asking* for trouble. How often do your cores change on the master? How big are they? Is there any chance you just have too many cores replicating at once? Best Erick On Tue, Dec 13, 2011 at 3:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: You can try to reuse your connections (prevent them from closing) by specifying -Dhttp.maxConnections=http://download.oracle.com/javase/1.4.2/docs/guide/net/properties.htmlN in jvm startup params. At client JVM!. Number should be chosen considering the number of connection you'd like to keep alive. Let me know if it works for you. On Tue, Dec 13, 2011 at 2:57 PM, samarth s samarth.s.seksa...@gmail.comwrote: Hi, I am using solr replication and am experiencing a lot of connections in the state CLOSE_WAIT at the master solr server. These disappear after a while, but till then the master solr stops responding. There are about 130 open connections on the master server with the client as the slave m/c and all are in the state CLOSE_WAIT. Also, the client port specified on the master solr server netstat results is not visible in the netstat results on the client (slave solr) m/c. Following is my environment: - 40 cores in the master solr on m/c 1 - 40 cores in the slave solr on m/c 2 - The replication poll interval is 20 seconds. - Replication part in solrconfig.xml in the slave solr: requestHandler name=/replication class=solr.ReplicationHandler lst name=slave !--fully qualified url for the replication handler of master-- str name=masterUrl$mastercorename/replication/str !--Interval in which the slave should poll master .Format is HH:mm:ss . If this is absent slave does not poll automatically. But a fetchindex can be triggered from the admin or the http API-- str name=pollInterval00:00:20/str !-- The following values are used when the slave connects to the master to download the index files. Default values implicitly set as 5000ms and 1ms respectively. The user DOES NOT need to specify these unless the bandwidth is extremely low or if there is an extremely high latency-- str name=httpConnTimeout5000/str str name=httpReadTimeout1/str /lst /requestHandler Thanks for any pointers. -- Regards, Samarth -- Sincerely yours Mikhail Khludnev Developer Grid Dynamics tel. 1-415-738-8644 Skype: mkhludnev http://www.griddynamics.com mkhlud...@griddynamics.com -- Regards, Samarth -- Regards, Samarth
Re: Delta Replication in SOLR
On Dec 14, 2011, at 9:58 PM, mechravi25 wrote: We would like know whether it is possible to replicate only a certain documents from master to slave. More like a Delta Replication process. No, it is not. wunder -- Walter Underwood wun...@wunderwood.org
Re: Solr Search Across Multiple Cores not working when quering on specific field
but when i searched on a specific field than it is not working http://localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0,localhost:8983/solr/core1; q=mnemonic_value:United Why distributed search is not working when i search on a particular field.? Since you have multiple shard infra, do the cores share the same configurations(schema.xml/solrconfig.xml etc.)?? What error/output you are getting for sharded query? Regards Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Search-Across-Multiple-Cores-not-working-when-quering-on-specific-field-tp3585013p3587890.html Sent from the Solr - User mailing list archive at Nabble.com.