external jar as processor
hi everybody i have the following entities. i added the jar file into webinf/lib folder and i dont know how to specify the field names in the schema.xml pls help me anybody entity processor=com.xxx.solr.handler.dataimport.FeedbackProcessor url=http://test.xxx.com; appKey=qto9gjtI68pi7JRxVZ8Z lastUpdate=${dataimporter.last_index_time} / entity processor=com.xxx.solr.handler.dataimport.AnswersProcessor url=http://abcs.xxx.com; pageSize=500 lastUpdate=${dataimporter.last_index_time} / -- View this message in context: http://lucene.472066.n3.nabble.com/external-jar-as-processor-tp3721915p3721915.html Sent from the Solr - User mailing list archive at Nabble.com.
more sql-like commands for solr
hi all, we have used solr to provide searching service in many products. I found for each product, we have to do some configurations and query expressions. our users are not used to this. they are familiar with sql and they may describe like this: I want a query that can search books whose title contains java, and I will group these books by publishing year and order by matching score and freshness, the weight of score is 2 and the weight of freshness is 1. maybe they will be happy if they can use sql like statements to convey their needs. select * from books where title contains java group by pub_year order by score^2, freshness^1 also they may like they can insert or delete documents by delete from books where title contains java and pub_year between 2011 and 2012. we can define some language similar to sql and translate the to solr query string such as .../select/?q=+title:java^2 +pub_year:2011 this may be equivalent to apache hive for hadoop.
Re: Parallel indexing in Solr
On Mon, Feb 6, 2012 at 5:55 PM, Per Steffensen st...@designware.dk wrote: Sami Siren skrev: On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote: Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. What kind of/how many discs do you have for your shards? ..also what kind of server are you experimenting with? Grrr, thats where I have a little fight with operations. For now they gave me one (fairly big) machine with XenServer. I create my machines as Xen VM's on top of that. One of the things I dont like about this (besides that I dont trust Xen to do its virtualization right, or at least not provide me with correct readings on IO) is that disk space is assigned from an iSCSI connected SAN that they all share (including the line out there). But for now actually it doesnt look like disk IO problems. It looks like networks-bottlenecks (but to some extend they all also shard network) among all the components in our setup - our client plus Lily stack (HDFS, HBase, ZK, Lily Server, Solr etc). Well it is complex, but anyways ... You could try to isolate the bottleneck by testing the indexing speed from the local machine hosting Solr. Also tools like iostat or sar might give you more details about the disk side. -- Sami Siren
Re: Which Tokeniser (and/or filter)
I'm still finding matches across newlines index... i am fluent german racing search... fluent german Any suggestions? I've currently got this in wdftypes.txt for WordDelimiterfilterfactory \u000A = ALPHANUM \u000B = ALPHANUM \u000C = ALPHANUM \u000D = ALPHANUM # \u000D\u000A = ALPHA \u0085 = ALPHANUM \u2028 = ALPHANUM \u2029 = ALPHANUM \u2424 = ALPHANUM --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Mon, 6 Feb 2012 04:10:18 -0800 (PST), Ahmet Arslan iori...@yahoo.com wrote: My fear is what will then happen with highlighting if I use re-mapping? What do you mean by re-mapping?
Re: Which Tokeniser (and/or filter)
I'm still finding matches across newlines index... i am fluent german racing search... fluent german Any suggestions? You can use a multiValued field for this. Split your document according to new line at client side. arri am fluent/arr arrgerman racing/arr positionIncrementGap=100 will prevent query fluent german to match. Or, may be you can inject artificial tokens via http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory Your document becomes : i am fluent NEWLINE german racing
Typical Cache Values
Based on the hit ratio of my caches, they seem to be pretty low. Here they are. What are typical values of yours production setup? What are some of the things that can be done to improve the ratios? queryResultCache lookups : 3234602 hits : 496 hitratio : 0.00 inserts : 3234239 evictions : 3230143 size : 4096 warmupTime : 8886 cumulative_lookups : 3465734 cumulative_hits : 526 cumulative_hitratio : 0.00 cumulative_inserts : 3465208 cumulative_evictions : 3457151 documentCache lookups : 17647360 hits : 11935609 hitratio : 0.67 inserts : 5711851 evictions : 5707755 size : 4096 warmupTime : 0 cumulative_lookups : 19009142 cumulative_hits : 12813630 cumulative_hitratio : 0.67 cumulative_inserts : 6195512 cumulative_evictions : 6187460 fieldValueCache lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 0 cumulative_hits : 0 cumulative_hitratio : 0.00 cumulative_inserts : 0 cumulative_evictions : 0 filterCache lookups : 30059278 hits : 28813869 hitratio : 0.95 inserts : 1245744 evictions : 1245232 size : 512 warmupTime : 28005 cumulative_lookups : 32155745 cumulative_hits : 30845811 cumulative_hitratio : 0.95 cumulative_inserts : 1309934 cumulative_evictions : 1309245 *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Parallel indexing in Solr
You could try to isolate the bottleneck by testing the indexing speed from the local machine hosting Solr. Also tools like iostat or sar might give you more details about the disk side. Yes, I am doing different stuff to isolate bottleneck. Im also profiling JVM. And I am using iostat, top and sar already. Thanks. This questions was originally just to get an early indication of whether or not Jetty was at all designed for parallel production-like processing. Now I believe it is, until I prove that it does not live up to my requirements. Thanks! -- Sami Siren
Re: Symbols in synonyms
You're probably looking at a custom tokenizer and/or filter chain here. Or at least creatively combining the ones that exist. The admin/analysis page will be your friend. Even if you define these as synonyms, the rest of the analysis chain may break them up so you really have to look at the effects of the entire analysis chain. I'd start with a really simple one (not the stock ones) and build up. Especially beware of WordDelimiterFilterFactory for instance Best Erick On Mon, Feb 6, 2012 at 4:39 AM, Robert Brown r...@intelcompute.com wrote: is it good practice, common, or even possible to put symbols in my list of synonyms? I'm having trouble indexing and searching for AE, with it being split on the . we already convert .net to dotnet, but don't want to store every combination of 2 letters, AE, ME, etc. -- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com
Re: Phonetic search and matching
What happens if you do NOT inject? Setting inject=false stores only the phonetic reduction, not the original text. In that case your false match on 13 would go away Not sure what that means for the rest of your app though. Best Erick On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann dirk.hoegem...@googlemail.com wrote: Hi, I have a question on phonetic search and matching in solr. In our application all the content of an article is written to a full-text search field, which provides stemming and a phonetic filter (cologne phonetic for german). This is the relevant part of the configuration for the index analyzer (search is analogous): tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 / filter class=solr.PhoneticFilterFactory encoder=ColognePhonetic inject=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory / Unfortunately this results sometimes in strange, but also explainable, matches. For example: Content field indexes the following String: Donnerstag von 13 bis 17 Uhr. This results in a match, if we search for puf as the result of the phonetic filter for this is 13. (As a consequence the 13 is then also highlighted) Does anyone has an idea how to handle this in a reasonable way that a search for puf does not match 13 in the content? Thanks in advance! Dirk
Re: Improving performance for SOLR geo queries?
So the obvious question is what is your performance like without the distance filters? Without that knowledge, we have no clue whether the modifications you've made had any hope of speeding up your response times As for the docs, any improvements you'd like to contribute would be happily received Best Erick 2012/2/6 Matthias Käppler matth...@qype.com: Hi, we need to perform fast geo lookups on an index of ~13M places, and were running into performance problems here with SOLR. We haven't done a lot of query optimization / SOLR tuning up until now so there's probably a lot of things we're missing. I was wondering if you could give me some feedback on the way we do things, whether they make sense, and especially why a supposed optimization we implemented recently seems to have no effect, when we actually thought it would help a lot. What we do is this: our API is built on a Rails stack and talks to SOLR via a Ruby wrapper. We have a few filters that almost always apply, which we put in filter queries. Filter cache hit rate is excellent, about 97%, and cache size caps at 10k filters (max size is 32k, but it never seems to reach that many, probably because we replicate / delta update every few minutes). Still, geo queries are slow, about 250-500msec on average. We send them with cache=false, so as to not flood the fq cache and cause undesirable evictions. Now our idea was this: while the actual geo queries are poorly cacheable, we could clearly identify geographical regions which are more often queried than others (naturally, since we're a user driven service). Therefore, we dynamically partition Earth into a static grid of overlapping boxes, where the grid size (the distance of the nodes) depends on the maximum allowed search radius. That way, for every user query, we would always be able to identify a single bounding box that covers it. This larger bounding box (200km edge length) we would send to SOLR as a cached filter query, along with the actual user query which would still be sent uncached. Ex: User asks for places in 10km around 49.14839,8.5691, then what we will send to SOLR is something like this: fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691} fq={!bbox cache=true d=100.0 sfield=location_ll pt=49.4684836290799,8.31165802979391} -- this one we derive automatically That way SOLR would intersect the two filters and return the same results as when only looking at the smaller bounding box, but keep the larger box in cache and speed up subsequent geo queries in the same regions. Or so we thought; unfortunately this approach did not help query execution times get better, at all. Question is: why does it not help? Shouldn't it be faster to search on a cached bbox with only a few hundred thousand places? Is it a good idea to make these kinds of optimizations in the app layer (we do this as part of resolving the SOLR query in Ruby), and does it make sense at all? We're not sure what kind of optimizations SOLR already does in its query planner. The documentation is (sorry) miserable, and debugQuery yields no insight into which optimizations are performed. So this has been a hit and miss game for us, which is very ineffective considering that it takes considerable time to build these kinds of optimizations in the app layer. Would be glad to hear your opinions / experience around this. Thanks! -- Matthias Käppler Lead Developer API Mobile Qype GmbH Großer Burstah 50-52 20457 Hamburg Telephone: +49 (0)40 - 219 019 2 - 160 Skype: m_kaeppler Email: matth...@qype.com Managing Director: Ian Brotherston Amtsgericht Hamburg HRB 95913 This e-mail and its attachments may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail and its attachments. Any unauthorized copying, disclosure or distribution of this e-mail and its attachments is strictly forbidden. This notice also applies to future messages.
Re: Realtime profile data
You have several options: 1 if you can go to trunk (bleeding edge, I admit), you can get into the near real time (NRT) stuff. 2 You could maintain essentially a post-filter step where your app maintains a list of deleted messages and removes them from the response. This will cause some of your counts (e.g. facets, grouping) to be slightly off 3 Train your users to expect whatever latency you've built into the system (i.e. indexing, commit and replication) Best Erick On Mon, Feb 6, 2012 at 10:42 AM, Pawel Rog pawelro...@gmail.com wrote: Hello. I have some problem which i'd like to solve using solr. I have user profile which has some kind of messages in it. User can filter messages, sort them etc. The problem is with delete operation. If user click on message to delete it it's very hard to update index of solr in real time. When user deletes message, it will be still visible. Have you idea how to solve problem with removing data?
Re: Commit call - ReadTimeoutException - usage scenario for big update requests and the ioexception case
Right, I suspect you're hitting merges. How often are you committing? In other words, why are you committing explicitly? It's often better to use commitWithin on the add command and just let Solr do its work without explicitly committing. Going forward, this is fixed in trunk by the DocumentWriterPerThread improvements. Best Erick On Mon, Feb 6, 2012 at 11:09 AM, Torsten Krah tk...@fachschaft.imn.htwk-leipzig.de wrote: Hi, i wonder if it is possible to commit data to solr without having to catch SockedReadTimeout Exceptions. I am calling commit(false, false) using a streaming server instance - but i still have to wait 30 seconds and catch the timeout from http method. I does not matter if its 30 or 60, it will fail depending on how long it takes until the update request is processed, or can i tweak things here? So whats the way to go here? Any other option or must i fetch those exception and go on like done now. The operation itself does finish successful - later on when its done - on server side and all stuff is committed and searchable. regards Torsten
Re: Typical Cache Values
See below... On Tue, Feb 7, 2012 at 8:21 AM, Pranav Prakash pra...@gmail.com wrote: Based on the hit ratio of my caches, they seem to be pretty low. Here they are. What are typical values of yours production setup? What are some of the things that can be done to improve the ratios? queryResultCache lookups : 3234602 hits : 496 hitratio : 0.00 inserts : 3234239 evictions : 3230143 size : 4096 warmupTime : 8886 cumulative_lookups : 3465734 cumulative_hits : 526 cumulative_hitratio : 0.00 cumulative_inserts : 3465208 cumulative_evictions : 3457151 This is not unusual, but there's also not much reason to give this much memory in your case. This is the cache that is hit when a user pages through result set. Your numbers would seem to indicate one of two things: 1 your window is smaller than 2 pages, see solrconfig.xml, queryResultWindowSize or 2 your users are rarely going to the next page. this cache isn't doing you much good, but then it's also not using that much in the way of resources. documentCache lookups : 17647360 hits : 11935609 hitratio : 0.67 inserts : 5711851 evictions : 5707755 size : 4096 warmupTime : 0 cumulative_lookups : 19009142 cumulative_hits : 12813630 cumulative_hitratio : 0.67 cumulative_inserts : 6195512 cumulative_evictions : 6187460 Again, this is actually quite reasonable. This cache is used to hold document data, and often doesn't have a great hit ratio. It is necessary though, it saves quite a bit of disk seeks when servicing a single query. fieldValueCache lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 0 cumulative_hits : 0 cumulative_hitratio : 0.00 cumulative_inserts : 0 cumulative_evictions : 0 Not doing much in the way of faceting, are you? filterCache lookups : 30059278 hits : 28813869 hitratio : 0.95 inserts : 1245744 evictions : 1245232 size : 512 warmupTime : 28005 cumulative_lookups : 32155745 cumulative_hits : 30845811 cumulative_hitratio : 0.95 cumulative_inserts : 1309934 cumulative_evictions : 1309245 Not a bad hit ratio here, this is where fq filters are stored. One caution here; it is better to break out your filter queries where possible into small chunks. Rather than write fq=field1:val1 AND field2:val2, it's better to write fq=field1:val1fq=field2:val2 Think of this cache as a map with the query as the key. If you write the fq the first way above, subsequent fqs for either half won't use the cache. Best Erick *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Display of highlighted search result should start with the beginning of the sentence that contains the search string.
Hi, We are using Solr 4.0 along with FVH and there is an issue we are facing while highlighting. For our requirement we want the highlighted search result should start with the beginning of the sentence and needed help to get this done. As of now this is not happening and the highlighted output comes up first in most scenarios. I have tried using the parameter boundaryScanner but still not getting the desired required result. Below is the configuration we are using. boundaryScanner name=simple class=solr.highlight.SimpleBoundaryScanner default=true lst name=defaults str name=hl.bs.maxScan10/str str name=hl.bs.chars.,!? #9;#10;#13;/str /lst /boundaryScanner I need help in getting the display of highlighted search result and it should start with the beginning of the sentence that contains the search string. -Shyam
Re: Display of highlighted search result should start with the beginning of the sentence that contains the search string.
(12/02/08 0:50), Shyam Bhaskaran wrote: Hi, We are using Solr 4.0 along with FVH and there is an issue we are facing while highlighting. For our requirement we want the highlighted search result should start with the beginning of the sentence and needed help to get this done. As of now this is not happening and the highlighted output comes up first in most scenarios. I have tried using the parameter boundaryScanner but still not getting the desired required result. Below is the configuration we are using. boundaryScanner name=simple class=solr.highlight.SimpleBoundaryScanner default=true lst name=defaults str name=hl.bs.maxScan10/str str name=hl.bs.chars.,!?#9;#10;#13;/str /lst /boundaryScanner I need help in getting the display of highlighted search result and it should start with the beginning of the sentence that contains the search string. Please provide more detail info, e.g. field data that you indexed and undesirable snippet you currently got. And have you tried BreakIteratorBoundaryScanner with hl.bs.type=SENTENCE? koji -- http://www.rondhuit.com/en/
RE: Display of highlighted search result should start with the beginning of the sentence that contains the search string.
Hi Koji, I have tried using hl.bs.type=SENTENCE and still no improvement. We are storing PDF extracted content in the field which has termVectors enabled. Example the field contains the following data extracted from PDF User-defined resolution functions. The synthesis tool only supports the resolution functions for std_logic and std_logic_vector. Slices with range indices that do not evaluate to constants When I search for the term std_logic - following is the highlighted snippet displayed functions for emstd_logic/em and std_logic_vector. * Slices with range indices that do not evaluate to constants As you can see the highlighted term does not start from the beginning of sentence, why is this and how can I achieve this. -Shyam
Re: Realtime profile data
Thank you. I'll try NRT and some post-filter :) On Tue, Feb 7, 2012 at 3:09 PM, Erick Erickson erickerick...@gmail.com wrote: You have several options: 1 if you can go to trunk (bleeding edge, I admit), you can get into the near real time (NRT) stuff. 2 You could maintain essentially a post-filter step where your app maintains a list of deleted messages and removes them from the response. This will cause some of your counts (e.g. facets, grouping) to be slightly off 3 Train your users to expect whatever latency you've built into the system (i.e. indexing, commit and replication) Best Erick On Mon, Feb 6, 2012 at 10:42 AM, Pawel Rog pawelro...@gmail.com wrote: Hello. I have some problem which i'd like to solve using solr. I have user profile which has some kind of messages in it. User can filter messages, sort them etc. The problem is with delete operation. If user click on message to delete it it's very hard to update index of solr in real time. When user deletes message, it will be still visible. Have you idea how to solve problem with removing data?
Re: Which Tokeniser (and/or filter)
This all seems a bit too much work for such a real-world scenario? --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Tue, 7 Feb 2012 05:11:01 -0800 (PST), Ahmet Arslan iori...@yahoo.com wrote: I'm still finding matches across newlines index... i am fluent german racing search... fluent german Any suggestions? You can use a multiValued field for this. Split your document according to new line at client side. arri am fluent/arr arrgerman racing/arr positionIncrementGap=100 will prevent query fluent german to match. Or, may be you can inject artificial tokens via http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory Your document becomes : i am fluent NEWLINE german racing
Re: solrcore.properties
Walter Underwood wrote Looking at SOLR-1335 and the wiki, I'm not quite sure of the final behavior for this. These properties are per-core, and not visible in other cores, right? yes it is. Walter Underwood wrote Are variables substituted in solr.xml, so I can swap in different properties files for dev, test, and prod? Like this: core name=mary properties=conf/solrcore-${env:dev}.properties/ If that does not work, what are the best practices for managing dev/test/prod configs for Solr? As you can see here http://wiki.apache.org/solr/CoreAdmin I am not sure you can set a property file to be loaded per core with this variable syntax. Does someone may confirm ? What we have made here is a maven project, some variable properties in .properties or .xml solr configuration files. Then while generating project, we use maven profile to generate dev/prod...distribution. Wish it can help you, Jul -- View this message in context: http://lucene.472066.n3.nabble.com/solrcore-properties-tp3720446p3723212.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Commit call - ReadTimeoutException - usage scenario for big update requests and the ioexception case
Am 07.02.2012 15:12, schrieb Erick Erickson: Right, I suspect you're hitting merges. Guess so. How often are you committing? One time, after all work is done. In other words, why are you committing explicitly? It's often better to use commitWithin on the add command and just let Solr do its work without explicitly committing. Tika does extract my docs and i'll fetch the results (memory, disk) - externally. I all went ok like expected, i'll take those docs and add it to my solr server instance. After i am done with add + deletes i'll do commit. One commit for all those docs - adding and deleting. If something went wrong before or between adding, update or deleting docs, i do call rollback and all is like before (i am doing the update from one source only so i can be sure that no one can call commit in between). CommitWithin will break my possibility to rollback things, that why i want to explicitly call commit here. Going forward, this is fixed in trunk by the DocumentWriterPerThread improvements. Will this be backported to upcoming 3.6? Best Erick On Mon, Feb 6, 2012 at 11:09 AM, Torsten Krah tk...@fachschaft.imn.htwk-leipzig.de wrote: Hi, i wonder if it is possible to commit data to solr without having to catch SockedReadTimeout Exceptions. I am calling commit(false, false) using a streaming server instance - but i still have to wait 30 seconds and catch the timeout from http method. I does not matter if its 30 or 60, it will fail depending on how long it takes until the update request is processed, or can i tweak things here? So whats the way to go here? Any other option or must i fetch those exception and go on like done now. The operation itself does finish successful - later on when its done - on server side and all stuff is committed and searchable. regards Torsten smime.p7s Description: S/MIME Kryptografische Unterschrift
Re: Phonetic search and matching
Thanks Erick. In the first place we thought of removing numbers with a pattern filter. Setting inject to false will have the same effect If we want to be able to search for numbers in the content this solution will not work,but another field without phonetic filtering and searching in both fields would be ok,right? Dirk Am 07.02.2012 14:01 schrieb Erick Erickson erickerick...@gmail.com: What happens if you do NOT inject? Setting inject=false stores only the phonetic reduction, not the original text. In that case your false match on 13 would go away Not sure what that means for the rest of your app though. Best Erick On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann dirk.hoegem...@googlemail.com wrote: Hi, I have a question on phonetic search and matching in solr. In our application all the content of an article is written to a full-text search field, which provides stemming and a phonetic filter (cologne phonetic for german). This is the relevant part of the configuration for the index analyzer (search is analogous): tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 / filter class=solr.PhoneticFilterFactory encoder=ColognePhonetic inject=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory / Unfortunately this results sometimes in strange, but also explainable, matches. For example: Content field indexes the following String: Donnerstag von 13 bis 17 Uhr. This results in a match, if we search for puf as the result of the phonetic filter for this is 13. (As a consequence the 13 is then also highlighted) Does anyone has an idea how to handle this in a reasonable way that a search for puf does not match 13 in the content? Thanks in advance! Dirk
Missing search result...
Hi, all... I have a small problem retrieving the full set of query responses I need and would appreciate any help. I have a query string as follows: +((Title:sales) (+Title:sales) (TOC:sales) (+TOC:sales) (Keywords:sales) (+Keywords:sales) (text:sales) (+text:sales) (sales)) +(RepType:WRO Revenue Services) +(ContentType:SOP ContentType:Key Concept) -(Topics:Backup) The query is intended to be: MUST have at least one of: - exact phrase in field Title - all of the phrase words in field Title - exact phrase in field TOC - all of the phrase words in field TOC - exact phrase in field Keywords - all of the phrase words in field Keywords - exact phrase in field text - all of the phrase words in field text - any of the phrase words in field text MUST have WRO Revenue Services in field RepType MUST have at least one of: - SOP in field ContentType - Key Concept in field ContentType MUST NOT have Backup in field Topics It's almost working, but it misses a couple of items that contain a single occurrence of the word sale in a indexed field. The indexed field containing that single occurrence is named UrlContent. schema.xml UrlContent is defined as: field name=UrlContent type=text indexed=true stored=false required=false omitNorms=false/ Copyfields are as follows: copyField source=Title dest=text/ copyField source=Keywords dest=text/ copyField source=TOC dest=text/ copyField source=Overview dest=text/ copyField source=UrlContent dest=text/ Thanks, Tim Hibbs
RE: Multi word synonyms
I suppose I could translate every user query to include the term with quotes. e.g. if someone searches for stock syrup I send a query like: q=stock syrup OR stock syrup Seems like a bit of a hack though, is there a better way of doing this? Zac -Original Message- From: Zac Smith Sent: Sunday, February 05, 2012 7:28 PM To: solr-user@lucene.apache.org Subject: RE: Multi word synonyms Thanks for the response. This almost worked, I created a new field using the KeywordTokenizerFactory as you suggested. The only problem was that searches only found documents when quotes were used. E.g. synonyms.txt setup like this: simple syrup,sugar syrup,stock syrup I indexed a document with the value 'simple syrup'. Searches only found the document when using quotes: e.g. simple syrup or stock syrup matched simple syrup (no quotes) did not match Here is the field I created: fieldType name=synonym_searcher class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType Any ideas? Also, I am using dismax and solr 3.5.0. Thanks Zac -Original Message- From: O. Klein [mailto:kl...@octoweb.nl] Sent: Sunday, February 05, 2012 5:22 AM To: solr-user@lucene.apache.org Subject: Re: Multi word synonyms Your query analyser will tokenize simple sirup into simple and sirup and wont match on simple syrup in the synonyms.txt So you have to change the query analyzer into KeywordTokenizerFactory as well. It might be idea to make a field for synonyms only with this tokenizer and another field to search on and use dismax. Never tried this though. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Which Tokeniser (and/or filter)
Well, this is a common approach. Someone has to split up the input as sentences (whatever they are). Putting them in multi-valued fields is trivial. Then you confine things to within sentences, then you start searching phrases with a slop less than your incrementGap... Best Erick On Tue, Feb 7, 2012 at 12:27 PM, Robert Brown r...@intelcompute.com wrote: This all seems a bit too much work for such a real-world scenario? --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Tue, 7 Feb 2012 05:11:01 -0800 (PST), Ahmet Arslan iori...@yahoo.com wrote: I'm still finding matches across newlines index... i am fluent german racing search... fluent german Any suggestions? You can use a multiValued field for this. Split your document according to new line at client side. arri am fluent/arr arrgerman racing/arr positionIncrementGap=100 will prevent query fluent german to match. Or, may be you can inject artificial tokens via http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory Your document becomes : i am fluent NEWLINE german racing
Re: Phonetic search and matching
Yes, you could do that. I guess numbers will give you trouble under all circumstances. You may be able to do something like search against your non- phonetic field with higher boosts to preferentially do those matches. Best Erick On Tue, Feb 7, 2012 at 2:30 PM, Dirk Högemann dirk.hoegem...@googlemail.com wrote: Thanks Erick. In the first place we thought of removing numbers with a pattern filter. Setting inject to false will have the same effect If we want to be able to search for numbers in the content this solution will not work,but another field without phonetic filtering and searching in both fields would be ok,right? Dirk Am 07.02.2012 14:01 schrieb Erick Erickson erickerick...@gmail.com: What happens if you do NOT inject? Setting inject=false stores only the phonetic reduction, not the original text. In that case your false match on 13 would go away Not sure what that means for the rest of your app though. Best Erick On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann dirk.hoegem...@googlemail.com wrote: Hi, I have a question on phonetic search and matching in solr. In our application all the content of an article is written to a full-text search field, which provides stemming and a phonetic filter (cologne phonetic for german). This is the relevant part of the configuration for the index analyzer (search is analogous): tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 / filter class=solr.PhoneticFilterFactory encoder=ColognePhonetic inject=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory / Unfortunately this results sometimes in strange, but also explainable, matches. For example: Content field indexes the following String: Donnerstag von 13 bis 17 Uhr. This results in a match, if we search for puf as the result of the phonetic filter for this is 13. (As a consequence the 13 is then also highlighted) Does anyone has an idea how to handle this in a reasonable way that a search for puf does not match 13 in the content? Thanks in advance! Dirk
RE: Multi word synonyms
Isn't that what autoGeneratePhraseQueries=true is for? -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3723886.html Sent from the Solr - User mailing list archive at Nabble.com.
I want to specify multiple facet prefixes per field
I simulated a hierarchical faceting browsing scheme using facet.prefix. However, it seems there can only be one facet.prefix per field. For OR queries, the browsing scheme requires multiple facet prefixes. For example: fq=facet1:term1 OR facet1:term2 OR facet1:term3 Something like the above is very powerful. For the hierarchical browsing, at this point what I want is to show the child terms (one level down) of term1, term2 and term3 (but not term4, term5 or term6). Now, if I add a facet.prefix, say f.facet1.facet.prefix=term1, it would give me all the child terms of term1, but I also want the children of child 2 and child 3. So what I want is to be able to do something like this: f.facet1.facet.prefix=term1 OR term2 OR term3. Is there a way to accomplish what I'm looking for?
RE: Multi word synonyms
It doesn't seem to do it for me. My field type is: fieldType name=synonym_searcher class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.KeywordTokenizerFactory / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType I am using edismax and solr 3.5 and multi word values can only be matched when using quotes. -Original Message- From: O. Klein [mailto:kl...@octoweb.nl] Sent: Tuesday, February 07, 2012 12:49 PM To: solr-user@lucene.apache.org Subject: RE: Multi word synonyms Isn't that what autoGeneratePhraseQueries=true is for? -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3723886.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Which Tokeniser (and/or filter)
A custom tokenizer/tokenfilter could set the position increment when a newline comes through as well. Erik On Feb 7, 2012, at 15:28, Erick Erickson erickerick...@gmail.com wrote: Well, this is a common approach. Someone has to split up the input as sentences (whatever they are). Putting them in multi-valued fields is trivial. Then you confine things to within sentences, then you start searching phrases with a slop less than your incrementGap... Best Erick On Tue, Feb 7, 2012 at 12:27 PM, Robert Brown r...@intelcompute.com wrote: This all seems a bit too much work for such a real-world scenario? --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Tue, 7 Feb 2012 05:11:01 -0800 (PST), Ahmet Arslan iori...@yahoo.com wrote: I'm still finding matches across newlines index... i am fluent german racing search... fluent german Any suggestions? You can use a multiValued field for this. Split your document according to new line at client side. arri am fluent/arr arrgerman racing/arr positionIncrementGap=100 will prevent query fluent german to match. Or, may be you can inject artificial tokens via http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory Your document becomes : i am fluent NEWLINE german racing
RE: Multi word synonyms
Well, if you want both multi word and single words I guess you will have to create another field :) Or make queries like you suggested. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3724009.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Multi word synonyms
Are you able to explain how I would create another field to fit my scenario? -Original Message- From: O. Klein [mailto:kl...@octoweb.nl] Sent: Tuesday, February 07, 2012 1:28 PM To: solr-user@lucene.apache.org Subject: RE: Multi word synonyms Well, if you want both multi word and single words I guess you will have to create another field :) Or make queries like you suggested. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3724009.html Sent from the Solr - User mailing list archive at Nabble.com.
URI Encoding with Solr and Weblogic
Hi, I try to get Solr 3.3.0 to process Arabic search requests using its admin interface. I have successfully managed to set it up on Tomcat using the URIEncoding attribute but fail miserably on WebLogic 10. Invoking the URL http://localhost:7012/solr/select/?q=? returns the XML below: response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=qتÙ?Ù?ئة/str /lst /lst result name=response numFound=0 start=0/ /response The search term is just gibberish. Running the query through Luke or Tomcat returns the expected result and renders the search term correctly. I have tried to change the URI encoding and JVM default encoding by setting the following start up arguments in WebLogic: -Dfile.encoding=UTF-8 -Dweblogic.http.URIDecodeEncoding=UTF-8. I can see them being set through Solr's admin interface. They don't have any impact though. I am running out of ideas on how to get this working. Any thoughts and pointers are much appreciated. Thanks, Elisabeth
Re: Display of highlighted search result should start with the beginning of the sentence that contains the search string.
(12/02/08 1:54), Shyam Bhaskaran wrote: Hi Koji, I have tried using hl.bs.type=SENTENCE and still no improvement. We are storing PDF extracted content in the field which has termVectors enabled. Example the field contains the following data extracted from PDF User-defined resolution functions. The synthesis tool only supports the resolution functions for std_logic and std_logic_vector. Slices with range indices that do not evaluate to constants When I search for the term std_logic - following is the highlighted snippet displayed functions foremstd_logic/em and std_logic_vector. * Slices with range indices that do not evaluate to constants As you can see the highlighted term does not start from the beginning of sentence, why is this and how can I achieve this. Hi Shyam, Can you try to set hl.bs.chars=.!? and hl.bs.maxScan=100 or larger number. SimpleBoudaryScanner will scan the stored data to back and forth from the highlighted terms until meet those setting. http://wiki.apache.org/solr/HighlightingParameters#hl.bs.maxScan koji -- http://www.rondhuit.com/en/
Re: Which Tokeniser (and/or filter)
: This all seems a bit too much work for such a real-world scenario? You haven't really told us what your scenerio is. You said you want to split tokens on whitespace, full-stop (aka: period) and comma only, but then in response to some suggestions you added comments other things that you never mentioned previously... 1) evidently you don't want the . in foo.net to cause a split in tokens? 2) evidently you not only want token splits on newlines, but also positition gaps to prevent phrases matching across newlines. ...these are kind of important details that affect suggestions people might give you. can you please provide some concrete examples of hte types of data you have, the types of queries you want them to match, and the types of queries you *don't* want to match? -Hoss
struggling with solr.WordDelimiterFilterFactory and periods . or dots
hello all, i am struggling with getting solr.WordDelimiterFilterFactory to behave as is indicated in the solr book (Smiley) on page 54. the example in the books reads like this: Here is an example exercising all options: WiFi-802.11b to Wi, Fi, WiFi, 802, 11, 80211, b, WiFi80211b essentially - i have the same requirement with embedded periods and need to return a successful search on a field, even if the user does NOT enter the period. i have a field, itemNo that can contain periods .. example content in the itemNo field: B12.0123 when the user searches on this field, they need to be able to enter an itemNo without the period, and still find the item. example: user enters: B120123 and a document is returned with B12.0123. unfortunately, the search will NOT return the appropriate document, if the user enters B120123. however - the search does work if the user enters B12 0123 (a space in place of the period). can someone help me understand what is missing from my configuration? this is snipped from my schema.xml file fields ... field name=itemNo type=text indexed=true stored=true/ ... /fields fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ *filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1/* filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3724822.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Display of highlighted search result should start with the beginning of the sentence that contains the search string.
It seems a bug to me Can you open a ticket? Thank you Koji Sekiguchi from iPhone On 2012/02/08, at 13:32, Shyam Bhaskaran shyam.bhaska...@synopsys.com wrote: Hi Koji, Thanks for the response when I use hl.bs.chars=.!? and hl.bs.maxScan=200 I see improvements, below is the highlighted value The synthesis tool only supports the resolution functions for emstd_logic/em and std_logic_vector. But in other cases I also see that some of the words break in between as shown below Original text: How Are Clock Gating Checks Inferred When searching for the term clock the highlighted text is displayed as show below w Are emClock/em Gating Checks Inferred As you can see only w is displayed from the word How. This issue goes away when I use .bs.chars=.!? #9;#10;#13; but it creates issue of highlighting not from the beginning of the sentence. Is there a way whereby I can have highlighting working in all cases. -Shyam
RE: Display of highlighted search result should start with the beginning of the sentence that contains the search string.
Hi Koji, Thanks for the response when I use hl.bs.chars=.!? and hl.bs.maxScan=200 I see improvements, below is the highlighted value The synthesis tool only supports the resolution functions for emstd_logic/em and std_logic_vector. But in other cases I also see that some of the words break in between as shown below Original text: How Are Clock Gating Checks Inferred When searching for the term clock the highlighted text is displayed as show below w Are emClock/em Gating Checks Inferred As you can see only w is displayed from the word How. This issue goes away when I use .bs.chars=.!? #9;#10;#13; but it creates issue of highlighting not from the beginning of the sentence. Is there a way whereby I can have highlighting working in all cases. -Shyam
How to use nested query in fq?
Hi Guys, I am using Solr 3.5, and would like to use a fq like 'getField(getDoc(uuid:workspace_${workspaceId})), isPublic):true? - workspace_${workspaceId}: workspaceId is indexed field. - getDoc(uuid:concat(workspace_, workspaceId): return the document whose uuid is workspace_${workspaceId} - getField(getDoc(uuid:workspace_${workspaceId})), isPublic): return the matched document's isPublic field The use case is that I have workspace objects and workspace contains many sub-objects, such as work files, comments, datasets and so on. And workspace has a 'isPublic' field. If this field is true, then all registered user could access this workspace and all its sub-objects. Otherwise, only workspace member could access this workspace and its sub-objects. So I want to use fq to determine whether document in question belongs to public workspace or not. Is it possible? If not, how to implement similar feature like this? implement a ValueSourcePlugin? any guidance or example on this? Or is there any better solutions? It is possible to add 'isPublic' field to all sub-objects, while it makes indexing update more complex. so try to find better solution. Thanks very much in advance! Regards, Yandong
Re: is there any practice to load index into RAM to accelerate solr performance?
Experience has shown that it is much faster to run Solr with a small amount of memory and let the rest of the ram be used by the operating system disk cache. That is, the OS is very good at keeping the right disk blocks in memory, much better than Solr. How much RAM is in the server and how much RAM does the JVM get? How big are the documents, and how large is the term index for your searches? How many documents do you get with each search? And, do you use filter queries- these are very powerful at limiting searches. 2012/2/7 James ljatreey...@163.com: Is there any practice to load index into RAM to accelerate solr performance? The over all documents is about 100 million. The search time around 100ms. I am seeking some method to accelerate the respond time for solr. Just check that there is some practice use SSD disk. And SSD is also cost much, just want to know is there some method like to load the index file in RAM and keep the RAM index and disk index synchronized. Then I can search on the RAM index. -- Lance Norskog goks...@gmail.com
Re: Typical Cache Values
* * This is not unusual, but there's also not much reason to give this much memory in your case. This is the cache that is hit when a user pages through result set. Your numbers would seem to indicate one of two things: 1 your window is smaller than 2 pages, see solrconfig.xml, queryResultWindowSize or 2 your users are rarely going to the next page. this cache isn't doing you much good, but then it's also not using that much in the way of resources. True it is. Although the queryResultWindowSize is 30, I will be reducing it to 4 or so. And yes, we have observed that mostly people don't go beyond the first page documentCache lookups : 17647360 hits : 11935609 hitratio : 0.67 inserts : 5711851 evictions : 5707755 size : 4096 warmupTime : 0 cumulative_lookups : 19009142 cumulative_hits : 12813630 cumulative_hitratio : 0.67 cumulative_inserts : 6195512 cumulative_evictions : 6187460 Again, this is actually quite reasonable. This cache is used to hold document data, and often doesn't have a great hit ratio. It is necessary though, it saves quite a bit of disk seeks when servicing a single query. fieldValueCache lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 0 cumulative_hits : 0 cumulative_hitratio : 0.00 cumulative_inserts : 0 cumulative_evictions : 0 Not doing much in the way of faceting, are you? No. We don't facet results filterCache lookups : 30059278 hits : 28813869 hitratio : 0.95 inserts : 1245744 evictions : 1245232 size : 512 warmupTime : 28005 cumulative_lookups : 32155745 cumulative_hits : 30845811 cumulative_hitratio : 0.95 cumulative_inserts : 1309934 cumulative_evictions : 1309245 Not a bad hit ratio here, this is where fq filters are stored. One caution here; it is better to break out your filter queries where possible into small chunks. Rather than write fq=field1:val1 AND field2:val2, it's better to write fq=field1:val1fq=field2:val2 Think of this cache as a map with the query as the key. If you write the fq the first way above, subsequent fqs for either half won't use the cache. That was a great advise. We do use the former approach but going forward we would stick to the latter one. Thanks, Pranav
Re: Chinese Phonetic search
you can convert Chinese words to pinyin and use n-gram to search phonetic similar words On Wed, Feb 8, 2012 at 11:10 AM, Floyd Wu floyd...@gmail.com wrote: Hi there, Does anyone here ever implemented phonetic search especially with Chinese(traditional/simplified) using SOLR or Lucene? Please share some thought or point me a possible solution. (hint me search keywords) I've searched and read lot of related articles but have no luck. Many thanks. Floyd