AW: Lexical analysis tools for German language data
Von: Tomas Zerolo There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem (interstitial or joint morpheme) [...] IANAL (I am not a linguist -- pun intended ;) but I've always read that as a genitive. Any pointers? Admittedly, that's what you'd think, and despite linguistics telling me otherwise I'd maintain there's some truth in it. For this case, however, consider: die Weihnacht declines like die Nacht, so: nom. die Weihnacht gen. der Weihnacht dat. der Weihnacht akk. die Weihnacht As you can see, there's no s to be found anywhere, not even in the genitive. But my gut feeling, like yours, is that this should indicate genitive, and I would make a point of well-argued gut feeling being at least as relevant as formalist analysis. Michael
Lexical analysis tools for German language data
Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael
AW: Lexical analysis tools for German language data
Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary- backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. A simple approach would obviously be a word list and a regular expression. There will, however, be nuts and bolts to take care of. A more sophisticated and tested approach might be known to you. Michael
AW: Lexical analysis tools for German language data
Von: Valeriy Felberg If you want that query jacke matches a document containing the word windjacke or kinderjacke, you could use a custom update processor. This processor could search the indexed text for words matching the pattern .*jacke and inject the word jacke into an additional field which you can search against. You would need a whole list of possible suffixes, of course. Merci, Valeriy - I agree on the feasability of such an approach. The list would likely have to be composed of the most frequently used terms for your specific domain. In our case, it's things people would buy in shops. Reducing overly complicated and convoluted product descriptions to proper basic terms - that would do the job. It's like going to a restaurant boasting fancy and unintelligible names for the dishes you may order when they are really just ordinary stuff like pork and potatoes. Thinking some more about it, giving sufficient boost to the attached category data might also do the job. That would shift the burden of supplying proper semantics to the guys doing the categorization. It would slow down the update process but you don't need to split words during search. Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit : Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. A query for Windjacke or Kinderjacke would probably not have to be de-specialized to Jacke because, well, that's the user input and users looking for specific things are probably doing so for a reason. If no matches are found you can still tell them to just broaden their search. Michael
AW: Lexical analysis tools for German language data
Von: Markus Jelsma We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the dictionary for the HunspellStemFilter. Thank you for pointing me to these two filter classes. It does introduce a recall/precision problem but it at least returns results for those many users that do not properly use compounds in their search query. Could you define what the term recall should be taken to mean in this context? I've also encountered it on the BASIStech website. Okay, I found a definition: http://en.wikipedia.org/wiki/Precision_and_recall Dank je wel! Michael
AW: Lexical analysis tools for German language data
Von: Walter Underwood German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem (interstitial or joint morpheme). But there's not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum in the example above is from the singular (die Weihnacht), then s, then Baum. Still, it's much more complex then, say, English or Italian. Internal nouns should be recapitalized, like Baum above. Casing won't matter for indexing, I think. The way I would go about obtaining stems from compound words is by using a dictionary of stems and a regex. We'll see how far that'll take us. Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary. Good point. Note that highlighting gets pretty weird when you are matching only part of a word. Guess it'll be a weird when you get it wrong, like Noten in Notentriegelung. Luckily, a lot of compounds are simple, and you could well get a measurable improvement with a very simple algorithm. There isn't anything complicated about compounds like Orgelmusik or Netzwerkbetreuer. Exactly. The Basis Technology linguistic analyzers aren't cheap or small, but they work well. We will consider our needs and options. Thanks for your thoughts. Michael
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
SergeyG schrieb: Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Why would the inclusion of a stopword list result in stopwords being of top importance in the MoreLikeThis query? Michael Ludwig
Re: Search for phrase including prepositions
akinori schrieb: When I search make for, solr returns words include both make and for, but when I type more than 3 words such as in order to, the result becomes 0 though the index is sure to have several words including 3 of the words. 2 words are ok but more than 3 words resulted zero. Why is happens? Hi Akinori, I guess you're using the DisMax query parser. Please read this entire page: http://wiki.apache.org/solr/DisMaxRequestHandler The parameter that allows you to tweak this is the mm parameter. Michael Ludwig
Re: Installing a patch in a solr nightly on Windows
Koji Sekiguchi schrieb: I'm not a Windows user, but I think you can use Linux command (e.g. patch, to apply SOLR-284 patch to Solr nightly build) on cygwin environment. The standalone patch utility for Win32 is another option. http://gnuwin32.sourceforge.net/packages/patch.htm Michael Ludwig
Re: Monitor search traffic
Gurjot Singh schrieb: Hi, Is there a way to monitor the number of search queries made on the solr index. http://localhost:8983/solr/admin/stats.jsp Look for requests :. Michael Ludwig
Re: spelling suggestion in solr.
Radha C. schrieb: The feature spelling suggestion is available in solr? If yes, can you tell me some documentations? Have you tried googling for: solr spelling ? First hit: http://wiki.apache.org/solr/SpellCheckComponent Michael Ludwig
Re: SOLR SpeelChecker and german Umlauts
Kraus, Ralf | pixelhouse GmbH schrieb: When I am searching for ONE word with an german umlaut like kräuterkeckse (the right word is kräuterkekse) the spellchecker gives me two corrections : Spellcheck for kr = kren Spellcheck for uterkeksse = butterkekse WHY is SOLR break this ONE word apart ? Moin Ralf, please read the following threads to understand the issue. In short, you need to specify your query in spellcheck.q as well. Re: French and SpellingQueryConverter - Shalin Shekhar Mangar http://markmail.org/message/k35r7qmpatjvllsc SpellCheckComponent: queryAnalyzerFieldType - Michael Ludwig http://markmail.org/thread/dgi4llhc7x5wuroc (BTW, the patch in SOLR-1204 is ready but still awaiting clarification. See comments from June 11 and 18.) My Config is : spellcheck = 'true'; spellcheck.dictionary = 'jarowinkler' spellcheck.onlyMorePopular = 'true' spellcheck.build = 'false' spellcheck.count = 1 So add: spellcheck.q = 'your query' Michael Ludwig
Re: Search for phrase including prepositions
akinori schrieb: I indexed English dictionary to solr. When I search apple juice for example, solr understands the query is apple juice as what I want. Howerver, when I search apple for, solr thinks that the query is just apple. How can I solve this? I think I have to understand the analyzer. Exactly. Could anyone navigate me? Go to your analysis page, enter your field name (or type), check verbose output, enter your query, and press Analyze. http://localhost:8983/solr/admin/analysis.jsp You'll probably find that the word for is removed as a so-called stopword. Michael Ludwig
Re: nested dismax queries
Ensdorf Ken schrieb: For exmaple, a user might enter Alabama Biotechnology in the main search box, triggering a dismax request which returns lots of different types of results. They may then want to refine their search by selecting a specific industry from a drop-down box. We handle this by adding a filterquery (fq=) to the original query. We have dozens of additional fields like this - some with a finite set of discrete values, some with arbitrary text values. The combinations are infinite, and I'm worried we will overwhelm the filterCache by supporting all of these cases as filter queries. Filter queries with arbitrary text values may swamp the cache in 1.3. Otherwise, the combinations aren't infinite. Keep the filters seperate in order to limit their number. Specify two simple filters instead of one composite filter, fq=x:bla and fq=y:blub instead of fq=x:bla AND y:blub. See: filterCache/@size, queryResultCache/@size, documentCache/@size http://markmail.org/thread/tb6aanicpt43okcm Michael Ludwig
Re: nested dismax queries
Ensdorf Ken schrieb: Filter queries with arbitrary text values may swamp the cache in 1.3. Are you implying this won't happen in 1.4? I intended to say just this, but I was on the wrong track. Can you point me to the feature that would mitigate this? What I was thinking of is the following: [#SOLR-475] multi-valued faceting via un-inverted field https://issues.apache.org/jira/browse/SOLR-475 But as you can see, this refers to faceting on multi-valued fields, not to filter queries with arbitrary text. I was off on a tangent. Sorry. To get back to your initial mail, I tend to think that drop-down boxes (the values of which you control) are a nice match for the filter query, whereas user-entered text is more likely to be a candidate for the main query. Michael Ludwig
Re: Searching across multivalued fields
MilkDud schrieb: Michael Ludwig-4 wrote: What do you expect the user to enter? * dream theater innocence faded - certainly wrong * dream theater innocence faded - much better Most likely they would just enter dream theater innocence faded, no quotes. Without any quotes around any fields, which is a large cause of the problem. Now if i index on the track level, than all those words would have to show up in just one track (including the album, artist, and track name), which is expected. If i index on the album level however, now, those words just need to show up anywhere throughout the entire album. Give the user separate form fields, in this case, don't use DisMax, and route each form field value to the appropriate field. Or go with DisMax, it has the mm option to fine-tune how multiple terms in the query should influence matching. So, while it will match dream theater - innocence faded, it will also match an album that has all the words dream theater innocence faded mentioned across all tracks, which for small queries can be very common. Basically, I'm looking for a way to say match all the words in the search query across the artist, album, and track name, but only looking at one track (a multivalued field) at a time given a query without any quotes. Does that make sense at all? If that's your use case (which I may have been unable to see up to now), then your approach of splitting up albums in tiny track documents makes sense. That is why I was leaning towards the track level index, such as: id, artist, album, track (all single valued) Yes, that makes sense. Good luck! (Off for a week now.) Michael Ludwig
Re: Searching across multivalued fields
MilkDud schrieb: Ok, so lets suppose i did index across just the album. Using that index, how would I be able to handle searches of the form artist name track name. What does the user interface look like? Do you have separate fields for artists and tracks? Or just one field? If i do the search using a phrase query, this won't match anything because the artist and track are not in one field (hence my idea of creating a third concatenated field). What do you expect the user to enter? * dream theater innocence faded - certainly wrong * dream theater innocence faded - much better Use the DisMax query parser to read the query, as I suggested in my first reply. You need to become more familiar with the various search facilities, that will probably steer your ideas in more promising directions. Read up about DisMax. If i make it a non phrase query, itll return albums that have those words across all the tracks, which is not ideal. I.e. if you search for a track titled love me you will get back albums with the words love and me in different tracks. That doesn't make sense me to me. Did you inspect your query using debugQuery=true as I suggested? What did it boil down to? Basically, i'd like it to look at each track individually That tells me you're thinking database and table scan. and if the artist + just one track match all the search terms, then that counts as a match. Does that make sense? If i index on the track level, that should work, but then i have to store album/artist info on each track. I think the following makes much more sense: An album should be a document and have the following fields (and maybe more, if you have more data attached to it): id - unique, an identifier title - album title interpret - the musician, possibly multi-valued track - every song or whatever, definitely multi-valued Read up about multi-valued fields (sample schema.xml, for example, or Google) if you're unsure what this is; your posting subject, however, suggests you aren't. Regards, Michael Ludwig
Re: Few Queries regarding indexes in Solr
Otis Gospodnetic schrieb: [...] nothing prevents the indexing client from sending the same doc to multiple shards. In some scenarios that's exactly what you want to do. What kind of scenario would that be? One scenario is making use of small and large core to provide near real-time search - you index to both - to smaller so you can flip/drop/purge+reopen it frequently and quickly, the large one to persist. You search across both of them and remove dupes. This makes sense. Thanks for taking the time to answer this. Q: What is the most annoying thing in e-mail? A: it never stops! Imagine it did one day! Michael Ludwig
Re: FilterCache issue
Manepalli, Kalyan schrieb: I am seeing an issue with the filtercache setting on my solr app which is causing slower faceting. Here is the configuration. filterCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=256/ hitratio : 0.00 inserts : 973531 evictions : 972978 size : 512 cumulative_hitratio : 0.00 cumulative_inserts : 61170111 cumulative_evictions : 61153787 As we can see the cache hit ratio is almost zero. How do I improve the filter cache. Maybe these pages add some ideas to the mix: http://wiki.apache.org/solr/FilterQueryGuidance https://issues.apache.org/jira/browse/SOLR-475 Michael Ludwig
Re: Distributed querying using solr multicore.
Rakhi Khatwani schrieb: [...] how do we do a distributed search across multicores?? is it just like how we query using multiple shards? I don't know how we're supposed to use it. I did the following: http://flunder:8983/solr/xpg/select?q=blashards=flunder:8983/solr/xpg,flunder:8983/solr/kk For SolrJ, see this thread: Using SolrJ with multicore/shards - ahammad http://markmail.org/thread/qnytfrk4dytmgjis if so, isnt there a better way to do that? No idea. Michael Ludwig
Re: Distributed querying using solr multicore.
Rakhi Khatwani schrieb: On Thu, Jun 18, 2009 at 3:51 PM, Michael Ludwig m...@as-guides.com wrote: I don't know how we're supposed to use it. I did the following: http://flunder:8983/solr/xpg/select?q=blashards=flunder:8983/solr/xpg,flunder:8983/solr/kk i am gettin a page load error... cannot find server This is not a public server, just an example for the syntax I found by trial and error. Michael Ludwig
Re: Searching across multivalued fields
Hi Vicky, Vicky_Dev schrieb: We are also facing same problem mentioned in the post (we are using dismaxrequesthandler):: When we are searching for --q=prdTitle_s:ladybirdqt=dismax , we are getting 2 results -- unique key ID =1000 and unique key ID =1001 (1) Append debugQuery=true to your query and see how the DisMax query parser rewrites your query, interpreting what you think is a field name as just another query term. (2) Proceed immediately to read the whole Wiki page explaining DisMax: http://wiki.apache.org/solr/DisMaxRequestHandler Is it possible to just exact match which is nothing but unique key = 1001? Yes, it is: q=id:1001 (1) Don't use DisMax here, that will not interpret field names. (2) Replace id by whatever name you gave to your unique key field. Michael Ludwig
Re: Searching across multivalued fields
MilkDud schrieb: To be more specific, I'm indexing a collection of music albums that have multiple tracks and an album artist. So, some searches will contain both the artist name and the track name. I can't make this a single phrase query as it is indexed across two separate fields. Use the DisMaxRequestHandler and specify all fields you want to use in your query in the qf parameter. !-- qf = query fields: list of fields with boost factor -- str name=qf artist^3 album^2 track^1 /str http://wiki.apache.org/solr/DisMaxRequestHandler Michael Ludwig
Re: Few Queries regarding indexes in Solr
Otis Gospodnetic schrieb: Regarding that 3rd answer below: Putting it back in context (where it belongs :-) : My (very limited) understanding of shards is that you repartition your documents among shards and send each document to only one shard. (Not sure this is correct.) Yes, that's what most people do, though nothing prevents the indexing client from sending the same doc to multiple shards. In some scenarios that's exactly what you want to do. What kind of scenario would that be? Michael Ludwig -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail?
Re: what date format to pass for search in Solr?
chem leakhina schrieb: Does anyone know what date format pass to search in Solr? A restricted subset of the W3C datetime format. See: http://wiki.apache.org/solr/IndexingDates Could you give me any examples for search with Date in solr? Examples can be very easily found searching for something like solr date range query. For example, see: http://www.nabble.com/Date-Range-Query-%2B-Fields-to16108517.html Michael Ludwig
Re: Could solr build two different indexes?
fei dong schrieb: I wanna build many instances of solr. My requirement is to statisfy different product search. Could I do that? Yes. Read all of the following: Multi-index Design - Chris Masters http://markmail.org/thread/6p7viwpinrwmj6my http://wiki.apache.org/solr/MultipleIndexes http://wiki.apache.org/solr/CoreAdmin Michael Ludwig
Re: Solr Query | Field:value with dismaxquery
prerna07 schrieb: I am facing issue with query with dismaxrequest. ?q=facetFormat_product_s:Pfqs ePub eBook Sfqs - return correct results ?q=facetFormat_product_s:Pfqs ePub eBook Sfqsqt=dismaxrequest - dose not return results, although field facetFormat_product_s is defined in dismaxrequest Handler of solrconfig.xml You mustn't include the fieldname in the query when sending the query to the DisMax query parser. The fieldname will be interpreted as just another term to build a query clause from. ?q=facetFormat_product_s:Pfqs Cassette Sfqsqt=dismaxrequest - return correct results I'd attribute that to the mm (minimum match) parameter, the meaning of which you can understand reading the following page, which it would probably make a lot of sense to read anyway: http://wiki.apache.org/solr/DisMaxRequestHandler Michael Ludwig
Re: fq vs. q
Fergus McMenemie schrieb: While q= and fq= affect the results portion of a search response. The facet.query only affects the facets portion of a response. facet.query(s) are only used where you want a facet summary of your query based on some kind of complex expression rather than the terms within a single field. I added the comment in that I think that a wiki page discussing fs vs q should also mention facet.query. It now does: http://wiki.apache.org/solr/FilterQueryGuidance Michael Ludwig
Re: Searching across multivalued fields
MilkDud schrieb: Basically, what I am trying to do is index a collection of music for an online music store. This contains information on the track, album, and artist levels. These are all different object types in the same schema and it does contain a lot of redundant information. What's a document in your case? If I were you, I'd probably organize the data so that each album is one document, because that's what you'd expect (shopping experience). For example, a track will have its own listing, but will show up again in the album listing and the artist listing for the objects that own that track. Sounds a bit bizarre to me, but then I don't know much about your requirements. There are reasons it is done this way as we search/display across the three differently. Hmm. That said, I have thought of ways of just indexing tracks and maintaining all the relevant information, but that seems to introduce its own issues. An album should be a document and have the following fields (and maybe more, if you have more data attached to it): id - unique, an identifier title - album title interpret - the musician, possibly multi-valued track - every song or whatever, definitely multi-valued Michael Ludwig
Re: Few Queries regarding indexes in Solr
Rakhi Khatwani schrieb: 1. Is it possible to query from another index folder (say index1) in solr? I think you're looking for the multi-core feature. http://wiki.apache.org/solr/MultipleIndexes http://wiki.apache.org/solr/CoreAdmin 2. Is it possible to query 2 indexes(folders index1 and index2) stored in the same machine using the same port on a single solr instance? Sounds like multi-core. 3. consider a case: i have indexes in 2 shards, and i merge the indexes (present in 2 shards) onto the 3rd shard now i add more documents into shard1 and delete some documents from shard 2 and update the indexes. is it possible to send the differences only into shard 3 and then merge it at shard 3? My (very limited) understanding of shards is that you repartition your documents among shards and send each document to only one shard. (Not sure this is correct.) Michael Ludwig
Re: fq vs. q
Ensdorf Ken schrieb: I ran into this very issue recently as we are using a freshness filter for our data that can be 6//12/18 months etc. I discovered that even though we were only indexing with day-level granularity, we were specifying the query by computing a date down to the second and thus virutally every filter was unique. It's amazing how something this simple could bring solr to it's knees on a large data set. I want to retrieve documents (TV programs) by a particular date and decided to convert the date to an integer, so I have: * 20090615 * 20090616 * 20090617 etc. I lose all date logic (timezones) for that date field, but it works for this particular use case, as the date is merely a tag, and not a real date I need to perform more logic on than an integer allows. Also, an integer looks about as efficient as it gets, so I thought it preferable to a date for this use case. YMMV. I think if you truncate dates to incomplete dates, you effectively also lose all the date logic. You may still apply it, but what would you take the result to mean? You can't regain precision you've decided to drop. The actual points in time where my TV programs start and end are encoded as a UNIX timestamp with exactitude down to the second, also stored as an integer, as I don't need sub-second precision. This makes sense for my client, which is not Java, but PHP, so it uses the C library strftime and friends, which need UNIX timestamps. Bottom line, I think it may make perfect sense to store dates and times in integers, depending on your use case and your client. Michael Ludwig
Re: fq vs. q
Fergus McMenemie schrieb: The article could explain the difference between fq= and facet.query= and when you should use one in preference to the other. My understanding is that while these query modifiers rely on the same implementation (cached filters) to boost performance, they simply and obviously differ in that fq limits the result set to your filter criterion whereas facet.query does not restrict the result but instead enhances it with statistical information gained from applying set intersection of result and facet query filters. It looks like facet.query is just a more flexible means of defining a filter than possible using a mere facet.field. Would that be approximately correct? A question of mine: It appears to me that each facet.query invariably leads to one boolean filter, so if you wanted to do range faceting for a given field and obtain, say, results reduced from their actual continuum of values to three ranges {A,B,C}, you'd have to define three facet.query parameters accordingly. A mere facet.field, on the other hand, creates as many filters as there are unique values in the field. Is that correct? Michael Ludwig
Re: fq vs. q
Shalin Shekhar Mangar schrieb: On Mon, Jun 15, 2009 at 4:39 PM, Michael Ludwig m...@as-guides.com wrote: I think if you truncate dates to incomplete dates, you effectively also lose all the date logic. You may still apply it, but what would you take the result to mean? You can't regain precision you've decided to drop. Note that with Trie search coming in (see example schema.xml in the nightly builds), this rounding may not be necessary any more. http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/schema.xml Not sure I understand correctly, but this sounds as if given an integer field and a @precisionStep of 3, the original value is stored along with three copies that omit (1) the last bit, (2) the two last bits, (3) the three last bits. So a given range query might be optimized to an equality query. But I'm not sure I'm on the right track here. Michael Ludwig
Re: Joins or subselects in solr
Nasseam Elkarra schrieb: I am storing items in an index. Each item has a comma separated list of related items. Is it possible to bring back an item and all of its related items in one query? If so how and how would you distinguish between which one is the main item and which are the related. Think about the data structure. You're saying there is a main item, which suggests there is some regularity to the underlying data structure, possibly a tree. If there is a main item, each item should store a reference to the main item. You could then perform a lookup specifying q=mainitem:12345. That would retrieve all items related to 12345 and solve the problem more efficiently than having each item store a list of all its related items. I'm thinking of small or moderately sized trees here, such as they grow in mailing lists or discussion boards. If it's not a tree, but some less regular graph, then the notion of a main item needs clarification. Michael Ludwig
Re: fq vs. q
Michael Ludwig schrieb: Martin Davidsson schrieb: I've tried to read up on how to decide, when writing a query, what criteria goes in the q parameter and what goes in the fq parameter, to achieve optimal performance. Is there [...] some kind of rule of thumb to help me decide how to split things up when querying against one or more fields. This is a good question. I don't know if there is any such rule. I'm going to sum up my understanding of filter queries hoping that the pros will point out any flaws in my assumptions. I've summarized what I've learnt about filter queries on this page: http://wiki.apache.org/solr/FilterQueryGuidance Michael Ludwig
Re: Customizing results
revas schrieb: What is GNU gettext and how this can be used in a multilanguage scenario? It'a an internationalization technology, so it is well suited to the tasks of internationalizing and localizing applications. http://www.gnu.org/software/gettext/manual/ http://www.gnu.org/software/gettext/manual/html_node/Why.html In your case, it might mean that the client is equipped with the language packages it needs and uses the name returned by Solr (likely the English term) to look up the translation by means of Gettext. But it certainly depends very much on your particular setup. It might be overkill for your particular situation. Michael Ludwig
Re: Build Failed
Mukerjee, Neiloy (Neil) schrieb: When running ant example to do an example configuration, I get the following message: BUILD FAILED /home/stagger2/Solr/apache-solr-1.3.0/common-build.xml:149: Compile failed; see the compiler error output for details. I've tried reading through the files in question, but I can't seem to find the issue. Any suggestions? Run: ant -verbose Michael Ludwig
Re: dismax parsing applied to specific fields
Nick Jenkin schrieb: Hi I was wondering if there is a way of applying dismax parsing to specific fields, where there are multiple fields being searched - all with different query values e.g. author:(tolkien) AND title:(the lord of the rings) would be something like: dismax(author, tolkien) AND dismax(title, the lord of the rings) I guess this can be thought of having two separate dismax configurations, one searching author and one searching title - and the intersection of the results is returned. http://wiki.apache.org/solr/DisMaxRequestHandler This says that the DisMaxRequestHandler is simply the standard request handler with the default query parser set to the DisMax Query Parser. So maybe you could program your own CustomDisMaxRequestHandler that reuses the DisMax query parser (and probably other components) to achieve what you want. Michael Ludwig
Re: Build Failed
Mukerjee, Neiloy (Neil) schrieb: Running ant -verbose still doesn't allow me to run an example configuration. I get the same error from ant example after getting the following from ant -verbose: Build sequence for target(s) `usage' is [usage] usage: [echo] Welcome to the Solr project! [echo] Use 'ant example' to create a runnable example configuration. [echo] And for developers: [echo] Use 'ant clean' to clean compiled files. [echo] Use 'ant compile' to compile the source code. [echo] Use 'ant dist' to build the project WAR and JAR files. [echo] Use 'ant generate-maven-artifacts' to generate maven artifacts. [echo] Use 'ant package' to generate zip, tgz, and maven artifacts for distribution. [echo] Use 'ant test' to run unit tests. BUILD SUCCESSFUL You might want to read up on Ant usage in the Ant User Manual, a copy of which should be part of your installation, or can be found on the web. Quick overview: ant -help When I wrote ant -verbose, I meant ant -verbose your-target, so: ant -verbose example Michael Ludwig
Re: Faceting on text fields
Yao Ge schrieb: BTW, Carrot2 has a very impressive Clustering Workbench (based on eclipse) that has built-in integration with Solr. If you have a Solr service running, it is a just a matter of point the workbench to it. The clustering results and visualization are amazing. (http://project.carrot2.org/download.html). A new world opens up for me ... Thanks for pointing out how cool this is! Hint for other newcomers: Open the View Menu to configure the details of how you perform your search, e.g. your Solr URL in case it differs from the default, or your summary field, which is what gets used to analyze the data in order to determine clusters, if I understand correctly. Michael Ludwig
Re: fq vs. q
Fergus McMenemie schrieb: On Tue, Jun 9, 2009 at 7:25 PM, Michael Ludwig m...@as-guides.com wrote: A filter query is cached, which means that it is the more useful the more often it is repeated. We know how often certain queries arise, or at least have the means to collect that data - so we know what might be candidates for filtering. Sorry but I cant make any sense of the above. Could you have another go at explaining it? Filtering a given query result R on bla:eins, bla:zwei, bla:drei or bla:vier is very common in my application. So while I could include this criterion in my main query (q) and hope for the queryResultCache to kick in, this would be unlikely to be efficient as my primary query, which gave me R, likely varies a lot, resulting in a high number of distinct queries, with relatively low probability for a given query to occur frequently. So each of these query result sets would enter the queryResultCache as a distinct set, hence high contention, high eviction rate, poor cache efficiency. Now I'm going to factor out those bla:{eins,zwei,drei,vier} filters from my primary query (q) and put them in the filter query (fq). The benefit is double: (1) Solr has a dedicated cachespace for filters the usage of which I control by my usage of the filter query (fq). I can set up things so the usage of the primary query (q) is under the user's control while the usage of the filter query (fq) is under my application's control. I control this cache, I ensure its efficiency. (2) Factoring out the filter query bla:{eins,zwei,drei,vier} from the primary query also reduces variation in the primary query, thus making the queryResultCache more efficient. So instead of having, say, 1 distinct primary queries, no usage of the filterCache, and poor usage of the queryResultCache, I may have only, say, 3000 distinct primary queries, four cached filters in the filterCache (bla:{eins,zwei,drei,vier}), and a somewhat better usage of the queryResultCache. I wrote that we know how often certain queries arise, or at least have the means to collect that data, because we know the application we're writing, so we either know the frequency of a given search pattern based on the usage our application makes of Solr and on the restrictions it imposes on the user by, say, using Dismax; or - if we give the user fine-grained control over the query language - we may somehow collect and analyze the actual queries in order to empirically determine actual search engine usage and optimize accordingly. The result of a filter query is cached and then used to filter a primary query result using set intersection. If my filter query result comprises more than 50 % of the entire document collection, its selectivity is poor. I might need it despite this fact, but it might also be worth while thinking about how to reframe the requirement, allowing for more efficient filters. So, just to be explicit, if I have a query containing: fq=EventType:fairfq=EventType:filmfq=LAT:[50 TO 60]fq=LONG:[-1 TO 1] The first time this is encountered it is going to cause four queries of the entire index and cause four sets of document ID's to be cached. Subsequent queries will reuse the various cached entries as appropriate. Is that correct? I do think so. I guess in the above case where my GEO search window will keep changing I should ideally arrange that the lat and long element is added to the q parameter to stop my cache being cluttered. My understanding is that what varies heavily should *not* go into the filterCache. Your GEO search window might vary quite a bit (probably much more than EventType), so to me it looks like a candidate for the main query. Also what happens when the filter is full? If there any accounting of which cache entries are getting the most or most recent hits? Good question! Michael Ludwig
Re: Faceting on text fields
Yonik Seeley schrieb: Yep, all that sounds right. An additional optimization counts terms for the documents *not* in the set when the base set is over half the size of the index. Cool :-) Thanks for confirming my assumptions! Michael Ludwig
Re: Faceting on text fields
Otis Gospodnetic schrieb: Solr can already cluster top N hits using Carrot2: http://wiki.apache.org/solr/ClusteringComponent Would it be fair to say that clustering as detailed on the page you're referring to is a kind of dynamic faceting? The faceting not being done based on distinct values of certain fields, but on the presence (and frequency) of terms in one field? The main difference seems to be that with faceting, grouping criteria (facets) are known beforehand, while with clustering, grouping criteria (the significant terms which create clusters - the cluster keys) have yet to be determined. Is that a correct assessment? Michael Ludwig
Re: Customizing results
Manepalli, Kalyan schrieb: Hi, I am trying to customize the response that I receive from Solr. In the index I have multiple fields that contain the same data in different language. At the query time client specifies the language. Based on this param, I want to return the value, copied into a different field. Eg: str name=location_da_dkLubang, Filippinerne/str str name=location_de_deLubang, Philippinen/str str name=location_en_usLubang, Philippines/str str name=location_es_esLubang, Filipinas/str If the user specifies language as de_de, then I want to return the result as str name=locationLubang, Philippinen/str If you control how the client works, you could also consider using an internationalization technology such as GNU Gettext for this purpose. May or may not make sense in your particular situation. Michael Ludwig
Re: How to disable posting updates from a remote server
ashokc schrieb: I find that I am freely able to post to my production SOLR server, from any other host that can run the post command. So somebody can wipe out the whole index by posting a delete query. Control this at the IP level, have your server listen on 127.0.0.1 or on a private subnet address. Michael Ludwig
Re: Solr relevancy score - conversion
Vijay_here schrieb: Would need an more proportionate score like rounded to 100% (95% relevant, 80 % relevant and so on). Is there a way to make solr returns such scores of such relevance. In XSLT: xsl:template match=result/doc xsl:variable name=score-percentage select= round( 100 * flo...@name='score'] div ../@maxScore)/ The div is the XPath division operator. Should be a straightforward mapping to any other language. Michael Ludwig
Re: copyfield and 'store' and highlighting
ashokc schrieb: Do I have to declare 'field1' also to be stored? 'field1' is never returned in the response. I find the following Wiki page helpful when dealing with @stored, @indexed and friends: http://wiki.apache.org/solr/FieldOptionsByUseCase Michael Ludwig
Re: fq vs. q
Martin Davidsson schrieb: I've tried to read up on how to decide, when writing a query, what criteria goes in the q parameter and what goes in the fq parameter, to achieve optimal performance. Is there [...] some kind of rule of thumb to help me decide how to split things up when querying against one or more fields. This is a good question. I don't know if there is any such rule. I'm going to sum up my understanding of filter queries hoping that the pros will point out any flaws in my assumptions. http://wiki.apache.org/solr/SolrCaching - filterCache A filter query is cached, which means that it is the more useful the more often it is repeated. We know how often certain queries arise, or at least have the means to collect that data - so we know what might be candidates for filtering. The result of a filter query is cached and then used to filter a primary query result using set intersection. If my filter query result comprises more than 50 % of the entire document collection, its selectivity is poor. I might need it despite this fact, but it might also be worth while thinking about how to reframe the requirement, allowing for more efficient filters. Memory consumption is probably not a great concern here as the cache stores only document IDs. (And if those are integers, it's just 4 bytes each.) So having 100 filters containing 100,000 items on average, the memory consumption increase should be around 40 MB. By the way, are these document IDs (user in filterCache, documentCache, queryResultCache) the ones I configure in schema.xml or does Solr map my IDs to integers in order to ensure efficiency? A filter query should probably be orthogonal to the primary query, which means in plain English: unrelated to the primary query. To give an example, I have a field category, which is a required field. In the class of searches where I use a filter on that field, the primary search is for something entirely different, so in most cases, it will not, or not necessarily, bias the primary result to any particular distribution of the category values. I then allow the application to apply filtering by category, incidentally, using faceting, which is a typical usage pattern, I guess. Michael Ludwig
filterCache/@size, queryResultCache/@size, documentCache/@size
Common cache configuration parameters include @size (size attribute). http://wiki.apache.org/solr/SolrCaching For each of the following, does this mean the maximum size of: * filterCache/@size - filter query results? * queryResultCache/@size - query results? * documentCache/@size - documents? So if I know my tiny documents don't take up much memory (just 500 Bytes on average), I'd want to have very different settings for the documentCache than if I decided to store 10 KB per doc in Solr? And if I know that only 100 filters are possible, there is no point raising the filterCache/@size above that threshold? Given the following three filtering scenarios of (a) x:bla, (b) y:blub, and (c) x:bla AND y:blub, will I end up with two or three distinct filters? In other words, may filters be composites or are they decomposed as far as their number (relevant for @size) is concerned? Michael Ludwig
Re: filter on millions of IDs from external query
Ryan McKinley schrieb: I am working with an in index of ~10 million documents. The index does not change often. I need to preform some external search criteria that will return some number of results -- this search could take up to 5 mins and return anywhere from 0-10M docs. If it really takes so long, then something is likely wrong. You might be able to achieve a significant improvement by reframing your requirement. I would like to use the output of this long running query as a filter in solr. Any suggestions on how to wire this all together? Just use it as a filter query. The result will be cached, the query won't have to be executed again (if I'm not mistaken) until a new index searcher is opened (after an index update and a commit), or until the filter query result is evicted from the cache, which you should make sure won't happen if your query really is so terribly expensive. Michael Ludwig
Re: Field Compression
Fer-Bj schrieb: for all the documents we have a field called small_body , which is a 60 chars max text field that were we store the abstract for each article. we need to display this small_body we want to compress every time. If this works like compressing individual files, the overhead for just 60 characters (which may be no more than 60 bytes) may mean that any attempt at compression results in inflation. On the other hand, if lower-level units (pages) are compressed (as opposed to individual fields), then I don't know what sense a configurable compression threshold might make. Maybe one of the pros can clarify. Last question: what's the best way to determine the compress threshold ? One fairly obvious way would be to index the same set of documents twice, with compression and then without, and then to compare the index size on disk. If you don't save, say, five or ten percent (YMMV), it might not be worth the effort. Michael Ludwig
Re: Faceting on text fields
Yao Ge schrieb: The facet query is considerably slower comparing to other facets from structured database fields (with highly repeated values). What I found interesting is that even after I constrained search results to just a few hunderd hits using other facets, these text facets are still very slow. I understand that text fields are not good candidate for faceting as it can contain very large number of unique values. However why it is still slow after my matching documents is reduced to hundreds? Is it because the whole filter is cached (regardless the matching docs) and I don't have enough filter cache size to fit the whole list? Very interesting questions! I think an answer would both require and further an understanding of how filters work, which might even lead to a more general guideline on when and how to use filters and facets. Even though faceting appears to have changed in 1.4 vs 1.3, it would still be interesting to understand the 1.3 side of things. Lastly, what I really want to is to give user a chance to visualize and filter on top relevant words in the free-text fields. Are there alternative to facet field approach? term vectors? I can do client side process based on top N (say 100) hits for this but it is my last option. Also a very interesting data mining question! I'm sorry I don't have any answers for you. Maybe someone else does. Best, Michael Ludwig
Re: Faceting on text fields
Yonik Seeley schrieb: Are you using Solr 1.3? You might want to try the latest 1.4 test build - faceting has changed a lot. I found two significant changes (but there may well be more): [#SOLR-911] multi-select facets - ASF JIRA https://issues.apache.org/jira/browse/SOLR-911 Yao, it sounds like the following (which is in 1.4) might have a chance of helping your faceting performance issue: [#SOLR-475] multi-valued faceting via un-inverted field - ASF JIRA https://issues.apache.org/jira/browse/SOLR-475 Yonik, from your initial comment for SOLR-475: | * To save space and speed up faceting, any term that matches enough | * documents will not be un-inverted... it will be skipped while | * building the un-inverted field structore, and will use a set | * intersection method during faceting. Does this mean that frequently occurring terms (which we can use for faceting in 1.3 without problems) are handled exactly as they were before, by allocating a slot in the filter cache upon request, while those zillions of pesky little fringe terms outside the mainstream, for which allocating a slot in the filter cache would be overkill (and possibly cause inefficient contention, eviction, and, hence, a performance penalty) are now handled by the new structure mapping documents to term numbers? So doing faceting for a given set of documents would result in (a) doing set intersection using those filter query results that have been set up (for the terms occurring in many documents), and (b) collecting all the pesky little terms from the new structure mapping documents to term numbers? So basically, depending on expediency, you (a) know the facets and count the documents which display them, or you (b) take the documents and see what facets they have? Michael Ludwig
Re: statistics about word distances in solr
Moin Jens, Jens Fischer schrieb: I was wondering if there's an option to return statistics about distances from the query terms to the most frequent terms in the result documents. The additional information I'm looking for is the average distance between these terms and my search term. So let's say I have two docs the house is red I live in a red house The search for house should also return the info the:1 is:1 red:1.5 I:5 live:4 Could you explain what the distance here is? Something like edit distance? Ah, I see: You want the textual distance between the search term and other terms in the document, and then you want that averaged, i.e. the cumulative distance divided by the number of occurrences. No idea if that functionality is available. However, the sort of calculation you want to perform requires the engine to not only collect all the terms to present as facets (much improved in 1.4, as I've just learned), but to also analyze each document (if I'm not mistaken) to determine the distance for each facet term from your primary query term. (Or terms.) The number of lookup operations is likely to scale as the product of the number of your primary search results, the number of your search terms, and the number of your facets. I assume this is an expensive operation. Michael Ludwig
Re: fq vs. q
Shalin Shekhar Mangar schrieb: On Tue, Jun 9, 2009 at 7:25 PM, Michael Ludwig m...@as-guides.com wrote: A filter query should probably be orthogonal to the primary query, which means in plain English: unrelated to the primary query. To give an example, I have a field category, which is a required field. In the class of searches where I use a filter on that field, the primary search is for something entirely different, so in most cases, it will not, or not necessarily, bias the primary result to any particular distribution of the category values. I then allow the application to apply filtering by category, incidentally, using faceting, which is a typical usage pattern, I guess. Yes and no. There are use-cases where the query is applicable only to the filtered set. For example, when the same index contains many different types of documents. It is just that the intersection may need to do more or less work. Sorry, I don't understand. I used to think that the engine applies the filter to the primary query result. What you're saying here sounds as if it could also pre-filter my document collection to then apply a query to it (which should yield the same result). What does it mean that the query is applicable only to the filtered set? And thanks for having clarified the other points! Michael Ludwig
Re: filterCache/@size, queryResultCache/@size, documentCache/@size
Shalin Shekhar Mangar schrieb: On Tue, Jun 9, 2009 at 7:47 PM, Michael Ludwig m...@as-guides.com wrote: Given the following three filtering scenarios of (a) x:bla, (b) y:blub, and (c) x:bla AND y:blub, will I end up with two or three distinct filters? In other words, may filters be composites or are they decomposed as far as their number (relevant for @size) is concerned? It will be three. If you want to cache separately, send them as separate fq parameters. Thanks a lot for clarifying all my questions. Michael Ludwig
Re: fq vs. q
Shalin Shekhar Mangar schrieb: No, both filters and queries are computed on the entire index. My comment was related to the A filter query should probably be orthogonal to the primary query... part. I meant that both kinds of use-cases are common. Got it. Thanks :-) Michael Ludwig
Re: SpellCheckComponent: queryAnalyzerFieldType
Shalin Shekhar Mangar schrieb: Is it correct to say that when I intend to always use the spellcheck.q parameter I do not need to specify a queryAnalyzerFieldType in my spellcheck searchComponent, which I define in solrconfig.xml? Yes, that is correct. Even if a queryAnalyzerFieldType is not specified and your query uses q, then WhitespaceTokenizer is used by default. Thanks for clarifying. SpellingQueryConverter was written for a very simple use-case dealing with ASCII only. But there is no reason why we cannot extend it to cover the full UTF-8 set. Can you please open an issue and if possible, give a patch? Please see: https://issues.apache.org/jira/browse/SOLR-1204 Regards, Michael Ludwig
Re: spell checking
Walter Underwood schrieb: query suggest --wunder That's very good. On the other hand, I noticed how the term spellcheck is spread all over the place, and that would be a massive renaming orgy. An explanation at the appropriate place in the documentation is less invasive. I added two sentences to the Introduction of: http://wiki.apache.org/solr/SpellCheckComponent Michael Ludwig
Re: spell checking
Yao Ge schrieb: Maybe we should call this alternative search terms or suggested search terms instead of spell checking. It is misleading as there is no right or wrong in spelling, there is only popular (term frequency?) alternatives. I had exactly the same difficulty in understanding the concept because of the name given to the feature, which usually denotes just what it says, i.e. a spellchecker, which is driven by an authoritative dictionary and a set of rules, as integrated in word processors, in order to ensure orthography. What we have here is quite different from a spellchecker. IMHO, a name conveying the actual meaning, along the lines of suggest, would make more sense. Michael Ludwig
SpellCheckComponent: queryAnalyzerFieldType
Shalin Shekhar Mangar wrote: | If you use spellcheck.q parameter for specifying | the spelling query, then the field's analyzer will | be used [...] If you use the q parameter, then the | SpellingQueryConverter is used. http://markmail.org/message/k35r7qmpatjvllsc - message http://markmail.org/thread/gypvpfnsd5sggkpx - whole thread Is it correct to say that when I intend to always use the spellcheck.q parameter I do not need to specify a queryAnalyzerFieldType in my spellcheck searchComponent, which I define in solrconfig.xml? Given the limitations of the SpellingQueryConverter laid out in the thread referred to above, it seems you want to use the spellcheck.q parameter for anything but what can be encoded in ASCII. Is that true? Michael Ludwig
Re: French and SpellingQueryConverter
Shalin Shekhar Mangar schrieb: On Mon, May 11, 2009 at 2:46 PM, Michael Ludwig m...@as-guides.com wrote: Could you give an example of how the spellcheck.q parameter can be brought into play to (take non-ASCII characters into account, so that Käse isn't mishandled) given the following example: You will need to set the correct tokenizer and filters for your field which can handle your language correctly. Look at the GermanAnalyzer in Lucene contrib-analysis. It uses StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, GermanStemFilter with a custom stopword list. Hello Shalin, thanks for your kind answer, and sorry for my delay in responding. Due to my newbieness in this domain, I misphrased my question. What I wanted to say (and Jonathan, too, I think) is that the regular expression in that SpellingQueryConverter only deals with ASCII, which is insufficient for most languages, including French and German. I think the regular expression in SpellingQueryConverter should be something like: (?:(?!(\w+:|\d+)))[\p{javaLowerCase}\p{javaUpperCase}\d_]+ vs. (?:(?!(\w+:|\d+)))\w+ Then, correct German and French TokenStreams are generated in the example program I posted. But I may well have misunderstood the purpose of this class. You will know. Michael Ludwig
Re: French and SpellingQueryConverter
Jonathan Mamou schrieb: Thanks Michael for your answer! I think that (?:(?!(\w+:|\d+)))[\p{L}]+ should also be OK. Oh yes, that's much simpler and clearer than my suggestion. (Newbieness factor for Java style regular expressions, too.) Or maybe this:(?:(?!(\w+:|\d+)))[\p{L}\d_]+:-) Michael Ludwig
Re: Replication master+slave
Bryan Talbot schrieb: So how are people managing solrconfig.xml files which are largely the same other than differences for replication? I don't think it's a good thing to maintain two copies of the same file and I'd like to avoid that. Maybe enabling the XInclude feature in DocumentBuilders would make it possible to modularize configuration files to make this possible? This is already possible using the XML feature called entities, more precisely external general parsed entities (EGPE). I've never seen a parser that doesn't do entities. C:\MILU\dev\XML # type egpe-net.xml !DOCTYPE Urmel [ !ENTITY egpe_from_the_net SYSTEM http://lobster.as-guides.com/ds/solr.schema.ent; !ENTITY egpe_from_the_local_disk SYSTEM egpe-local.ent ] Urmel egpe_from_the_net; egpe_from_the_local_disk; /Urmel C:\MILU\dev\XML # type egpe-local.ent eins/ zwei/ drei/ Michael Ludwig
Re: Selective Searches Based on User Identity
Terence Gannon schrieb: Paul -- thanks for the reply, I appreciate it. That's a very practical approach, and is worth taking a closer look at. Actually, taking your idea one step further, perhaps three fields; 1) ownerUid (uid of the document's owner) 2) grantedUid (uid of users who have been granted access), and 3) deniedUid (uid of users specifically denied access to the document). Grants might change quite a bit, the owner will likely remain the same. Wouldn't it be better to include only the owner in the document and store grants someplace else, like in an RDBMS or - if you don't want one - a lightweight embedded database like BDB? That way you could have your application tag an ineluctable filter query onto each and every user query, which would ensure to include only those documents in the results the owner of which has granted the user access. Considering that I'm a Solr/Lucene newbie, this approach might have a disadvantage that escapes me, which is why other people haven't made this particular suggestion. If so, I'd be happy to learn why this isn't preferable. Michael Ludwig
Re: Selective Searches Based on User Identity
Hi Terence, Terence Gannon schrieb: Yes, the ownerUid will likely be assigned once and never changed. But you still need it, in order to keep track of who has contributed which document. Yes, of course! I've been going over some of the simpler query scenarios, and Solr is capable of handling them without having to resort to an external RDBMS. The database is only to store grants - it's not to help with searching. It would look like this: grantee| grant ---+-- fritz | fred,frank,egon frank | egon,fritz egon | terence,frank ... Each user is granted to access to his own documents and to those he had received grants for. In order to limit documents to those which a given user owns, or those to which he has been granted access, the syntax fragment would be something like; ownerUid:ab2734 or grantedUid:ab2734 I think it could be: ownerUid:egon OR ownerUid:terence OR ownerUid:frank No need to embed grants in the document. Ah, I see my mistake now. You want grants based on the document, not on the user - I had overlooked that fact. That makes my suggestion invalid. I'll plead ignorance of the 'ineluctable filter query' and will have to read up on that one. I meant a filter query that the application tags onto the query on behalf of the user and without the user being able to do anything about it so he cannot circumvent the filter. Best regards, Michael Ludwig
Re: French and SpellingQueryConverter
Shalin Shekhar Mangar schrieb: On Fri, May 8, 2009 at 2:14 AM, Jonathan Mamou ma...@il.ibm.com wrote: SpellingQueryConverter always splits words with special character. I think that the issue is in SpellingQueryConverter class Pattern.compile.((?:(?!(\\w+:|\\d+)))\\w+);?: According to http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html, \w A word character: [a-zA-Z_0-9] I think that special character should also be added to the regex. Same issue for the GermanAnalyzer as for the FrenchAnalyzer. http://wiki.apache.org/solr/SpellCheckComponent says: The SpellingQueryConverter class does not deal properly with non-ASCII characters. In this case, you have either to use spellcheck.q, or to implement your own QueryConverter. If you use spellcheck.q parameter for specifying the spelling query, then the field's analyzer will be used (in this case, FrenchAnalyzer). If you use the q parameter, then the SpellingQueryConverter is used. Could you give an example of how the spellcheck.q parameter can be brought into play to (take non-ASCII characters into account, so that Käse isn't mishandled) given the following example: package org.apache.solr.spelling; import org.apache.lucene.analysis.de.GermanAnalyzer; public class GermanTest { public static void main(String[] args) { SpellingQueryConverter sqc = new SpellingQueryConverter(); sqc.analyzer = new GermanAnalyzer(); System.out.println(sqc.convert(Käse)); } } Note the result of the above, which is plain wrong, reads: [(k,0,1,type=ALPHANUM), (se,2,4,type=ALPHANUM)] Thanks. Michael Ludwig
Organizing multiple searchers around overlapping subsets of data
I have one type of document, but different searchers, each of which is interested in a different subset of the documents, which are different configurations of TV channels {A,B,C,D}. * Application S1 is interested in all channels, i.e. {A,B,C,D}. * Application S2 is interested in {A,B,C}. * Application S3 is interested in {A,C,D}. * Application S4 is interested in {B,D}. As can be seen from this simplified example, the subsets are not disjoint, but do have considerable overlaps. The total data volume is only about 200 MB. There are four searchers, and they may become ten or a dozen. The set elements an application may or may not be interested in, however, i.e. the channels, which are {A,B,C,D} in this example, are not just four, but about 150, each of which has about 1000 documents. What is the best way to organize this? (a) Set up different cores for each application, i.e. going multi-core, thereby incurring a good deal of redundancy, but simplifying searches? (b) Apply filter queries to select documents from only, say 60, 80 or 110 out of 150 channels. (c) Something else I'm not aware of. Am I right in suspecting that multi-core makes less sense with increasing overlaps and hence redundancy? Michael Ludwig
Re: What are the Unicode encodings supported by Solr?
KK schrieb: I'd like to know about the different Unicode[/any other?] encodings supported by Solr for posting docs [thru Solrj in my case]. Is it that just UTF-8, UCN supported or other character encodings like NCR(decimal), NCR(hex) etc are supported as well? Any numerical character reference (NCR), decimal or hexadecimal, is valid UTF-8 as long as it maps to a valid Unicode character. I found that for most of the pages the encoding is UTF-8[in this case searching works fine] but for others the encoding is some other character encoding[like NCR(dec), NCR(hex) or might be something else, don't have much idea on this]. Whatever the encoding is, your application needs to know what it is when dealing with bytes read from the network. So when I fetch the page content thru java methods using InputSteamReaders and after stripping various tags what I obtained is raw text with some encoding not getting supported by Solr. Did you make sure to not rely on your platform default encoding (Charset) when constructing the InputStreamReader? If in doubt, take a look at the InputStreamReader constructors. Michael Ludwig
Re: Multi-index Design
Matt Weber schrieb: http://wiki.apache.org/solr/MultipleIndexes Thanks, Mark. Your explanation and the pointer to the Wiki have clarified things for me. Michael Ludwig
Re: schema.xml: default values for @indexed and @stored
Otis Gospodnetic schrieb: Attribute values for fields should be inherited from attribute values of their field types. Thanks, that answers my question pertaining to @indexed and @stored in the fieldtype and field elements in schema.xml. Michael Ludwig
Re: unable to run the solr in tomcat 5.0
uday kumar maddigatla schrieb: Hi, I'm new to this Solr. I got distribution of Solr. i placed the war file in tomcat/webapps. After that i don't know what to do. I got confused while reading The instalation notes which is given in wiki . It might be easier for you to follow the instructions in the tutorial and run Solr in Jetty as per the distribution, which works out of the box: http://lucene.apache.org/solr/tutorial.html Michael Ludwig
Re: unable to run the solr in tomcat 5.0
uday kumar maddigatla schrieb: The link which shows the things in Jetty. But i'm using Tomcat. If i run the command which is given in the link, it is tryinge to post the indexes at port number 8983. But in my case my tomcat is running on 8080. Where to change the port. That's a basic Tomcat question. The answer is: In your Tomcat's server.xml configuration file. Look here: http://tomcat.apache.org/tomcat-6.0-doc/config/ Then, look for the port parameter here: http://tomcat.apache.org/tomcat-6.0-doc/config/http.html You could also change the port in the address bar of your browser. Or even do a string replacement s/8983/8080/g on the Solr doc you're viewing. Michael Ludwig
Re: unable to run the solr in tomcat 5.0
uday kumar maddigatla schrieb: My intention is to use 8080 as port. Is there any other way taht Solr will post the files in 8080 port Solr doesn't post, it listens. Use the curl utility as indicated in the documentation. http://wiki.apache.org/solr/UpdateXmlMessages Michael Ludwig
Re: unable to run the solr in tomcat 5.0
uday kumar maddigatla schrieb: When i try to use the command java -post.jar *.*. It is trying to Post files in Solr which is there in 8983 port. The post.jar seems to be hardcoded to port 8983, that's why I pointed you to the curl utilty, which lets you specify any port and address you can dream up. Seriously, read the docs, it'll help you :-) Michael Ludwig
Re: Multi-index Design
Chris Masters schrieb: - flatten the searchable objects as much as I can - use a type field to distinguish - into a single index - use multi-core approach to segregate domains of data Some newbie questions: (1) What is a type field? Is it to designate different types of documents, e.g. product descriptions and forum postings? (2) Would I include such a type field in the data I send to the update facility and maybe configure Solr to take special action depending on the value of the update field? (3) Like, write the processing results to a domain dedicated to that type of data that I could limit my search to, as per Otis' post? (4) And is that what's called a core here? (5) Or, failing (3), and lumping everything together in one search domain (core?), would I use that type field to limit my search to a particular type of data? Michael Ludwig
schema.xml: default values for @indexed and @stored
From the apache-solr-1.3.0\example\solr\conf\schema.xml file: !-- since fields of this type are by default not stored or indexed, any data added to them will be ignored outright -- fieldtype name=ignored stored=false indexed=false class=solr.StrField / So for both fieldtype/@stored and fieldtype/@indexed, the default is true, correct? And does the fieldtype configuration constitute a default for field so that field/@stored and field/@indexed take their effective values according to field/@type? Or do these default to true regardless of what's specified in the respective fieldtype? Michael Ludwig
Re: Problem adding unicoded docs to Solr through SolrJ
ahmed baseet schrieb: I tried something stupid but working though. I first converted the whole string to byte array and then used that byte array to create a new utf-8 encoded sting like this, // Encode in Unicode UTF-8 byte [] utfEncodeByteArray = textOnly.getBytes(); This yields a sequence of bytes using the platform's default charset, which may not be UTF-8. Check: * String#getBytes() * String#getBytes(String charsetName) String utfString = new String(utfEncodeByteArray, Charset.forName(UTF-8)); Note that strings in Java are always internally encoded in UTF-16, so it doesn't make much sense to call it utfString, especially if you think that it is encoded in UTF-8, which it is not. The above operation is only guaranteed to succeed without losing data (resulting in ? in the output) when the sequence of bytes is valid as UTF-8, i.e. in this case when your platform encoding, which you've relied upon, is UTF-8. then passed the utfString to the function for posting to Solr and it works prefectly. But is there any intelligent way of doing all this, like straight from default encoded string to utf-8 encoded string, without going via byte array. It is a feature of the java.lang.String that you don't need to know the encoding, as the string contains characters, not bytes. Only for input and output you are concerned with encoding. So where you're dealing with encodings, you're dealing with bytes. And when dealing with bytes on the wire, you're likely concerned with encodings, for example when the page you read via HTTP comes with a Content-Type header specifying the encoding, or when you send documents to the Solr indexer. For more intelligent ways, you could take a look at the class java.nio.charset.Charset and the methods encode, decode, newEncoder, newDecoder. Michael Ludwig
Re: UTF8 compatibility
Muhammed Sameer schrieb: We run post.jar periodically ie after every 15mins to commit the changes, Is this approach correct ? Sounds reasonable to me. SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported That's just to remind you not to try and post documents in another encoding. This seems to be a limitation of the SimplePostTool, not of Solr. I guess the reason is that in order for Solr to work quickly and reliably, it relies on the Content-Type of the request to determine the encoding. If, for example, you send XML encoded in ISO-8859-1, you have to specify that in two places: * XML declaration: ?xml version=1.0 encoding=ISO-8859-1? * HTTP header: Content-Type: text/xml; charset=ISO-8859-1 The SimplePostTool, however, being just what the name says, may not bother to read the encoding from the document and bring the HTTP content type header in line. Instead, it explicitly requests UTF-8, probably in the interest of simplicity. Well, that's just my theory. Can anyone confirm? So I tried to run the test_utf8.sh script and got the following output {code} Solr server is up. HTTP GET is accepting UTF-8 HTTP POST is accepting UTF-8 HTTP POST defaults to UTF-8 ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic multilingual plane {code} Are these errors normal or do I need to change something ? I'm seeing the same output, don't worry, just some tests. It is possible to have Solr index documents containing characters outside of the BMP (Basic Multilingual Plane), which can be verified posting something like this: add doc field name=id1001/field field name=titleBMP plus 1 #x1;/field /doc /add Maybe the test script output says that such characters cannot be used for querying. Hardly relevant if you consider that the BMP comprises even languages such as Telugu, Bopomofo and French. Best, Michael Ludwig
Re: Performance and number of search results
Wouter Samaey schrieb: Can someone please comment on the performance impact of the number of search results? Is there a big difference between querying for 1 result, 10, 20 or even 100 ? Probably not, but YMMV, as the question is very general. Consider that for fast queries the HTTP round trip may well be the determining factor. Or XML parsing. If you've stored a lot of data in Solr and request all of it to be returned, the difference between 1 and 100 results may be the difference between 1 and 100 KB payload. If you think it matters, the best thing for you would be to do some profiling for your specific scenario. The rule of thumb here is probably: Get what you need. Michael Ludwig
Re: Problem adding unicoded docs to Solr through SolrJ
ahmed baseet schrieb: public void postToSolrUsingSolrj(String rawText, String pageId) { doc.addField(features, rawText ); In the above the param rawText is just the html stripped off of all its tags, js, css etc and pageId is the Url for that page. When I'm using this for English pages its working perfectly fine but the problem comes up when I'm trying to index some non-english pages. Maybe you're constructing a string without specifying the encoding, so Java uses your default platform encoding? String(byte[] bytes) Constructs a new String by decoding the specified array of bytes using the platform's default charset. String(byte[] bytes, Charset charset) Constructs a new String by decoding the specified array of bytes using the specified charset. Now what I did is just extracted the raw text from that html page and manually created an xml page like this ?xml version=1.0 encoding=UTF-8? add doc field name=idUTF2TEST/field field name=nameTest with some UTF-8 encoded characters/field field name=features*some tamil unicode text here*/field /doc /add and posted this from command line using the post.jar file. Now searching gives me the result but unlike last time browser shows the indexed text in tamil itself and not the raw unicode. Now that's perfect, isn't it? I tried doing something like this also, // Encode in Unicode UTF-8 utfEncodedText = new String(rawText.getBytes(UTF-8)); but even this didn't help eighter. No encoding specified, so the default platform encoding is used, which is likely not what you want. Consider the following example: package milu; import java.nio.charset.Charset; public class StringAndCharset { public static void main(String[] args) { byte[] bytes = { 'K', (byte) 195, (byte) 164, 's', 'e' }; System.out.println(Charset.defaultCharset().displayName()); System.out.println(new String(bytes)); System.out.println(new String(bytes, Charset.forName(UTF-8))); } } Output: windows-1252 Käse (bad) Käse (good) Michael Ludwig
Highlighting using XML instead of strings?
http://wiki.apache.org/solr/HighlightingParameters I can specify the strings to highlight matched text with using hl.simple.pre and hl.simple.post, for example b and /b. The result looks like this: strlt;bgt;Eumellt;/bgt; NDR Ländermagazine/str However, what if as the result of favouring XML over strings, I rather want something like this: strbEumel/b NDR Ländermagazine/str There could be a parameter hl.xml which I could use to request modified XML like this: hl.xlm=em hl.xlm=b This would allow smoother processing technologies like XSLT. Is such a feature available? Michael Ludwig