Re: Another japanese analysis problem
Did you read through the CJK article series? Maybe there is something in there? http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html Sorry, no help on actual Japanese. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Fri, Apr 18, 2014 at 12:50 PM, Shawn Heisey s...@elyograg.org wrote: On 4/10/2014 11:53 AM, Shawn Heisey wrote: My analysis chain includes CJKBigramFilter on both the index and query. I have outputUnigrams enabled on the index side, but it is disabled on the query side. This has resulted in a problem with phrase queries. This is a subset of my index analysis for the three terms you can see in the ICUNF step, separated by spaces: https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png Note that in the CJKBF step, the second unigram is output at position 2, pushing the english terms to 3 and 4. When the customer phrase filter query (lucene query parser) for the first two terms on this specific field, it doesn't match, because the query analysis doesn't output the unigrams and therefore the positions don't match. I would have expected both unigrams to be at position 1. Is this a bug or expected behavior? It's been a week with no reply. First I worked around this problem by disabling outputUnigrams on the index side, to match the query side. At that point, the customer was unable to do a searches for a single character and find longer strings containing that character. I knew this would happen ... I did tell our project manager, but I do not know whether it was communicated to the customer. Then I tried setting outputUnigrams to true on both index and query. Just as I had anticipated, the customer was unhappy with getting results where a word containing only one character of their multi-character search string was present. Re-stating the underlying problem and my question: The outputUnigrams option sets one of the unigrams from each bigram to the same position as the bigram, but then puts the other one at the next position, breaking phrase queries. This sounds like a bug. Is it a bug? If not, I would REALLY like a config option to produce the behavior that I expected. Thanks, Shawn
'qt' parameter is not working in search call of SolrPhpClient
I am using SolrPhpClient for interacting with Solr via PHP. I am using a custom request handler ( /select_test ) with 'edismax' feature in Solr config file requestHandler name=/select_test class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str str name=wtjson/str str name=defTypeedismax/str str name=qf text name topic description /str str name=dftext/str str name=mm100%/str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str str name=mlt.qf text name topic description /str str name=mlt.fltext,name,topic,description/str int name=mlt.count3/int /lst /requestHandler I set the value for 'qt' parameter as '/select_test' in the $search_options array and pass it as parameter to the search function of the Apache_Solr_Service as below: $search_options = array( 'qt' = '/select_test', 'fq' = 'topic:games', 'sort' = 'name desc' ); $result = $solr-search($query, 0, 10, $search_options); It does not call the request handler at all. The call goes to the default '/select' handler in solr config file. Just to confirm I put the custom request handler code in default handler and it worked. Why is this happening? Am I not setting it right? Please help! -- View this message in context: http://lucene.472066.n3.nabble.com/qt-parameter-is-not-working-in-search-call-of-SolrPhpClient-tp4131934.html Sent from the Solr - User mailing list archive at Nabble.com.
solr parallel update and total indexing Issue
There is a bis issue in solr parallel update and total indexing Total Import syntax (working) dataimport?command=full-importcommit=trueoptimize=true Update syntax(working) solr/update?softCommit=true' -H 'Content-type:application/json' -d '[{id:1870719,column:{set:11}}]' Issue: If both are run in parallel, then commit in b/w take place. Example: i have 10k in total indexes i fire an solr query to update 1000 records and in between i fire a total import(full indexer) what's happening is that in between commit is taken place... i.e untill total indexer runs i got limited records(1000). How to solve this ? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-parallel-update-and-total-indexing-Issue-tp4131935.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Another japanese analysis problem
On 4/18/2014 12:04 AM, Alexandre Rafalovitch wrote: Did you read through the CJK article series? Maybe there is something in there? http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html Sorry, no help on actual Japanese. Almost everything I know about the Japanese language has been learned in the last few weeks, working on this Solr config! That blog series looks like really awesome information. I will be trying out some of what they've mentioned. Thank you for pointing me that direction. The author's index is a lot more complex than ours ... I'm really hoping to avoid having a lot of copies of each field. The index is already relatively large. I think I'll take my discussion about a possible bug in CJKBigramFilter to the dev list. Thanks, Shawn
Re: Where to specify numShards when startup up a cloud setup
Hi zzT Putting numShards in core.properties also works. I struggled a little bit while figuring out this configuration approach. I knew I am not alone! ;-) On 2 April 2014 18:06, zzT zis@gmail.com wrote: It seems that I've figured out a configuration approach to this issue. I'm having the exact same issue and the only viable solutions found on the net till now are 1) Pass -DnumShards=x when starting up Solr server 2) Use the Collections API as indicated by Shawn. What I've noticed though - after making the call to /collections to create a node solr.xml - is that a new core entry is added inside solr.xml with the attribute numShards. So, right now I'm configuring solr.xml with numShards attribute inside my core nodes. This way I don't have to worry with annoying stuff you've already mentioned e.g. waiting for Solr to start up etc. Of course same logic applies here, numShards param is meanigful only the first time. Even if you change it at a later point the # of shards stays the same. -- View this message in context: http://lucene.472066.n3.nabble.com/Where-to-specify-numShards-when-startup-up-a-cloud-setup-tp4078473p4128566.html Sent from the Solr - User mailing list archive at Nabble.com. -- All the best Liu Bo
Having trouble with German compound words in Solr 4.7
Hello all, I'm a fairly new Solr user and I need my search function to handle compound words in German. I've searched through the archives and found that Solr already has a Filter Factory made for such words called DictionaryCompoundWordTokenFilterFactory. I've already built a list of words that I want split, and it seems like the filter is working correctly in most cases, the majority of our searches are clothing items so let's say /schwarzkleid/ (black dress) becomes /schwarz/ /kleid/, which is what I want to happen. However, it seems like the keyword search is done using an *OR* operator. So I'm seeing items that are either black or are dresses but I just want to see items that are both. I've also read that changing the default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml will rectify this issue, but nothing has changed in my query results. It still uses the *OR* operator. I've tried using Extended dismax in my queries but I am using the Solr PHP library and I don't think it supports adding Dismax filters to the queries themselves (if I'm wrong, please correct me). By the way, I am using Zend Framework 2.0 in the backend and am communicating with Solr through the Solr PHP library: Solr PHP http://www.php.net/manual/tr/book.solr.php . Any suggestions on how to change the operator after my compound word queries have been split? Thanks! Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Having trouble with German compound words in Solr 4.7
Make sure your field type has the autoGeneratePhraseQueries=true attribute (default is false). q.op only applies to explicit terms, not to terms which decompose into multiple terms. Confusing? Yes! -- Jack Krupansky -Original Message- From: Alistair Sent: Friday, April 18, 2014 6:11 AM To: solr-user@lucene.apache.org Subject: Having trouble with German compound words in Solr 4.7 Hello all, I'm a fairly new Solr user and I need my search function to handle compound words in German. I've searched through the archives and found that Solr already has a Filter Factory made for such words called DictionaryCompoundWordTokenFilterFactory. I've already built a list of words that I want split, and it seems like the filter is working correctly in most cases, the majority of our searches are clothing items so let's say /schwarzkleid/ (black dress) becomes /schwarz/ /kleid/, which is what I want to happen. However, it seems like the keyword search is done using an *OR* operator. So I'm seeing items that are either black or are dresses but I just want to see items that are both. I've also read that changing the default operator in schema.xml or adding q.op as *AND* in the solrconfig.xml will rectify this issue, but nothing has changed in my query results. It still uses the *OR* operator. I've tried using Extended dismax in my queries but I am using the Solr PHP library and I don't think it supports adding Dismax filters to the queries themselves (if I'm wrong, please correct me). By the way, I am using Zend Framework 2.0 in the backend and am communicating with Solr through the Solr PHP library: Solr PHP http://www.php.net/manual/tr/book.solr.php . Any suggestions on how to change the operator after my compound word queries have been split? Thanks! Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964.html Sent from the Solr - User mailing list archive at Nabble.com.
space between search terms
Hi, I Have a field called title. It is having a values called indira nagar as well as indiranagar. If i type any of the keywords it has to display both results. Can anybody help how can we do this? I am using the title field in the following way: fieldType name=title class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=synonyms_tf.txt expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: space between search terms
Use an index-time synonym filter with a synonym entry: indira nagar,indiranagar But do not use that same filter at query time. But, that may mess up some exact phrase queries, such as: q=indiranagar xyz since the following term is actually positioned after the longest synonym. To resolve that, use a sloppy phrase: q=indiranagar xyz~1 Or, set qs=1 for the edismax query parser. -- Jack Krupansky -Original Message- From: kumar Sent: Friday, April 18, 2014 6:34 AM To: solr-user@lucene.apache.org Subject: space between search terms Hi, I Have a field called title. It is having a values called indira nagar as well as indiranagar. If i type any of the keywords it has to display both results. Can anybody help how can we do this? I am using the title field in the following way: fieldType name=title class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=synonyms_tf.txt expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multi word search for elevator (QueryElevationComponent) not working
Hi Remi , Thanks for your reply. I tried with with setting the query_text for apple ipod and added the required doc_id to elevate. I got the result but again I am not able to get the desired result for NLP queries such as ipod nano generation 5 or apple ipod best music . As in both the queries it contains ipod for which I want my desired doc id's to be elevated. I also tried changing in the QueryElevationComponent as: First with this: str name=queryFieldTypestring/str Second time: str name=queryFieldTypetext_general/str But no success. Please correct me if I am doing the correct change as you mentioned. Is there any other way round in solr to achieve this.(Promoted Search). Please guide me. Regads, Niranjan -- View this message in context: http://lucene.472066.n3.nabble.com/multi-word-search-for-elevator-QueryElevationComponent-not-working-tp4131016p4131971.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Having trouble with German compound words in Solr 4.7
Hey Jack, thanks for the reply. I added autoGeneratePhraseQueries=true to the fieldType and now it's giving me even more results! I'm not sure if the debug of my query will be helpful but I'll paste it just in case someone might have an idea. This produces 113524 results, whereas if I manually enter the query as keyword:schwarz AND keyword:kleid I only get 20283 results (which is the correct one). -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html Sent from the Solr - User mailing list archive at Nabble.com.
QueryElevationComponent always reads config from zookeeper
Hello, I was looking into QueryElevationComponent component. As per the spec (http://wiki.apache.org/solr/QueryElevationComponent), if config is not found in zookeepr, it should be loaded from data directory. However, I see the bug. It doesn't seem to be working even in latest 4.7.2 release. I have checked the latest code and found this: MapString, ElevationObj getElevationMap(IndexReader reader, SolrCore core) throws Exception { synchronized (elevationCache) { MapString, ElevationObj map = elevationCache.get(null); if (map != null) return map; map = elevationCache.get(reader); if (map == null) { String f = initArgs.get(CONFIG_FILE); if (f == null) { throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, QueryElevationComponent must specify argument: + CONFIG_FILE); } log.info(Loading QueryElevation from data dir: + f); Config cfg; ZkController zkController = core.getCoreDescriptor().getCoreContainer().getZkController(); if (zkController != null) { cfg = new Config(core.getResourceLoader(), f, null, null); } else { InputStream is = VersionedFile.getLatestFile(core.getDataDir(), f); cfg = new Config(core.getResourceLoader(), f, new InputSource(is), null); } map = loadElevationMap(cfg); elevationCache.put(reader, map); } return map; } } As per this code, we will never be able to load config from data directory if zookeepr exists. Can we fix this issue? Thanks, Ronak
Re: cache warming questions
cool, thanks. Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Thu, Apr 17, 2014 at 11:37 PM, Erick Erickson erickerick...@gmail.comwrote: No, the 5 most recently used in a query will be used to autowarm. If you have things you _know_ are going to be popular fqs, you could put them in newSearcher queries. Best, Erick On Thu, Apr 17, 2014 at 4:51 PM, Kranti Parisa kranti.par...@gmail.com wrote: Erik, I have a followup question on this topic. If we have used 10 unique FQs and when we configure filterCache=100 autoWarm=5, then which 5 out of the 10 will be repopulated in the case of new searcher? I don't think there is a way to set the preference or there is? Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Thu, Apr 17, 2014 at 5:25 PM, Matt Kuiper matt.kui...@issinc.com wrote: Ok, that makes sense. Thanks again, Matt Matt Kuiper - Software Engineer Intelligent Software Solutions p. 719.452.7721 | matt.kui...@issinc.com www.issinc.com | LinkedIn: intelligent-software-solutions -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, April 17, 2014 9:26 AM To: solr-user@lucene.apache.org Subject: Re: cache warming questions Don't go overboard warming here, you often hit diminishing returns very quickly. For instance, if the size is 512 you might set your autowarm count to 16 and get the most bang for your buck. Beyond some (usually small) number, the additional work you put in to warming is wasted. This is especially true if your autocommit (soft, or hard with openSearcher=true) is short. So while you're correct in your sizing bit, practically it's rarely that complicated since the autowarm count is usually so much smaller than the size that there's no danger of swapping them out. YMMV of course. Best, Erick On Wed, Apr 16, 2014 at 10:33 AM, Matt Kuiper matt.kui...@issinc.com wrote: Thanks Erick, this is helpful information! So it sounds like, at minimum the cache size (at least for filterCache and queryResultCache) should be the sum of the autowarmCount for that cache and the number of queries defined for the newSearcher listener. Otherwise some items in the caches will be evicted right away. Matt -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, April 15, 2014 5:21 PM To: solr-user@lucene.apache.org Subject: Re: cache warming questions bq: What does it mean that items will be regenerated or prepopulated from the current searcher's cache... You're right, the values aren't cached. They can't be since the internal Lucene document id is used to identify docs, and due to merging the internal ID may bear no relation to the old internal ID for a particular document. I find it useful to think of Solr's caches as a map where the key is the query and the value is some representation of the found documents. The details of the value don't matter, so I'll skip them. What matters is the key. Consider the filter cache. You put something like fq=price:[0 TO 100] on a URL. Solr then uses the fq clause as the key to the filterCache. Here's the sneaky bit. When you specify an autowarm count of N for the filterCache, when a new searcher is opened the first N keys from the map are re-executed in the new searcher's context and the results put into the new searcher's filterCache. bq: ...how does auto warming and explicit warming work together? They're orthogonal. IOW, the autowarming for each cache is executed as well as the newSearcher static warming queries. Use the static queries to do things like fill the sort caches etc. Incidentally, this bears on why there's a firstSearcher and newSearcher. The newSearcher queries are run in addition to the cache autowarms. firstSearcher static queries are only run when a Solr server is started the first time, and there are no cache entries to autowarm. So the firstSearcher queries might be quite a bit more complex than newSearcher queries. HTH, Erick On Tue, Apr 15, 2014 at 1:55 PM, Matt Kuiper matt.kui...@issinc.com wrote: Hello, I have a few questions regarding how Solr caches are warmed. My understanding is that there are two ways to warm internal Solr caches (only one way for document cache and lucene FieldCache): Auto warming - occurs when there is a current searcher handling requests and new searcher is being prepared. When a new searcher is opened, its caches may be prepopulated or autowarmed with cached object from caches in the old searcher. autowarmCount is the number of cached items that will be regenerated in the new searcher. http://wiki.apache.org/solr/SolrCaching#autowarmCount Explicit warming - where the static warming queries specified in
Re: Filtering Solr Queries
Is this a manageable list? That is, not a zillion names? If so, it seems like you could do this with synonyms. Assuming your string_ci bit is a string type, you'd need to change that to something like KeywordTokenizerFactory followed by filters, and you might want to add something like LowercaseFilterFactory to the chain. Best, Erick On Thu, Apr 17, 2014 at 9:47 PM, kumar pavan2...@gmail.com wrote: Hi, I am indexing the data using title, city and location fields. but different cities are having same location names like rajaji nagar, rajajinagar. When user types computers in rajaji nagarIt has to display results like computers in rajajinagr as well as computers in rajaji nagr. I am using the following schema. field name=city type=string_ci indexed=true stored=false / field name=locality type=string_ci indexed=true stored=false / field name=mytitle type=textfullmatch indexed=true stored=false multiValued=true omitNorms=true omitTermFreqAndPositions=true / fieldType name=textfullmatch class=solr.TextField analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([\.,;:-_]) replacement= replace=all/ filter class=solr.EdgeNGramFilterFactory maxGramSize=50 minGramSize=2/ filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([\.,;:-_]) replacement= replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=^(.{30})(.*)? replacement=$1 replace=all/ filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=synonyms_fsw.txt expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-Solr-Queries-tp4131924.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Having trouble with German compound words in Solr 4.7
Hi Alistair, quick email before getting my plane - I worked with similar requirements in the past and tuning SOLR can be tricky * are you hitting the same SOLR query handler (application versus manual checking)? * turn on debugging for your application SOLR queries so you see what query is actually executed * one thing I always do for prototyping is setting up the Solritas GUI using the same query handler as the application server Cheers, Siegfried Goeschl On 18 Apr 2014, at 06:06, Alistair ali...@gmail.com wrote: Hey Jack, thanks for the reply. I added autoGeneratePhraseQueries=true to the fieldType and now it's giving me even more results! I'm not sure if the debug of my query will be helpful but I'll paste it just in case someone might have an idea. This produces 113524 results, whereas if I manually enter the query as keyword:schwarz AND keyword:kleid I only get 20283 results (which is the correct one). -- View this message in context: http://lucene.472066.n3.nabble.com/Having-trouble-with-German-compound-words-in-Solr-4-7-tp4131964p4131973.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: 'qt' parameter is not working in search call of SolrPhpClient
You're confusing a couple of things here. the /select_test can be accessed by pointing your URL at it rather than using qt, i.e. the destination you're going to will be http://server:port/solr/collection/select_test rather than http://server:port/solr/collection/select Best, Erick On Thu, Apr 17, 2014 at 11:31 PM, harshrossi harshro...@gmail.com wrote: I am using SolrPhpClient for interacting with Solr via PHP. I am using a custom request handler ( /select_test ) with 'edismax' feature in Solr config file requestHandler name=/select_test class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str str name=wtjson/str str name=defTypeedismax/str str name=qf text name topic description /str str name=dftext/str str name=mm100%/str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str str name=mlt.qf text name topic description /str str name=mlt.fltext,name,topic,description/str int name=mlt.count3/int /lst /requestHandler I set the value for 'qt' parameter as '/select_test' in the $search_options array and pass it as parameter to the search function of the Apache_Solr_Service as below: $search_options = array( 'qt' = '/select_test', 'fq' = 'topic:games', 'sort' = 'name desc' ); $result = $solr-search($query, 0, 10, $search_options); It does not call the request handler at all. The call goes to the default '/select' handler in solr config file. Just to confirm I put the custom request handler code in default handler and it worked. Why is this happening? Am I not setting it right? Please help! -- View this message in context: http://lucene.472066.n3.nabble.com/qt-parameter-is-not-working-in-search-call-of-SolrPhpClient-tp4131934.html Sent from the Solr - User mailing list archive at Nabble.com.
multi-field suggestions
I've been working on getting AnalyzingInfixSuggester to make suggestions using tokens drawn from multiple fields. I've done this by copying tokens from each of those fields into a destination field, and building suggestions using that destination field. This allows me to use different analysis strategies for each of the fields, which I need, but it doesn't address a couple of remaining issues: 1. Some source fields are more important than others, and it would be good to be able to give their tokens greater weight somehow 2. The threshold is applied equally across all tokens, but for some fields we want to suggest singletons (threshold=0), while for others we want to use the threshold to exclude low-frequency terms. I looked a little bit at how to extend the whole framework from Solr on down to handle multiple source fields intrinsically, rather than using the copying technique, and it looks like I could possibly manage something like this by extending DocumentDictionary and plugging in a different DictionaryFactory. Does that sound like a good approach? Is there some better way to approach this problem? Thanks -Mike PS Sorry for the cross-post; I realized after I hit send this was probably a better question for solr-user than lucene...
Re: solr parallel update and total indexing Issue
try not setting softCommit=true, that's going to take the current state of your index and make it visible. If your DIH process has deleted all your records, then that's the current state. Personally I wouldn't try to mix-n-match like this, the results will take forever to get right. If you absolutely must do something like this, I'd use collection aliasing to rebuild my index in a different collection then switch from the old to new one in a controlled fashion. Best, Erick On Thu, Apr 17, 2014 at 11:37 PM, ~$alpha` lavesh.ra...@gmail.com wrote: There is a bis issue in solr parallel update and total indexing Total Import syntax (working) dataimport?command=full-importcommit=trueoptimize=true Update syntax(working) solr/update?softCommit=true' -H 'Content-type:application/json' -d '[{id:1870719,column:{set:11}}]' Issue: If both are run in parallel, then commit in b/w take place. Example: i have 10k in total indexes i fire an solr query to update 1000 records and in between i fire a total import(full indexer) what's happening is that in between commit is taken place... i.e untill total indexer runs i got limited records(1000). How to solve this ? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-parallel-update-and-total-indexing-Issue-tp4131935.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can I reconstruct text from tokens?
I believe you could use term vectors to retrieve all the terms in a document, with their offsets. Retrieving them from the inverted index would be expensive since the index is term-oriented, not document-oriented. Without tv, I think you essentially have to scan the entire term dictionary looking for terms in your document. So that will cost you probably more than it's worth? -Mike On 04/16/2014 11:50 AM, Alexandre Rafalovitch wrote: Hello, If I use very basic tokenizers, e.g. space based and no filters, can I reconstruct the text from the tokenized form? So, This is a test - This, is, a, test - This is a test? I know we store enough information, but I don't know internal API enough to know what I should be looking at for reconstruction algorithm. Any hints? The XY problem is that I want to store large amount of very repeatable text into Solr. I want the index to be as small as possible, so thought if I just pre-tokenized, my dictionary will be quite small. And I will be reconstructing some final form anyway. The other option is to just use compressed fields on stored field, but I assume that does not take cross-document efficiencies into account. And, it will be a read-only index after build, so I don't care about updates messing things up. Regards, Alex Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: Indexing Big Data With or Without Solr
Thanks Furkan, I will definitely give it a try then. Thanks again! On Tue, Apr 15, 2014 at 7:53 PM, Furkan KAMACI furkankam...@gmail.comwrote: Hi Vineet; I've been using SolrCloud for such kind of Big Data and I think that you should consider to use it. If you have any problems you can ask it here. Thanks; Furkan KAMACI 2014-04-15 13:20 GMT+03:00 Vineet Mishra clearmido...@gmail.com: Hi All, I have worked with Solr 3.5 to implement real time search on some 100GB data, that worked fine but was little slow on complex queries(Multiple group/joined queries). But now I want to index some real Big Data(around 4 TB or even more), can SolrCloud be solution for it if not what could be the best possible solution in this case. *Stats for the previous Implementation:* It was Master Slave Architecture with normal Standalone multiple instance of Solr 3.5. There were around 12 Solr instance running on different machines. *Things to consider for the next implementation:* Since all the data is sensor data hence it is the factor of duplicity and uniqueness. *Really urgent, please take the call on priority with set of feasible solution.* Regards
Boost Search results
Hi, When I started to compare the search results with the two options below, I see a lot of difference in the search results esp. the* urls that show up on the top *(*Relevancy *perspective). (1) Nutch 2.2.1 (with *Solr 4.0*) (2) Bing custom search set-up I wonder how should I tweak the boost parameters to get the best results on the top like how Bing, Google does. Please suggest why I see a difference and what parameters are best to configure in Solr to achieve what I see from Bing, or Google search relevancy. Here is what i got in solrconfig.xml: str name=defTypeedismax/str str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str Thanks
Re: Can I reconstruct text from tokens?
Sorry, didn't think this through. You're right, still the same problem.. On 16 Apr 2014 17:40, Alexandre Rafalovitch arafa...@gmail.com wrote: Why? I want stored=false, at which point multivalued field is just offset values in the dictionary. Still have to reconstruct from offsets. Or am I missing something? Regards, Alex On 16/04/2014 10:59 pm, Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: Logically if you tokenize and put the results in a multivalued field, you should be able to get all values in sequence? On 16 Apr 2014 16:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, If I use very basic tokenizers, e.g. space based and no filters, can I reconstruct the text from the tokenized form? So, This is a test - This, is, a, test - This is a test? I know we store enough information, but I don't know internal API enough to know what I should be looking at for reconstruction algorithm. Any hints? The XY problem is that I want to store large amount of very repeatable text into Solr. I want the index to be as small as possible, so thought if I just pre-tokenized, my dictionary will be quite small. And I will be reconstructing some final form anyway. The other option is to just use compressed fields on stored field, but I assume that does not take cross-document efficiencies into account. And, it will be a read-only index after build, so I don't care about updates messing things up. Regards, Alex Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: Boost Search results
Hi, replicating full features search engine behaviour is not going to work with nutch and solr out of the box. You are missing a thousand features such as proper main content extraction, deduplication, classification of content and hub or link pages, and much more. These things are possible to implement but you may want to start with having you solr request handler better configured, to begin with, your qf parameter does not have nutchs default title and content field selected. A Laxmi a.lakshmi...@gmail.com schreef:Hi, When I started to compare the search results with the two options below, I see a lot of difference in the search results esp. the* urls that show up on the top *(*Relevancy *perspective). (1) Nutch 2.2.1 (with *Solr 4.0*) (2) Bing custom search set-up I wonder how should I tweak the boost parameters to get the best results on the top like how Bing, Google does. Please suggest why I see a difference and what parameters are best to configure in Solr to achieve what I see from Bing, or Google search relevancy. Here is what i got in solrconfig.xml: str name=defTypeedismax/str str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str Thanks
Re: Boost Search results
Hi Markus, Yes, you are right. I passed the qf from my front-end framework (PHP which uses SolrClient). This is how I got it set-up: $this-solr-set_param('defType','edismax'); $this-solr-set_param('qf','title^10 content^5 url^5'); where you can see qf = title^10 content^5 url^5 On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, replicating full features search engine behaviour is not going to work with nutch and solr out of the box. You are missing a thousand features such as proper main content extraction, deduplication, classification of content and hub or link pages, and much more. These things are possible to implement but you may want to start with having you solr request handler better configured, to begin with, your qf parameter does not have nutchs default title and content field selected. A Laxmi a.lakshmi...@gmail.com schreef:Hi, When I started to compare the search results with the two options below, I see a lot of difference in the search results esp. the* urls that show up on the top *(*Relevancy *perspective). (1) Nutch 2.2.1 (with *Solr 4.0*) (2) Bing custom search set-up I wonder how should I tweak the boost parameters to get the best results on the top like how Bing, Google does. Please suggest why I see a difference and what parameters are best to configure in Solr to achieve what I see from Bing, or Google search relevancy. Here is what i got in solrconfig.xml: str name=defTypeedismax/str str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str Thanks
Re: Boost Search results
Markus, like I mentioned in my last email, I have got the qf with title, content and url. That doesn't help a whole lot. Could you please advise if there are any other parameters that I should consider for solr request handler config or the numbers I have got for title, content, url in qf parameter have to be modified? Thanks for your help.. On Fri, Apr 18, 2014 at 4:08 PM, A Laxmi a.lakshmi...@gmail.com wrote: Hi Markus, Yes, you are right. I passed the qf from my front-end framework (PHP which uses SolrClient). This is how I got it set-up: $this-solr-set_param('defType','edismax'); $this-solr-set_param('qf','title^10 content^5 url^5'); where you can see qf = title^10 content^5 url^5 On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, replicating full features search engine behaviour is not going to work with nutch and solr out of the box. You are missing a thousand features such as proper main content extraction, deduplication, classification of content and hub or link pages, and much more. These things are possible to implement but you may want to start with having you solr request handler better configured, to begin with, your qf parameter does not have nutchs default title and content field selected. A Laxmi a.lakshmi...@gmail.com schreef:Hi, When I started to compare the search results with the two options below, I see a lot of difference in the search results esp. the* urls that show up on the top *(*Relevancy *perspective). (1) Nutch 2.2.1 (with *Solr 4.0*) (2) Bing custom search set-up I wonder how should I tweak the boost parameters to get the best results on the top like how Bing, Google does. Please suggest why I see a difference and what parameters are best to configure in Solr to achieve what I see from Bing, or Google search relevancy. Here is what i got in solrconfig.xml: str name=defTypeedismax/str str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str Thanks
Re: Can I reconstruct text from tokens?
Luke actually does this, or attempts to. The doc you assemble is lossy though It doesn't have stop words All capitalization is lost original terms for synonyms are lost all punctuation is lost I don't think you can do this unless you store term information. it's slow. original words that are stemmed are lost Anything you do with, say, ngrams will definitely be strange. etc. Basically, all the filters in the analysis chain may change what goes into the index, that's their job. Each step may lose information. FWIW, Erick On Fri, Apr 18, 2014 at 12:36 PM, Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: Sorry, didn't think this through. You're right, still the same problem.. On 16 Apr 2014 17:40, Alexandre Rafalovitch arafa...@gmail.com wrote: Why? I want stored=false, at which point multivalued field is just offset values in the dictionary. Still have to reconstruct from offsets. Or am I missing something? Regards, Alex On 16/04/2014 10:59 pm, Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: Logically if you tokenize and put the results in a multivalued field, you should be able to get all values in sequence? On 16 Apr 2014 16:51, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, If I use very basic tokenizers, e.g. space based and no filters, can I reconstruct the text from the tokenized form? So, This is a test - This, is, a, test - This is a test? I know we store enough information, but I don't know internal API enough to know what I should be looking at for reconstruction algorithm. Any hints? The XY problem is that I want to store large amount of very repeatable text into Solr. I want the index to be as small as possible, so thought if I just pre-tokenized, my dictionary will be quite small. And I will be reconstructing some final form anyway. The other option is to just use compressed fields on stored field, but I assume that does not take cross-document efficiencies into account. And, it will be a read-only index after build, so I don't care about updates messing things up. Regards, Alex Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Re: space between search terms
Hi Jack, I am planning to extract and publish such words for Turkish language. But I am not sure how to utilize them. I wonder if there is a more flexible solution that will work query time only. That would not require reindexing every time a new item is added. Ahmet On Friday, April 18, 2014 1:47 PM, Jack Krupansky j...@basetechnology.com wrote: Use an index-time synonym filter with a synonym entry: indira nagar,indiranagar But do not use that same filter at query time. But, that may mess up some exact phrase queries, such as: q=indiranagar xyz since the following term is actually positioned after the longest synonym. To resolve that, use a sloppy phrase: q=indiranagar xyz~1 Or, set qs=1 for the edismax query parser. -- Jack Krupansky -Original Message- From: kumar Sent: Friday, April 18, 2014 6:34 AM To: solr-user@lucene.apache.org Subject: space between search terms Hi, I Have a field called title. It is having a values called indira nagar as well as indiranagar. If i type any of the keywords it has to display both results. Can anybody help how can we do this? I am using the title field in the following way: fieldType name=title class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=synonyms_tf.txt expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: space between search terms
Ahmet: Yeah, the index .vs. query time bit is a pain. Often what people will do is take their best shot at index time, then accumulate omissions and use that list for query time. Then whenever they can/need to re-index, merge the query-time list into the index time list and start over. Not an ideal solution by any means, but one that people have made to work. Best, Erick On Fri, Apr 18, 2014 at 4:38 PM, Ahmet Arslan iori...@yahoo.com wrote: Hi Jack, I am planning to extract and publish such words for Turkish language. But I am not sure how to utilize them. I wonder if there is a more flexible solution that will work query time only. That would not require reindexing every time a new item is added. Ahmet On Friday, April 18, 2014 1:47 PM, Jack Krupansky j...@basetechnology.com wrote: Use an index-time synonym filter with a synonym entry: indira nagar,indiranagar But do not use that same filter at query time. But, that may mess up some exact phrase queries, such as: q=indiranagar xyz since the following term is actually positioned after the longest synonym. To resolve that, use a sloppy phrase: q=indiranagar xyz~1 Or, set qs=1 for the edismax query parser. -- Jack Krupansky -Original Message- From: kumar Sent: Friday, April 18, 2014 6:34 AM To: solr-user@lucene.apache.org Subject: space between search terms Hi, I Have a field called title. It is having a values called indira nagar as well as indiranagar. If i type any of the keywords it has to display both results. Can anybody help how can we do this? I am using the title field in the following way: fieldType name=title class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=synonyms_tf.txt expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: space between search terms
The LucidWorks Search query parser does indeed support multi-word synonyms at query time. I vaguely recall some Jira traffic on supporting multi-word synonyms at query time for some special cases, but a review of CHANGES.txt does not find any such changes that made it into a release, yet. The simplest approach for now is to do the query-time synonym expansion in your app layer as a preprocessor. -- Jack Krupansky -Original Message- From: Ahmet Arslan Sent: Friday, April 18, 2014 7:38 PM To: solr-user@lucene.apache.org Subject: Re: space between search terms Hi Jack, I am planning to extract and publish such words for Turkish language. But I am not sure how to utilize them. I wonder if there is a more flexible solution that will work query time only. That would not require reindexing every time a new item is added. Ahmet On Friday, April 18, 2014 1:47 PM, Jack Krupansky j...@basetechnology.com wrote: Use an index-time synonym filter with a synonym entry: indira nagar,indiranagar But do not use that same filter at query time. But, that may mess up some exact phrase queries, such as: q=indiranagar xyz since the following term is actually positioned after the longest synonym. To resolve that, use a sloppy phrase: q=indiranagar xyz~1 Or, set qs=1 for the edismax query parser. -- Jack Krupansky -Original Message- From: kumar Sent: Friday, April 18, 2014 6:34 AM To: solr-user@lucene.apache.org Subject: space between search terms Hi, I Have a field called title. It is having a values called indira nagar as well as indiranagar. If i type any of the keywords it has to display both results. Can anybody help how can we do this? I am using the title field in the following way: fieldType name=title class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt / tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([^\w\d\*æøåÆØÅ ]) replacement= replace=all / filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=synonyms_tf.txt expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/space-between-search-terms-tp4131967.html Sent from the Solr - User mailing list archive at Nabble.com.
need help from hard core solr experts - out of memory error
I have lots of log files and other files to support this issue (sometimes referenced in the text below) but I am not sure the best way to submit. I don't want to overwhelm and I am not sure if this email will accept graphs and charts. Please provide direction and I will send them. *Issue Description* We are getting Out Of Memory errors when we try to execute a full import using the Data Import Handler. This error originally occurred on a production environment with a database containing 27 million records. Heap memory was configured for 6GB and the server had 32GB of physical memory. We have been able to replicate the error on a local system with 6 million records. We set the memory heap size to 64MB to accelerate the error replication. The indexing process has been failing in different scenarios. We have 9 test cases documented. In some of the test cases we increased the heap size to 128MB. In our first test case we set heap memory to 512MB which also failed. *Environment Values Used* *SOLR/Lucene version: *4.2.1* *JVM version: Java(TM) SE Runtime Environment (build 1.7.0_07-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) *Indexer startup command: set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m java %JVMARGS% ^ -Dcom.sun.management.jmxremote.port=1092 ^ -Dcom.sun.management.jmxremote.ssl=false ^ -Dcom.sun.management.jmxremote.authenticate=false ^ -jar start.jar *SOLR indexing HTTP parameters request: webapp=/solr path=/dataimport params={clean=falsecommand=full-importwt=javabinversion=2} The information we use for the database retrieve using the Data Import Handler is as follows: dataSource name=org_only type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@{server name}:1521:{database name} user={username} password={password} readOnly=false / *The Query (simple, single table)* *select* *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')* *as SOLR_ID,* *'STU.ACCT_ADDRESS_ALL'* *as SOLR_CATEGORY,* *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as ADDRESSALLRID,* *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as ADDRESSALLADDRTYPECD,* *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as ADDRESSALLLONGITUDE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as ADDRESSALLLATITUDE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as ADDRESSALLADDRNAME,* *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as ADDRESSALLCITY,* *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as ADDRESSALLSTATE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as ADDRESSALLEMAILADDR * *from STU.ACCT_ADDRESS_ALL* You can see this information in the database.xml file. Our main solrconfig.xml file contains the following differences compared to a new downloaded solrconfig.xml filefile:///D:/Solr%20Full%20Indexing%20issue/solrconfig%20(default%20content).xml(the original content). config lib dir=../../../dist/ regex=solr-dataimporthandler-.*\.jar / !—Our libraries containing customized filters-- lib path=../../../../default/lib/common.jar / lib path=../../../../default/lib/webapp.jar / lib path=../../../../default/lib/commons-pool-1.4.jar / abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError directoryFactory name=DirectoryFactory class=org.apache.solr.core.StandardDirectoryFactory / requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdatabase.xml/str /lst /requestHandler /config *Custom Libraries* The common.jar contains a customized TokenFiltersFactory implementation that we use for indexing. They do some special treatment to the fields read from the database. How those classes are used is described in the schema.xml file. The webapp.jar file contains other related classes. The commons-pool-1.4.jar is an API from apache used for instances reuse. The logic used in the TokenFiltersFactory is contained in the following files: ConcatFilterFactory.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/ConcatFilterFactory.java ConcatFilter.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/ConcatFilter.java MDFilterSchemaFactory.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/MDFilterSchemaFactory.java MDFilter.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/MDFilter.java MDFilterPoolObjectFactory.javafile:///D:/Solr%20Full%20Indexing%20issue/source%20files/MDFilterPoolObjectFactory.java
Re: need help from hard core solr experts - out of memory error
I see heap size commands for 128 Meg and 512 Meg. That will certainly run out of memory. Why do you think you have 6G of heap with these settings? –Xmx128m –Xms128m –Xmx512m –Xms512m wunder On Apr 18, 2014, at 5:15 PM, Candygram For Mongo candygram.for.mo...@gmail.com wrote: I have lots of log files and other files to support this issue (sometimes referenced in the text below) but I am not sure the best way to submit. I don't want to overwhelm and I am not sure if this email will accept graphs and charts. Please provide direction and I will send them. *Issue Description* We are getting Out Of Memory errors when we try to execute a full import using the Data Import Handler. This error originally occurred on a production environment with a database containing 27 million records. Heap memory was configured for 6GB and the server had 32GB of physical memory. We have been able to replicate the error on a local system with 6 million records. We set the memory heap size to 64MB to accelerate the error replication. The indexing process has been failing in different scenarios. We have 9 test cases documented. In some of the test cases we increased the heap size to 128MB. In our first test case we set heap memory to 512MB which also failed. *Environment Values Used* *SOLR/Lucene version: *4.2.1* *JVM version: Java(TM) SE Runtime Environment (build 1.7.0_07-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) *Indexer startup command: set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m java %JVMARGS% ^ -Dcom.sun.management.jmxremote.port=1092 ^ -Dcom.sun.management.jmxremote.ssl=false ^ -Dcom.sun.management.jmxremote.authenticate=false ^ -jar start.jar *SOLR indexing HTTP parameters request: webapp=/solr path=/dataimport params={clean=falsecommand=full-importwt=javabinversion=2} The information we use for the database retrieve using the Data Import Handler is as follows: dataSource name=org_only type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@{server name}:1521:{database name} user={username} password={password} readOnly=false / *The Query (simple, single table)* *select* *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')* *as SOLR_ID,* *'STU.ACCT_ADDRESS_ALL'* *as SOLR_CATEGORY,* *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as ADDRESSALLRID,* *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as ADDRESSALLADDRTYPECD,* *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as ADDRESSALLLONGITUDE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as ADDRESSALLLATITUDE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as ADDRESSALLADDRNAME,* *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as ADDRESSALLCITY,* *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as ADDRESSALLSTATE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as ADDRESSALLEMAILADDR * *from STU.ACCT_ADDRESS_ALL* You can see this information in the database.xml file. Our main solrconfig.xml file contains the following differences compared to a new downloaded solrconfig.xml filefile:///D:/Solr%20Full%20Indexing%20issue/solrconfig%20(default%20content).xml(the original content). config lib dir=../../../dist/ regex=solr-dataimporthandler-.*\.jar / !—Our libraries containing customized filters-- lib path=../../../../default/lib/common.jar / lib path=../../../../default/lib/webapp.jar / lib path=../../../../default/lib/commons-pool-1.4.jar / abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError directoryFactory name=DirectoryFactory class=org.apache.solr.core.StandardDirectoryFactory / requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdatabase.xml/str /lst /requestHandler /config *Custom Libraries* The common.jar contains a customized TokenFiltersFactory implementation that we use for indexing. They do some special treatment to the fields read from the database. How those classes are used is described in the schema.xml file. The webapp.jar file contains other related classes. The commons-pool-1.4.jar is an API from apache used for instances reuse. The logic used in the TokenFiltersFactory is contained in the following files:
Re: need help from hard core solr experts - out of memory error
We consistently reproduce this problem on multiple systems configured with 6GB and 12GB of heap space. To quickly reproduce many cases for troubleshooting we reduced the heap space to 64, 128 and 512MB. With 6 or 12GB configured it takes hours to see the error. On Fri, Apr 18, 2014 at 5:54 PM, Walter Underwood wun...@wunderwood.orgwrote: I see heap size commands for 128 Meg and 512 Meg. That will certainly run out of memory. Why do you think you have 6G of heap with these settings? –Xmx128m –Xms128m –Xmx512m –Xms512m wunder On Apr 18, 2014, at 5:15 PM, Candygram For Mongo candygram.for.mo...@gmail.com wrote: I have lots of log files and other files to support this issue (sometimes referenced in the text below) but I am not sure the best way to submit. I don't want to overwhelm and I am not sure if this email will accept graphs and charts. Please provide direction and I will send them. *Issue Description* We are getting Out Of Memory errors when we try to execute a full import using the Data Import Handler. This error originally occurred on a production environment with a database containing 27 million records. Heap memory was configured for 6GB and the server had 32GB of physical memory. We have been able to replicate the error on a local system with 6 million records. We set the memory heap size to 64MB to accelerate the error replication. The indexing process has been failing in different scenarios. We have 9 test cases documented. In some of the test cases we increased the heap size to 128MB. In our first test case we set heap memory to 512MB which also failed. *Environment Values Used* *SOLR/Lucene version: *4.2.1* *JVM version: Java(TM) SE Runtime Environment (build 1.7.0_07-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) *Indexer startup command: set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m java %JVMARGS% ^ -Dcom.sun.management.jmxremote.port=1092 ^ -Dcom.sun.management.jmxremote.ssl=false ^ -Dcom.sun.management.jmxremote.authenticate=false ^ -jar start.jar *SOLR indexing HTTP parameters request: webapp=/solr path=/dataimport params={clean=falsecommand=full-importwt=javabinversion=2} The information we use for the database retrieve using the Data Import Handler is as follows: dataSource name=org_only type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@{server name}:1521:{database name} user={username} password={password} readOnly=false / *The Query (simple, single table)* *select* *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')* *as SOLR_ID,* *'STU.ACCT_ADDRESS_ALL'* *as SOLR_CATEGORY,* *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as ADDRESSALLRID,* *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as ADDRESSALLADDRTYPECD,* *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as ADDRESSALLLONGITUDE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as ADDRESSALLLATITUDE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as ADDRESSALLADDRNAME,* *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as ADDRESSALLCITY,* *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as ADDRESSALLSTATE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as ADDRESSALLEMAILADDR * *from STU.ACCT_ADDRESS_ALL* You can see this information in the database.xml file. Our main solrconfig.xml file contains the following differences compared to a new downloaded solrconfig.xml filefile:///D:/Solr%20Full%20Indexing%20issue/solrconfig%20(default%20content).xml(the original content). config lib dir=../../../dist/ regex=solr-dataimporthandler-.*\.jar / !—Our libraries containing customized filters-- lib path=../../../../default/lib/common.jar / lib path=../../../../default/lib/webapp.jar / lib path=../../../../default/lib/commons-pool-1.4.jar / abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError directoryFactory name=DirectoryFactory class=org.apache.solr.core.StandardDirectoryFactory / requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdatabase.xml/str /lst /requestHandler /config *Custom Libraries* The common.jar contains a customized TokenFiltersFactory implementation that we use for indexing. They do some
is there any way to post images and attachments to this mailing list?
Re: is there any way to post images and attachments to this mailing list?
Just upload them in Google Drive and share the link with this group. On Fri, Apr 18, 2014 at 9:15 PM, Candygram For Mongo candygram.for.mo...@gmail.com wrote:
Re: need help from hard core solr experts - out of memory error
I have uploaded several files including the problem description with graphics to this link on Google drive: https://drive.google.com/folderview?id=0B7UpFqsS5lSjWEhxRE1NN2tMNTQusp=sharing I shared it with this address solr-user@lucene.apache.org so I am hoping it can be accessed by people in the group. On Fri, Apr 18, 2014 at 5:15 PM, Candygram For Mongo candygram.for.mo...@gmail.com wrote: I have lots of log files and other files to support this issue (sometimes referenced in the text below) but I am not sure the best way to submit. I don't want to overwhelm and I am not sure if this email will accept graphs and charts. Please provide direction and I will send them. *Issue Description* We are getting Out Of Memory errors when we try to execute a full import using the Data Import Handler. This error originally occurred on a production environment with a database containing 27 million records. Heap memory was configured for 6GB and the server had 32GB of physical memory. We have been able to replicate the error on a local system with 6 million records. We set the memory heap size to 64MB to accelerate the error replication. The indexing process has been failing in different scenarios. We have 9 test cases documented. In some of the test cases we increased the heap size to 128MB. In our first test case we set heap memory to 512MB which also failed. *Environment Values Used* *SOLR/Lucene version: *4.2.1* *JVM version: Java(TM) SE Runtime Environment (build 1.7.0_07-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) *Indexer startup command: set JVMARGS= -XX:MaxPermSize=364m -Xss256K –Xmx128m –Xms128m java %JVMARGS% ^ -Dcom.sun.management.jmxremote.port=1092 ^ -Dcom.sun.management.jmxremote.ssl=false ^ -Dcom.sun.management.jmxremote.authenticate=false ^ -jar start.jar *SOLR indexing HTTP parameters request: webapp=/solr path=/dataimport params={clean=falsecommand=full-importwt=javabinversion=2} The information we use for the database retrieve using the Data Import Handler is as follows: dataSource name=org_only type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@{server name}:1521:{database name} user={username} password={password} readOnly=false / *The Query (simple, single table)* *select* *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(100)), 'null')* *as SOLR_ID,* *'STU.ACCT_ADDRESS_ALL'* *as SOLR_CATEGORY,* *NVL(cast(STU.ACCT_ADDRESS_ALL.R_ID as varchar2(255)), ' ') as ADDRESSALLRID,* *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_TYPE as varchar2(255)), ' ') as ADDRESSALLADDRTYPECD,* *NVL(cast(STU.ACCT_ADDRESS_ALL.LONGITUDE as varchar2(255)), ' ') as ADDRESSALLLONGITUDE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.LATITUDE as varchar2(255)), ' ') as ADDRESSALLLATITUDE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.ADDR_NAME as varchar2(255)), ' ') as ADDRESSALLADDRNAME,* *NVL(cast(STU.ACCT_ADDRESS_ALL.CITY as varchar2(255)), ' ') as ADDRESSALLCITY,* *NVL(cast(STU.ACCT_ADDRESS_ALL.STATE as varchar2(255)), ' ') as ADDRESSALLSTATE,* *NVL(cast(STU.ACCT_ADDRESS_ALL.EMAIL_ADDR as varchar2(255)), ' ') as ADDRESSALLEMAILADDR * *from STU.ACCT_ADDRESS_ALL* You can see this information in the database.xml file. Our main solrconfig.xml file contains the following differences compared to a new downloaded solrconfig.xml file (the original content). config lib dir=../../../dist/ regex=solr-dataimporthandler-.*\.jar / !—Our libraries containing customized filters-- lib path=../../../../default/lib/common.jar / lib path=../../../../default/lib/webapp.jar / lib path=../../../../default/lib/commons-pool-1.4.jar / abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError directoryFactory name=DirectoryFactory class=org.apache.solr.core.StandardDirectoryFactory / requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdatabase.xml/str /lst /requestHandler /config *Custom Libraries* The common.jar contains a customized TokenFiltersFactory implementation that we use for indexing. They do some special treatment to the fields read from the database. How those classes are used is described in the schema.xml file. The webapp.jar file contains other related classes. The commons-pool-1.4.jar is an API from apache used for instances reuse. The logic used in the TokenFiltersFactory is contained in the following files: ConcatFilterFactory.java ConcatFilter.java MDFilterSchemaFactory.java MDFilter.java
Re: Boost Search results
I guess you can apply some deboost for URL. Lakshmi it will be more helpful to suggest if you also provide some kind of example about what you want to achieve On Saturday, April 19, 2014, A Laxmi a.lakshmi...@gmail.com wrote: Markus, like I mentioned in my last email, I have got the qf with title, content and url. That doesn't help a whole lot. Could you please advise if there are any other parameters that I should consider for solr request handler config or the numbers I have got for title, content, url in qf parameter have to be modified? Thanks for your help.. On Fri, Apr 18, 2014 at 4:08 PM, A Laxmi a.lakshmi...@gmail.com wrote: Hi Markus, Yes, you are right. I passed the qf from my front-end framework (PHP which uses SolrClient). This is how I got it set-up: $this-solr-set_param('defType','edismax'); $this-solr-set_param('qf','title^10 content^5 url^5'); where you can see qf = title^10 content^5 url^5 On Fri, Apr 18, 2014 at 4:02 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, replicating full features search engine behaviour is not going to work with nutch and solr out of the box. You are missing a thousand features such as proper main content extraction, deduplication, classification of content and hub or link pages, and much more. These things are possible to implement but you may want to start with having you solr request handler better configured, to begin with, your qf parameter does not have nutchs default title and content field selected. A Laxmi a.lakshmi...@gmail.com schreef:Hi, When I started to compare the search results with the two options below, I see a lot of difference in the search results esp. the* urls that show up on the top *(*Relevancy *perspective). (1) Nutch 2.2.1 (with *Solr 4.0*) (2) Bing custom search set-up I wonder how should I tweak the boost parameters to get the best results on the top like how Bing, Google does. Please suggest why I see a difference and what parameters are best to configure in Solr to achieve what I see from Bing, or Google search relevancy. Here is what i got in solrconfig.xml: str name=defTypeedismax/str str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str Thanks -- Sent from Gmail Mobile
Re: Indexing Big Data With or Without Solr
Vineet please share after you setup for solr cloud Are you using jetty or tomcat.? On Saturday, April 19, 2014, Vineet Mishra clearmido...@gmail.com wrote: Thanks Furkan, I will definitely give it a try then. Thanks again! On Tue, Apr 15, 2014 at 7:53 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Vineet; I've been using SolrCloud for such kind of Big Data and I think that you should consider to use it. If you have any problems you can ask it here. Thanks; Furkan KAMACI 2014-04-15 13:20 GMT+03:00 Vineet Mishra clearmido...@gmail.com: Hi All, I have worked with Solr 3.5 to implement real time search on some 100GB data, that worked fine but was little slow on complex queries(Multiple group/joined queries). But now I want to index some real Big Data(around 4 TB or even more), can SolrCloud be solution for it if not what could be the best possible solution in this case. *Stats for the previous Implementation:* It was Master Slave Architecture with normal Standalone multiple instance of Solr 3.5. There were around 12 Solr instance running on different machines. *Things to consider for the next implementation:* Since all the data is sensor data hence it is the factor of duplicity and uniqueness. *Really urgent, please take the call on priority with set of feasible solution.* Regards -- Sent from Gmail Mobile
Re: need help from hard core solr experts - out of memory error
On 4/18/2014 6:15 PM, Candygram For Mongo wrote: We are getting Out Of Memory errors when we try to execute a full import using the Data Import Handler. This error originally occurred on a production environment with a database containing 27 million records. Heap memory was configured for 6GB and the server had 32GB of physical memory. We have been able to replicate the error on a local system with 6 million records. We set the memory heap size to 64MB to accelerate the error replication. The indexing process has been failing in different scenarios. We have 9 test cases documented. In some of the test cases we increased the heap size to 128MB. In our first test case we set heap memory to 512MB which also failed. One characteristic of a JDBC connection is that unless you tell it otherwise, it will try to retrieve the entire resultset into RAM before any results are delivered to the application. It's not Solr doing this, it's JDBC. In this case, there are 27 million rows in the resultset. It's highly unlikely that this much data (along with the rest of Solr's memory requirements) will fit in 6GB of heap. JDBC has a built-in way to deal with this. It's called fetchSize. By using the batchSize parameter on your JdbcDataSource config, you can set the JDBC fetchSize. Set it to something small, between 100 and 1000, and you'll probably get rid of the OOM problem. http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource If you had been using MySQL, I would have recommended that you set batchSize to -1. This sets fetchSize to Integer.MIN_VALUE, which tells the MySQL driver to stream results instead of trying to either batch them or return everything. I'm pretty sure that the Oracle driver doesn't work this way -- you would have to modify the dataimport source code to use their streaming method. Thanks, Shawn