Re: exact matches possible?
Hi Erik, thanks for the response. I have ensured the type is string and that the field is indexed. No luck though: (Schema setting under solr/conf): field name=Word type=string indexed=true stored=true / Query: Word:apple Desired result: apple Achieved Results: apple, the red apple, pine-apple, etc, etc I have also tried your other suggestion: q={! f=Word}apple (attmpting to eliminate any results with spaces) But that just gives errors (from calling from the solr/admin query interface. Am I doing something obviously wrong? Thanks again, Roland It's certainly quite possible with Lucene/Solr. But you have to index the field to accommodate it. If you literally want an exact match query, use the string field type and then issue a term query. q=field:value will work in simple cases (where the value has no spaces or colons, or other query parser syntax), but q={!term f=field}value is the fail-safe way to do that. Erik Erik Hatcher wrote: It's certainly quite possible with Lucene/Solr. But you have to index the field to accommodate it. If you literally want an exact match query, use the string field type and then issue a term query. q=field:value will work in simple cases (where the value has no spaces or colons, or other query parser syntax), but q={!term f=field}value is the fail-safe way to do that. Erik On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote: Hi, I am trying to do a search that will only match exact words on a field. I have read somewhere that this is not what SOLR is meant for but I am still hoping that its possible. This is an example of what I have tried (to exclude spaces) but the workaround does not seem to work. Word:apple NOT What I am really looking for is the = operator in SQL (eg Word='apple') but I cannot find its equivalent for lucene. Thanks for the help. Regards, Roland
pingQuery problem ?
My solr instance works well, when calling ping page I get no problem : But in logs, I see this error lines repeated, do you know how to solve this ? solrconfig.xml Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/pingQuery-problem-tp3476850p3476850.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using Solr components for dictionary matching?
I really don't understand what you're asking. Could you give some examples of what you're trying to do? Best Erick On Tue, Nov 1, 2011 at 10:38 AM, Nagendra Mishr nmi...@gmail.com wrote: Hi all, Is there a good guide on using Solr components as a dictionary matcher? I'm need to do some pre-processing that involves lots of dictionary lookups and it doesn't seem right to query solr for each instance. Thanks in advance, Nagendra
Re: SOLRJ commitWithin inconsistent
Vijay: You may want to try Solr 3.3/3.4 with RankingAlgorithm as it supports NRT (Real Time Updates). You can set the commit interval to about 15 mins or as desired. You can get more information about NRT with 3.3/3.4.0 from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.3/3.4.0 with RankingAlgorithm 1.3 from here: http://solr-ra.tgels.org Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 11/2/2011 8:40 PM, Vijay Sampath wrote: Hi, I'm using CommitWithin for immediate commit. The response times are inconsistent. Sometimes it's less than a second. Sometimes more than 25 seconds. I'm not sending concurrent requests. Any idea? http://wiki.apache.org/solr/CommitWithin Snippet: UpdateRequest req = new UpdateRequest(); req.add( solrDoc); req.setCommitWithin(5000); req.process( server ); Thanks, Vijay -- View this message in context: http://lucene.472066.n3.nabble.com/SOLRJ-commitWithin-inconsistent-tp3476104p3476104.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: exact matches possible?
Roland - Is it possible that you indexed with a different field type and then changed to string without reindexing? A query on a string will only match literally the exact value (barring any wildcard/regex syntax), so something is fishy with your example. Your query example was odd, not sure if you meant it literally, but given the Word field name the query would be q={!term f=Word}apple - maybe you thought term was meta, but it is meant literally here. Erik On Nov 3, 2011, at 04:45 , Roland Tollenaar wrote: Hi Erik, thanks for the response. I have ensured the type is string and that the field is indexed. No luck though: (Schema setting under solr/conf): field name=Word type=string indexed=true stored=true / Query: Word:apple Desired result: apple Achieved Results: apple, the red apple, pine-apple, etc, etc I have also tried your other suggestion: q={! f=Word}apple (attmpting to eliminate any results with spaces) But that just gives errors (from calling from the solr/admin query interface. Am I doing something obviously wrong? Thanks again, Roland It's certainly quite possible with Lucene/Solr. But you have to index the field to accommodate it. If you literally want an exact match query, use the string field type and then issue a term query. q=field:value will work in simple cases (where the value has no spaces or colons, or other query parser syntax), but q={!term f=field}value is the fail-safe way to do that. Erik Erik Hatcher wrote: It's certainly quite possible with Lucene/Solr. But you have to index the field to accommodate it. If you literally want an exact match query, use the string field type and then issue a term query. q=field:value will work in simple cases (where the value has no spaces or colons, or other query parser syntax), but q={!term f=field}value is the fail-safe way to do that. Erik On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote: Hi, I am trying to do a search that will only match exact words on a field. I have read somewhere that this is not what SOLR is meant for but I am still hoping that its possible. This is an example of what I have tried (to exclude spaces) but the workaround does not seem to work. Word:apple NOT What I am really looking for is the = operator in SQL (eg Word='apple') but I cannot find its equivalent for lucene. Thanks for the help. Regards, Roland
Re: Multivalued fields question
multiValued has nothing to do with how many tokens are in the field, it's just whether you can call document.add(field1, val1) more than once on the same field. Or, equivalently, in input document in XML has two field entries with the same name=field entries. So it strictly depends upon whether you want to take it upon yourself to make these long strings or call document.add once for each value in the field. The field is returned as an array if it's multiValued Just to make your life interesting If you define your increment gap as 0, there is no difference between how multiValued fields are searched as opposed to single-valued fields. FWIW Erick On Tue, Nov 1, 2011 at 1:26 PM, Travis Low t...@4centurion.com wrote: Greetings. We're finally kicking off our little Solr project. We're indexing a paltry 25,000 records but each has MANY documents attached, so we're using Tika to parse those documents into a big long string, which we use in a call to solrj.addField(relateddoccontents, bigLongStringOfDocumentContents). We don't care about search results pointing back to a particular document, just one of the 25K records, so this should work. Now my question. Many of these records have related records in other tables, and there are several types of these related records. For example, we have record #100 that my have blue records with numbers , , , and , and red records with numbers , , , . Currently we're just handling these the same way as related document contents -- we concatenate them, separated by spaces, into one long string, then we do solrj.addField(redRecords, stringOfRedRecordNumbers). That is, stringOfRedRecordNumbers is . We have no need to show these records to the user in Solr search results, because we're going to use the database for displaying of detailed information for any records found. Is there any reason to specify redRecords and blueRecords as multivalued fields in schema.xml? And if we did that, we'd call solrj.addField() once for each value, would we not? cheers, Travis
Re: pingQuery problem ?
One of my core had a missing ping request handler. -- View this message in context: http://lucene.472066.n3.nabble.com/pingQuery-problem-tp3476850p3476980.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Jetty logging
Hi, remove slf4j-jdk14-1.6.1.jar from the war and repack it with slf4j-log4j12.jar and log4j-1.2.14.jar instead. -http://wiki.apache.org/solr/SolrLogging Regards, Kai Gülzau -Original Message- From: darul [mailto:daru...@gmail.com] Sent: Thursday, November 03, 2011 11:26 AM To: solr-user@lucene.apache.org Subject: Jetty logging Hello everybody, I do not find a solution on how to configure jetty with sl4j and a log4j.properties file. In I have put : - log4j-1.2.14.jar - slf4j-api-1.3.1.jar in directory: - log4j.properties At the end, nothing append when running jetty. Do you have any ideas ? Thanks, Julien -- View this message in context: http://lucene.472066.n3.nabble.com/Jetty-logging-tp3476715p3476715.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using Solr components for dictionary matching?
Assuming that with dictionary you would mean (also) a thesaurus, you can consider to use SIREn which is a SOLR / Lucene add-on, able to index (and search) RDF data. In this way, you could index an already available thesaurus like LCSH, Agrovoc or build and index your own vocabulary. subsequently, querying its services for lookups will benefit of SOLR / Lucene features. Best, Andrea On 11/1/11, Nagendra Mishr nmi...@gmail.com wrote: Hi all, Is there a good guide on using Solr components as a dictionary matcher? I'm need to do some pre-processing that involves lots of dictionary lookups and it doesn't seem right to query solr for each instance. Thanks in advance, Nagendra
RE: large scale indexing issues / single threaded bottleneck
Shishir, we have 35 million documents, and should be doing about 5000-1 new documents a day, but with very small documents: 40 fields which have at most a few terms, with many being single terms. You may occasionally see some impact from top level index merges but those should be very infrequent given your stated volumes. For more concrete advice, you should also provide information on the size of your documents, and your search volume. JRJ -Original Message- From: Awasthi, Shishir [mailto:shishir.awas...@baml.com] Sent: Tuesday, November 01, 2011 10:58 PM To: solr-user@lucene.apache.org Subject: RE: large scale indexing issues / single threaded bottleneck Roman, How frequently do you update your index? I have a need to do real time add/delete to SOLR documents at a rate of approximately 20/min. The total number of documents are in the range of 4 million. Will there be any performance issues? Thanks, Shishir -Original Message- From: Roman Alekseenkov [mailto:ralekseen...@gmail.com] Sent: Sunday, October 30, 2011 6:11 PM To: solr-user@lucene.apache.org Subject: Re: large scale indexing issues / single threaded bottleneck Guys, thank you for all the replies. I think I have figured out a partial solution for the problem on Friday night. Adding a whole bunch of debug statements to the info stream showed that every document is following update document path instead of add document path. Meaning that all document IDs are getting into the pending deletes queue, and Solr has to rescan its index on every commit for potential deletions. This is single threaded and seems to get progressively slower with the index size. Adding overwrite=false to the URL in /update handler did NOT help, as my debug statements showed that messages still go to updateDocument() function with deleteTerm not being null. So, I hacked Lucene a little bit and set deleteTerm=null as a temporary solution in the beginning of updateDocument(), and it does not call applyDeletes() anymore. This gave a 6-8x performance boost, and now we can index about 9 million documents/hour (producing 20Gb of index every hour). Right now it's at 1TB index size and going, without noticeable degradation of the indexing speed. This is decent, but still the 24-core machine is barely utilized :) Now I think it's hitting a merge bottleneck, where all indexing threads are being paused. And ConcurrentMergeScheduler with 4 threads is not helping much. I guess the changes on the trunk would definitely help, but we will likely stay on 3.4 Will dig more into the issue on Monday. Really curious to see why overwrite=false didn't help, but the hack did. Once again, thank you for the answers and recommendations Roman -- View this message in context: http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th readed-bottleneck-tp3461815p3466523.html Sent from the Solr - User mailing list archive at Nabble.com. -- This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. References to Sender are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you consent to the foregoing.
RE: change solr url
The file that he refers to, web.xml, is inside the solr WAR file in folder web-inf. That WAR file is in ...\example\webapps. You would have to uncomment the init-param section under filter-class and change the param-value to something else. But, as the comments in the filter-class section explain, you would also have to make other changes. If you are unfamiliar with how JEE Java applications are packaged, it might be best to leave it alone. Note that both alternatives that he has suggested would change the path for all of solr, not just admin. JRJ -Original Message- From: Ankita Patil [mailto:ankita.pa...@germinait.com] Sent: Tuesday, November 01, 2011 11:44 PM To: solr-user@lucene.apache.org Subject: Re: change solr url I am not very clear. Could you explain a bit in detail or give an example. Ankita. On 2 November 2011 06:26, Chris Hostetter hossman_luc...@fucit.org wrote: : Is it possible to change the url for solr admin?? : What i want is : : http://192.168.0.89:8983/solr/private/coreName/admin : : i want to add /private/ before the coreName. Is that possible? If yes how? You can either do this via settings in your servlet container (to specify that hte mapping of hte solr applicaiton should be solr/private instead of solr/ or you can modify the path-prefix value in Solr's web.xml (but that is not very well tested/supported) -Hoss
RE: Questions about Solr's security
It seems to me that this issue needs to be addressed in the FAQ and in the tutorial, and that somewhere there should be a /select lock-down how to. This is not obvious to many (most?) users of Solr. It certainly wasn't obvious to me before I read this. JRJ -Original Message- From: Erik Hatcher [mailto:erik.hatc...@gmail.com] Sent: Tuesday, November 01, 2011 3:50 PM To: solr-user@lucene.apache.org Subject: Re: Questions about Solr's security SSL and auth doesn't address that /select can hit any request handler defined (/select?qt=/updatestream.body=deletequery*:*/query/deletecommit=true). Be careful! But certainly knowing all the issues mentioned on this thread, it is possible to lock Solr down and make it safe to hit directly. But not out of the box or trivially. Erik On Nov 1, 2011, at 16:09 , Alireza Salimi wrote: I'm not sure if anybody has asked these questions before or not. Sorry if they are duplicates. The problem is that the clients (smart phones) of our Solr machines are outside the network in which solr machines are located. So, we need to somehow expose their service to the outside word. What's the safest way to do that? If we implement just a controlled app sitting between those clients we gonna waste lots of processing power because of proxying between Solr and clients. We might also ignore some HTTP headers that Solr would generate such as HTTP Cache headers. Anyways, creating such an application seems to be a lot of work which is not that needed. Erik, do you think even if we use SSL and HTTP Authentication, still it's not a good idea to expose Solr services? On Tue, Nov 1, 2011 at 3:57 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Be aware that even /select could have some harmful effects, see https://issues.apache.org/jira/browse/SOLR-2854 (addressed on trunk). Even disregarding that issue, /select is a potential gateway to any request handler defined via /select?qt=/req_handler Again, in general it's not a good idea to expose Solr to anything but a controlled app server. Erik On Nov 1, 2011, at 15:51 , Alireza Salimi wrote: What if we just expose '/select' paths - by firewalls and load balancers - and also use SSL and HTTP basic or digest access control? On Tue, Nov 1, 2011 at 2:20 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I was wondering if it's a good idea to expose Solr to the outside world, : so that our clients running on smart phones will be able to use Solr. As a general rule of thumb, i would say that it is not a good idea to expose solr directly to the public internet. there are exceptions to this rule -- AOL hosted some live solr instances of the Sarah Palin emails for HufPo -- but it is definitely an expert level type thing for people who are so familiar with solr they know exactly what to lock down to make it safe for typical users: put an application between your untrusted users and solr and only let that application generate safe welformed requests to Solr... https://wiki.apache.org/solr/SolrSecurity -Hoss -- Alireza Salimi Java EE Developer -- Alireza Salimi Java EE Developer
AW: large scale indexing issues / single threaded bottleneck
Hi, we are currently thinking about the performance facts too. I wonder if there are any sites on the net describing what a large index is? People always talk about huge indexes and heavy commits etc. but i can't find some stats about it in numbers and no information about the hardware used. Maybe an article in the wiki would help. I expect our index to be about 4 to 5 gig with 500.000 docs and 80.000 commits a day. Is that considered to be large, medium or small? Greets Sebastian -Ursprüngliche Nachricht- Von: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] Gesendet: Donnerstag, 3. November 2011 14:00 An: 'solr-user@lucene.apache.org' Betreff: RE: large scale indexing issues / single threaded bottleneck Shishir, we have 35 million documents, and should be doing about 5000-1 new documents a day, but with very small documents: 40 fields which have at most a few terms, with many being single terms. You may occasionally see some impact from top level index merges but those should be very infrequent given your stated volumes. For more concrete advice, you should also provide information on the size of your documents, and your search volume. JRJ -Original Message- From: Awasthi, Shishir [mailto:shishir.awas...@baml.com] Sent: Tuesday, November 01, 2011 10:58 PM To: solr-user@lucene.apache.org Subject: RE: large scale indexing issues / single threaded bottleneck Roman, How frequently do you update your index? I have a need to do real time add/delete to SOLR documents at a rate of approximately 20/min. The total number of documents are in the range of 4 million. Will there be any performance issues? Thanks, Shishir -Original Message- From: Roman Alekseenkov [mailto:ralekseen...@gmail.com] Sent: Sunday, October 30, 2011 6:11 PM To: solr-user@lucene.apache.org Subject: Re: large scale indexing issues / single threaded bottleneck Guys, thank you for all the replies. I think I have figured out a partial solution for the problem on Friday night. Adding a whole bunch of debug statements to the info stream showed that every document is following update document path instead of add document path. Meaning that all document IDs are getting into the pending deletes queue, and Solr has to rescan its index on every commit for potential deletions. This is single threaded and seems to get progressively slower with the index size. Adding overwrite=false to the URL in /update handler did NOT help, as my debug statements showed that messages still go to updateDocument() function with deleteTerm not being null. So, I hacked Lucene a little bit and set deleteTerm=null as a temporary solution in the beginning of updateDocument(), and it does not call applyDeletes() anymore. This gave a 6-8x performance boost, and now we can index about 9 million documents/hour (producing 20Gb of index every hour). Right now it's at 1TB index size and going, without noticeable degradation of the indexing speed. This is decent, but still the 24-core machine is barely utilized :) Now I think it's hitting a merge bottleneck, where all indexing threads are being paused. And ConcurrentMergeScheduler with 4 threads is not helping much. I guess the changes on the trunk would definitely help, but we will likely stay on 3.4 Will dig more into the issue on Monday. Really curious to see why overwrite=false didn't help, but the hack did. Once again, thank you for the answers and recommendations Roman -- View this message in context: http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th readed-bottleneck-tp3461815p3466523.html Sent from the Solr - User mailing list archive at Nabble.com. -- This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses.
RE: Jetty logging
Well, jetty is running as a unix service. Here is run command : jetty-logging.xml: With this configuration I have logs of jetty but no logs of log4j: exemple /logs/_mm_dd.stderrout.log 2011-11-03 14:36:59.306:INFO::jetty-6.1-SNAPSHOT Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: JNDI not configured for solr (NoInitialContextEx) Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: using system property solr.solr.home: /opt/solr-slave/multicore Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader init INFO: Solr home set to '/opt/solr-slave/multicore/' Nov 3, 2011 2:36:59 PM org.apache.solr.servlet.SolrDispatchFilter init INFO: SolrDispatchFilter.init() Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: JNDI not configured for solr (NoInitialContextEx) Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: using system property solr.solr.home: /opt/solr-slave/multicore Nov 3, 2011 2:36:59 PM org.apache.solr.core.CoreContainer$Initializer initialize I would like jetty use my resource/log4j.properties file : -- View this message in context: http://lucene.472066.n3.nabble.com/Jetty-logging-tp3476715p3477221.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to apply sort and search both on multivalued field in solr
What does sorting on a multivalued field mean? Should the document appear, in your example, in the a's? c's? e's? p's? There's no logical place to sort a document into a list when there's more than one token that makes sense in the general case that I can think of Why wouldn't searching oh your multivalued field and sorting on your min and max fields give you what you want? Can you give an example? Best Erick On Wed, Nov 2, 2011 at 8:32 AM, vrpar...@gmail.com vrpar...@gmail.com wrote: Hello all, i did googling and also as per wiki, we can not apply sorting on multivalued field. workaround for that is we need to add two more fields for particular multivalued field, min and max. e.g. multivalued field have 4 values abc, cde, efg, pqr than min=abc and max=pqr and we can make sort on it. this is fine if there is only required to sort on multivalued field. but i want to do searching and sorting on same multivalued field, then result would not fine. how to solve this problem ? Thanks vishal parekh -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3473652.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH doesn't handle bound namespaces?
Hi Gary, From http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource *It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is 'dc:subject' the mapping should just contain 'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy ** * You should be able to use xpath=//titleInfo/title without making any modifications (removing the namespace) to your xml. I hope that answers your question. Regards, Tricia On Mon, Oct 31, 2011 at 9:24 AM, Moore, Gary gary.mo...@ars.usda.govwrote: I'm trying to import some MODS XML using DIH. The XML uses bound namespacing: mods xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:mods=http://www.loc.gov/mods/v3; xmlns:xlink=http://www.w3.org/1999/xlink; xmlns=http://www.loc.gov/mods/v3; xsi:schemaLocation=http://www.loc.gov/mods/v3 http://www.loc.gov/mods/v3/mods-3-4.xsd; version=3.4 mods:titleInfo mods:titleMalus domestica: Arnold/mods:title /mods:titleInfo /mods However, XPathEntityProcessor doesn't seem to handle xpaths of the type xpath=//mods:titleInfo/mods:title. If I remove the namespaces from the source XML: mods xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns:mods=http://www.loc.gov/mods/v3; xmlns:xlink=http://www.w3.org/1999/xlink; xmlns=http://www.loc.gov/mods/v3; xsi:schemaLocation=http://www.loc.gov/mods/v3 http://www.loc.gov/mods/v3/mods-3-4.xsd; version=3.4 titleInfo titleMalus domestica: Arnold/title /titleInfo /mods then xpath=//titleInfo/title works just fine. Can anyone confirm that this is the case and, if so, recommend a solution? Thanks Gary Gary Moore Technical Lead LCA Digital Commons Project NAL/ARS/USDA
Re: Stream still in memory after tika exception? Possible memoryleak?
Hi All, I'm experiencing a similar problem to the other's in the thread. I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to apache-solr-4.0-2011-10-14_08-56-59.war and then apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various sizes, using the TikaEntityProcessor. My indexing would run to completion and was completely successful under the June build. The only error was readability of the fulltext in highlighting. This was fixed in Tika 0.10 (TIKA-611). I chose to use the October 14 build of Solr because Tika 0.10 had recently been included (SOLR-2372). On the same machine without changing any memory settings my initial problem is a Perm Gen error. Fine, I increase the PermGen space. I've set the onError parameter to skip for the TikaEntityProcessor. Now I get several (6) *SEVERE: Exception thrown while getting data* *java.net.SocketTimeoutException: Read timed out* *SEVERE: Exception in entity : tika:org.apache.solr.handler.dataimport.DataImport* *HandlerException: Exception in invoking url url removed # 2975* pairs. And after ~3881 documents, with auto commit set unreasonably frequently I consistently get an Out of Memory Error *SEVERE: Exception while processing: f document : null:org.apache.solr.handle**r.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap s**pace* The stack trace points to org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151) and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:718). The October 30 build performs identically. Funny thing is that monitoring via JConsole doesn't reveal any memory issues. Because the out of Memory error did not occur in June, this leads me to believe that a bug has been introduced to the code since then. Should I open an issue in JIRA? Thanks, Tricia On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs jacob...@gmail.com wrote: Hi Erick, I am using Solr 3.3.0, but with 1.4.1 the same problems. The connector is a homemade program in the C# programming language and is posting via http remote streaming (i.e. http://localhost:8080/solr/update/extract?stream.file=/path/to/file.docliteral.id=1 ) I'm using Tika to extract the content (comes with the Solr Cell). A possible problem is that the filestream needs to be closed, after extracting, by the client application, but it seems that there is going something wrong while getting a Tika-exception: the stream never leaves the memory. At least that is my assumption. What is the common way to extract content from officefiles (pdf, doc, rtf, xls etc) and index them? To write a content extractor / validator yourself? Or is it possible to do this with the Solr Cell without getting a huge memory consumption? Please let me know. Thanks in advance. Marc 2011/8/30 Erick Erickson erickerick...@gmail.com What version of Solr are you using, and how are you indexing? DIH? SolrJ? I'm guessing you're using Tika, but how? Best Erick On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs jacob...@gmail.com wrote: Hi all, Currently I'm testing Solr's indexing performance, but unfortunately I'm running into memory problems. It looks like Solr is not closing the filestream after an exception, but I'm not really sure. The current system I'm using has 150GB of memory and while I'm indexing the memoryconsumption is growing and growing (eventually more then 50GB). In the attached graph I indexed about 70k of office-documents (pdf,doc,xls etc) and between 1 and 2 percent throws an exception. The commits are after 64MB, 60 seconds or after a job (there are 6 evenly divided jobs). After indexing the memoryconsumption isn't dropping. Even after an optimize command it's still there. What am I doing wrong? I can't imagine I'm the only one with this problem. Thanks in advance! Kind regards, Marc
Default value for dynamic fields
Is there any way to define the default value for the dynamic fields in SOLR? I use some dynamic fields of type float with _val_ and if they haven't been created at index time, the value defaults to 0. I would want this to be 1. Can that be changed?
Re: how to apply sort and search both on multivalued field in solr
Thanks Erick, what i given 'abc',...etc... its values of one multivalued field in one document, but might be its confusing. lets say, i have one field named Array1 has multivalued=true now i want to Search on Array1 , but i want only affected values (which i can get in highlighting), now i also want to sort on filed Array1, now whatever be the response should be sorted on only affected values (which contains search term). also without search sorting on Array1 sometimes works fine, sometimes not. Thanks Vishal Parekh -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3477747.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: exact matches possible?
Hi Erik, you are spot on with your guess. I had reinserted my data but apparently that does not reindex. Delete everything and re-enter was required. Behaviour now seems to be as desired. Thank you very much. PS, thanks for pointing out that the !term is literal. Where can I find that kind of information on the internet? I use the lucene syntax page as my reference but it appears to be somewhat limited: http://lucene.apache.org/java/2_9_1/queryparsersyntax.html Kind regards, Roland Erik Hatcher wrote: Roland - Is it possible that you indexed with a different field type and then changed to string without reindexing? A query on a string will only match literally the exact value (barring any wildcard/regex syntax), so something is fishy with your example. Your query example was odd, not sure if you meant it literally, but given the Word field name the query would be q={!term f=Word}apple - maybe you thought term was meta, but it is meant literally here. Erik On Nov 3, 2011, at 04:45 , Roland Tollenaar wrote: Hi Erik, thanks for the response. I have ensured the type is string and that the field is indexed. No luck though: (Schema setting under solr/conf): field name=Word type=string indexed=true stored=true / Query: Word:apple Desired result: apple Achieved Results: apple, the red apple, pine-apple, etc, etc I have also tried your other suggestion: q={! f=Word}apple (attmpting to eliminate any results with spaces) But that just gives errors (from calling from the solr/admin query interface. Am I doing something obviously wrong? Thanks again, Roland It's certainly quite possible with Lucene/Solr. But you have to index the field to accommodate it. If you literally want an exact match query, use the string field type and then issue a term query. q=field:value will work in simple cases (where the value has no spaces or colons, or other query parser syntax), but q={!term f=field}value is the fail-safe way to do that. Erik Erik Hatcher wrote: It's certainly quite possible with Lucene/Solr. But you have to index the field to accommodate it. If you literally want an exact match query, use the string field type and then issue a term query. q=field:value will work in simple cases (where the value has no spaces or colons, or other query parser syntax), but q={!term f=field}value is the fail-safe way to do that. Erik On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote: Hi, I am trying to do a search that will only match exact words on a field. I have read somewhere that this is not what SOLR is meant for but I am still hoping that its possible. This is an example of what I have tried (to exclude spaces) but the workaround does not seem to work. Word:apple NOT What I am really looking for is the = operator in SQL (eg Word='apple') but I cannot find its equivalent for lucene. Thanks for the help. Regards, Roland
Re: Selective Result Grouping
Ok I think I get this. I think this can be achieved if one could specify a filter inside a group and only documents that pass the filter get grouped. For example only group documents with the value image for the mimetype field. This filter should be specified per group command. Maybe we should open an issue for this? Martijn On 1 November 2011 19:58, entdeveloper cameron.develo...@gmail.com wrote: Martijn v Groningen-2 wrote: When using the group.field option values must be the same otherwise they don't get grouped together. Maybe fuzzy grouping would be nice. Grouping videos and images based on mimetype should be easy, right? Videos have a mimetype that start with video/ and images have a mimetype that start with image/. Storing the mime type's subtype and type in separate fields and group on the type field would do the job. Off course you need to know the mimetype during indexing, but solutions like Apache Tika can do that for you. Not necessarily interested in grouping by mimetype (that's an analysis issue). I simply used videos and images as an example. I'm not sure what you mean by fuzzy grouping. But my goal is to have collapse be more selective somehow on what gets grouped. As a more specific example, I have a field called 'type', with the following possible field values: Type -- image video webpage Basically I want to be able to collapse all the images into a single result so that they don't fill up the first page of the results. This is not possible with the current grouping implementation because if you call group.field=type, it'll group everything. I do not want to collapse videos or webpages, only images. I've attached a screenshot of google's srp to help explain what I mean. http://lucene.472066.n3.nabble.com/file/n3471548/Screen_Shot_2011-11-01_at_11.52.04_AM.png Hopefully that makes more sense. If it's still not clear I can email you privately. -- View this message in context: http://lucene.472066.n3.nabble.com/Selective-Result-Grouping-tp3391538p3471548.html Sent from the Solr - User mailing list archive at Nabble.com. -- Met vriendelijke groet, Martijn van Groningen
Re: Default value for dynamic fields
On 11/3/2011 12:59 PM, Milan Dobrota wrote: Is there any way to define the default value for the dynamic fields in SOLR? I use some dynamic fields of type float with _val_ and if they haven't been created at index time, the value defaults to 0. I would want this to be 1. Can that be changed? Does specifying default=1 not work?
Re: Stopword filter - refreshing stop word list periodically
On Fri, Oct 14, 2011 at 10:06 PM, Jithin jithin1...@gmail.com wrote: What will be the name of this hard coded core? I was re arranging my directory structure adding a separate directory for code. And it does work with a single core. In trunk the single core setup core is called collection1. So to reload that you'd call url: http://localhost:8983/solr/admin/cores?action=RELOADcore=collection1 -- Sami Siren
Re: how to apply sort and search both on multivalued field in solr
Right, the behavior when sorting on a multivalued field is not defined, so results are unreliable. There's nothing that I know of that'll allow your sort to occur on the matched terms in a multiValued field. But, again, defining correct behavior here isn't easy. What if you searched for two terms and both terms matched a value in a single document's multiValued field? Which term should it sort by? Sorry, but sorting just doesn't work that way and I don't have any bright ideas how to get this to work as you'd like Best Erick On Thu, Nov 3, 2011 at 1:06 PM, vrpar...@gmail.com vrpar...@gmail.com wrote: Thanks Erick, what i given 'abc',...etc... its values of one multivalued field in one document, but might be its confusing. lets say, i have one field named Array1 has multivalued=true now i want to Search on Array1 , but i want only affected values (which i can get in highlighting), now i also want to sort on filed Array1, now whatever be the response should be sorted on only affected values (which contains search term). also without search sorting on Array1 sometimes works fine, sometimes not. Thanks Vishal Parekh -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3477747.html Sent from the Solr - User mailing list archive at Nabble.com.
Three questions about: Commit, single index vs multiple indexes and implementation advice
Hi guys! I have a couple of questions that I hope someone could help me with: 1) Recently I've implemented Solr in my app. My use case is not complicated. Suppose that there will be 50 concurrent users tops. This is an app like, let's say, a CRM. I tell you this so you have an idea in terms of how many read and write operations will be needed. What I do need is that the data that is added / updated be available right after it's added / updated (maybe a second later it's ok). I know that the commit operation is expensive, so maybe doing a commit right after each write operation is not a good idea. I'm trying to use the autoCommit feature with a maxTime of 1000ms, but then the question arised: Is this the best way to handle this type of situation? and if not, what should I do? 2) I'm using a single index per entity type because I've read that if the app is not handling lots of data (let's say, 1 million of records) then it's safe to use a single index. Is this true? if not, why? 3) Is it a problem if I use a simple setup of Solr using a single core for this use case? if not, what do you recommend? Any help in any of these topics would be greatly appreciated. Thanks in advance!
Ordered proximity search
Hi, By ordered I mean term1 will always come before term2 in the document. I have two documents: 1. By ordered I mean term1 will always come before term2 in the document 2. By ordered I mean term2 will always come before term1 in the document if I make the query: term1 term2~Integer.MAX_VALUE my results is: 2 documents How can I query to have one result (only if term1 come before term2): By ordered I mean term1 will always come before term2 in the document Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Ordered-proximity-search-tp3477946p3477946.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Default value for dynamic fields
It doesn't work for me. 2011/11/3 Yury Kats yuryk...@yahoo.com On 11/3/2011 12:59 PM, Milan Dobrota wrote: Is there any way to define the default value for the dynamic fields in SOLR? I use some dynamic fields of type float with _val_ and if they haven't been created at index time, the value defaults to 0. I would want this to be 1. Can that be changed? Does specifying default=1 not work? -- Milan Dobrota Ruby on Rails developer milandobrota.com rubylove.info
Re: Can you please guide me through step-by-step installation of Solr Cell ?
: Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.extraction.ExtractingRequestHandler' : : With the jetty and the provided example, I have no problem. It all happens when I use tomcat and solr. : : My setup is as follows: : : I downloaded the apache-solr-3.3.0 and unpacked itI am using : apache-solr-3.3.0 folder as my solr-home folder. Inside the dist : folder I have the apache-solr-3.3.0.war and coppied everything from the : contrib/extraction/lib into dist. just copying jars into dist isn't going to make things magically work for you -- what matters is that your solr instance knows how to find those plugin jars. when you use the example jetty instance, the solrconfig.xml file has lib directives with relative paths that indicate where to find them. If you use a differnet solr home dir and/or move files arround then those lib directives are no longer going to work... https://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins https://wiki.apache.org/solr/SolrConfigXml#lib -Hoss
performance - dynamic fields versus static fields
Hi, Is there a handy resource on the: a. performance of: dynamic fields versus static fields b. other pros-cons? Thanks.
Re: score based on unique words matching
: q=david bowie changes : : Problem : If a record mentions david bowie a lot, it beats out something : more relevant (more unique matches) ... : : A. (now appearing david bowie at the cineplex 7pm david bowie goes on stage, : then mr. bowie will sign autographs) : B. song :david bowie - changes : : (A) ends up more relevant because of the frequency or number of words in : it.. not cool... : I want it so the number of words matching will trump density/weight debugQuery=true is your freind .. it will show you exactly how the scores are being computed. the key factors in something like this are fieldNorm, tf, and the coord factor. The fieldNorm includes as a factor the length of the field, so as long as you have omitNorm=false configured for this field, doc#A should be panalized relative doc#B for being longer -- but if you omitNorm's then that won't help you -- so start by checking that. The coord factor will penalize documents that don't match all of the clauses of a boolean query (ie: doc #A only matches 2/3 clauses becuase it doesn't match the word changes) so you could customize your Similarity implementation to make that coord penalty higher, but that requires some custom java code. As an extreme option, you could use omitTf to completley eliminate the term frequency from being a factor in scoring so the number of times bowie appears won't affect the score, just that it appears at least once) but that probably isn't what you want: david bowie changes some stuff would get the same score as david bowie changes david bowie in general the simplest way to deal with a lot of this type of thing is to think about how you are structuring your query. something as simple as using the dismax parser with your field in both the qf and pf fields (and a little bit of slop in the ps param) may give you exactly what you want (since it will reward docs where the whole query string appears in the field... https://wiki.apache.org/solr/DisMaxQParserPlugin -Hoss
Re: Access Document Score in Custom Function Query (ValueSource)
: In this value source I compute another score for every document : using some features. I want to access the score of the query myField^2 : (for a given document) in this same value source. : : Ideas? your ValueSource can wrap the score from the other query using a QueryValueSource. just keep in mind that by definition function queries match every document in the index, so you'll still need to use the other query in some way (or use somehting like the frange parser to constrain the set of docs returned based on a range of values produced by your function) -Hoss
admin index version not updating
I have a setup with a master and single slave, using the collection distribution scripts. I'm not sure if it's relevant, but I'm running multicore also. I am on version 3.4.0 (we are upgrading from 1.3). My understanding that the indexVersion (a number) reported by the stats page (admin/stats.jsp) is a timestamp that should correspond to the time of the latest snapshot. At least that's how it has behaved on version 1.3. When I install a new snapshot on the slave (snapinstaller), it does not report any errors, and the logs/snapshot.current is updated with the latest snapshot, but the admin/stats page still reports the old version. Actually, the version number increases by 4 each time I install a new index, but doesn't update to anywhere near the time of the latest snapshot (it's a few days off at this point). I have verified that the slave is actually running on the latest index by searching for something that only exists in the latest index. Am I misunderstanding how to interpret the indexVersion, or is the latest snapshot not getting fully installed? Thanks Nathan
Re: DIH doesn't handle bound namespaces?
: *It does not support namespaces , but it can handle xmls with namespaces . The real crux of hte issue is that XPathEntityProcessor is terribly named. it should have been called LimitedXPathishSyntaxEntityProcessor or something like that because it doesn't support full xpath syntax... The XPathEntityProcessor implements a streaming parser which supports a subset of xpath syntax. Complete xpath syntax is not supported but most of the common use cases are covered... ...i thought there was a DIH FAQ about this, but if not there really should be. -Hoss
Re: Dismax and phrases
Interesting, in the case where you use quotes... : +result name=response numFound=6888 start=0 maxScore=3.0879765 ... : /lststr name=rawquerystringasuntojen hinnat/str : str name=querystringasuntojen hinnat/str ...there is one DisjunctionMaxQuery (expected) for the entire phrase, but in the sub-clauses for each individual field the clauses coming from your _fi fields are just building boolean OR queries of the terms from your phrase (instead of building an actual phrase query... : str name=parsedquery+DisjunctionMaxQuery((table.title_t:asuntojen : hinnat^2.0 | title_t:asuntojen hinnat^2.0 | ingress_t:asuntojen hinnat | : (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto : table.description_fi:hinta) | table.description_t:asuntojen hinnat | : graphic.title_t:asuntojen hinnat^2.0 | ((graphic.title_fi:asunto : graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto : table.title_fi:hinta)^2.0) | table.contents_t:asuntojen hinnat | : text_t:asuntojen hinnat | (ingress_fi:asunto ingress_fi:hinta) | : (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto : title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0 : FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)/str ...is this perhaps a side effect of the new autoGeneratePhraseQueries option? ... you are explicitly specifying a quoted phrase, but maybe somehwere in the code path of the dismax parser that information is getting lost? can you post the details of your schema.xml? (ie: the version property on the schema file, and the dynamicField/field + fieldType definitions for all these fields) In contrast, your unquoted example is working exactly as i'd expect. a DisjunctionMaxQuery is built for each clause of the input, and the two DisjunctionMaxQuery objects are then combined in a BooleanQuery where the minNrShouldMatch property is set to 2 : +result name=response numFound=1065 start=0 : maxScore=2.230382/result ... : str name=rawquerystringasuntojen hinnat/str : str name=querystringasuntojen hinnat/str : : str name=parsedquery+((DisjunctionMaxQuery((table.title_t:asuntojen^2.0 | : title_t:asuntojen^2.0 | ingress_t:asuntojen | text_fi:asunto | : table.description_fi:asunto | table.description_t:asuntojen | : graphic.title_t:asuntojen^2.0 | graphic.title_fi:asunto^2.0 | : table.title_fi:asunto^2.0 | table.contents_t:asuntojen | text_t:asuntojen | : ingress_fi:asunto | table.contents_fi:asunto | title_fi:asunto^2.0)~0.01) : DisjunctionMaxQuery((table.title_t:hinnat^2.0 | title_t:hinnat^2.0 | : ingress_t:hinnat | text_fi:hinta | table.description_fi:hinta | : table.description_t:hinnat | graphic.title_t:hinnat^2.0 | : graphic.title_fi:hinta^2.0 | table.title_fi:hinta^2.0 | : table.contents_t:hinnat | text_t:hinnat | ingress_fi:hinta | : table.contents_fi:hinta | title_fi:hinta^2.0)~0.01))~2) () type:tie^6.0 : type:kuv^2.0 type:tau^2.0 : FunctionQuery((1.0/(3.16E-11*float(ms(const(1319438484878),date(date.modified_dt)))+1.0))^100.0)/str -Hoss
UnInvertedField vs FieldCache for facets for single-token text fields
I have some fields I facet on that are TextFields but have just a single token. The fieldType looks like this: fieldType name=myStringFieldType class=solr.TextField indexed=true stored=false omitNorms=true sortMissingLast=true positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType SimpleFacets uses an UnInvertedField for these fields because multiValuedFieldCache() returns true for TextField. I tried changing the type for these fields to the plain string type (StrField). The facets *seem* to be generated much faster. Is it expected that FieldCache would be faster than UnInvertedField for single-token strings like this? My goal is to make the facet re-generation after a commit as fast as possible. I would like to continue using TextField for these fields since I have a need for filters like LowerCaseFilterFactory, which still produces a single token. Is it safe to extend TextField and have multiValuedFieldCache() return false for these fields, so that UnInvertedField is not used? Or is there a better way to accomplish what I'm trying to do? -Michael
Re: Dismax and phrases
: ...is this perhaps a side effect of the new autoGeneratePhraseQueries : option? ... you are explicitly specifying a quoted phrase, but : maybe somehwere in the code path of the dismax parser that information is : getting lost? FWIW: a) I just realized you said in your first message you were using Solr 1.4.1, which *definitely* predates the autoGeneratePhraseQueries option - so i'm really at a loss to understand how you are getting that query structure (definitely want to see your configs) b) I did some quick testing with Solr 3.4 using the example configs, and verified that regardless of how autoGeneratePhraseQueries is set on the fieldType for the name field, this request... http://localhost:8983/solr/select/?fl=namedebugQuery=trueq=%22samsung%20hard%20drive%22defType=dismaxqf=nameqs=100 ..always produces a dismax query wrapped arround a phrase query. -Hoss
RE: Questions about Solr's security
Me too! -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, November 01, 2011 1:02 PM To: solr-user@lucene.apache.org Subject: Re: Questions about Solr's security I once had to deal with a severe performance problem caused by a bot that was requesting results starting at 5000. We disallowed requests over a certain number of pages in the front end to fix it. wunder On Nov 1, 2011, at 12:57 PM, Erik Hatcher wrote: Be aware that even /select could have some harmful effects, see https://issues.apache.org/jira/browse/SOLR-2854 (addressed on trunk). Even disregarding that issue, /select is a potential gateway to any request handler defined via /select?qt=/req_handler Again, in general it's not a good idea to expose Solr to anything but a controlled app server. Erik On Nov 1, 2011, at 15:51 , Alireza Salimi wrote: What if we just expose '/select' paths - by firewalls and load balancers - and also use SSL and HTTP basic or digest access control? On Tue, Nov 1, 2011 at 2:20 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I was wondering if it's a good idea to expose Solr to the outside world, : so that our clients running on smart phones will be able to use Solr. As a general rule of thumb, i would say that it is not a good idea to expose solr directly to the public internet. there are exceptions to this rule -- AOL hosted some live solr instances of the Sarah Palin emails for HufPo -- but it is definitely an expert level type thing for people who are so familiar with solr they know exactly what to lock down to make it safe for typical users: put an application between your untrusted users and solr and only let that application generate safe welformed requests to Solr... https://wiki.apache.org/solr/SolrSecurity -Hoss -- Alireza Salimi Java EE Developer -- Walter Underwood Venture Asst. Scoutmaster Troop 14, Palo Alto, CA
Re: BaseTokenFilterFactory not found in plugin
: myorg/solr/analysis/*.java`. I then made a `.jar` file from the .class files : and put the .jar file in the solr/lib/ directory. I modified schema.xml to : include the new filter: what exactly do you mean by the solr/lib/ directory ? ... if you mean that solr is the solr home dir where you are running solr, so you have a structure like this... solr/conf/solrconfig.xml solr/conf/schema.xml solr/lib/your-jar-name.jar ...then that should be correct. If however you put it in some other lib directory (like, perhaps jetty's lib directory) then it might get loaded by a lower level class loader so it has no runtime visibility of the classes loaded by Solr. when Solr starts up, the SolrResourceLoader explicitly logs every jar file it finds in it's lib dir, or any jars explicitly specified, or loaded because of @sharedLib or lib/ configurations, so check your logs to make sure your jar is listed there -- if it's not, but it's still getting loaded, then it's getting loaded by a different classloader. -Hoss
Re: Default value for dynamic fields
On Thu, Nov 3, 2011 at 12:59 PM, Milan Dobrota mi...@milandobrota.com wrote: Is there any way to define the default value for the dynamic fields in SOLR? I use some dynamic fields of type float with _val_ and if they haven't been created at index time, the value defaults to 0. I would want this to be 1. Can that be changed? On trunk, there are some new (currently undocumented) function queries that can do this: def(myfield,1) If there are not normally 0 values anyway, you can also map any 0 values encountered via map(), or min() if existing values are all positive. -Yonik http://www.lucidimagination.com
Re: facet with group by (or field collapsing)
I'm attempting the following query: http://{host}/apache-solr-3.3.0/select/?q=cesyversion=2.2start=0rows=10indent=ongroup=truegroup.field=SIPgroup.limit=1facet=truefacet.field=REPOSITORYNAME The result is 4 matches all in 1 group (with group.limit=1). Rather than show facet.field=REPOSITORYNAME's 4 facets, I want to see the REPOSITORYNAMES facet with a count of 1 (for the 1 group returned) with the value of the REPOSITORYNAMES field in the 1 doc returned in the group. Is this possible? I tried adding the parameter collapse.facet=after, but that seemed to have no effect. -- View this message in context: http://lucene.472066.n3.nabble.com/facet-with-group-by-or-field-collapsing-tp497252p3478515.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: facet with group by (or field collapsing)
collapse.facet=after doesn't exists in Solr 3.3. This parameter exists in the SOLR-236 patches and is implemented differently in the released versions of Solr. From Solr 3.4 you can use group.truncate. The facet counts are then computed based on the most relevant documents per group. Martijn On 3 November 2011 22:47, erj35 eric.ja...@yale.edu wrote: I'm attempting the following query: http://{host}/apache-solr-3.3.0/select/?q=cesyversion=2.2start=0rows=10indent=ongroup=truegroup.field=SIPgroup.limit=1facet=truefacet.field=REPOSITORYNAME The result is 4 matches all in 1 group (with group.limit=1). Rather than show facet.field=REPOSITORYNAME's 4 facets, I want to see the REPOSITORYNAMES facet with a count of 1 (for the 1 group returned) with the value of the REPOSITORYNAMES field in the 1 doc returned in the group. Is this possible? I tried adding the parameter collapse.facet=after, but that seemed to have no effect. -- View this message in context: http://lucene.472066.n3.nabble.com/facet-with-group-by-or-field-collapsing-tp497252p3478515.html Sent from the Solr - User mailing list archive at Nabble.com. -- Met vriendelijke groet, Martijn van Groningen
Access Score in Custom Function Query
Hi, I have a custom function query (value source) where I want to use the score for some computation. For example, for every document I want to add some number (obtained from an external file) to its score. I am achieving this like the following: http://localhost:PORT/myCore/select?q=queryStringqt=my_request_handlerfl=field1,field2,scoredebugQuery=onsort=myfunc(query$(qq)) desc Where, definition of my_request_handler qq are as follows: requestHandler name=my_request_handler class=solr.SearchHandler lst name=defaults str name=qq{!dismax v=$q}/str str name=qf field1^2 field^3 /str /requestHandler Questions: 1. To obtain the score in my function query I am executing the dismax query again ( myfunc( query($qq)). Could it slow things down? Is there any way I can access the score without querying again? 2. I also want to normalize the (query) score I get to a range between 0 - 1. Is there any way to access the MAX_SCORE in the same function query/Value source (so that I can divide every score by that)? Thanks lot guys Sid -- View this message in context: http://lucene.472066.n3.nabble.com/Access-Score-in-Custom-Function-Query-tp3478597p3478597.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Access Document Score in Custom Function Query (ValueSource)
I understand that. Thanks. I just posted a related question , titled : Access Score in Custom Function Query where (among other things) I am asking about the performance aspects of this method. As you said, I need to execute some query first to create a constrained recall set then apply my custom function query (which in turn executes another query) to it. In my case I am using the same query again. First to create the recall set (and also score the docs which I don't use though) and then execute that query in my custom function to get the score. I am worried it may slow things down. Comments? Thanks Sid -- View this message in context: http://lucene.472066.n3.nabble.com/Access-Document-Score-in-Custom-Function-Query-ValueSource-tp3432459p3478619.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: UnInvertedField vs FieldCache for facets for single-token text fields
Hi Micheal, The FieldCache is an easier data structure and easier to create, so I also expect it to be faster. Unfortunately for TextField UnInvertedField is always used even if you have one token per document. I think overriding the multiValuedFieldCache method and return false would work. If you're using 4.0-dev (trunk) I'd use facet.method=fcs (this parameter is only useable if multiValuedFieldCache method returns false) This is per segment faceting and the cache will only be extended for new segments. This field facet approach is better for indexes with frequent changes. I think this even faster in your case then just using the FieldCache method (which operates on a top level reader. After each commit the complete cache is invalid and has to be recreated). Otherwise I'd try facet.method=enum which is fast if you have fewer distinct facet values (num of docs doesn't influence the performance that much). The facet.method=enum option is also valid for normal TextFields, so no need to have custom code. Martijn On 3 November 2011 21:16, Michael Ryan mr...@moreover.com wrote: I have some fields I facet on that are TextFields but have just a single token. The fieldType looks like this: fieldType name=myStringFieldType class=solr.TextField indexed=true stored=false omitNorms=true sortMissingLast=true positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType SimpleFacets uses an UnInvertedField for these fields because multiValuedFieldCache() returns true for TextField. I tried changing the type for these fields to the plain string type (StrField). The facets *seem* to be generated much faster. Is it expected that FieldCache would be faster than UnInvertedField for single-token strings like this? My goal is to make the facet re-generation after a commit as fast as possible. I would like to continue using TextField for these fields since I have a need for filters like LowerCaseFilterFactory, which still produces a single token. Is it safe to extend TextField and have multiValuedFieldCache() return false for these fields, so that UnInvertedField is not used? Or is there a better way to accomplish what I'm trying to do? -Michael -- Met vriendelijke groet, Martijn van Groningen
Highlighter showing matched query words only
Hello Folks, I am a newbie of Solr. I wonder if Solr Highlighter can show the matched query words only. Suppose my query is godfather AND pacino. I just want to display godfather and pacino in any of the highlighted fields. For the sake of performance, I do not want to use regular expressions to parse the text and locate the query words which are already enclosed between em and /em. Solr obviously has already done the searching and highlighting, but the Solr output mixes what I want with what I do not want. I just want to get out the intermediate results, the matching query words, and nothing else. Is there a way to get the intermediate results, the matching query words, before they are mixed with other text? Thank you all very much for your help in advance! N. J. -- View this message in context: http://lucene.472066.n3.nabble.com/Highlighter-showing-matched-query-words-only-tp3478731p3478731.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Using Solr components for dictionary matching?
The scenarios that could use dictionary matching: 1. Document being processed to see if it contains one of 10,000 terms. 2. Query completion as you type 3. Basically the inverse of finding a document.. Instead the document is the query term and the dictionary of terms is being matched in parallel Nagendra Sent from my Windows Phone From: Erick Erickson Sent: 11/3/2011 8:13 AM To: solr-user@lucene.apache.org Subject: Re: Using Solr components for dictionary matching? I really don't understand what you're asking. Could you give some examples of what you're trying to do? Best Erick On Tue, Nov 1, 2011 at 10:38 AM, Nagendra Mishr nmi...@gmail.com wrote: Hi all, Is there a good guide on using Solr components as a dictionary matcher? I'm need to do some pre-processing that involves lots of dictionary lookups and it doesn't seem right to query solr for each instance. Thanks in advance, Nagendra
Re: Using Solr components for dictionary matching?
On Thu, Nov 3, 2011 at 4:06 PM, Nagendra Mishr nmi...@gmail.com wrote: The scenarios that could use dictionary matching: 1. Document being processed to see if it contains one of 10,000 terms. 2. Query completion as you type 3. Basically the inverse of finding a document.. Instead the document is the query term and the dictionary of terms is being matched in parallel Try the Aho-Corasick algorithm - http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the dictionary) within an input text. It matches all patterns simultaneously. The complexityhttp://en.wikipedia.org/wiki/Computational_complexity_theoryof the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. HTH, Vijay
how to achieve google.com like results for phrase queries
Hello, I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word phrases like newspaper latimes, latimes.com is not in results at all. This may be due to the dismax def type that I use in request handler str name=defTypedismax/str str name=qfurl^1.5 id^1.5 content^ title^1.2/str str name=pfurl^1.5 id^1.5 content^0.5 title^1.2/str with mm as str name=mm2lt;-1 5lt;-2 6lt;90%/str However, changing it to str name=mm1lt;-1 2lt;-1 5lt;-2 6lt;90%/str and q.op to OR or AND do not solve the problem. In this case latimes.com is ranked higher, but still is not in the first place. Also in this case results with both words are ranked very low, almost at the end. We need to be able to achieve the case when latimes.com is placed in the first place then results with both words and etc. Any ideas how to modify config to this end? Thanks in advance. Alex.
Re: Stopword filter - refreshing stop word list periodically
Thanks Sami. I ended up setting up a proper core as per documentation, named core0. On Thu, Nov 3, 2011 at 11:07 PM, Sami Siren-2 [via Lucene] ml-node+s472066n3477844...@n3.nabble.com wrote: On Fri, Oct 14, 2011 at 10:06 PM, Jithin [hidden email]http://user/SendEmail.jtp?type=nodenode=3477844i=0 wrote: What will be the name of this hard coded core? I was re arranging my directory structure adding a separate directory for code. And it does work with a single core. In trunk the single core setup core is called collection1. So to reload that you'd call url: http://localhost:8983/solr/admin/cores?action=RELOADcore=collection1 -- Sami Siren -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Stopword-filter-refreshing-stop-word-list-periodically-tp3421611p3477844.html To unsubscribe from Stopword filter - refreshing stop word list periodically, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3421611code=aml0aGluMTk4N0BnbWFpbC5jb218MzQyMTYxMXwtMTEwMTgwMTA3Ng==. -- Thanks Jithin Emmanuel -- View this message in context: http://lucene.472066.n3.nabble.com/Stopword-filter-refreshing-stop-word-list-periodically-tp3421611p3479040.html Sent from the Solr - User mailing list archive at Nabble.com.