Query multiple collections together
Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin
Re: Query multiple collections together
You can query multiple collections by specifying the list of collections e.g.: http://hostname:port /solr/gettingstarted/select?q=testcollection=collection1,collection2,collection3 On Sun, May 10, 2015 at 11:49 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin -- Anshum Gupta
Re: Upgraded to 4.10.3, highlighting performance unusably slow
Has anyone looked at it? On Sun, May 3, 2015 at 10:18 AM, jaime spicciati jaime.spicci...@gmail.com wrote: We ran into this as well on 4.10.3 (not related to an upgrade). It was identified during load testing when a small percentage of queries would take more than 20 seconds to return. We were able to isolate it by rerunning the same query multiple times and regardless of cache hits the queries would still take a long time to return. We used this method to narrow down the performance problem to a small number of very large records (many many fields in a single record). We fixed it by turning on hl.requireFieldMatch on the query so that only fields that have an actual hit are passed through the highlighter. Hopefully this helps, Jaime Spicciati On Sat, May 2, 2015 at 8:20 PM, Joel Bernstein joels...@gmail.com wrote: Hi, Can you also include the details of your research that narrowed the issue to the highlighter? Joel Bernstein http://joelsolr.blogspot.com/ On Sat, May 2, 2015 at 5:27 PM, Ryan, Michael F. (LNG-DAY) michael.r...@lexisnexis.com wrote: Are you able to identify if there is a particular part of the code that is slow? A simple way to do this is to use the jstack command (assuming your server has the full JDK installed). You can run it like this: /path/to/java/bin/jstack PID If you run that a bunch of times while your highlight query is running, you might be able to spot the hotspot. Usually I'll do something like this to see the stacktrace for the thread running the query: /path/to/java/bin/jstack PID | grep SearchHandler -B30 A few more questions: - What are response times you are seeing before and after the upgrade? Is unusably slow 1 second, 10 seconds...? - If you run the exact same query multiple times, is it consistently slow? Or is it only slow on the first run? - While the query is running, do you see high user CPU on your server, or high IO wait, or both? (You can check this with the top command or vmstat command in Linux.) -Michael -Original Message- From: Cheng, Sophia Kuen [mailto:sophia_ch...@hms.harvard.edu] Sent: Saturday, May 02, 2015 4:13 PM To: solr-user@lucene.apache.org Subject: Upgraded to 4.10.3, highlighting performance unusably slow Hello, We recently upgraded solr from 3.8.0 to 4.10.3. We saw that this upgrade caused a incredible slowdown in our searches. We were able to narrow it down to the highlighting. The slowdown is extreme enough that we are holding back our release until we can resolve this. Our research indicated using TermVectors FastHighlighter were the way to go, however this still does nothing for the performance. I think we may be overlooking a crucial configuration, but cannot figure it out. I was hoping for some guidance and help. Sorry for the long email, I wanted to provide enough information. Our documents are largely dynamic fields, and so we have been using ‘*’ as the field for highlighting. This is the same setting as in prior versions of solr use. The dynamic fields are of type ’text’ and we added customizations to the schema.xml for the type ’text’: fieldType name=text class=solr.TextField positionIncrementGap=100 storeOffsetsWithPositions=true termVectors=true termPositions=true termOffsets=true analyzer type=index !-- this charFilter removes all xml-tagging from the text: -- charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query !-- this charFilter removes all xml-tagging from the text. Needed also in query due to autosuggest -- charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType
Re: Unable to identify why faceting is taking so much time
On Mon, 2015-05-11 at 05:48 +, Abhishek Gupta wrote: According to this there are 137 records. Now I am faceting over these 137 records with facet.method=fc. Ideally it should just iterate over these 137 records and sub up the facets. That is only the ideal method if you are not planning on issuing subsequent calls: facet.method=fc does more work up front to ensure that later calls are fast. http://localhost:9020/search/p1-umShard-1/select?q=*:*; fq=(msgType:38+AND+snCreatedTime:[2015-04-15T00:00:00Z%20TO%20*]) facet.field=conversationIdfacet=trueindent=onwt=jsonrows=0 facet.method=fcdebug=timing { - responseHeader: { - status: 0, - QTime: 395103 }, [...] According to this faceting is taking 395036 time. Why its taking *395 seconds* to just calculate facets of 137 records? 6½ minute is a long time, even for first call. Do you have tens to hundreds of millions of documents in your index? Or do you have a similiar amount of unique values in your facet? Either way, subsequent faceting calls should be much faster and a switch to DocValues should lower your first-call time significantly. Toke Eskildsen, State and University Library, Denmark
Re: Query multiple collections together
Thank you for the query. Just to confirm, for the 'gettingstarted' in the query, does it matter which collection name I put? Regards, Edwin On 11 May 2015 15:51, Anshum Gupta ans...@anshumgupta.net wrote: You can query multiple collections by specifying the list of collections e.g.: http://hostname:port /solr/gettingstarted/select?q=testcollection=collection1,collection2,collection3 On Sun, May 10, 2015 at 11:49 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin -- Anshum Gupta
答复: 答复: How to get the docs id after commit
You are right. I get last commit time and current commit time in the newsearcher listener, then query from last commit time to current commit time that I can get the newest committed docs.Thanks. Best, WenLi -邮件原件- 发件人: Erick Erickson [mailto:erickerick...@gmail.com] 发送时间: 2015年5月11日 9:47 收件人: solr-user@lucene.apache.org 主题: Re: 答复: How to get the docs id after commit Not something really built into Solr. It's easy enough, at least conceptually, to build in a batch_id. The idea here would be that every doc in each batch would have a unique id (really, something you changed after each commit). That pretty much requires, though, that you control the indexing carefully (we're probably talking SolrJ here). There's no good way that I know to get this info after an autocommit for instance. I suppose you could use a TimestampUpdateProcessorFactory and keep high water marks so a query like q=timestamp:[last_timestamp_I_checked TO most_recent_timestamp] would do it. Even that, though, has some issues in SolrCloud because each server's time may be slightly off. You can get around this by placing the TimestampUpdateProcessorFactory in _front_ of the distributed update processor in your update chain, but then you'd really require that all updates be sent to the _same_ machine, or that the commit intervals were guaranteed to be outside the clock skew on your machines. Bottom line is that you'd have to build it yourself, there's no OOB functionality here. Even all the docs that last committed is ambiguous. What about autocommits? Does last committed mean _just_ the ones between the last two autocommits? It seems like you really want all the docs committed since last time I asked. And for that, you really need to control the mechanism yourself. Not only does Solr not provide this OOB, I'm not even sure what it could be implemented in a general case unless Solr became transactional. Best, Erick On Sun, May 10, 2015 at 5:38 PM, liwen(李文).apabi l@founder.com.cn wrote: Sorry. The newest means all the docs that last committed, I need to get ids of these docs to trigger another server to do something. -邮件原件- 发件人: Erick Erickson [mailto:erickerick...@gmail.com] 发送时间: 2015年5月10日 23:22 收件人: solr-user@lucene.apache.org 主题: Re: How to get the docs id after commit Not really. It's an ambiguous thing though, what's a newest document when a whole batch is committed at once? And in distributed mode, you can fire docs to any node in the cloud and they'll get to the right shard, but order is not guaranteed so newest is a fuzzy concept. I'd put a counter in my docs that I guaranteed was increasing and just q=*:*rows=1sort=timestamp desc. That should give you the most recent doc. Beware using a timestamp though if you're not absolutely sure that the clock times you use are comparable! Best, Erick On Sun, May 10, 2015 at 12:57 AM, liwen(李文).apabi l@founder.com.cn wrote: Hi, Solr Developers I want to get the newest commited docs in the postcommit event, then nofity the other server which data can be used, but I can not find any way to get the newest docs after commited, so is there any way to do this? Thank you. Wen Li
Re: Queries on SynonymFilterFactory
2015-05-11 4:44 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I've managed to run the synonyms with 10 different synonyms file. Each of the synonym file size is 1MB, which consist of about 1000 tokens, and each token has about 40-50 words. These lists of files are more extreme, which I probably won't use for the real environment, except now for the testing purpose. The QTime is about 100-200, as compared to about 50 for collection without synonyms configured. Is this timing consider fast or slow? Although the synonyms files are big, there's not that many index in my collection yet. Just afraid the performance will be affected when more index comes in. If it's fast or slow it depends on your requirements :) For a human waiting for the response, I would say 100ms to be quite fast. To understand what happens when the index scale up, you should prototype ! Anyway there are a lot of solution in Solr to scale up your system ! Cheers Regards, Edwin On 9 May 2015 00:14, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Thank you for your suggestions. I can't do a proper testing on that yet as I'm currently using a 4GB RAM normal PC machine, and all these probably requires more RAM that what I have. I've tried running the setup with 20 synonyms file, and the system went Out of Memory before I could test anything. For your option 2), do you mean that I'll need to download a synonym database (like the one with over 20MB in size which I have), and index them into an Ad Hoc Solr Core to manage them? I probably can only try them out properly when I can get the server machine with more RAM. Regards, Edwin On 8 May 2015 at 22:16, Alessandro Benedetti benedetti.ale...@gmail.com wrote: This is a quite big Sinonym corpus ! If it's not feasible to have only 1 big synonym file ( I haven't checked, so I assume the 1 Mb limit is true, even if strange) I would do an experiment : 1) testing query time with a Solr Classic config 2) Use an Ad Hoc Solr Core to manage Synonyms ( in this way we can keep it updated and use it with a custom version of the Sysnonym filter that will get the Synonyms directly from another Solr instance). 2b) develop a Solr plugin to provide this approach If the synonym thesaurus is really big, I guess managing them through another Solr Core ( or something similar) locally , will be better than managing it with an external web service. Cheers 2015-05-08 12:16 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: So it means like having more than 10 or 20 synonym files locally will still be faster than accessing external service? As I found out that zookeeper only allows the synonym.txt file to be a maximum of 1MB, and as my potential synonym file is more than 20MB, I'll need to split the file to more than 20 of them. Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Query multiple collections together
FWIR, you just need to make sure that it's a valid collection. It doesn't have to be one from the list of collections that you want to query, but the collection name you use in the URL should exist. e.g, assuming you have 2 collections foo (10 docs) and bar (5 docs): */solr/foo/select?q=*:*collection=bar* #results: 5 */solr/xyz/select?q=*:*collection=bar* will lead to a HTTP 404 response */solr/foo/select?q=*:* *#results: 10 On Mon, May 11, 2015 at 12:59 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Thank you for the query. Just to confirm, for the 'gettingstarted' in the query, does it matter which collection name I put? Regards, Edwin On 11 May 2015 15:51, Anshum Gupta ans...@anshumgupta.net wrote: You can query multiple collections by specifying the list of collections e.g.: http://hostname:port /solr/gettingstarted/select?q=testcollection=collection1,collection2,collection3 On Sun, May 10, 2015 at 11:49 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin -- Anshum Gupta -- Anshum Gupta
Re: Query multiple collections together
Ok, thank you so much. Regards, Edwin On 11 May 2015 16:15, Anshum Gupta ans...@anshumgupta.net wrote: FWIR, you just need to make sure that it's a valid collection. It doesn't have to be one from the list of collections that you want to query, but the collection name you use in the URL should exist. e.g, assuming you have 2 collections foo (10 docs) and bar (5 docs): */solr/foo/select?q=*:*collection=bar* #results: 5 */solr/xyz/select?q=*:*collection=bar* will lead to a HTTP 404 response */solr/foo/select?q=*:* *#results: 10 On Mon, May 11, 2015 at 12:59 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Thank you for the query. Just to confirm, for the 'gettingstarted' in the query, does it matter which collection name I put? Regards, Edwin On 11 May 2015 15:51, Anshum Gupta ans...@anshumgupta.net wrote: You can query multiple collections by specifying the list of collections e.g.: http://hostname:port /solr/gettingstarted/select?q=testcollection=collection1,collection2,collection3 On Sun, May 10, 2015 at 11:49 PM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Would like to check, is there a way to query multiple collections together in a single query and return the results in one result set? For example, I have 2 collections and I want to search for records with the word 'solr' in both of the collections. Is there a query to do that, or must I query both collections separately, and get two different result sets? Regards, Edwin -- Anshum Gupta -- Anshum Gupta
Re: Solr custom component issue
Thanks Upayavira, I tried it by changing it to first-component in solrconfig.xml but no luck . Am I missing something here ? Here I want to add my own qf fields with boost in query. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-custom-component-issue-tp4204799p4204810.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing java byte code in classes / jars
There's also Perl-backed ACK. http://beyondgrep.com/ Which does the job of searching code really well. And I think at least once I came across something that stemmed from ACK and claimed it was faster/better... googling... aah! The Silver Searcher it was. :-) http://betterthanack.com/ pozdrawiam, LAFK 2015-05-09 12:40 GMT+02:00 Mark javam...@gmail.com: Hi Alexandre, Solr ASM is the extact poblem I'm looking to hack about with so I'm keen to consider any code no matter how ugly or broken Regards Mark On 9 May 2015 at 10:21, Alexandre Rafalovitch arafa...@gmail.com wrote: If you only have classes/jars, use ASM. I have done this before, have some ugly code to share if you want. If you have sources, javadoc 8 is a good way too. I am doing that now for solr-start.com, code on Github. Regards, Alex On 9 May 2015 7:09 am, Mark javam...@gmail.com wrote: To answer why bytecode - because mostly the use case I have is looking to index as much detail from jars/classes. extract class names, method names signatures packages / imports I am considering using ASM in order to generate an analysis view of the class The sort of usecases I have would be method / signature searches. For example; 1) show any classes with a method named parse* 2) show any classes with a method named parse that passes in a type *json* ...etc In the past I have written something to reverse out javadocs from just java bytecode, using solr would move this idea considerably much more powerful. Thanks for the suggestions so far On 8 May 2015 at 21:19, Erik Hatcher erik.hatc...@gmail.com wrote: Oh, and sorry, I omitted a couple of details: # creating the “java” core/collection bin/solr create -c java # I ran this from my Solr source code checkout, so that SolrLogFormatter.class just happened to be handy Erik On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote: What kinds of searches do you want to run? Are you trying to extract class names, method names, and such and make those searchable? If that’s the case, you need some kind of “parser” to reverse engineer that information from .class and .jar files before feeding it to Solr, which would happen before analysis. Java itself comes with a javap command that can do this; whether this is the “best” way to go for your scenario I don’t know, but here’s an interesting example pasted below (using Solr 5.x). — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class test.txt bin/post -c java test.txt now search for coreInfoMap http://localhost:8983/solr/java/browse?q=coreInfoMap I tried to be cleverer and use the stdin option of bin/post, like this: javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | bin/post -c java -url http://localhost:8983/solr/java/update/extract -type text/plain -params literal.id=SolrLogFormatter -out yes -d but something isn’t working right with the stdin detection like that (it does work to `cat test.txt | bin/post…` though, hmmm) test.txt looks like this, `cat test.txt`: Compiled from SolrLogFormatter.java public class org.apache.solr.SolrLogFormatter extends java.util.logging.Formatter { long startTime; long lastTime; java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String methodAlias; public boolean shorterFormat; java.util.Maporg.apache.solr.core.SolrCore, org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap; public java.util.Mapjava.lang.String, java.lang.String classAliases; static java.lang.ThreadLocaljava.lang.String threadLocal; public org.apache.solr.SolrLogFormatter(); public void setShorterFormat(); public java.lang.String format(java.util.logging.LogRecord); public void appendThread(java.lang.StringBuilder, java.util.logging.LogRecord); public java.lang.String _format(java.util.logging.LogRecord); public java.lang.String getHead(java.util.logging.Handler); public java.lang.String getTail(java.util.logging.Handler); public java.lang.String formatMessage(java.util.logging.LogRecord); public static void main(java.lang.String[]) throws java.lang.Exception; public static void go() throws java.lang.Exception; static {}; } On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
Re: Slow highlighting on Solr 5.0.0
Thanks for the pointers. Using hl.usePhraseHighlighter=false does indeed make it a lot faster. Obviously it's not really a solution, though, since in 4.10 it wasn't a problem and turning it off has consequences. I'm looking forward for the improvements in the next releases. --Ere 8.5.2015, 19.06, Matt Hilt kirjoitti: I¹ve been looking into this again. The phrase highlighter is much slower than the default highlighter, so you might be able to add hl.usePhraseHighlighter=false to your query to make it faster. Note that web interface will NOT help here, because that param is true by default, and the checkbox is basically broken in that respect. Also, the default highlighter doesn¹t seem to work in all case the phrase highlighter does though. Also, the current development branch of 5x is much better than 5.1, but not as good as 4.10. This ticket seems to be hitting on some of the issues at hand: https://issues.apache.org/jira/browse/SOLR-5855 I think this means they are getting there, but the performance is really still much worse than 4.10, and its not obvious why. On 5/5/15, 2:06 AM, Ere Maijala ere.maij...@helsinki.fi wrote: I'm seeing the same with Solr 5.1.0 after upgrading from 4.10.2. Here are my timings: 4.10.2: process: 1432.0 highlight: 723.0 5.1.0: process: 9570.0 highlight: 8790.0 schema.xml and solrconfig.xml are available at https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf . A couple of jstack outputs taken when the query was executing are available at http://pastebin.com/eJrEy2Wb Any suggestions would be appreciated. Or would it make sense to just file a JIRA issue? --Ere 3.3.2015, 0.48, Matt Hilt kirjoitti: Short form: While testing Solr 5.0.0 within our staging environment, I noticed that highlight enabled queries are much slower than I saw with 4.10. Are there any obvious reasons why this might be the case? As far as I can tell, nothing has changed with the default highlight search component or its parameters. A little more detail: The bulk of the collection config set was stolen from the basic 4.X example config set. I changed my schema.xml and solrconfig.xml just enough to get 5.0 to create a new collection (removed non-trie fields, some other deprecated response handler definitions, etc). I can provide my version of the solr.HighlightComponent config, but it is identical to the sample_techproducts_configs example in 5.0. Are there any other config files I could provide that might be useful? Number on ³much slower²: I indexed a very small subset of my data into the new collection and used the /select interface to do a simple debug query. Solr 4.10 gives the following pertinent info: response: { numFound: 72628, ... debug: { timing: { time: 95, process: { time: 94, query: { time: 6 }, highlight: { time: 84 }, debug: { time: 4 } } --- Whereas solr 5.0 is: response: { numFound: 1093, ... debug: { timing: { time: 6551, process: { time: 6549, query: { time: 0 }, highlight: { time: 6524 }, debug: { time: 25 } -- Ere Maijala Kansalliskirjasto / The National Library of Finland -- Ere Maijala Kansalliskirjasto / The National Library of Finland
Re: Solr custom component issue
On Mon, May 11, 2015, at 10:30 AM, nutchsolruser wrote: I can not set qf in solrconfig.xml file because my qf and boost values will be changing frequently . I am reading those values from external source. Can we not set qf value from searchComponent? Or is there any other way to do this? Changing frequently, but you can't pass them in via the search request?
Re: Solr custom component issue
You are adding a search component, and adding it as a last-component, meaning, it will come after the Query component which actually does the work. Given the parameters you have set, you will be using the default Lucene query parser which doesn't honour the qf parameter, so it isn't surprising that the QueryComponent is ignoring qf. What is it that you are trying to do? Upayavira On Mon, May 11, 2015, at 09:33 AM, nutchsolruser wrote: Hi , I am trying to add my own query parameters in Solr query using solr component . In below example I am trying to add qf parameter in the query. Below is my prepare method of component. But Solr is not considering qf parameter while searching It is using df parameter that I have added in schema.xml file as default search field. @Override public void prepare(ResponseBuilder rb) throws IOException { LOG.info(called Prepare ); SolrQueryRequest req = rb.req; SolrQueryResponse rsp = rb.rsp; SolrParams params = req.getParams(); ModifiableSolrParams modifiableSolrParams=new ModifiableSolrParams(params); modifiableSolrParams.set(qf, journal); rb.req.setParams(modifiableSolrParams); QParser parser; try { parser = QParser.getParser(rb.getQueryString(), edismax, req); rb.setQparser(parser) ; } catch (SyntaxError e) { e.printStackTrace(); } LOG.info(Solr Request +rb.req.toString()); } relevanct request handler in solrconfig.xml file : requestHandler name=/custom-api class=solr.SearchHandler lst name=defaults str name=q.alt*:*/str str name=dfdescription/str /lst arr name=last-components strcustom-component/str /arr /requestHandler How I can add qf param correctly in query so that solr can use this while searching ? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-custom-component-issue-tp4204799.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr custom component issue
If all you want to do is to hardwire a qf, you can do that in your requestHandler config in solrconfig.xml. If you want to extend how the edismax query parser works, you may well be better off subclassing the edismax query parser, and passing in modified request parameters, but I'd explore getting your problem solved without coding first. Can you not set qf= in the request handler configuration? Make sure you set defType=edismax if you want qf to have any effect at all. Upayavira On Mon, May 11, 2015, at 10:09 AM, nutchsolruser wrote: Thanks Upayavira, I tried it by changing it to first-component in solrconfig.xml but no luck . Am I missing something here ? Here I want to add my own qf fields with boost in query. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-custom-component-issue-tp4204799p4204810.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr custom component issue
Hi , I am trying to add my own query parameters in Solr query using solr component . In below example I am trying to add qf parameter in the query. Below is my prepare method of component. But Solr is not considering qf parameter while searching It is using df parameter that I have added in schema.xml file as default search field. @Override public void prepare(ResponseBuilder rb) throws IOException { LOG.info(called Prepare ); SolrQueryRequest req = rb.req; SolrQueryResponse rsp = rb.rsp; SolrParams params = req.getParams(); ModifiableSolrParams modifiableSolrParams=new ModifiableSolrParams(params); modifiableSolrParams.set(qf, journal); rb.req.setParams(modifiableSolrParams); QParser parser; try { parser = QParser.getParser(rb.getQueryString(), edismax, req); rb.setQparser(parser) ; } catch (SyntaxError e) { e.printStackTrace(); } LOG.info(Solr Request +rb.req.toString()); } relevanct request handler in solrconfig.xml file : requestHandler name=/custom-api class=solr.SearchHandler lst name=defaults str name=q.alt*:*/str str name=dfdescription/str /lst arr name=last-components strcustom-component/str /arr /requestHandler How I can add qf param correctly in query so that solr can use this while searching ? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-custom-component-issue-tp4204799.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr custom component issue
I can not set qf in solrconfig.xml file because my qf and boost values will be changing frequently . I am reading those values from external source. Can we not set qf value from searchComponent? Or is there any other way to do this? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-custom-component-issue-tp4204799p4204815.html Sent from the Solr - User mailing list archive at Nabble.com.
SOLR 4.10.4 - error creating document
I'm getting the following error with 4.10.4 WARN org.apache.solr.handler.dataimport.SolrWriter – Error creating document : SolrInputDocument(fields: [dcautoclasscode=310, dclang=unknown, ..., dcdocid=dd05ad427a58b49150a4ca36148187028562257a77643062382a1366250112ac]) org.apache.solr.common.SolrException: Exception writing document id ftumdeepblue:oai:deepblue.lib.umich.edu:2027.42/79437 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) ... at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=f_dcperson (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114, 111, 119, 110, 105, 110, 103, 32, 32, 32, 50, 48]...', original message: bytes can be at most 32766 in length; got 38177 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687) ... My huge field is dcdescription, with the following schema: field name=dccreator type=string indexed=true stored=true multiValued=true / field name=dcdescription type=string indexed=false stored=true / field name=f_dcperson type=string indexed=true stored=true multiValued=true / ... copyField source=dccreator dest=f_dcperson / copyField source=dccontributor dest=f_dcperson / I guess I have to make dcdescription also multivalue=true? But why is it complaining about f_dcperson which is already multivalue? Second guess, dcdescription is not multivalue, but filled to max (32766). Then it is UTF8 encoded and going beyond 32766 which is larger than a single subfield of a multivaled field and therefore the error? Any really explanation on this and how to prevent it? Regards Bernd
Re: SolrJ vs. plain old HTTP post
Hi Steve, Main advantage is that it uses binary format so XML/JSON overhead is avoided. You should also check out if SOLR's Data Import Handler is good fit for you. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On 11.05.2015 14:21, Steven White wrote: Hi Everyone, If all that I need to do is send data to Solr to add / delete a Solr document, which tool is better for the job: SolrJ or plain old HTTP post? In other word, what are the advantages of using SolrJ when the need is to push data to Solr for indexing? Thanks, Steve
SolrJ vs. plain old HTTP post
Hi Everyone, If all that I need to do is send data to Solr to add / delete a Solr document, which tool is better for the job: SolrJ or plain old HTTP post? In other word, what are the advantages of using SolrJ when the need is to push data to Solr for indexing? Thanks, Steve
Re: SOLR 4.10.4 - error creating document
Hi Bernd, Issue is with f_dcperson and what ends up in that field. It is configured to be string, which means it is not tokenized so if some huge value is in either dccreator or dccontributor it will end up as single term. Nemes suggest that it should not contain such values, but double check in your import code if you are reading wrong column or concatenating contributors or something else causing value to be to big. Also check if you have some copyField that should not be there. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On 11.05.2015 14:13, Bernd Fehling wrote: I'm getting the following error with 4.10.4 WARN org.apache.solr.handler.dataimport.SolrWriter – Error creating document : SolrInputDocument(fields: [dcautoclasscode=310, dclang=unknown, ..., dcdocid=dd05ad427a58b49150a4ca36148187028562257a77643062382a1366250112ac]) org.apache.solr.common.SolrException: Exception writing document id ftumdeepblue:oai:deepblue.lib.umich.edu:2027.42/79437 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) ... at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=f_dcperson (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114, 111, 119, 110, 105, 110, 103, 32, 32, 32, 50, 48]...', original message: bytes can be at most 32766 in length; got 38177 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687) ... My huge field is dcdescription, with the following schema: field name=dccreator type=string indexed=true stored=true multiValued=true / field name=dcdescription type=string indexed=false stored=true / field name=f_dcperson type=string indexed=true stored=true multiValued=true / ... copyField source=dccreator dest=f_dcperson / copyField source=dccontributor dest=f_dcperson / I guess I have to make dcdescription also multivalue=true? But why is it complaining about f_dcperson which is already multivalue? Second guess, dcdescription is not multivalue, but filled to max (32766). Then it is UTF8 encoded and going beyond 32766 which is larger than a single subfield of a multivaled field and therefore the error? Any really explanation on this and how to prevent it? Regards Bernd
Solr query which return only those docs whose all tokens are from given list
Hi all, Also asked this here : http://stackoverflow.com/questions/30166116 For example i have SOLR docs in which tags field is indexed : Doc1 - tags:T1 T2 Doc2 - tags:T1 T3 Doc3 - tags:T1 T4 Doc4 - tags:T1 T2 T3 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will give Doc2 and Doc4 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected is : Doc1, Doc2, Doc4 How to model Query2 in Solr ?? Please help me on this ?
Re: Solr custom component issue
These boosting parameters will be configured outside Solr and there is seperate module from which these values get populated , I am reading those values from external datasource and I want to attach them to each request . -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-custom-component-issue-tp4204799p4204832.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ vs. plain old HTTP post
Another advantage to SolrJ is with SolrCloud (ZK) awareness, and taking advantage of some routing optimizations client-side so the cluster has less hops to make. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On May 11, 2015, at 8:21 AM, Steven White swhite4...@gmail.com wrote: Hi Everyone, If all that I need to do is send data to Solr to add / delete a Solr document, which tool is better for the job: SolrJ or plain old HTTP post? In other word, what are the advantages of using SolrJ when the need is to push data to Solr for indexing? Thanks, Steve
storeOffsetsWithPositions does not reflect in the index
Hi, Using solr 4.10.2. Looks like storeOffsetsWithPositions has no effect, i.e. it does not store offsets in addition to positions. If we use termVectors=true termPositions=true termOffsets=true, then offsets and positions are available fine. Any ideas how to make storeOffsetsWithPositions work? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: SOLR 4.10.4 - error creating document
After reading https://issues.apache.org/jira/browse/LUCENE-5472 one question still remains. Why is it complaining about f_dcperson which is a copyField when the origin problem field is dcdescription which definately is much larger than 32766? I would assume it complains about dcdescription field. Or not? Bernd Am 11.05.2015 um 14:13 schrieb Bernd Fehling: I'm getting the following error with 4.10.4 WARN org.apache.solr.handler.dataimport.SolrWriter – Error creating document : SolrInputDocument(fields: [dcautoclasscode=310, dclang=unknown, ..., dcdocid=dd05ad427a58b49150a4ca36148187028562257a77643062382a1366250112ac]) org.apache.solr.common.SolrException: Exception writing document id ftumdeepblue:oai:deepblue.lib.umich.edu:2027.42/79437 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) ... at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=f_dcperson (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114, 111, 119, 110, 105, 110, 103, 32, 32, 32, 50, 48]...', original message: bytes can be at most 32766 in length; got 38177 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687) ... My huge field is dcdescription, with the following schema: field name=dccreator type=string indexed=true stored=true multiValued=true / field name=dcdescription type=string indexed=false stored=true / field name=f_dcperson type=string indexed=true stored=true multiValued=true / ... copyField source=dccreator dest=f_dcperson / copyField source=dccontributor dest=f_dcperson / I guess I have to make dcdescription also multivalue=true? But why is it complaining about f_dcperson which is already multivalue? Second guess, dcdescription is not multivalue, but filled to max (32766). Then it is UTF8 encoded and going beyond 32766 which is larger than a single subfield of a multivaled field and therefore the error? Any really explanation on this and how to prevent it? Regards Bernd
Re: SOLR 4.10.4 - error creating document
Hi Emir, the dcdescription field is definately to big. But why is it complaining about f_dcperson and not dcdescription? Regards Bernd Am 11.05.2015 um 15:12 schrieb Emir Arnautovic: Hi Bernd, Issue is with f_dcperson and what ends up in that field. It is configured to be string, which means it is not tokenized so if some huge value is in either dccreator or dccontributor it will end up as single term. Nemes suggest that it should not contain such values, but double check in your import code if you are reading wrong column or concatenating contributors or something else causing value to be to big. Also check if you have some copyField that should not be there. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On 11.05.2015 14:13, Bernd Fehling wrote: I'm getting the following error with 4.10.4 WARN org.apache.solr.handler.dataimport.SolrWriter – Error creating document : SolrInputDocument(fields: [dcautoclasscode=310, dclang=unknown, ..., dcdocid=dd05ad427a58b49150a4ca36148187028562257a77643062382a1366250112ac]) org.apache.solr.common.SolrException: Exception writing document id ftumdeepblue:oai:deepblue.lib.umich.edu:2027.42/79437 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) ... at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=f_dcperson (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114, 111, 119, 110, 105, 110, 103, 32, 32, 32, 50, 48]...', original message: bytes can be at most 32766 in length; got 38177 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687) ... My huge field is dcdescription, with the following schema: field name=dccreator type=string indexed=true stored=true multiValued=true / field name=dcdescription type=string indexed=false stored=true / field name=f_dcperson type=string indexed=true stored=true multiValued=true / ... copyField source=dccreator dest=f_dcperson / copyField source=dccontributor dest=f_dcperson / I guess I have to make dcdescription also multivalue=true? But why is it complaining about f_dcperson which is already multivalue? Second guess, dcdescription is not multivalue, but filled to max (32766). Then it is UTF8 encoded and going beyond 32766 which is larger than a single subfield of a multivaled field and therefore the error? Any really explanation on this and how to prevent it? Regards Bernd -- * Bernd FehlingBielefeld University Library Dipl.-Inform. (FH)LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: SOLR 4.10.4 - error creating document
Hi Bernrd, dcdescription field is not indexed. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On 11.05.2015 15:22, Bernd Fehling wrote: Hi Emir, the dcdescription field is definately to big. But why is it complaining about f_dcperson and not dcdescription? Regards Bernd Am 11.05.2015 um 15:12 schrieb Emir Arnautovic: Hi Bernd, Issue is with f_dcperson and what ends up in that field. It is configured to be string, which means it is not tokenized so if some huge value is in either dccreator or dccontributor it will end up as single term. Nemes suggest that it should not contain such values, but double check in your import code if you are reading wrong column or concatenating contributors or something else causing value to be to big. Also check if you have some copyField that should not be there. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On 11.05.2015 14:13, Bernd Fehling wrote: I'm getting the following error with 4.10.4 WARN org.apache.solr.handler.dataimport.SolrWriter – Error creating document : SolrInputDocument(fields: [dcautoclasscode=310, dclang=unknown, ..., dcdocid=dd05ad427a58b49150a4ca36148187028562257a77643062382a1366250112ac]) org.apache.solr.common.SolrException: Exception writing document id ftumdeepblue:oai:deepblue.lib.umich.edu:2027.42/79437 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) ... at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=f_dcperson (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114, 111, 119, 110, 105, 110, 103, 32, 32, 32, 50, 48]...', original message: bytes can be at most 32766 in length; got 38177 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687) ... My huge field is dcdescription, with the following schema: field name=dccreator type=string indexed=true stored=true multiValued=true / field name=dcdescription type=string indexed=false stored=true / field name=f_dcperson type=string indexed=true stored=true multiValued=true / ... copyField source=dccreator dest=f_dcperson / copyField source=dccontributor dest=f_dcperson / I guess I have to make dcdescription also multivalue=true? But why is it complaining about f_dcperson which is already multivalue? Second guess, dcdescription is not multivalue, but filled to max (32766). Then it is UTF8 encoded and going beyond 32766 which is larger than a single subfield of a multivaled field and therefore the error? Any really explanation on this and how to prevent it? Regards Bernd
Re: SOLR 4.10.4 - error creating document
Hi Shawn, that means if I set a length limit on dcdescription or make dcdescription multivalue than the problem is solved because f_dcperson is already multivalue? Regards Bernd Am 11.05.2015 um 15:17 schrieb Shawn Heisey: On 5/11/2015 6:13 AM, Bernd Fehling wrote: Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=f_dcperson (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114, 111, 119, 110, 105, 110, 103, 32, 32, 32, 50, 48]...', original message: bytes can be at most 32766 in length; got 38177 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687) ... The field in question is f_dcperson, which according to your schema is a string type. If your schema follows the example fieldType definitions, then string is a solr.StrField, where the entire input is treated as one term. The field is multiValued and a copyField destination, so each value that is sent is one term. I went looking for this message in the code. It is logged when a MaxBytesLengthExceededException is thrown. This error is complaining that the size of the *term* (since it's a string type, likely the contents of an individual copyField source field) you are sending to the f_dcperson field has exceeded 32766, which is apparently the largest size for that field type. You'll either need to fix your source data or pick a field type that can handle more data. Thanks, Shawn
Re: SOLR 4.10.4 - error creating document
It turned out that I didn't recognized that dcdescription is not indexed, only stored. So the next in chain ist f_dcperson where dccreator and dcdescription is combined and indexed. And this is why the error shows up on f_dcperson. (delay of error) Thanks for your help, regards. Bernd Am 11.05.2015 um 15:35 schrieb Shawn Heisey: On 5/11/2015 7:19 AM, Bernd Fehling wrote: After reading https://issues.apache.org/jira/browse/LUCENE-5472 one question still remains. Why is it complaining about f_dcperson which is a copyField when the origin problem field is dcdescription which definately is much larger than 32766? I would assume it complains about dcdescription field. Or not? If the value resulting in the error does come from a copyField source that also uses a string type, then my guess here is that Solr has some prioritization that causes the copyField destination to be indexed before the sources. This ordering might make things go a little faster, because if it happens right after copying, all or most of the data for the destination field would already be sitting in one or more of the CPU caches. Cache hits are wonderful things for performance. Thanks, Shawn
Re: SOLR 4.10.4 - error creating document
Hi Emir, ahhh, yes you're right. I missed that. Now I understand why it is not complaining about dcdescription and the error shows up on f_dcperson. delay of error ;-) Thanks Bernd Am 11.05.2015 um 15:25 schrieb Emir Arnautovic: Hi Bernrd, dcdescription field is not indexed. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On 11.05.2015 15:22, Bernd Fehling wrote: Hi Emir, the dcdescription field is definately to big. But why is it complaining about f_dcperson and not dcdescription? Regards Bernd Am 11.05.2015 um 15:12 schrieb Emir Arnautovic: Hi Bernd, Issue is with f_dcperson and what ends up in that field. It is configured to be string, which means it is not tokenized so if some huge value is in either dccreator or dccontributor it will end up as single term. Nemes suggest that it should not contain such values, but double check in your import code if you are reading wrong column or concatenating contributors or something else causing value to be to big. Also check if you have some copyField that should not be there. Thanks, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On 11.05.2015 14:13, Bernd Fehling wrote: I'm getting the following error with 4.10.4 WARN org.apache.solr.handler.dataimport.SolrWriter – Error creating document : SolrInputDocument(fields: [dcautoclasscode=310, dclang=unknown, ..., dcdocid=dd05ad427a58b49150a4ca36148187028562257a77643062382a1366250112ac]) org.apache.solr.common.SolrException: Exception writing document id ftumdeepblue:oai:deepblue.lib.umich.edu:2027.42/79437 to the index; possible analysis error. at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) ... at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=f_dcperson (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114, 111, 119, 110, 105, 110, 103, 32, 32, 32, 50, 48]...', original message: bytes can be at most 32766 in length; got 38177 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687) ... My huge field is dcdescription, with the following schema: field name=dccreator type=string indexed=true stored=true multiValued=true / field name=dcdescription type=string indexed=false stored=true / field name=f_dcperson type=string indexed=true stored=true multiValued=true / ... copyField source=dccreator dest=f_dcperson / copyField source=dccontributor dest=f_dcperson / I guess I have to make dcdescription also multivalue=true? But why is it complaining about f_dcperson which is already multivalue? Second guess, dcdescription is not multivalue, but filled to max (32766). Then it is UTF8 encoded and going beyond 32766 which is larger than a single subfield of a multivaled field and therefore the error? Any really explanation on this and how to prevent it? Regards Bernd
Re: SOLR 4.10.4 - error creating document
On 5/11/2015 6:13 AM, Bernd Fehling wrote: Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field=f_dcperson (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114, 111, 119, 110, 105, 110, 103, 32, 32, 32, 50, 48]...', original message: bytes can be at most 32766 in length; got 38177 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687) ... The field in question is f_dcperson, which according to your schema is a string type. If your schema follows the example fieldType definitions, then string is a solr.StrField, where the entire input is treated as one term. The field is multiValued and a copyField destination, so each value that is sent is one term. I went looking for this message in the code. It is logged when a MaxBytesLengthExceededException is thrown. This error is complaining that the size of the *term* (since it's a string type, likely the contents of an individual copyField source field) you are sending to the f_dcperson field has exceeded 32766, which is apparently the largest size for that field type. You'll either need to fix your source data or pick a field type that can handle more data. Thanks, Shawn
Re: SOLR 4.10.4 - error creating document
On 5/11/2015 7:19 AM, Bernd Fehling wrote: After reading https://issues.apache.org/jira/browse/LUCENE-5472 one question still remains. Why is it complaining about f_dcperson which is a copyField when the origin problem field is dcdescription which definately is much larger than 32766? I would assume it complains about dcdescription field. Or not? If the value resulting in the error does come from a copyField source that also uses a string type, then my guess here is that Solr has some prioritization that causes the copyField destination to be indexed before the sources. This ordering might make things go a little faster, because if it happens right after copying, all or most of the data for the destination field would already be sitting in one or more of the CPU caches. Cache hits are wonderful things for performance. Thanks, Shawn
Re: Solr query which return only those docs whose all tokens are from given list
Hi Naresh, Couldn't you could just model this as an OR query since your requirement is at least one (but can be more than one), ie: tags:T1 tags:T2 tags:T3 -sujit On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, Also asked this here : http://stackoverflow.com/questions/30166116 For example i have SOLR docs in which tags field is indexed : Doc1 - tags:T1 T2 Doc2 - tags:T1 T3 Doc3 - tags:T1 T4 Doc4 - tags:T1 T2 T3 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will give Doc2 and Doc4 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected is : Doc1, Doc2, Doc4 How to model Query2 in Solr ?? Please help me on this ?
Re: SolrJ vs. plain old HTTP post
Thanks Erik and Emir. Erik: The fact that SolrJ is aware of SolrCloud is enough to put it over plain old HTTP post. Emir: I looked into Solr's data import handler, unfortunately, it won't work for my need. To close the loop on this question, I will need to enable Jetty's SSL (the jetty that comes with Solr 5.1). If I do so, will SolrJ still work, can I assume that SolrJ supports SSL? I Google'ed but cannot find the answer. Thanks again. Steve On Mon, May 11, 2015 at 8:39 AM, Erik Hatcher erik.hatc...@gmail.com wrote: Another advantage to SolrJ is with SolrCloud (ZK) awareness, and taking advantage of some routing optimizations client-side so the cluster has less hops to make. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On May 11, 2015, at 8:21 AM, Steven White swhite4...@gmail.com wrote: Hi Everyone, If all that I need to do is send data to Solr to add / delete a Solr document, which tool is better for the job: SolrJ or plain old HTTP post? In other word, what are the advantages of using SolrJ when the need is to push data to Solr for indexing? Thanks, Steve
Re: indexing java byte code in classes / jars
How about Krugle? http://opensearch.krugle.org/ Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 11, 2015, at 3:18 AM, Tomasz Borek tomasz.bo...@gmail.com wrote: There's also Perl-backed ACK. http://beyondgrep.com/ Which does the job of searching code really well. And I think at least once I came across something that stemmed from ACK and claimed it was faster/better... googling... aah! The Silver Searcher it was. :-) http://betterthanack.com/ pozdrawiam, LAFK 2015-05-09 12:40 GMT+02:00 Mark javam...@gmail.com: Hi Alexandre, Solr ASM is the extact poblem I'm looking to hack about with so I'm keen to consider any code no matter how ugly or broken Regards Mark On 9 May 2015 at 10:21, Alexandre Rafalovitch arafa...@gmail.com wrote: If you only have classes/jars, use ASM. I have done this before, have some ugly code to share if you want. If you have sources, javadoc 8 is a good way too. I am doing that now for solr-start.com, code on Github. Regards, Alex On 9 May 2015 7:09 am, Mark javam...@gmail.com wrote: To answer why bytecode - because mostly the use case I have is looking to index as much detail from jars/classes. extract class names, method names signatures packages / imports I am considering using ASM in order to generate an analysis view of the class The sort of usecases I have would be method / signature searches. For example; 1) show any classes with a method named parse* 2) show any classes with a method named parse that passes in a type *json* ...etc In the past I have written something to reverse out javadocs from just java bytecode, using solr would move this idea considerably much more powerful. Thanks for the suggestions so far On 8 May 2015 at 21:19, Erik Hatcher erik.hatc...@gmail.com wrote: Oh, and sorry, I omitted a couple of details: # creating the “java” core/collection bin/solr create -c java # I ran this from my Solr source code checkout, so that SolrLogFormatter.class just happened to be handy Erik On May 8, 2015, at 4:11 PM, Erik Hatcher erik.hatc...@gmail.com wrote: What kinds of searches do you want to run? Are you trying to extract class names, method names, and such and make those searchable? If that’s the case, you need some kind of “parser” to reverse engineer that information from .class and .jar files before feeding it to Solr, which would happen before analysis. Java itself comes with a javap command that can do this; whether this is the “best” way to go for your scenario I don’t know, but here’s an interesting example pasted below (using Solr 5.x). — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class test.txt bin/post -c java test.txt now search for coreInfoMap http://localhost:8983/solr/java/browse?q=coreInfoMap I tried to be cleverer and use the stdin option of bin/post, like this: javap build/solr-core/classes/java/org/apache/solr/SolrLogFormatter.class | bin/post -c java -url http://localhost:8983/solr/java/update/extract -type text/plain -params literal.id=SolrLogFormatter -out yes -d but something isn’t working right with the stdin detection like that (it does work to `cat test.txt | bin/post…` though, hmmm) test.txt looks like this, `cat test.txt`: Compiled from SolrLogFormatter.java public class org.apache.solr.SolrLogFormatter extends java.util.logging.Formatter { long startTime; long lastTime; java.util.Maporg.apache.solr.SolrLogFormatter$Method, java.lang.String methodAlias; public boolean shorterFormat; java.util.Maporg.apache.solr.core.SolrCore, org.apache.solr.SolrLogFormatter$CoreInfo coreInfoMap; public java.util.Mapjava.lang.String, java.lang.String classAliases; static java.lang.ThreadLocaljava.lang.String threadLocal; public org.apache.solr.SolrLogFormatter(); public void setShorterFormat(); public java.lang.String format(java.util.logging.LogRecord); public void appendThread(java.lang.StringBuilder, java.util.logging.LogRecord); public java.lang.String _format(java.util.logging.LogRecord); public java.lang.String getHead(java.util.logging.Handler); public java.lang.String getTail(java.util.logging.Handler); public java.lang.String formatMessage(java.util.logging.LogRecord); public static void main(java.lang.String[]) throws java.lang.Exception; public static void go() throws java.lang.Exception; static {}; } On May 8, 2015, at 3:31 PM, Mark javam...@gmail.com wrote: I looking to use Solr search over the byte code in Classes and Jars. Does anyone know or have experience of Analyzers, Tokenizers, and Token Filters for such a task? Regards Mark
PatternReplaceCharFilter + solr.WhitespaceTokenizerFactory behaviour
I must be missing something obvious.I have a simple regex that removes spacehyphenspace pattern. The unit test below works fine, but when I plug it into schema and query, regex does not match, since input already gets split by space (further below). My understanding that charFilter would operate on raw input string and than pass it to the whitespace tokenizer which seems to be the case, but I am not sure why I get already split token stream. Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false); return new TokenStreamComponents(tokenizer, tokenizer); } @Override protected Reader initReader(String fieldName, Reader reader) { return new PatternReplaceCharFilter(pattern(\\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\\s+), , reader); } }; final TokenStream tokens = analyzer.tokenStream(, new StringReader(a - b)); tokens.reset(); final CharTermAttribute termAtt = tokens.addAttribute(CharTermAttribute.class); while (tokens.incrementToken()) { System.out.println(=== + new String(Arrays.copyOf(termAtt.buffer(), termAtt.length(; } I end up with: === a === b Now I define the same in my schema: fieldType name=text class=solr.TextField positionIncrementGap=100 multiValued=true autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\s+ replacement= ; / tokenizer class=solr.WhitespaceTokenizerFactory / /analyzer /fieldType field name=myfield type=text indexed=true stored=false multiValued=true/ When I query the input already comes in split into (e.g. a,-,b) PatternReplaceCharFilter's processPattern method so regex would not match. CharSequence processPattern(CharSequence input) ... even though charFilter is defined before tokenizer. Here is the query SolrQuery solrQuery = new SolrQuery(a - b); solrQuery.setRequestHandler(/select); solrQuery.set(defType, edismax); solrQuery.set(qf, myfield); solrQuery.set(CommonParams.ROWS, 0); solrQuery.set(CommonParams.DEBUG, true); solrQuery.set(CommonParams.DEBUG_QUERY, true); QueryResponse response = solrSvr.query(solrQuery); System.out.println(parsedQtoString + response.getDebugMap() .get(parsedquery_toString)); System.out.println(parsedQ + response.getDebugMap() .get(parsedquery)); Output is parsedQtoString +((myfield:a) (myfield:-) (myfield:b)) parsedQ (+(DisjunctionMaxQuery((myfield:a)) DisjunctionMaxQuery((myfield:-)) DisjunctionMaxQuery((myfield:b/no_coord
Re: Completion Suggester in Solr
Bumping this thread again in the group, haven't received any responses for this. I am kind of stuck with this problem last week, any help is highly appreciated. Thanks Pradeep On Wed, May 6, 2015 at 5:00 PM, Pradeep Bhattiprolu pbhatt...@gmail.com wrote: Hi Is there a equivalent of Completion suggester of ElasticSearch in Solr ? I am a user who uses both Solr and ES, in different projects. I am not able to find a solution in Solr, where i can use : 1) FSA Structure 2) multiple terms as synonyms 3) assign a weight to each document based on certain hueristics, ex: popularity score, user search history etc. Any kind of help , pointers to relevant examples and documentation is highly appreciated. thanks in advance. Pradeep
Re: Solr custom component issue
unsubscribe On Mon, May 11, 2015 at 6:58 PM, Upayavira u...@odoko.co.uk wrote: attaching them to each request, then just add qf= as a param to the URL, easy. On Mon, May 11, 2015, at 12:17 PM, nutchsolruser wrote: These boosting parameters will be configured outside Solr and there is seperate module from which these values get populated , I am reading those values from external datasource and I want to attach them to each request . -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-custom-component-issue-tp4204799p4204832.html Sent from the Solr - User mailing list archive at Nabble.com.
SOLR plugin: Retrieve all values of multivalued field
Hi folks, I'm playing with a custom SOLR plugin and I'm trying to retrieve the value for a multivalued field, using the code below. == schema.xml: field name=my_field_name type=string indexed=true stored=false multiValued=true/ == input data: add doc field name=id83127/field field name=my_field_namesomevalue/field field name=my_field_namesome other value/field field name=my_field_namesome other value 3/field field name=my_field_namesome other value 4/field /doc /add == plugin: SortedDocValues termsIndex = FieldCache.DEFAULT.getTermsIndex(atomicReader, my_field_name); ... int document = 12; BytesRef spare = termsIndex.get(document); String value = new String(spare.bytes, spare.offset, spare.length); -- *This only returns the value some other value 3*. Is there any way to obtain the other values as well (eg. somevalue, some other value)? Any help is gladly appreciated. Thanks, Costi
Help to index nested document
Need your valuable inputs... I am indexing data from database (one table) which is in this example format : id name value 1 Joe 102724904 2 Joe 100996643 - id is primary/ unique key - there can be same name but different value - If I try name as unique key then SOLR removes duplicate and indexes 1 document - I am getting the result in this format... Is there as way I can index data in a way so that I can value can be child for name... response: { numFound: 2, start: 0, docs: [ { id: 1, name: Joe, value: [ 102724904 ] }, { id: 2, name: Joe, value: [ 100996643 ] }... Expected format : docs: [ { name: Joe, value: [ 102724904, 100996643 ] }
Re: PatternReplaceCharFilter + solr.WhitespaceTokenizerFactory behaviour
This trips up _everybody_ at one point or other. The problem is that the input goes through the query _parsing_ prior to getting to the field analysis, and the parser is sensitive to spaces. Consider the input (without quotes) of my dog. That gets broken up into default_field:my default_field:dog and only _then_ does the analysis chain, including your PatternReplaceCharFilterFactory get applied to the individual tokens. So, your query input needs to escape the spaces, as in whatever\ -\ somethingelse, or perhaps quote the input, although this latter has other implications. Best, Erick On Mon, May 11, 2015 at 2:00 PM, Mihran Shahinian slowmih...@gmail.com wrote: I must be missing something obvious.I have a simple regex that removes spacehyphenspace pattern. The unit test below works fine, but when I plug it into schema and query, regex does not match, since input already gets split by space (further below). My understanding that charFilter would operate on raw input string and than pass it to the whitespace tokenizer which seems to be the case, but I am not sure why I get already split token stream. Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.WHITESPACE, false); return new TokenStreamComponents(tokenizer, tokenizer); } @Override protected Reader initReader(String fieldName, Reader reader) { return new PatternReplaceCharFilter(pattern(\\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\\s+), , reader); } }; final TokenStream tokens = analyzer.tokenStream(, new StringReader(a - b)); tokens.reset(); final CharTermAttribute termAtt = tokens.addAttribute(CharTermAttribute.class); while (tokens.incrementToken()) { System.out.println(=== + new String(Arrays.copyOf(termAtt.buffer(), termAtt.length(; } I end up with: === a === b Now I define the same in my schema: fieldType name=text class=solr.TextField positionIncrementGap=100 multiValued=true autoGeneratePhraseQueries=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\s+ replacement= ; / tokenizer class=solr.WhitespaceTokenizerFactory / /analyzer /fieldType field name=myfield type=text indexed=true stored=false multiValued=true/ When I query the input already comes in split into (e.g. a,-,b) PatternReplaceCharFilter's processPattern method so regex would not match. CharSequence processPattern(CharSequence input) ... even though charFilter is defined before tokenizer. Here is the query SolrQuery solrQuery = new SolrQuery(a - b); solrQuery.setRequestHandler(/select); solrQuery.set(defType, edismax); solrQuery.set(qf, myfield); solrQuery.set(CommonParams.ROWS, 0); solrQuery.set(CommonParams.DEBUG, true); solrQuery.set(CommonParams.DEBUG_QUERY, true); QueryResponse response = solrSvr.query(solrQuery); System.out.println(parsedQtoString + response.getDebugMap() .get(parsedquery_toString)); System.out.println(parsedQ + response.getDebugMap() .get(parsedquery)); Output is parsedQtoString +((myfield:a) (myfield:-) (myfield:b)) parsedQ (+(DisjunctionMaxQuery((myfield:a)) DisjunctionMaxQuery((myfield:-)) DisjunctionMaxQuery((myfield:b/no_coord
Re: SOLR 4.10.4 - error creating document
I've got to ask _how_ are you intending to search this field? On the surface, this feels like an XY problem. It's a string type. Therefore, if this is the input: 102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114 you'll only ever get a match if you search exactly: 102, 111, 114, 32, 97, 32, 114, 101, 118, 105, 101, 119, 32, 115, 101, 101, 32, 66, 114 None of these will match 102 102, 32 32, 119, 32, 115 etc. The idea of doing a match on a single _token_ that's over 32K long is pretty far out there, thus the check. The entire multiValued discussion is _probably_ a red herring and won't help you. multiValued has nothing to do with multiple terms, that's all up to your field type. So back up and tell us _how_ you intend to search this field. I'm guessing you really want to make it a text-based type instead. But that's just a guess. Best, Erick. On Mon, May 11, 2015 at 8:43 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: It turned out that I didn't recognized that dcdescription is not indexed, only stored. So the next in chain ist f_dcperson where dccreator and dcdescription is combined and indexed. And this is why the error shows up on f_dcperson. (delay of error) Thanks for your help, regards. Bernd Am 11.05.2015 um 15:35 schrieb Shawn Heisey: On 5/11/2015 7:19 AM, Bernd Fehling wrote: After reading https://issues.apache.org/jira/browse/LUCENE-5472 one question still remains. Why is it complaining about f_dcperson which is a copyField when the origin problem field is dcdescription which definately is much larger than 32766? I would assume it complains about dcdescription field. Or not? If the value resulting in the error does come from a copyField source that also uses a string type, then my guess here is that Solr has some prioritization that causes the copyField destination to be indexed before the sources. This ordering might make things go a little faster, because if it happens right after copying, all or most of the data for the destination field would already be sitting in one or more of the CPU caches. Cache hits are wonderful things for performance. Thanks, Shawn
Re: SolrJ vs. plain old HTTP post
On Mon, May 11, 2015 at 8:20 PM, Steven White swhite4...@gmail.com wrote: Thanks Erik and Emir. snip/ To close the loop on this question, I will need to enable Jetty's SSL (the jetty that comes with Solr 5.1). If I do so, will SolrJ still work, can I assume that SolrJ supports SSL? Yes, SolrJ can work with SSL enabled on the server as long as you pass the same JVM parameters on the client side to enable SSL e.g. -Djavax.net.ssl.keyStore= -Djavax.net.ssl.keyStorePassword= -Djavax.net.ssl.trustStore= -Djavax.net.ssl.trustStorePassword= See https://cwiki.apache.org/confluence/display/solr/Enabling+SSL#EnablingSSL-IndexadocumentusingCloudSolrClient I Google'ed but cannot find the answer. Thanks again. Steve On Mon, May 11, 2015 at 8:39 AM, Erik Hatcher erik.hatc...@gmail.com wrote: Another advantage to SolrJ is with SolrCloud (ZK) awareness, and taking advantage of some routing optimizations client-side so the cluster has less hops to make. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On May 11, 2015, at 8:21 AM, Steven White swhite4...@gmail.com wrote: Hi Everyone, If all that I need to do is send data to Solr to add / delete a Solr document, which tool is better for the job: SolrJ or plain old HTTP post? In other word, what are the advantages of using SolrJ when the need is to push data to Solr for indexing? Thanks, Steve -- Regards, Shalin Shekhar Mangar.
Re: Solr query which return only those docs whose all tokens are from given list
Thanks Andrew, You got my problem precisely But solutions you suggested may not work for me. In my API i get only list of tags authorized i.e [T1, T2, T3] and based on that only i need to construct my Solr query. So first solution with NOT (T4 OR T5) will not work. In real case tag ids T1, T2 are UUID's, so range query also will not work as i have no control on order of these ids. Looking for more suggestions ?? Thanks Naresh On Mon, May 11, 2015 at 10:05 PM, Andrew Chillrud achill...@opentext.com wrote: Based on his example, it sounds like Naresh not only wants the tags field to contain at least one of the values [T1, T2, T3] but also wants to exclude documents that contain a tag other than T1, T2, or T3 (Doc3 should not be retrieved). If the set of possible values in the tags field is limited and known, you could use a NOT (or '-') clause to accomplish this. If there were 5 possible tag values: tags:(( T1 OR T2 OR T3) NOT (T4 OR T5)) However this doesn't seem practical if the number of possible values is large or unlimited. Perhaps something could be done with range queries: tags:(( T1 OR T2 OR T3) NOT ([* TO T1} OR {T1 TO T2} OR {T3 to * ])) however this would require whatever is constructing the query to be aware of the lexical ordering of the terms in the index. Maybe there are more elegant solutions, but I am not aware of them. - Andy - -Original Message- From: sujitatgt...@gmail.com [mailto:sujitatgt...@gmail.com] On Behalf Of Sujit Pal Sent: Monday, May 11, 2015 10:40 AM To: solr-user@lucene.apache.org Subject: Re: Solr query which return only those docs whose all tokens are from given list Hi Naresh, Couldn't you could just model this as an OR query since your requirement is at least one (but can be more than one), ie: tags:T1 tags:T2 tags:T3 -sujit On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, Also asked this here : http://stackoverflow.com/questions/30166116 For example i have SOLR docs in which tags field is indexed : Doc1 - tags:T1 T2 Doc2 - tags:T1 T3 Doc3 - tags:T1 T4 Doc4 - tags:T1 T2 T3 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will give Doc2 and Doc4 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected is : Doc1, Doc2, Doc4 How to model Query2 in Solr ?? Please help me on this ?
Re: Best way to backup and restore an index for a cloud setup in 4.6.1?
Hi John, There are a few HTTP APIs for replication, one of which can let you take a backup of the index. Restoring can be as simple as just copying over the index in the right location on the disk. A new restore API will be released with the next version of Solr which will make some of these tasks easier. See https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler On Fri, May 8, 2015 at 10:26 PM, John Smith g10vstmo...@gmail.com wrote: All, With a cloud setup for a collection in 4.6.1, what is the most elegant way to backup and restore an index? We are specifically looking into the application of when doing a full reindex, with the idea of building an index on one set of servers, backing up the index, and then restoring that backup on another set of servers. Is there a better way to rebuild indexes on another set of servers? We are not sharding if that makes any difference. Thanks, g10vstmoney -- Regards, Shalin Shekhar Mangar.
Re: Queries on SynonymFilterFactory
Yes sure, thanks for your advice. I'm still waiting for my server to come before I can scale up my system and do the testing. Now the Solr running on my 4GB RAM system will crash if I try to scale up my system as there's not enough memory to support it. Regards, Edwin On 11 May 2015 at 19:11, Alessandro Benedetti benedetti.ale...@gmail.com wrote: 2015-05-11 4:44 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I've managed to run the synonyms with 10 different synonyms file. Each of the synonym file size is 1MB, which consist of about 1000 tokens, and each token has about 40-50 words. These lists of files are more extreme, which I probably won't use for the real environment, except now for the testing purpose. The QTime is about 100-200, as compared to about 50 for collection without synonyms configured. Is this timing consider fast or slow? Although the synonyms files are big, there's not that many index in my collection yet. Just afraid the performance will be affected when more index comes in. If it's fast or slow it depends on your requirements :) For a human waiting for the response, I would say 100ms to be quite fast. To understand what happens when the index scale up, you should prototype ! Anyway there are a lot of solution in Solr to scale up your system ! Cheers Regards, Edwin On 9 May 2015 00:14, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Thank you for your suggestions. I can't do a proper testing on that yet as I'm currently using a 4GB RAM normal PC machine, and all these probably requires more RAM that what I have. I've tried running the setup with 20 synonyms file, and the system went Out of Memory before I could test anything. For your option 2), do you mean that I'll need to download a synonym database (like the one with over 20MB in size which I have), and index them into an Ad Hoc Solr Core to manage them? I probably can only try them out properly when I can get the server machine with more RAM. Regards, Edwin On 8 May 2015 at 22:16, Alessandro Benedetti benedetti.ale...@gmail.com wrote: This is a quite big Sinonym corpus ! If it's not feasible to have only 1 big synonym file ( I haven't checked, so I assume the 1 Mb limit is true, even if strange) I would do an experiment : 1) testing query time with a Solr Classic config 2) Use an Ad Hoc Solr Core to manage Synonyms ( in this way we can keep it updated and use it with a custom version of the Sysnonym filter that will get the Synonyms directly from another Solr instance). 2b) develop a Solr plugin to provide this approach If the synonym thesaurus is really big, I guess managing them through another Solr Core ( or something similar) locally , will be better than managing it with an external web service. Cheers 2015-05-08 12:16 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com : So it means like having more than 10 or 20 synonym files locally will still be faster than accessing external service? As I found out that zookeeper only allows the synonym.txt file to be a maximum of 1MB, and as my potential synonym file is more than 20MB, I'll need to split the file to more than 20 of them. Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Queries on SynonymFilterFactory
Yes sure, thanks for your advice. I'm still waiting for my server to come before I can scale up my system and do the testing. Now the Solr running on my 4GB RAM system will crash if I try to scale up my system as there's not enough memory to support it. Regards, Edwin On 11 May 2015 at 19:11, Alessandro Benedetti benedetti.ale...@gmail.com wrote: 2015-05-11 4:44 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com: I've managed to run the synonyms with 10 different synonyms file. Each of the synonym file size is 1MB, which consist of about 1000 tokens, and each token has about 40-50 words. These lists of files are more extreme, which I probably won't use for the real environment, except now for the testing purpose. The QTime is about 100-200, as compared to about 50 for collection without synonyms configured. Is this timing consider fast or slow? Although the synonyms files are big, there's not that many index in my collection yet. Just afraid the performance will be affected when more index comes in. If it's fast or slow it depends on your requirements :) For a human waiting for the response, I would say 100ms to be quite fast. To understand what happens when the index scale up, you should prototype ! Anyway there are a lot of solution in Solr to scale up your system ! Cheers Regards, Edwin On 9 May 2015 00:14, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Thank you for your suggestions. I can't do a proper testing on that yet as I'm currently using a 4GB RAM normal PC machine, and all these probably requires more RAM that what I have. I've tried running the setup with 20 synonyms file, and the system went Out of Memory before I could test anything. For your option 2), do you mean that I'll need to download a synonym database (like the one with over 20MB in size which I have), and index them into an Ad Hoc Solr Core to manage them? I probably can only try them out properly when I can get the server machine with more RAM. Regards, Edwin On 8 May 2015 at 22:16, Alessandro Benedetti benedetti.ale...@gmail.com wrote: This is a quite big Sinonym corpus ! If it's not feasible to have only 1 big synonym file ( I haven't checked, so I assume the 1 Mb limit is true, even if strange) I would do an experiment : 1) testing query time with a Solr Classic config 2) Use an Ad Hoc Solr Core to manage Synonyms ( in this way we can keep it updated and use it with a custom version of the Sysnonym filter that will get the Synonyms directly from another Solr instance). 2b) develop a Solr plugin to provide this approach If the synonym thesaurus is really big, I guess managing them through another Solr Core ( or something similar) locally , will be better than managing it with an external web service. Cheers 2015-05-08 12:16 GMT+01:00 Zheng Lin Edwin Yeo edwinye...@gmail.com : So it means like having more than 10 or 20 synonym files locally will still be faster than accessing external service? As I found out that zookeeper only allows the synonym.txt file to be a maximum of 1MB, and as my potential synonym file is more than 20MB, I'll need to split the file to more than 20 of them. Regards, Edwin -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Solr Multiword Synonym Problem
Hi all, I am trying to solve the solr multiword synonym issue at our installation, I am currently using SOLR-4.9.x version. I used the com.lucidworks.analysis.AutoPhrasingTokenFilterFactory from Lucidworks git repo and used this in my schema.xml and also used their com.lucidworks.analysis.AutoPhrasingQParserPlugin in the solrconfig.xml. To make testing easier for the solr community, i used the autophrases.txt as below. big apple new york city city of new york new york new york new york ny ny city ny ny new york When i run a query for big+apple my parsedQuery converts perfectly. parsedquery:(+DisjunctionMaxQuery((searchField:big_apple)))/no_coord, parsedquery_toString:+(searchField:big_apple), .. but when i search for new+york+city, it converts to parsedquery:(+(DisjunctionMaxQuery((searchField:new_york_city)) DisjunctionMaxQuery((searchField:city/no_coord, parsedquery_toString:+((searchField:new_york_city) (searchField:city)), explain:{}, Why is it trying to parse the word city separately. I thought when it finds an exact match new york city in the auto phrases.txt it should just replace the white space with underscore ( which is what i choose) in my solrconfig. But if i comment out the following in my autophrases.txt #city of new york it works, fine, it doesn't perform a DisjunctionMaxQuery on city. Same with New york Ny, since there is an entry in auto phrases.txt beginning with Ny , its searching for NY as well. Its like an overlap causing this problem. Did anybody face this problem, if so could you please throw some light on how you solved this? I used the branch from git for lucid works, that was 10 months old. Any help is highly appreciated. this is my solrconfig.xml -- queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWith_/str str name=defTypeedismax/str /queryParser requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dftext/str str name=defTypeautophrasingParser/str /lst /requestHandler -- This is my setting from schema.xml -- fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=true replaceWhitespaceWith=_ / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType -- thanks SolrUser -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Synonym-Problem-tp4204979.html Sent from the Solr - User mailing list archive at Nabble.com.
boolean operators OR/NOT get highlighted by solr
Hi, We have a SOLR query like this q=ddmdate%3A2012-05-01T00%3A00%3A00Z+NOT+dddate%3A2010-06-11T00%3A00%3A00Zwt=jsonindent=truehl=truehl.simple.pre=%3Ch1%3Ehl.simple.post=%3C%2Fh1%3Ehl.requireFieldMatch=truehl.preserveMulti=truehl.fl=otf.ot.hl.fragsize=300f.ot.hl.alternateField=otf.ot.hl.maxAlternateFieldLength=300fl=id And the response looks like this and notice the work not is highlighted by solr: { responseHeader:{ status:0, QTime:5, params:{ f.ot.hl.maxAlternateFieldLength:300, hl.requireFieldMatch:true, fl:id, f.ot.hl.alternateField:ot, indent:true, q:ddmdate:2012-05-01T00:00:00Z NOT dddate:2010-06-11T00:00:00Z, f.ot.hl.fragsize:300, hl.preserveMulti:true, hl.simple.pre:h1, hl.simple.post:/h1, hl.fl:ot, wt:json, hl:true}}, response:{numFound:1,start:0,docs:[ { id:xrbw0180}] }, highlighting:{ xrbw0180:{ ot:[ of this info getting out to consumers and others, therefore, please do h1not/h1 forward or provide copies to others. Hope this helps...\n\nKevin\n\nl\n\n5027717374#12;pgNbr=1\n]}}} This happens with OR as well: q=ddmdate%3A2012-05-01T00%3A00%3A00Z+OR+dddate%3A2010-06-11T00%3A00%3A00Zwt=jsonindent=truehl=truehl.simple.pre=%3Ch1%3Ehl.simple.post=%3C%2Fh1%3Ehl.requireFieldMatch=truehl.preserveMulti=truehl.fl=otf.ot.hl.fragsize=300f.ot.hl.alternateField=otf.ot.hl.maxAlternateFieldLength=300fl=id { responseHeader:{ status:0, QTime:4, params:{ f.ot.hl.maxAlternateFieldLength:300, hl.requireFieldMatch:true, fl:id, f.ot.hl.alternateField:ot, indent:true, q:ddmdate:2012-05-01T00:00:00Z OR dddate:2010-06-11T00:00:00Z, f.ot.hl.fragsize:300, hl.preserveMulti:true, hl.simple.pre:h1, hl.simple.post:/h1, hl.fl:ot, wt:json, hl:true}}, response:{numFound:1,start:0,docs:[ { id:xrbw0180}] }, highlighting:{ xrbw0180:{ ot:[ of this info getting out to consumers and others, therefore, please do not forward h1or/h1 provide copies to others. Hope this helps...\n\nKevin\n\nl\n\n5027717374#12;pgNbr=1\n]}}} This does not happen with the AND operator. Is this a bug in solr? Or is it a feature that I can turn off? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu
RE: Solr query which return only those docs whose all tokens are from given list
Based on his example, it sounds like Naresh not only wants the tags field to contain at least one of the values [T1, T2, T3] but also wants to exclude documents that contain a tag other than T1, T2, or T3 (Doc3 should not be retrieved). If the set of possible values in the tags field is limited and known, you could use a NOT (or '-') clause to accomplish this. If there were 5 possible tag values: tags:(( T1 OR T2 OR T3) NOT (T4 OR T5)) However this doesn't seem practical if the number of possible values is large or unlimited. Perhaps something could be done with range queries: tags:(( T1 OR T2 OR T3) NOT ([* TO T1} OR {T1 TO T2} OR {T3 to * ])) however this would require whatever is constructing the query to be aware of the lexical ordering of the terms in the index. Maybe there are more elegant solutions, but I am not aware of them. - Andy - -Original Message- From: sujitatgt...@gmail.com [mailto:sujitatgt...@gmail.com] On Behalf Of Sujit Pal Sent: Monday, May 11, 2015 10:40 AM To: solr-user@lucene.apache.org Subject: Re: Solr query which return only those docs whose all tokens are from given list Hi Naresh, Couldn't you could just model this as an OR query since your requirement is at least one (but can be more than one), ie: tags:T1 tags:T2 tags:T3 -sujit On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, Also asked this here : http://stackoverflow.com/questions/30166116 For example i have SOLR docs in which tags field is indexed : Doc1 - tags:T1 T2 Doc2 - tags:T1 T3 Doc3 - tags:T1 T4 Doc4 - tags:T1 T2 T3 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will give Doc2 and Doc4 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected is : Doc1, Doc2, Doc4 How to model Query2 in Solr ?? Please help me on this ?
Re: Solr query which return only those docs whose all tokens are from given list
A simple OR query should be fine : tags:(T1 T2 T3) Cheers 2015-05-11 15:39 GMT+01:00 Sujit Pal sujit@comcast.net: Hi Naresh, Couldn't you could just model this as an OR query since your requirement is at least one (but can be more than one), ie: tags:T1 tags:T2 tags:T3 -sujit On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, Also asked this here : http://stackoverflow.com/questions/30166116 For example i have SOLR docs in which tags field is indexed : Doc1 - tags:T1 T2 Doc2 - tags:T1 T3 Doc3 - tags:T1 T4 Doc4 - tags:T1 T2 T3 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will give Doc2 and Doc4 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected is : Doc1, Doc2, Doc4 How to model Query2 in Solr ?? Please help me on this ? -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: schema modification issue
Hi, Thanks for reporting, I’m working a test to reproduce. Can you please create a Solr JIRA issue for this?: https://issues.apache.org/jira/browse/SOLR/ Thanks, Steve On May 7, 2015, at 5:40 AM, User Zolr zolr.u...@gmail.com wrote: Hi there, I have come accross a problem that when using managed schema in SolrCloud, adding fields into schema would SOMETIMES end up prompting Can't find resource 'schema.xml' in classpath or '/configs/collectionName', cwd=/export/solr/solr-5.1.0/server, there is of course no schema.xml in configs, but 'schema.xml.bak' and 'managed-schema' i use solrj to create a collection: Path tempPath = getConfigPath(); client.uploadConfig(tempPath, name); //customized configs with solrconfig.xml using ManagedIndexSchemaFactory if(numShards==0){ numShards = getNumNodes(client); } Create request = new CollectionAdminRequest.Create(); request.setCollectionName(name); request.setNumShards(numShards); replicationFactor = (replicationFactor==0?DEFAULT_REPLICA_FACTOR:replicationFactor); request.setReplicationFactor(replicationFactor); request.setMaxShardsPerNode(maxShardsPerNode==0?replicationFactor:maxShardsPerNode); CollectionAdminResponse response = request.process(client); and adding fields to schema, either by curl or by httpclient, would sometimes yield the following error, but the error can be fixed by RELOADING the newly created collection once or several times: INFO - [{ responseHeader:{status:500,QTime:5}, errors:[Error reading input String Can't find resource 'schema.xml' in classpath or '/configs/collectionName', cwd=/export/solr/solr-5.1.0/server], error:{msg:Can't find resource 'schema.xml' in classpath or '/configs/collectionName', cwd=/export/solr/solr-5.1.0/server,trace:java.io.IOException: Can't find resource 'schema.xml' in classpath or '/configs/collectionName', cwd=/export/solr/solr-5.1.0/server at org.apache.solr.cloud.ZkSolrResourceLoader.openResource(ZkSolrResourceLoader.java:98) at org.apache.solr.schema.SchemaManager.getFreshManagedSchema(SchemaManager.java:421) at org.apache.solr.schema.SchemaManager.doOperations(SchemaManager.java:104) at org.apache.solr.schema.SchemaManager.performOperations(SchemaManager.java:94) at org.apache.solr.handler.SchemaHandler.handleRequestBody(SchemaHandler.java:57) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1984) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745)\n,code:500}}]
Re: Solr custom component issue
attaching them to each request, then just add qf= as a param to the URL, easy. On Mon, May 11, 2015, at 12:17 PM, nutchsolruser wrote: These boosting parameters will be configured outside Solr and there is seperate module from which these values get populated , I am reading those values from external datasource and I want to attach them to each request . -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-custom-component-issue-tp4204799p4204832.html Sent from the Solr - User mailing list archive at Nabble.com.